Linux virtualization list

Linux virtualization list
 help / color / mirror / Atom feed

* Re: [PATCH v3] tools/virtio/ringtest: fix run-on-all.sh to work without /dev/cpu
From: Mike Rapoport @ 2016-05-24 12:41 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: kvm, virtualization
In-Reply-To: <1462356950-3278-1-git-send-email-rppt@linux.vnet.ibm.com>

Michael,

Any updates on this?

On Wed, May 04, 2016 at 01:15:50PM +0300, Mike Rapoport wrote:
> /dev/cpu is only available on x86 with certain modules (e.g. msr) enabled.
> Using lscpu to get processors count is more portable.
> 
> Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
> ---
> v3: simplify by using lscpu -p=cpu 
> v2: use lspcu instead of /proc/cpuinfo as per Cornelia's suggestion
> 
>  tools/virtio/ringtest/run-on-all.sh | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/tools/virtio/ringtest/run-on-all.sh b/tools/virtio/ringtest/run-on-all.sh
> index 52b0f71..2e69ca8 100755
> --- a/tools/virtio/ringtest/run-on-all.sh
> +++ b/tools/virtio/ringtest/run-on-all.sh
> @@ -3,10 +3,10 @@
>  #use last CPU for host. Why not the first?
>  #many devices tend to use cpu0 by default so
>  #it tends to be busier
> -HOST_AFFINITY=$(cd /dev/cpu; ls|grep -v '[a-z]'|sort -n|tail -1)
> +HOST_AFFINITY=$(lscpu -p=cpu | tail -1)
> 
>  #run command on all cpus
> -for cpu in $(cd /dev/cpu; ls|grep -v '[a-z]'|sort -n);
> +for cpu in $(seq 0 $HOST_AFFINITY)
>  do
>  	#Don't run guest and host on same CPU
>  	#It actually works ok if using signalling
> -- 
> 1.9.1
> 

^ permalink raw reply

* Re: [PATCH] tools/virtio/ringtest: add usage example to README
From: Mike Rapoport @ 2016-05-24 12:41 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: kvm, virtualization
In-Reply-To: <1462342376-16065-1-git-send-email-rppt@linux.vnet.ibm.com>

Michael,

Any updates on this?

On Wed, May 04, 2016 at 09:12:55AM +0300, Mike Rapoport wrote:
> Having typical usage example in the README file is more convinient than in
> the git history...
> 
> Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
> ---
>  tools/virtio/ringtest/README | 4 ++++
>  1 file changed, 4 insertions(+)
> 
> diff --git a/tools/virtio/ringtest/README b/tools/virtio/ringtest/README
> index 34e94c4..d83707a 100644
> --- a/tools/virtio/ringtest/README
> +++ b/tools/virtio/ringtest/README
> @@ -1,2 +1,6 @@
>  Partial implementation of various ring layouts, useful to tune virtio design.
>  Uses shared memory heavily.
> +
> +Typical use:
> +
> +# sh run-on-all.sh perf stat -r 10 --log-fd 1 -- ./ring
> -- 
> 1.9.1
> 

^ permalink raw reply

* [PATCH] tools/virtio: add noring tool
From: Michael S. Tsirkin @ 2016-05-24 12:23 UTC (permalink / raw)
  To: linux-kernel; +Cc: virtualization

Useful to measure testing framework overhead.

Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 tools/virtio/ringtest/noring.c | 69 ++++++++++++++++++++++++++++++++++++++++++
 tools/virtio/ringtest/Makefile |  4 ++-
 2 files changed, 72 insertions(+), 1 deletion(-)
 create mode 100644 tools/virtio/ringtest/noring.c

diff --git a/tools/virtio/ringtest/noring.c b/tools/virtio/ringtest/noring.c
new file mode 100644
index 0000000..eda2f48
--- /dev/null
+++ b/tools/virtio/ringtest/noring.c
@@ -0,0 +1,69 @@
+#define _GNU_SOURCE
+#include "main.h"
+#include <assert.h>
+
+/* stub implementation: useful for measuring overhead */
+void alloc_ring(void)
+{
+}
+
+/* guest side */
+int add_inbuf(unsigned len, void *buf, void *datap)
+{
+	return 0;
+}
+
+/*
+ * skb_array API provides no way for producer to find out whether a given
+ * buffer was consumed.  Our tests merely require that a successful get_buf
+ * implies that add_inbuf succeed in the past, and that add_inbuf will succeed,
+ * fake it accordingly.
+ */
+void *get_buf(unsigned *lenp, void **bufp)
+{
+	return "Buffer";
+}
+
+void poll_used(void)
+{
+}
+
+void disable_call()
+{
+	assert(0);
+}
+
+bool enable_call()
+{
+	assert(0);
+}
+
+void kick_available(void)
+{
+	assert(0);
+}
+
+/* host side */
+void disable_kick()
+{
+	assert(0);
+}
+
+bool enable_kick()
+{
+	assert(0);
+}
+
+void poll_avail(void)
+{
+}
+
+bool use_buf(unsigned *lenp, void **bufp)
+{
+	return true;
+}
+
+void call_used(void)
+{
+	assert(0);
+}
diff --git a/tools/virtio/ringtest/Makefile b/tools/virtio/ringtest/Makefile
index a8356d8..b3d5bc8 100644
--- a/tools/virtio/ringtest/Makefile
+++ b/tools/virtio/ringtest/Makefile
@@ -1,6 +1,6 @@
 all:
 
-all: ring virtio_ring_0_9 virtio_ring_poll virtio_ring_inorder skb_array
+all: ring virtio_ring_0_9 virtio_ring_poll virtio_ring_inorder skb_array noring
 
 CFLAGS += -Wall
 CFLAGS += -pthread -O2 -ggdb
@@ -17,6 +17,7 @@ virtio_ring_0_9: virtio_ring_0_9.o main.o
 virtio_ring_poll: virtio_ring_poll.o main.o
 virtio_ring_inorder: virtio_ring_inorder.o main.o
 skb_array: skb_array.o main.o
+noring: noring.o main.o
 clean:
 	-rm main.o
 	-rm ring.o ring
@@ -24,5 +25,6 @@ clean:
 	-rm virtio_ring_poll.o virtio_ring_poll
 	-rm virtio_ring_inorder.o virtio_ring_inorder
 	-rm skb_array.o skb_array
+	-rm noring.o noring
 
 .PHONY: all clean
-- 
MST

^ permalink raw reply related

* [PULL] vhost: cleanups and fixes
From: Michael S. Tsirkin @ 2016-05-24 11:57 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: kvm, mst, netdev, linux-kernel, virtualization

The following changes since commit 2dcd0af568b0cf583645c8a317dd12e344b1c72a:

  Linux 4.6 (2016-05-15 15:43:13 -0700)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git tags/for_linus

for you to fetch changes up to bb991288728e6a47a6f0fac6a4e9dfaeecc27956:

  ringtest: pass buf != NULL (2016-05-22 19:44:14 +0300)

----------------------------------------------------------------
virtio: patches for 4.7

Looks like a quiet cycle for virtio.  There's a new inorder option for the
ringtest tool, and a bugfix for balloon for ppc platforms when using virtio 1
mode.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

----------------------------------------------------------------
Michael S. Tsirkin (3):
      virtio: add inorder option
      virtio_balloon: fix PFN format for virtio-1
      ringtest: pass buf != NULL

 drivers/virtio/virtio_balloon.c             | 20 +++++++-----
 tools/virtio/ringtest/main.c                |  2 +-
 tools/virtio/ringtest/virtio_ring_0_9.c     | 49 ++++++++++++++++++++++++++++-
 tools/virtio/ringtest/virtio_ring_inorder.c |  2 ++
 tools/virtio/ringtest/Makefile              |  5 ++-
 5 files changed, 67 insertions(+), 11 deletions(-)
 create mode 100644 tools/virtio/ringtest/virtio_ring_inorder.c

^ permalink raw reply

* Re: [PATCH RFC kernel] balloon: speed up inflating/deflating process
From: Michael S. Tsirkin @ 2016-05-24 11:11 UTC (permalink / raw)
  To: Li, Liang Z
  Cc: kvm@vger.kernel.org, qemu-devel@nongnu.org,
	linux-kernel@vger.kernel.org,
	virtualization@lists.linux-foundation.org, amit.shah@redhat.com,
	pbonzini@redhat.com, akpm@linux-foundation.org,
	dgilbert@redhat.com
In-Reply-To: <F2CBF3009FA73547804AE4C663CAB28E041A4D21@shsmsx102.ccr.corp.intel.com>

On Tue, May 24, 2016 at 10:38:43AM +0000, Li, Liang Z wrote:
> > > > >  {
> > > > > -	struct scatterlist sg;
> > > > >  	unsigned int len;
> > > > >
> > > > > -	sg_init_one(&sg, vb->pfns, sizeof(vb->pfns[0]) * vb->num_pfns);
> > > > > +	if (virtio_has_feature(vb->vdev,
> > > > VIRTIO_BALLOON_F_PAGE_BITMAP)) {
> > > > > +		u32 page_shift = PAGE_SHIFT;
> > > > > +		unsigned long start_pfn, end_pfn, flags = 0, bmap_len;
> > > > > +		struct scatterlist sg[5];
> > > > > +
> > > > > +		start_pfn = rounddown(vb->start_pfn, BITS_PER_LONG);
> > > > > +		end_pfn = roundup(vb->end_pfn, BITS_PER_LONG);
> > > > > +		bmap_len = (end_pfn - start_pfn) / BITS_PER_LONG *
> > > > sizeof(long);
> > > > > +
> > > > > +		sg_init_table(sg, 5);
> > > > > +		sg_set_buf(&sg[0], &flags, sizeof(flags));
> > > > > +		sg_set_buf(&sg[1], &start_pfn, sizeof(start_pfn));
> > > > > +		sg_set_buf(&sg[2], &page_shift, sizeof(page_shift));
> > > > > +		sg_set_buf(&sg[3], &bmap_len, sizeof(bmap_len));
> > > > > +		sg_set_buf(&sg[4], vb->page_bitmap +
> > > > > +				 (start_pfn / BITS_PER_LONG), bmap_len);
> > > >
> > > > This can be pre-initialized, correct?
> > >
> > > pre-initialized? I am not quite understand your mean.
> > 
> > I think you can maintain sg as part of device state and init sg with the bitmap.
> > 
> 
> I got it.
> 
> > > > This is grossly inefficient if you only requested a single page.
> > > > And it's also allocating memory very aggressively without ever
> > > > telling the host what is going on.
> > >
> > > If only requested a single page, there is no need  to send the entire
> > > page bitmap, This RFC patch has already considered about this.
> > 
> > where's that addressed in code?
> > 
> 
> By record the start_pfn and end_pfn.
> 
> The start_pfn & end_pfn will be updated in set_page_bitmap()
> and will be used in the function tell_host():
> 
> ---------------------------------------------------------------------------------
> +static void set_page_bitmap(struct virtio_balloon *vb, struct page 
> +*page) {
> +	unsigned int i;
> +	unsigned long *bitmap = vb->page_bitmap;
> +	unsigned long balloon_pfn = page_to_balloon_pfn(page);
> +
> +	for (i = 0; i < VIRTIO_BALLOON_PAGES_PER_PAGE; i++)
> +		set_bit(balloon_pfn + i, bitmap);

BTW, there's a page size value in header so there
is no longer need to set multiple bits per page.

> +	if (balloon_pfn < vb->start_pfn)
> +		vb->start_pfn = balloon_pfn;
> +	if (balloon_pfn > vb->end_pfn)
> +		vb->end_pfn = balloon_pfn;
> +}

Sounds good, but you also need to limit by
allocated bitmap size.

> 
> +		unsigned long start_pfn, end_pfn, flags = 0, bmap_len;
> +		struct scatterlist sg[5];
> +
> +		start_pfn = rounddown(vb->start_pfn, BITS_PER_LONG);
> +		end_pfn = roundup(vb->end_pfn, BITS_PER_LONG);
> +		bmap_len = (end_pfn - start_pfn) / BITS_PER_LONG * sizeof(long);
> +
> +		sg_init_table(sg, 5);
> +		sg_set_buf(&sg[0], &flags, sizeof(flags));
> +		sg_set_buf(&sg[1], &start_pfn, sizeof(start_pfn));
> +		sg_set_buf(&sg[2], &page_shift, sizeof(page_shift));
> +		sg_set_buf(&sg[3], &bmap_len, sizeof(bmap_len));
> +		sg_set_buf(&sg[4], vb->page_bitmap +
> +				 (start_pfn / BITS_PER_LONG), bmap_len);

Looks wrong. start_pfn should start at offset 0 I think ...

> +		virtqueue_add_outbuf(vq, sg, 5, vb, GFP_KERNEL);
> -------------------------------------------------------------------------------------------
> > > But it can works very well if requesting several pages  which across a
> > > large range.
> > 
> > Some kind of limit on range would make sense though.
> > It need not cover max pfn.
> > 
> 
> Yes, agree.
> 
> > > > Suggestion to address all above comments:
> > > > 	1. allocate a bunch of pages and link them up,
> > > > 	   calculating the min and the max pfn.
> > > > 	   if max-min exceeds the allocated bitmap size,
> > > > 	   tell host.
> > >
> > > I am not sure if it works well in some cases, e.g. The allocated pages
> > > are across a wide range and the max-min > limit is very frequently to be
> > true.
> > > Then, there will be many times of virtio transmission and it's bad for
> > > performance improvement. Right?
> > 
> > It's a tradeoff for sure. Measure it, see what the overhead is.
> 
> OK, I will try and get back to you.
> 
> > 
> > >
> > > > 	2. limit allocated bitmap size to something reasonable.
> > > > 	   How about 32Kbytes? This is 256kilo bit in the map, which comes
> > > > 	   out to 1Giga bytes of memory in the balloon.
> > >
> > > So, even the VM has 1TB of RAM, the page bitmap will take 32MB of
> > memory.
> > > Maybe it's better to use a big page bitmap the save the pages
> > > allocated by balloon, and split the big page bitmap to 32K bytes unit, then
> > transfer one unit at a time.
> > 
> > How is this different from what I said?
> > 
> 
> It's good if it's the same as you said.
> 
> Thanks!
> Liang
> 
> > >
> > > Should we use a page bitmap to replace 'vb->pages' ?
> > >
> > > How about rolling back to use PFNs if the count of requested pages is a
> > small number?
> > >
> > > Liang
> > 
> > That's why we have start pfn. you can use that to pass even a single page
> > without a lot of overhead.
> > 
> > > > > --
> > > > > 1.9.1
> > > > --
> > > > To unsubscribe from this list: send the line "unsubscribe kvm" in
> > > > the body of a message to majordomo@vger.kernel.org More majordomo
> > > > info at http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* RE: [PATCH RFC kernel] balloon: speed up inflating/deflating process
From: Li, Liang Z @ 2016-05-24 10:38 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: kvm@vger.kernel.org, qemu-devel@nongnu.org,
	linux-kernel@vger.kernel.org,
	virtualization@lists.linux-foundation.org, amit.shah@redhat.com,
	pbonzini@redhat.com, akpm@linux-foundation.org,
	dgilbert@redhat.com
In-Reply-To: <20160524130041-mutt-send-email-mst@redhat.com>

> > > >  {
> > > > -	struct scatterlist sg;
> > > >  	unsigned int len;
> > > >
> > > > -	sg_init_one(&sg, vb->pfns, sizeof(vb->pfns[0]) * vb->num_pfns);
> > > > +	if (virtio_has_feature(vb->vdev,
> > > VIRTIO_BALLOON_F_PAGE_BITMAP)) {
> > > > +		u32 page_shift = PAGE_SHIFT;
> > > > +		unsigned long start_pfn, end_pfn, flags = 0, bmap_len;
> > > > +		struct scatterlist sg[5];
> > > > +
> > > > +		start_pfn = rounddown(vb->start_pfn, BITS_PER_LONG);
> > > > +		end_pfn = roundup(vb->end_pfn, BITS_PER_LONG);
> > > > +		bmap_len = (end_pfn - start_pfn) / BITS_PER_LONG *
> > > sizeof(long);
> > > > +
> > > > +		sg_init_table(sg, 5);
> > > > +		sg_set_buf(&sg[0], &flags, sizeof(flags));
> > > > +		sg_set_buf(&sg[1], &start_pfn, sizeof(start_pfn));
> > > > +		sg_set_buf(&sg[2], &page_shift, sizeof(page_shift));
> > > > +		sg_set_buf(&sg[3], &bmap_len, sizeof(bmap_len));
> > > > +		sg_set_buf(&sg[4], vb->page_bitmap +
> > > > +				 (start_pfn / BITS_PER_LONG), bmap_len);
> > >
> > > This can be pre-initialized, correct?
> >
> > pre-initialized? I am not quite understand your mean.
> 
> I think you can maintain sg as part of device state and init sg with the bitmap.
> 

I got it.

> > > This is grossly inefficient if you only requested a single page.
> > > And it's also allocating memory very aggressively without ever
> > > telling the host what is going on.
> >
> > If only requested a single page, there is no need  to send the entire
> > page bitmap, This RFC patch has already considered about this.
> 
> where's that addressed in code?
> 

By record the start_pfn and end_pfn.

The start_pfn & end_pfn will be updated in set_page_bitmap()
and will be used in the function tell_host():

---------------------------------------------------------------------------------
+static void set_page_bitmap(struct virtio_balloon *vb, struct page 
+*page) {
+	unsigned int i;
+	unsigned long *bitmap = vb->page_bitmap;
+	unsigned long balloon_pfn = page_to_balloon_pfn(page);
+
+	for (i = 0; i < VIRTIO_BALLOON_PAGES_PER_PAGE; i++)
+		set_bit(balloon_pfn + i, bitmap);
+	if (balloon_pfn < vb->start_pfn)
+		vb->start_pfn = balloon_pfn;
+	if (balloon_pfn > vb->end_pfn)
+		vb->end_pfn = balloon_pfn;
+}


+		unsigned long start_pfn, end_pfn, flags = 0, bmap_len;
+		struct scatterlist sg[5];
+
+		start_pfn = rounddown(vb->start_pfn, BITS_PER_LONG);
+		end_pfn = roundup(vb->end_pfn, BITS_PER_LONG);
+		bmap_len = (end_pfn - start_pfn) / BITS_PER_LONG * sizeof(long);
+
+		sg_init_table(sg, 5);
+		sg_set_buf(&sg[0], &flags, sizeof(flags));
+		sg_set_buf(&sg[1], &start_pfn, sizeof(start_pfn));
+		sg_set_buf(&sg[2], &page_shift, sizeof(page_shift));
+		sg_set_buf(&sg[3], &bmap_len, sizeof(bmap_len));
+		sg_set_buf(&sg[4], vb->page_bitmap +
+				 (start_pfn / BITS_PER_LONG), bmap_len);
+		virtqueue_add_outbuf(vq, sg, 5, vb, GFP_KERNEL);
-------------------------------------------------------------------------------------------
> > But it can works very well if requesting several pages  which across a
> > large range.
> 
> Some kind of limit on range would make sense though.
> It need not cover max pfn.
> 

Yes, agree.

> > > Suggestion to address all above comments:
> > > 	1. allocate a bunch of pages and link them up,
> > > 	   calculating the min and the max pfn.
> > > 	   if max-min exceeds the allocated bitmap size,
> > > 	   tell host.
> >
> > I am not sure if it works well in some cases, e.g. The allocated pages
> > are across a wide range and the max-min > limit is very frequently to be
> true.
> > Then, there will be many times of virtio transmission and it's bad for
> > performance improvement. Right?
> 
> It's a tradeoff for sure. Measure it, see what the overhead is.

OK, I will try and get back to you.

> 
> >
> > > 	2. limit allocated bitmap size to something reasonable.
> > > 	   How about 32Kbytes? This is 256kilo bit in the map, which comes
> > > 	   out to 1Giga bytes of memory in the balloon.
> >
> > So, even the VM has 1TB of RAM, the page bitmap will take 32MB of
> memory.
> > Maybe it's better to use a big page bitmap the save the pages
> > allocated by balloon, and split the big page bitmap to 32K bytes unit, then
> transfer one unit at a time.
> 
> How is this different from what I said?
> 

It's good if it's the same as you said.

Thanks!
Liang

> >
> > Should we use a page bitmap to replace 'vb->pages' ?
> >
> > How about rolling back to use PFNs if the count of requested pages is a
> small number?
> >
> > Liang
> 
> That's why we have start pfn. you can use that to pass even a single page
> without a lot of overhead.
> 
> > > > --
> > > > 1.9.1
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe kvm" in
> > > the body of a message to majordomo@vger.kernel.org More majordomo
> > > info at http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH RFC kernel] balloon: speed up inflating/deflating process
From: Michael S. Tsirkin @ 2016-05-24 10:08 UTC (permalink / raw)
  To: Li, Liang Z
  Cc: kvm@vger.kernel.org, qemu-devel@nongnu.org,
	linux-kernel@vger.kernel.org,
	virtualization@lists.linux-foundation.org, amit.shah@redhat.com,
	pbonzini@redhat.com, akpm@linux-foundation.org,
	dgilbert@redhat.com
In-Reply-To: <F2CBF3009FA73547804AE4C663CAB28E041A4BFD@shsmsx102.ccr.corp.intel.com>

On Tue, May 24, 2016 at 09:51:46AM +0000, Li, Liang Z wrote:
> > On Fri, May 20, 2016 at 05:59:46PM +0800, Liang Li wrote:
> > > The implementation of the current virtio-balloon is not very
> > > efficient, Bellow is test result of time spends on inflating the
> > > balloon to 3GB of a 4GB idle guest:
> > >
> > > a. allocating pages (6.5%, 103ms)
> > > b. sending PFNs to host (68.3%, 787ms) c. address translation (6.1%,
> > > 96ms) d. madvise (19%, 300ms)
> > >
> > > It takes about 1577ms for the whole inflating process to complete. The
> > > test shows that the bottle neck is the stage b and stage d.
> > >
> > > If using a bitmap to send the page info instead of the PFNs, we can
> > > reduce the overhead spends on stage b quite a lot. Furthermore, it's
> > > possible to do the address translation and do the madvise with a bulk
> > > of pages, instead of the current page per page way, so the overhead of
> > > stage c and stage d can also be reduced a lot.
> > >
> > > This patch is the kernel side implementation which is intended to
> > > speed up the inflating & deflating process by adding a new feature to
> > > the virtio-balloon device. And now, inflating the balloon to 3GB of a
> > > 4GB idle guest only takes 175ms, it's about 9 times as fast as before.
> > >
> > > TODO: optimize stage a by allocating/freeing a chunk of pages instead
> > > of a single page at a time.
> > >
> > > Signed-off-by: Liang Li <liang.z.li@intel.com>
> > > ---
> > >  drivers/virtio/virtio_balloon.c     | 199
> > ++++++++++++++++++++++++++++++++++--
> > >  include/uapi/linux/virtio_balloon.h |   1 +
> > >  mm/page_alloc.c                     |   6 ++
> > >  3 files changed, 198 insertions(+), 8 deletions(-)
> > >
> > > diff --git a/drivers/virtio/virtio_balloon.c
> > > b/drivers/virtio/virtio_balloon.c index 7b6d74f..5330b6f 100644
> > > --- a/drivers/virtio/virtio_balloon.c
> > > +++ b/drivers/virtio/virtio_balloon.c
> > > @@ -45,6 +45,8 @@ static int oom_pages =
> > OOM_VBALLOON_DEFAULT_PAGES;
> > > module_param(oom_pages, int, S_IRUSR | S_IWUSR);
> > > MODULE_PARM_DESC(oom_pages, "pages to free on OOM");
> > >
> > > +extern unsigned long get_max_pfn(void);
> > > +
> > >  struct virtio_balloon {
> > >  	struct virtio_device *vdev;
> > >  	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq; @@ -62,6 +64,9
> > > @@ struct virtio_balloon {
> > >
> > >  	/* Number of balloon pages we've told the Host we're not using. */
> > >  	unsigned int num_pages;
> > > +	unsigned long *page_bitmap;
> > > +	unsigned long start_pfn, end_pfn;
> > > +	unsigned long bmap_len;
> > >  	/*
> > >  	 * The pages we've told the Host we're not using are enqueued
> > >  	 * at vb_dev_info->pages list.
> > > @@ -111,15 +116,66 @@ static void balloon_ack(struct virtqueue *vq)
> > >  	wake_up(&vb->acked);
> > >  }
> > >
> > > +static int balloon_page_bitmap_init(struct virtio_balloon *vb) {
> > > +	unsigned long max_pfn, bmap_bytes;
> > > +
> > > +	max_pfn = get_max_pfn();
> > 
> > This is racy. max_pfn could be increased by memory hotplug after you got it.
> > 
> > 
> > > +	bmap_bytes = ALIGN(max_pfn, BITS_PER_LONG) / BITS_PER_BYTE;
> > > +	if (!vb->page_bitmap)
> > > +		vb->page_bitmap = kzalloc(bmap_bytes, GFP_KERNEL);
> > 
> > Likely to fail for a huge busy guest.
> > Why not init on device probe?
> > this way
> > 	- probe will fail, or we can clear the feature bit
> > 	- free memory is more likely to be available
> > 
> 
> Very good suggestion!
> 
> > 
> > > +	else {
> > > +		if (bmap_bytes <= vb->bmap_len)
> > > +			memset(vb->page_bitmap, 0, bmap_bytes);
> > > +		else {
> > > +			kfree(vb->page_bitmap);
> > > +			vb->page_bitmap = kzalloc(bmap_bytes,
> > GFP_KERNEL);
> > > +		}
> > > +	}
> > > +	if (!vb->page_bitmap) {
> > > +		dev_err(&vb->vdev->dev, "%s failure: allocate page
> > bitmap\n",
> > > +			 __func__);
> > > +		return -ENOMEM;
> > > +	}
> > > +	vb->bmap_len = bmap_bytes;
> > > +	vb->start_pfn = max_pfn;
> > > +	vb->end_pfn = 0;
> > > +
> > > +	return 0;
> > > +}
> > > +
> > 
> > >  {
> > > -	struct scatterlist sg;
> > >  	unsigned int len;
> > >
> > > -	sg_init_one(&sg, vb->pfns, sizeof(vb->pfns[0]) * vb->num_pfns);
> > > +	if (virtio_has_feature(vb->vdev,
> > VIRTIO_BALLOON_F_PAGE_BITMAP)) {
> > > +		u32 page_shift = PAGE_SHIFT;
> > > +		unsigned long start_pfn, end_pfn, flags = 0, bmap_len;
> > > +		struct scatterlist sg[5];
> > > +
> > > +		start_pfn = rounddown(vb->start_pfn, BITS_PER_LONG);
> > > +		end_pfn = roundup(vb->end_pfn, BITS_PER_LONG);
> > > +		bmap_len = (end_pfn - start_pfn) / BITS_PER_LONG *
> > sizeof(long);
> > > +
> > > +		sg_init_table(sg, 5);
> > > +		sg_set_buf(&sg[0], &flags, sizeof(flags));
> > > +		sg_set_buf(&sg[1], &start_pfn, sizeof(start_pfn));
> > > +		sg_set_buf(&sg[2], &page_shift, sizeof(page_shift));
> > > +		sg_set_buf(&sg[3], &bmap_len, sizeof(bmap_len));
> > > +		sg_set_buf(&sg[4], vb->page_bitmap +
> > > +				 (start_pfn / BITS_PER_LONG), bmap_len);
> > 
> > This can be pre-initialized, correct?
> 
> pre-initialized? I am not quite understand your mean.

I think you can maintain sg as part of device state
and init sg with the bitmap.

> > 
> > > +		virtqueue_add_outbuf(vq, sg, 5, vb, GFP_KERNEL);
> > > +
> > > +	} else {
> > > +		struct scatterlist sg;
> > > +
> > > +		sg_init_one(&sg, vb->pfns, sizeof(vb->pfns[0]) * vb-
> > >num_pfns);
> > > +		/* We should always be able to add one buffer to an
> > > +		* empty queue.
> > > +		*/
> > > +		virtqueue_add_outbuf(vq, &sg, 1, vb, GFP_KERNEL);
> > > +	}
> > >
> > > -	/* We should always be able to add one buffer to an empty queue.
> > */
> > > -	virtqueue_add_outbuf(vq, &sg, 1, vb, GFP_KERNEL);
> > >  	virtqueue_kick(vq);
> > >
> > >  	/* When host has read buffer, this completes via balloon_ack */ @@
> > > -137,7 +193,21 @@ static void set_page_pfns(u32 pfns[], struct page *page)
> > >  		pfns[i] = page_to_balloon_pfn(page) + i;  }
> > >
> > > -static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
> > > +static void set_page_bitmap(struct virtio_balloon *vb, struct page
> > > +*page) {
> > > +	unsigned int i;
> > > +	unsigned long *bitmap = vb->page_bitmap;
> > > +	unsigned long balloon_pfn = page_to_balloon_pfn(page);
> > > +
> > > +	for (i = 0; i < VIRTIO_BALLOON_PAGES_PER_PAGE; i++)
> > > +		set_bit(balloon_pfn + i, bitmap);
> > > +	if (balloon_pfn < vb->start_pfn)
> > > +		vb->start_pfn = balloon_pfn;
> > > +	if (balloon_pfn > vb->end_pfn)
> > > +		vb->end_pfn = balloon_pfn;
> > > +}
> > > +
> > > +static unsigned fill_balloon_pfns(struct virtio_balloon *vb, size_t
> > > +num)
> > >  {
> > >  	struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info;
> > >  	unsigned num_allocated_pages;
> > > @@ -174,7 +244,104 @@ static unsigned fill_balloon(struct virtio_balloon
> > *vb, size_t num)
> > >  	return num_allocated_pages;
> > >  }
> > >
> > > -static void release_pages_balloon(struct virtio_balloon *vb)
> > > +static long fill_balloon_bitmap(struct virtio_balloon *vb, size_t
> > > +num) {
> > > +	struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info;
> > > +	long num_allocated_pages = 0;
> > > +
> > > +	if (balloon_page_bitmap_init(vb) < 0)
> > > +		return num;
> > > +
> > > +	mutex_lock(&vb->balloon_lock);
> > > +	for (vb->num_pfns = 0; vb->num_pfns < num;
> > > +	     vb->num_pfns += VIRTIO_BALLOON_PAGES_PER_PAGE) {
> > > +		struct page *page = balloon_page_enqueue(vb_dev_info);
> > > +
> > > +		if (!page) {
> > > +			dev_info_ratelimited(&vb->vdev->dev,
> > > +					     "Out of puff! Can't get %u
> > pages\n",
> > > +
> > VIRTIO_BALLOON_PAGES_PER_PAGE);
> > > +			/* Sleep for at least 1/5 of a second before retry. */
> > > +			msleep(200);
> > > +			break;
> > > +		}
> > > +		set_page_bitmap(vb, page);
> > > +		vb->num_pages += VIRTIO_BALLOON_PAGES_PER_PAGE;
> > > +		if (!virtio_has_feature(vb->vdev,
> > > +
> > 	VIRTIO_BALLOON_F_DEFLATE_ON_OOM))
> > > +			adjust_managed_page_count(page, -1);
> > > +	}
> > 
> > This is grossly inefficient if you only requested a single page.
> > And it's also allocating memory very aggressively without ever telling the host
> > what is going on.
> 
> If only requested a single page, there is no need  to send the entire page bitmap,
> This RFC patch has already considered about this.

where's that addressed in code?

> But it can works very well if requesting
> several pages  which across a large range.

Some kind of limit on range would make sense though.
It need not cover max pfn.

> > > +
> > > +	num_allocated_pages = vb->num_pfns;
> > > +	/* Did we get any? */
> > > +	if (vb->num_pfns != 0)
> > > +		tell_host(vb, vb->inflate_vq);
> > > +	mutex_unlock(&vb->balloon_lock);
> > > +
> > > +	return num_allocated_pages;
> > > +}
> > > +
> > > +static long fill_balloon(struct virtio_balloon *vb, size_t num) {
> > > +	long num_allocated_pages;
> > > +
> > > +	if (virtio_has_feature(vb->vdev,
> > VIRTIO_BALLOON_F_PAGE_BITMAP))
> > > +		num_allocated_pages = fill_balloon_bitmap(vb, num);
> > > +	else
> > > +		num_allocated_pages = fill_balloon_pfns(vb, num);
> > > +
> > > +	return num_allocated_pages;
> > > +}
> > > +
> > > +static void release_pages_balloon_bitmap(struct virtio_balloon *vb) {
> > > +	unsigned long pfn, offset, size;
> > > +	struct page *page;
> > > +
> > > +	size = min(vb->bmap_len * BITS_PER_BYTE, vb->end_pfn);
> > > +	for (offset = vb->start_pfn; offset < size;
> > > +		 offset = pfn + VIRTIO_BALLOON_PAGES_PER_PAGE) {
> > > +		pfn = find_next_bit(vb->page_bitmap, size, offset);
> > > +		if (pfn < size) {
> > > +			page = balloon_pfn_to_page(pfn);
> > > +			if (!virtio_has_feature(vb->vdev,
> > > +
> > 	VIRTIO_BALLOON_F_DEFLATE_ON_OOM))
> > > +				adjust_managed_page_count(page, 1);
> > > +			put_page(page);
> > > +		}
> > > +	}
> > > +}
> > > +
> > > +static unsigned long leak_balloon_bitmap(struct virtio_balloon *vb,
> > > +size_t num) {
> > > +	unsigned long num_freed_pages = num;
> > > +	struct page *page;
> > > +	struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info;
> > > +
> > > +	if (balloon_page_bitmap_init(vb) < 0)
> > > +		return num_freed_pages;
> > > +
> > > +	mutex_lock(&vb->balloon_lock);
> > > +	for (vb->num_pfns = 0; vb->num_pfns < num;
> > > +	     vb->num_pfns += VIRTIO_BALLOON_PAGES_PER_PAGE) {
> > > +		page = balloon_page_dequeue(vb_dev_info);
> > > +		if (!page)
> > > +			break;
> > > +		set_page_bitmap(vb, page);
> > > +		vb->num_pages -= VIRTIO_BALLOON_PAGES_PER_PAGE;
> > > +	}
> > > +
> > > +	num_freed_pages = vb->num_pfns;
> > > +
> > > +	if (vb->num_pfns != 0)
> > > +		tell_host(vb, vb->deflate_vq);
> > > +	release_pages_balloon_bitmap(vb);
> > > +	mutex_unlock(&vb->balloon_lock);
> > > +
> > > +	return num_freed_pages;
> > > +}
> > > +
> > > +static void release_pages_balloon_pfns(struct virtio_balloon *vb)
> > >  {
> > >  	unsigned int i;
> > >
> > > @@ -188,7 +355,7 @@ static void release_pages_balloon(struct
> > virtio_balloon *vb)
> > >  	}
> > >  }
> > >
> > > -static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
> > > +static unsigned leak_balloon_pfns(struct virtio_balloon *vb, size_t
> > > +num)
> > >  {
> > >  	unsigned num_freed_pages;
> > >  	struct page *page;
> > > @@ -215,11 +382,23 @@ static unsigned leak_balloon(struct virtio_balloon
> > *vb, size_t num)
> > >  	 */
> > >  	if (vb->num_pfns != 0)
> > >  		tell_host(vb, vb->deflate_vq);
> > > -	release_pages_balloon(vb);
> > > +	release_pages_balloon_pfns(vb);
> > >  	mutex_unlock(&vb->balloon_lock);
> > >  	return num_freed_pages;
> > >  }
> > >
> > > +static long leak_balloon(struct virtio_balloon *vb, size_t num) {
> > > +	long num_freed_pages;
> > > +
> > > +	if (virtio_has_feature(vb->vdev,
> > VIRTIO_BALLOON_F_PAGE_BITMAP))
> > > +		num_freed_pages = leak_balloon_bitmap(vb, num);
> > > +	else
> > > +		num_freed_pages = leak_balloon_pfns(vb, num);
> > > +
> > > +	return num_freed_pages;
> > > +}
> > > +
> > >  static inline void update_stat(struct virtio_balloon *vb, int idx,
> > >  			       u16 tag, u64 val)
> > >  {
> > > @@ -510,6 +689,8 @@ static int virtballoon_probe(struct virtio_device
> > *vdev)
> > >  	spin_lock_init(&vb->stop_update_lock);
> > >  	vb->stop_update = false;
> > >  	vb->num_pages = 0;
> > > +	vb->page_bitmap = NULL;
> > > +	vb->bmap_len = 0;
> > >  	mutex_init(&vb->balloon_lock);
> > >  	init_waitqueue_head(&vb->acked);
> > >  	vb->vdev = vdev;
> > > @@ -567,6 +748,7 @@ static void virtballoon_remove(struct virtio_device
> > *vdev)
> > >  	cancel_work_sync(&vb->update_balloon_stats_work);
> > >
> > >  	remove_common(vb);
> > > +	kfree(vb->page_bitmap);
> > >  	kfree(vb);
> > >  }
> > >
> > > @@ -605,6 +787,7 @@ static unsigned int features[] = {
> > >  	VIRTIO_BALLOON_F_MUST_TELL_HOST,
> > >  	VIRTIO_BALLOON_F_STATS_VQ,
> > >  	VIRTIO_BALLOON_F_DEFLATE_ON_OOM,
> > > +	VIRTIO_BALLOON_F_PAGE_BITMAP,
> > >  };
> > >
> > >  static struct virtio_driver virtio_balloon_driver = { diff --git
> > > a/include/uapi/linux/virtio_balloon.h
> > > b/include/uapi/linux/virtio_balloon.h
> > > index 343d7dd..f78fa47 100644
> > > --- a/include/uapi/linux/virtio_balloon.h
> > > +++ b/include/uapi/linux/virtio_balloon.h
> > > @@ -34,6 +34,7 @@
> > >  #define VIRTIO_BALLOON_F_MUST_TELL_HOST	0 /* Tell before
> > reclaiming pages */
> > >  #define VIRTIO_BALLOON_F_STATS_VQ	1 /* Memory Stats virtqueue
> > */
> > >  #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM	2 /* Deflate balloon
> > on OOM */
> > > +#define VIRTIO_BALLOON_F_PAGE_BITMAP	3 /* Send page info
> > with bitmap */
> > >
> > >  /* Size of a PFN in the balloon interface. */  #define
> > > VIRTIO_BALLOON_PFN_SHIFT 12 diff --git a/mm/page_alloc.c
> > > b/mm/page_alloc.c index c1069ef..74b2fc5 100644
> > > --- a/mm/page_alloc.c
> > > +++ b/mm/page_alloc.c
> > > @@ -2139,6 +2139,12 @@ void drain_all_pages(struct zone *zone)
> > >  								zone, 1);
> > >  }
> > >
> > > +unsigned long get_max_pfn(void)
> > > +{
> > > +	return max_pfn;
> > > +}
> > > +EXPORT_SYMBOL(get_max_pfn);
> > > +
> > >  #ifdef CONFIG_HIBERNATION
> > >
> > >  void mark_free_pages(struct zone *zone)
> > 
> > Suggestion to address all above comments:
> > 	1. allocate a bunch of pages and link them up,
> > 	   calculating the min and the max pfn.
> > 	   if max-min exceeds the allocated bitmap size,
> > 	   tell host.
> 
> I am not sure if it works well in some cases, e.g. The allocated pages 
> are across a wide range and the max-min > limit is very frequently to be true.
> Then, there will be many times of virtio transmission and it's bad for performance
> improvement. Right?

It's a tradeoff for sure. Measure it, see what the overhead is.

> 
> > 	2. limit allocated bitmap size to something reasonable.
> > 	   How about 32Kbytes? This is 256kilo bit in the map, which comes
> > 	   out to 1Giga bytes of memory in the balloon.
> 
> So, even the VM has 1TB of RAM, the page bitmap will take 32MB of memory.
> Maybe it's better to use a big page bitmap the save the pages allocated by balloon,
> and split the big page bitmap to 32K bytes unit, then transfer one unit at a time.

How is this different from what I said?

> 
> Should we use a page bitmap to replace 'vb->pages' ?
> 
> How about rolling back to use PFNs if the count of requested pages is a small number?
> 
> Liang

That's why we have start pfn. you can use that to pass
even a single page without a lot of overhead.

> > > --
> > > 1.9.1
> > --
> > To unsubscribe from this list: send the line "unsubscribe kvm" in the body of
> > a message to majordomo@vger.kernel.org More majordomo info at
> > http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* RE: [PATCH RFC kernel] balloon: speed up inflating/deflating process
From: Li, Liang Z @ 2016-05-24  9:55 UTC (permalink / raw)
  To: Li, Liang Z, Michael S. Tsirkin
  Cc: kvm@vger.kernel.org, qemu-devel@nongnu.org,
	linux-kernel@vger.kernel.org,
	virtualization@lists.linux-foundation.org, amit.shah@redhat.com,
	pbonzini@redhat.com, akpm@linux-foundation.org,
	dgilbert@redhat.com
In-Reply-To: <F2CBF3009FA73547804AE4C663CAB28E041A4BFD@shsmsx102.ccr.corp.intel.com>

> > On Fri, May 20, 2016 at 05:59:46PM +0800, Liang Li wrote:
> > > The implementation of the current virtio-balloon is not very
> > > efficient, Bellow is test result of time spends on inflating the
> > > balloon to 3GB of a 4GB idle guest:
> > >
> > > a. allocating pages (6.5%, 103ms)
> > > b. sending PFNs to host (68.3%, 787ms) c. address translation (6.1%,
> > > 96ms) d. madvise (19%, 300ms)
> > >
> > > It takes about 1577ms for the whole inflating process to complete.
> > > The test shows that the bottle neck is the stage b and stage d.
> > >
> > > If using a bitmap to send the page info instead of the PFNs, we can
> > > reduce the overhead spends on stage b quite a lot. Furthermore, it's
> > > possible to do the address translation and do the madvise with a
> > > bulk of pages, instead of the current page per page way, so the
> > > overhead of stage c and stage d can also be reduced a lot.
> > >
> > > This patch is the kernel side implementation which is intended to
> > > speed up the inflating & deflating process by adding a new feature
> > > to the virtio-balloon device. And now, inflating the balloon to 3GB
> > > of a 4GB idle guest only takes 175ms, it's about 9 times as fast as before.
> > >
> > > TODO: optimize stage a by allocating/freeing a chunk of pages
> > > instead of a single page at a time.
> > >
> > > Signed-off-by: Liang Li <liang.z.li@intel.com>
> > > ---
> > >  drivers/virtio/virtio_balloon.c     | 199
> > ++++++++++++++++++++++++++++++++++--
> > >  include/uapi/linux/virtio_balloon.h |   1 +
> > >  mm/page_alloc.c                     |   6 ++
> > >  3 files changed, 198 insertions(+), 8 deletions(-)
> > >
> > > diff --git a/drivers/virtio/virtio_balloon.c
> > > b/drivers/virtio/virtio_balloon.c index 7b6d74f..5330b6f 100644
> > > --- a/drivers/virtio/virtio_balloon.c
> > > +++ b/drivers/virtio/virtio_balloon.c
> > > @@ -45,6 +45,8 @@ static int oom_pages =
> > OOM_VBALLOON_DEFAULT_PAGES;
> > > module_param(oom_pages, int, S_IRUSR | S_IWUSR);
> > > MODULE_PARM_DESC(oom_pages, "pages to free on OOM");
> > >
> > > +extern unsigned long get_max_pfn(void);
> > > +
> > >  struct virtio_balloon {
> > >  	struct virtio_device *vdev;
> > >  	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq; @@ -62,6
> > > +64,9 @@ struct virtio_balloon {
> > >
> > >  	/* Number of balloon pages we've told the Host we're not using. */
> > >  	unsigned int num_pages;
> > > +	unsigned long *page_bitmap;
> > > +	unsigned long start_pfn, end_pfn;
> > > +	unsigned long bmap_len;
> > >  	/*
> > >  	 * The pages we've told the Host we're not using are enqueued
> > >  	 * at vb_dev_info->pages list.
> > > @@ -111,15 +116,66 @@ static void balloon_ack(struct virtqueue *vq)
> > >  	wake_up(&vb->acked);
> > >  }
> > >
> > > +static int balloon_page_bitmap_init(struct virtio_balloon *vb) {
> > > +	unsigned long max_pfn, bmap_bytes;
> > > +
> > > +	max_pfn = get_max_pfn();
> >
> > This is racy. max_pfn could be increased by memory hotplug after you got it.
> >
> >
> > > +	bmap_bytes = ALIGN(max_pfn, BITS_PER_LONG) / BITS_PER_BYTE;
> > > +	if (!vb->page_bitmap)
> > > +		vb->page_bitmap = kzalloc(bmap_bytes, GFP_KERNEL);
> >
> > Likely to fail for a huge busy guest.
> > Why not init on device probe?
> > this way
> > 	- probe will fail, or we can clear the feature bit
> > 	- free memory is more likely to be available
> >
> 
> Very good suggestion!
> 
> >
> > > +	else {
> > > +		if (bmap_bytes <= vb->bmap_len)
> > > +			memset(vb->page_bitmap, 0, bmap_bytes);
> > > +		else {
> > > +			kfree(vb->page_bitmap);
> > > +			vb->page_bitmap = kzalloc(bmap_bytes,
> > GFP_KERNEL);
> > > +		}
> > > +	}
> > > +	if (!vb->page_bitmap) {
> > > +		dev_err(&vb->vdev->dev, "%s failure: allocate page
> > bitmap\n",
> > > +			 __func__);
> > > +		return -ENOMEM;
> > > +	}
> > > +	vb->bmap_len = bmap_bytes;
> > > +	vb->start_pfn = max_pfn;
> > > +	vb->end_pfn = 0;
> > > +
> > > +	return 0;
> > > +}
> > > +
> >
> > >  {
> > > -	struct scatterlist sg;
> > >  	unsigned int len;
> > >
> > > -	sg_init_one(&sg, vb->pfns, sizeof(vb->pfns[0]) * vb->num_pfns);
> > > +	if (virtio_has_feature(vb->vdev,
> > VIRTIO_BALLOON_F_PAGE_BITMAP)) {
> > > +		u32 page_shift = PAGE_SHIFT;
> > > +		unsigned long start_pfn, end_pfn, flags = 0, bmap_len;
> > > +		struct scatterlist sg[5];
> > > +
> > > +		start_pfn = rounddown(vb->start_pfn, BITS_PER_LONG);
> > > +		end_pfn = roundup(vb->end_pfn, BITS_PER_LONG);
> > > +		bmap_len = (end_pfn - start_pfn) / BITS_PER_LONG *
> > sizeof(long);
> > > +
> > > +		sg_init_table(sg, 5);
> > > +		sg_set_buf(&sg[0], &flags, sizeof(flags));
> > > +		sg_set_buf(&sg[1], &start_pfn, sizeof(start_pfn));
> > > +		sg_set_buf(&sg[2], &page_shift, sizeof(page_shift));
> > > +		sg_set_buf(&sg[3], &bmap_len, sizeof(bmap_len));
> > > +		sg_set_buf(&sg[4], vb->page_bitmap +
> > > +				 (start_pfn / BITS_PER_LONG), bmap_len);
> >
> > This can be pre-initialized, correct?
> 
> pre-initialized? I am not quite understand your mean.
> 
> >
> > > +		virtqueue_add_outbuf(vq, sg, 5, vb, GFP_KERNEL);
> > > +
> > > +	} else {
> > > +		struct scatterlist sg;
> > > +
> > > +		sg_init_one(&sg, vb->pfns, sizeof(vb->pfns[0]) * vb-
> > >num_pfns);
> > > +		/* We should always be able to add one buffer to an
> > > +		* empty queue.
> > > +		*/
> > > +		virtqueue_add_outbuf(vq, &sg, 1, vb, GFP_KERNEL);
> > > +	}
> > >
> > > -	/* We should always be able to add one buffer to an empty queue.
> > */
> > > -	virtqueue_add_outbuf(vq, &sg, 1, vb, GFP_KERNEL);
> > >  	virtqueue_kick(vq);
> > >
> > >  	/* When host has read buffer, this completes via balloon_ack */ @@
> > > -137,7 +193,21 @@ static void set_page_pfns(u32 pfns[], struct page
> *page)
> > >  		pfns[i] = page_to_balloon_pfn(page) + i;  }
> > >
> > > -static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
> > > +static void set_page_bitmap(struct virtio_balloon *vb, struct page
> > > +*page) {
> > > +	unsigned int i;
> > > +	unsigned long *bitmap = vb->page_bitmap;
> > > +	unsigned long balloon_pfn = page_to_balloon_pfn(page);
> > > +
> > > +	for (i = 0; i < VIRTIO_BALLOON_PAGES_PER_PAGE; i++)
> > > +		set_bit(balloon_pfn + i, bitmap);
> > > +	if (balloon_pfn < vb->start_pfn)
> > > +		vb->start_pfn = balloon_pfn;
> > > +	if (balloon_pfn > vb->end_pfn)
> > > +		vb->end_pfn = balloon_pfn;
> > > +}
> > > +
> > > +static unsigned fill_balloon_pfns(struct virtio_balloon *vb, size_t
> > > +num)
> > >  {
> > >  	struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info;
> > >  	unsigned num_allocated_pages;
> > > @@ -174,7 +244,104 @@ static unsigned fill_balloon(struct
> > > virtio_balloon
> > *vb, size_t num)
> > >  	return num_allocated_pages;
> > >  }
> > >
> > > -static void release_pages_balloon(struct virtio_balloon *vb)
> > > +static long fill_balloon_bitmap(struct virtio_balloon *vb, size_t
> > > +num) {
> > > +	struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info;
> > > +	long num_allocated_pages = 0;
> > > +
> > > +	if (balloon_page_bitmap_init(vb) < 0)
> > > +		return num;
> > > +
> > > +	mutex_lock(&vb->balloon_lock);
> > > +	for (vb->num_pfns = 0; vb->num_pfns < num;
> > > +	     vb->num_pfns += VIRTIO_BALLOON_PAGES_PER_PAGE) {
> > > +		struct page *page = balloon_page_enqueue(vb_dev_info);
> > > +
> > > +		if (!page) {
> > > +			dev_info_ratelimited(&vb->vdev->dev,
> > > +					     "Out of puff! Can't get %u
> > pages\n",
> > > +
> > VIRTIO_BALLOON_PAGES_PER_PAGE);
> > > +			/* Sleep for at least 1/5 of a second before retry. */
> > > +			msleep(200);
> > > +			break;
> > > +		}
> > > +		set_page_bitmap(vb, page);
> > > +		vb->num_pages += VIRTIO_BALLOON_PAGES_PER_PAGE;
> > > +		if (!virtio_has_feature(vb->vdev,
> > > +
> > 	VIRTIO_BALLOON_F_DEFLATE_ON_OOM))
> > > +			adjust_managed_page_count(page, -1);
> > > +	}
> >
> > This is grossly inefficient if you only requested a single page.
> > And it's also allocating memory very aggressively without ever telling
> > the host what is going on.
> 
> If only requested a single page, there is no need  to send the entire page
> bitmap, This RFC patch has already considered about this. But it can works

s/can/can't
> very well if requesting several pages  which across a large range.
> 
> > > +
> > > +	num_allocated_pages = vb->num_pfns;
> > > +	/* Did we get any? */
> > > +	if (vb->num_pfns != 0)
> > > +		tell_host(vb, vb->inflate_vq);
> > > +	mutex_unlock(&vb->balloon_lock);
> > > +
> > > +	return num_allocated_pages;
> > > +}
> > > +
> > > +static long fill_balloon(struct virtio_balloon *vb, size_t num) {
> > > +	long num_allocated_pages;
> > > +
> > > +	if (virtio_has_feature(vb->vdev,
> > VIRTIO_BALLOON_F_PAGE_BITMAP))
> > > +		num_allocated_pages = fill_balloon_bitmap(vb, num);
> > > +	else
> > > +		num_allocated_pages = fill_balloon_pfns(vb, num);
> > > +
> > > +	return num_allocated_pages;
> > > +}
> > > +
> > > +static void release_pages_balloon_bitmap(struct virtio_balloon *vb) {
> > > +	unsigned long pfn, offset, size;
> > > +	struct page *page;
> > > +
> > > +	size = min(vb->bmap_len * BITS_PER_BYTE, vb->end_pfn);
> > > +	for (offset = vb->start_pfn; offset < size;
> > > +		 offset = pfn + VIRTIO_BALLOON_PAGES_PER_PAGE) {
> > > +		pfn = find_next_bit(vb->page_bitmap, size, offset);
> > > +		if (pfn < size) {
> > > +			page = balloon_pfn_to_page(pfn);
> > > +			if (!virtio_has_feature(vb->vdev,
> > > +
> > 	VIRTIO_BALLOON_F_DEFLATE_ON_OOM))
> > > +				adjust_managed_page_count(page, 1);
> > > +			put_page(page);
> > > +		}
> > > +	}
> > > +}
> > > +
> > > +static unsigned long leak_balloon_bitmap(struct virtio_balloon *vb,
> > > +size_t num) {
> > > +	unsigned long num_freed_pages = num;
> > > +	struct page *page;
> > > +	struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info;
> > > +
> > > +	if (balloon_page_bitmap_init(vb) < 0)
> > > +		return num_freed_pages;
> > > +
> > > +	mutex_lock(&vb->balloon_lock);
> > > +	for (vb->num_pfns = 0; vb->num_pfns < num;
> > > +	     vb->num_pfns += VIRTIO_BALLOON_PAGES_PER_PAGE) {
> > > +		page = balloon_page_dequeue(vb_dev_info);
> > > +		if (!page)
> > > +			break;
> > > +		set_page_bitmap(vb, page);
> > > +		vb->num_pages -= VIRTIO_BALLOON_PAGES_PER_PAGE;
> > > +	}
> > > +
> > > +	num_freed_pages = vb->num_pfns;
> > > +
> > > +	if (vb->num_pfns != 0)
> > > +		tell_host(vb, vb->deflate_vq);
> > > +	release_pages_balloon_bitmap(vb);
> > > +	mutex_unlock(&vb->balloon_lock);
> > > +
> > > +	return num_freed_pages;
> > > +}
> > > +
> > > +static void release_pages_balloon_pfns(struct virtio_balloon *vb)
> > >  {
> > >  	unsigned int i;
> > >
> > > @@ -188,7 +355,7 @@ static void release_pages_balloon(struct
> > virtio_balloon *vb)
> > >  	}
> > >  }
> > >
> > > -static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
> > > +static unsigned leak_balloon_pfns(struct virtio_balloon *vb, size_t
> > > +num)
> > >  {
> > >  	unsigned num_freed_pages;
> > >  	struct page *page;
> > > @@ -215,11 +382,23 @@ static unsigned leak_balloon(struct
> > > virtio_balloon
> > *vb, size_t num)
> > >  	 */
> > >  	if (vb->num_pfns != 0)
> > >  		tell_host(vb, vb->deflate_vq);
> > > -	release_pages_balloon(vb);
> > > +	release_pages_balloon_pfns(vb);
> > >  	mutex_unlock(&vb->balloon_lock);
> > >  	return num_freed_pages;
> > >  }
> > >
> > > +static long leak_balloon(struct virtio_balloon *vb, size_t num) {
> > > +	long num_freed_pages;
> > > +
> > > +	if (virtio_has_feature(vb->vdev,
> > VIRTIO_BALLOON_F_PAGE_BITMAP))
> > > +		num_freed_pages = leak_balloon_bitmap(vb, num);
> > > +	else
> > > +		num_freed_pages = leak_balloon_pfns(vb, num);
> > > +
> > > +	return num_freed_pages;
> > > +}
> > > +
> > >  static inline void update_stat(struct virtio_balloon *vb, int idx,
> > >  			       u16 tag, u64 val)
> > >  {
> > > @@ -510,6 +689,8 @@ static int virtballoon_probe(struct
> > > virtio_device
> > *vdev)
> > >  	spin_lock_init(&vb->stop_update_lock);
> > >  	vb->stop_update = false;
> > >  	vb->num_pages = 0;
> > > +	vb->page_bitmap = NULL;
> > > +	vb->bmap_len = 0;
> > >  	mutex_init(&vb->balloon_lock);
> > >  	init_waitqueue_head(&vb->acked);
> > >  	vb->vdev = vdev;
> > > @@ -567,6 +748,7 @@ static void virtballoon_remove(struct
> > > virtio_device
> > *vdev)
> > >  	cancel_work_sync(&vb->update_balloon_stats_work);
> > >
> > >  	remove_common(vb);
> > > +	kfree(vb->page_bitmap);
> > >  	kfree(vb);
> > >  }
> > >
> > > @@ -605,6 +787,7 @@ static unsigned int features[] = {
> > >  	VIRTIO_BALLOON_F_MUST_TELL_HOST,
> > >  	VIRTIO_BALLOON_F_STATS_VQ,
> > >  	VIRTIO_BALLOON_F_DEFLATE_ON_OOM,
> > > +	VIRTIO_BALLOON_F_PAGE_BITMAP,
> > >  };
> > >
> > >  static struct virtio_driver virtio_balloon_driver = { diff --git
> > > a/include/uapi/linux/virtio_balloon.h
> > > b/include/uapi/linux/virtio_balloon.h
> > > index 343d7dd..f78fa47 100644
> > > --- a/include/uapi/linux/virtio_balloon.h
> > > +++ b/include/uapi/linux/virtio_balloon.h
> > > @@ -34,6 +34,7 @@
> > >  #define VIRTIO_BALLOON_F_MUST_TELL_HOST	0 /* Tell before
> > reclaiming pages */
> > >  #define VIRTIO_BALLOON_F_STATS_VQ	1 /* Memory Stats virtqueue
> > */
> > >  #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM	2 /* Deflate balloon
> > on OOM */
> > > +#define VIRTIO_BALLOON_F_PAGE_BITMAP	3 /* Send page info
> > with bitmap */
> > >
> > >  /* Size of a PFN in the balloon interface. */  #define
> > > VIRTIO_BALLOON_PFN_SHIFT 12 diff --git a/mm/page_alloc.c
> > > b/mm/page_alloc.c index c1069ef..74b2fc5 100644
> > > --- a/mm/page_alloc.c
> > > +++ b/mm/page_alloc.c
> > > @@ -2139,6 +2139,12 @@ void drain_all_pages(struct zone *zone)
> > >  								zone, 1);
> > >  }
> > >
> > > +unsigned long get_max_pfn(void)
> > > +{
> > > +	return max_pfn;
> > > +}
> > > +EXPORT_SYMBOL(get_max_pfn);
> > > +
> > >  #ifdef CONFIG_HIBERNATION
> > >
> > >  void mark_free_pages(struct zone *zone)
> >
> > Suggestion to address all above comments:
> > 	1. allocate a bunch of pages and link them up,
> > 	   calculating the min and the max pfn.
> > 	   if max-min exceeds the allocated bitmap size,
> > 	   tell host.
> 
> I am not sure if it works well in some cases, e.g. The allocated pages are
> across a wide range and the max-min > limit is very frequently to be true.
> Then, there will be many times of virtio transmission and it's bad for
> performance improvement. Right?
> 
> 
> > 	2. limit allocated bitmap size to something reasonable.
> > 	   How about 32Kbytes? This is 256kilo bit in the map, which comes
> > 	   out to 1Giga bytes of memory in the balloon.
> 
> So, even the VM has 1TB of RAM, the page bitmap will take 32MB of memory.
> Maybe it's better to use a big page bitmap the save the pages allocated by
> balloon,
> and split the big page bitmap to 32K bytes unit, then transfer one unit at a
> time.
> 
> 
> Should we use a page bitmap to replace 'vb->pages' ?
> 
> How about rolling back to use PFNs if the count of requested pages is a small
> number?
> 
> Liang
> > > --
> > > 1.9.1
> > --
> > To unsubscribe from this list: send the line "unsubscribe kvm" in the body
> of
> > a message to majordomo@vger.kernel.org More majordomo info at
> > http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* RE: [PATCH RFC kernel] balloon: speed up inflating/deflating process
From: Li, Liang Z @ 2016-05-24  9:51 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: kvm@vger.kernel.org, qemu-devel@nongnu.org,
	linux-kernel@vger.kernel.org,
	virtualization@lists.linux-foundation.org, amit.shah@redhat.com,
	pbonzini@redhat.com, akpm@linux-foundation.org,
	dgilbert@redhat.com
In-Reply-To: <20160520120038.GA28757@redhat.com>

> On Fri, May 20, 2016 at 05:59:46PM +0800, Liang Li wrote:
> > The implementation of the current virtio-balloon is not very
> > efficient, Bellow is test result of time spends on inflating the
> > balloon to 3GB of a 4GB idle guest:
> >
> > a. allocating pages (6.5%, 103ms)
> > b. sending PFNs to host (68.3%, 787ms) c. address translation (6.1%,
> > 96ms) d. madvise (19%, 300ms)
> >
> > It takes about 1577ms for the whole inflating process to complete. The
> > test shows that the bottle neck is the stage b and stage d.
> >
> > If using a bitmap to send the page info instead of the PFNs, we can
> > reduce the overhead spends on stage b quite a lot. Furthermore, it's
> > possible to do the address translation and do the madvise with a bulk
> > of pages, instead of the current page per page way, so the overhead of
> > stage c and stage d can also be reduced a lot.
> >
> > This patch is the kernel side implementation which is intended to
> > speed up the inflating & deflating process by adding a new feature to
> > the virtio-balloon device. And now, inflating the balloon to 3GB of a
> > 4GB idle guest only takes 175ms, it's about 9 times as fast as before.
> >
> > TODO: optimize stage a by allocating/freeing a chunk of pages instead
> > of a single page at a time.
> >
> > Signed-off-by: Liang Li <liang.z.li@intel.com>
> > ---
> >  drivers/virtio/virtio_balloon.c     | 199
> ++++++++++++++++++++++++++++++++++--
> >  include/uapi/linux/virtio_balloon.h |   1 +
> >  mm/page_alloc.c                     |   6 ++
> >  3 files changed, 198 insertions(+), 8 deletions(-)
> >
> > diff --git a/drivers/virtio/virtio_balloon.c
> > b/drivers/virtio/virtio_balloon.c index 7b6d74f..5330b6f 100644
> > --- a/drivers/virtio/virtio_balloon.c
> > +++ b/drivers/virtio/virtio_balloon.c
> > @@ -45,6 +45,8 @@ static int oom_pages =
> OOM_VBALLOON_DEFAULT_PAGES;
> > module_param(oom_pages, int, S_IRUSR | S_IWUSR);
> > MODULE_PARM_DESC(oom_pages, "pages to free on OOM");
> >
> > +extern unsigned long get_max_pfn(void);
> > +
> >  struct virtio_balloon {
> >  	struct virtio_device *vdev;
> >  	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq; @@ -62,6 +64,9
> > @@ struct virtio_balloon {
> >
> >  	/* Number of balloon pages we've told the Host we're not using. */
> >  	unsigned int num_pages;
> > +	unsigned long *page_bitmap;
> > +	unsigned long start_pfn, end_pfn;
> > +	unsigned long bmap_len;
> >  	/*
> >  	 * The pages we've told the Host we're not using are enqueued
> >  	 * at vb_dev_info->pages list.
> > @@ -111,15 +116,66 @@ static void balloon_ack(struct virtqueue *vq)
> >  	wake_up(&vb->acked);
> >  }
> >
> > +static int balloon_page_bitmap_init(struct virtio_balloon *vb) {
> > +	unsigned long max_pfn, bmap_bytes;
> > +
> > +	max_pfn = get_max_pfn();
> 
> This is racy. max_pfn could be increased by memory hotplug after you got it.
> 
> 
> > +	bmap_bytes = ALIGN(max_pfn, BITS_PER_LONG) / BITS_PER_BYTE;
> > +	if (!vb->page_bitmap)
> > +		vb->page_bitmap = kzalloc(bmap_bytes, GFP_KERNEL);
> 
> Likely to fail for a huge busy guest.
> Why not init on device probe?
> this way
> 	- probe will fail, or we can clear the feature bit
> 	- free memory is more likely to be available
> 

Very good suggestion!

> 
> > +	else {
> > +		if (bmap_bytes <= vb->bmap_len)
> > +			memset(vb->page_bitmap, 0, bmap_bytes);
> > +		else {
> > +			kfree(vb->page_bitmap);
> > +			vb->page_bitmap = kzalloc(bmap_bytes,
> GFP_KERNEL);
> > +		}
> > +	}
> > +	if (!vb->page_bitmap) {
> > +		dev_err(&vb->vdev->dev, "%s failure: allocate page
> bitmap\n",
> > +			 __func__);
> > +		return -ENOMEM;
> > +	}
> > +	vb->bmap_len = bmap_bytes;
> > +	vb->start_pfn = max_pfn;
> > +	vb->end_pfn = 0;
> > +
> > +	return 0;
> > +}
> > +
> 
> >  {
> > -	struct scatterlist sg;
> >  	unsigned int len;
> >
> > -	sg_init_one(&sg, vb->pfns, sizeof(vb->pfns[0]) * vb->num_pfns);
> > +	if (virtio_has_feature(vb->vdev,
> VIRTIO_BALLOON_F_PAGE_BITMAP)) {
> > +		u32 page_shift = PAGE_SHIFT;
> > +		unsigned long start_pfn, end_pfn, flags = 0, bmap_len;
> > +		struct scatterlist sg[5];
> > +
> > +		start_pfn = rounddown(vb->start_pfn, BITS_PER_LONG);
> > +		end_pfn = roundup(vb->end_pfn, BITS_PER_LONG);
> > +		bmap_len = (end_pfn - start_pfn) / BITS_PER_LONG *
> sizeof(long);
> > +
> > +		sg_init_table(sg, 5);
> > +		sg_set_buf(&sg[0], &flags, sizeof(flags));
> > +		sg_set_buf(&sg[1], &start_pfn, sizeof(start_pfn));
> > +		sg_set_buf(&sg[2], &page_shift, sizeof(page_shift));
> > +		sg_set_buf(&sg[3], &bmap_len, sizeof(bmap_len));
> > +		sg_set_buf(&sg[4], vb->page_bitmap +
> > +				 (start_pfn / BITS_PER_LONG), bmap_len);
> 
> This can be pre-initialized, correct?

pre-initialized? I am not quite understand your mean.

> 
> > +		virtqueue_add_outbuf(vq, sg, 5, vb, GFP_KERNEL);
> > +
> > +	} else {
> > +		struct scatterlist sg;
> > +
> > +		sg_init_one(&sg, vb->pfns, sizeof(vb->pfns[0]) * vb-
> >num_pfns);
> > +		/* We should always be able to add one buffer to an
> > +		* empty queue.
> > +		*/
> > +		virtqueue_add_outbuf(vq, &sg, 1, vb, GFP_KERNEL);
> > +	}
> >
> > -	/* We should always be able to add one buffer to an empty queue.
> */
> > -	virtqueue_add_outbuf(vq, &sg, 1, vb, GFP_KERNEL);
> >  	virtqueue_kick(vq);
> >
> >  	/* When host has read buffer, this completes via balloon_ack */ @@
> > -137,7 +193,21 @@ static void set_page_pfns(u32 pfns[], struct page *page)
> >  		pfns[i] = page_to_balloon_pfn(page) + i;  }
> >
> > -static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
> > +static void set_page_bitmap(struct virtio_balloon *vb, struct page
> > +*page) {
> > +	unsigned int i;
> > +	unsigned long *bitmap = vb->page_bitmap;
> > +	unsigned long balloon_pfn = page_to_balloon_pfn(page);
> > +
> > +	for (i = 0; i < VIRTIO_BALLOON_PAGES_PER_PAGE; i++)
> > +		set_bit(balloon_pfn + i, bitmap);
> > +	if (balloon_pfn < vb->start_pfn)
> > +		vb->start_pfn = balloon_pfn;
> > +	if (balloon_pfn > vb->end_pfn)
> > +		vb->end_pfn = balloon_pfn;
> > +}
> > +
> > +static unsigned fill_balloon_pfns(struct virtio_balloon *vb, size_t
> > +num)
> >  {
> >  	struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info;
> >  	unsigned num_allocated_pages;
> > @@ -174,7 +244,104 @@ static unsigned fill_balloon(struct virtio_balloon
> *vb, size_t num)
> >  	return num_allocated_pages;
> >  }
> >
> > -static void release_pages_balloon(struct virtio_balloon *vb)
> > +static long fill_balloon_bitmap(struct virtio_balloon *vb, size_t
> > +num) {
> > +	struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info;
> > +	long num_allocated_pages = 0;
> > +
> > +	if (balloon_page_bitmap_init(vb) < 0)
> > +		return num;
> > +
> > +	mutex_lock(&vb->balloon_lock);
> > +	for (vb->num_pfns = 0; vb->num_pfns < num;
> > +	     vb->num_pfns += VIRTIO_BALLOON_PAGES_PER_PAGE) {
> > +		struct page *page = balloon_page_enqueue(vb_dev_info);
> > +
> > +		if (!page) {
> > +			dev_info_ratelimited(&vb->vdev->dev,
> > +					     "Out of puff! Can't get %u
> pages\n",
> > +
> VIRTIO_BALLOON_PAGES_PER_PAGE);
> > +			/* Sleep for at least 1/5 of a second before retry. */
> > +			msleep(200);
> > +			break;
> > +		}
> > +		set_page_bitmap(vb, page);
> > +		vb->num_pages += VIRTIO_BALLOON_PAGES_PER_PAGE;
> > +		if (!virtio_has_feature(vb->vdev,
> > +
> 	VIRTIO_BALLOON_F_DEFLATE_ON_OOM))
> > +			adjust_managed_page_count(page, -1);
> > +	}
> 
> This is grossly inefficient if you only requested a single page.
> And it's also allocating memory very aggressively without ever telling the host
> what is going on.

If only requested a single page, there is no need  to send the entire page bitmap,
This RFC patch has already considered about this. But it can works very well if requesting
several pages  which across a large range.

> > +
> > +	num_allocated_pages = vb->num_pfns;
> > +	/* Did we get any? */
> > +	if (vb->num_pfns != 0)
> > +		tell_host(vb, vb->inflate_vq);
> > +	mutex_unlock(&vb->balloon_lock);
> > +
> > +	return num_allocated_pages;
> > +}
> > +
> > +static long fill_balloon(struct virtio_balloon *vb, size_t num) {
> > +	long num_allocated_pages;
> > +
> > +	if (virtio_has_feature(vb->vdev,
> VIRTIO_BALLOON_F_PAGE_BITMAP))
> > +		num_allocated_pages = fill_balloon_bitmap(vb, num);
> > +	else
> > +		num_allocated_pages = fill_balloon_pfns(vb, num);
> > +
> > +	return num_allocated_pages;
> > +}
> > +
> > +static void release_pages_balloon_bitmap(struct virtio_balloon *vb) {
> > +	unsigned long pfn, offset, size;
> > +	struct page *page;
> > +
> > +	size = min(vb->bmap_len * BITS_PER_BYTE, vb->end_pfn);
> > +	for (offset = vb->start_pfn; offset < size;
> > +		 offset = pfn + VIRTIO_BALLOON_PAGES_PER_PAGE) {
> > +		pfn = find_next_bit(vb->page_bitmap, size, offset);
> > +		if (pfn < size) {
> > +			page = balloon_pfn_to_page(pfn);
> > +			if (!virtio_has_feature(vb->vdev,
> > +
> 	VIRTIO_BALLOON_F_DEFLATE_ON_OOM))
> > +				adjust_managed_page_count(page, 1);
> > +			put_page(page);
> > +		}
> > +	}
> > +}
> > +
> > +static unsigned long leak_balloon_bitmap(struct virtio_balloon *vb,
> > +size_t num) {
> > +	unsigned long num_freed_pages = num;
> > +	struct page *page;
> > +	struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info;
> > +
> > +	if (balloon_page_bitmap_init(vb) < 0)
> > +		return num_freed_pages;
> > +
> > +	mutex_lock(&vb->balloon_lock);
> > +	for (vb->num_pfns = 0; vb->num_pfns < num;
> > +	     vb->num_pfns += VIRTIO_BALLOON_PAGES_PER_PAGE) {
> > +		page = balloon_page_dequeue(vb_dev_info);
> > +		if (!page)
> > +			break;
> > +		set_page_bitmap(vb, page);
> > +		vb->num_pages -= VIRTIO_BALLOON_PAGES_PER_PAGE;
> > +	}
> > +
> > +	num_freed_pages = vb->num_pfns;
> > +
> > +	if (vb->num_pfns != 0)
> > +		tell_host(vb, vb->deflate_vq);
> > +	release_pages_balloon_bitmap(vb);
> > +	mutex_unlock(&vb->balloon_lock);
> > +
> > +	return num_freed_pages;
> > +}
> > +
> > +static void release_pages_balloon_pfns(struct virtio_balloon *vb)
> >  {
> >  	unsigned int i;
> >
> > @@ -188,7 +355,7 @@ static void release_pages_balloon(struct
> virtio_balloon *vb)
> >  	}
> >  }
> >
> > -static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
> > +static unsigned leak_balloon_pfns(struct virtio_balloon *vb, size_t
> > +num)
> >  {
> >  	unsigned num_freed_pages;
> >  	struct page *page;
> > @@ -215,11 +382,23 @@ static unsigned leak_balloon(struct virtio_balloon
> *vb, size_t num)
> >  	 */
> >  	if (vb->num_pfns != 0)
> >  		tell_host(vb, vb->deflate_vq);
> > -	release_pages_balloon(vb);
> > +	release_pages_balloon_pfns(vb);
> >  	mutex_unlock(&vb->balloon_lock);
> >  	return num_freed_pages;
> >  }
> >
> > +static long leak_balloon(struct virtio_balloon *vb, size_t num) {
> > +	long num_freed_pages;
> > +
> > +	if (virtio_has_feature(vb->vdev,
> VIRTIO_BALLOON_F_PAGE_BITMAP))
> > +		num_freed_pages = leak_balloon_bitmap(vb, num);
> > +	else
> > +		num_freed_pages = leak_balloon_pfns(vb, num);
> > +
> > +	return num_freed_pages;
> > +}
> > +
> >  static inline void update_stat(struct virtio_balloon *vb, int idx,
> >  			       u16 tag, u64 val)
> >  {
> > @@ -510,6 +689,8 @@ static int virtballoon_probe(struct virtio_device
> *vdev)
> >  	spin_lock_init(&vb->stop_update_lock);
> >  	vb->stop_update = false;
> >  	vb->num_pages = 0;
> > +	vb->page_bitmap = NULL;
> > +	vb->bmap_len = 0;
> >  	mutex_init(&vb->balloon_lock);
> >  	init_waitqueue_head(&vb->acked);
> >  	vb->vdev = vdev;
> > @@ -567,6 +748,7 @@ static void virtballoon_remove(struct virtio_device
> *vdev)
> >  	cancel_work_sync(&vb->update_balloon_stats_work);
> >
> >  	remove_common(vb);
> > +	kfree(vb->page_bitmap);
> >  	kfree(vb);
> >  }
> >
> > @@ -605,6 +787,7 @@ static unsigned int features[] = {
> >  	VIRTIO_BALLOON_F_MUST_TELL_HOST,
> >  	VIRTIO_BALLOON_F_STATS_VQ,
> >  	VIRTIO_BALLOON_F_DEFLATE_ON_OOM,
> > +	VIRTIO_BALLOON_F_PAGE_BITMAP,
> >  };
> >
> >  static struct virtio_driver virtio_balloon_driver = { diff --git
> > a/include/uapi/linux/virtio_balloon.h
> > b/include/uapi/linux/virtio_balloon.h
> > index 343d7dd..f78fa47 100644
> > --- a/include/uapi/linux/virtio_balloon.h
> > +++ b/include/uapi/linux/virtio_balloon.h
> > @@ -34,6 +34,7 @@
> >  #define VIRTIO_BALLOON_F_MUST_TELL_HOST	0 /* Tell before
> reclaiming pages */
> >  #define VIRTIO_BALLOON_F_STATS_VQ	1 /* Memory Stats virtqueue
> */
> >  #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM	2 /* Deflate balloon
> on OOM */
> > +#define VIRTIO_BALLOON_F_PAGE_BITMAP	3 /* Send page info
> with bitmap */
> >
> >  /* Size of a PFN in the balloon interface. */  #define
> > VIRTIO_BALLOON_PFN_SHIFT 12 diff --git a/mm/page_alloc.c
> > b/mm/page_alloc.c index c1069ef..74b2fc5 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -2139,6 +2139,12 @@ void drain_all_pages(struct zone *zone)
> >  								zone, 1);
> >  }
> >
> > +unsigned long get_max_pfn(void)
> > +{
> > +	return max_pfn;
> > +}
> > +EXPORT_SYMBOL(get_max_pfn);
> > +
> >  #ifdef CONFIG_HIBERNATION
> >
> >  void mark_free_pages(struct zone *zone)
> 
> Suggestion to address all above comments:
> 	1. allocate a bunch of pages and link them up,
> 	   calculating the min and the max pfn.
> 	   if max-min exceeds the allocated bitmap size,
> 	   tell host.

I am not sure if it works well in some cases, e.g. The allocated pages 
are across a wide range and the max-min > limit is very frequently to be true.
Then, there will be many times of virtio transmission and it's bad for performance
improvement. Right?


> 	2. limit allocated bitmap size to something reasonable.
> 	   How about 32Kbytes? This is 256kilo bit in the map, which comes
> 	   out to 1Giga bytes of memory in the balloon.

So, even the VM has 1TB of RAM, the page bitmap will take 32MB of memory.
Maybe it's better to use a big page bitmap the save the pages allocated by balloon,
and split the big page bitmap to 32K bytes unit, then transfer one unit at a time.


Should we use a page bitmap to replace 'vb->pages' ?

How about rolling back to use PFNs if the count of requested pages is a small number?

Liang
> > --
> > 1.9.1
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in the body of
> a message to majordomo@vger.kernel.org More majordomo info at
> http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* [RFC PATCH V3 3/3] vhost: device IOTLB API
From: Jason Wang @ 2016-05-24  9:36 UTC (permalink / raw)
  To: mst, kvm, virtualization, netdev, linux-kernel; +Cc: vkaplans, wexu, peterx
In-Reply-To: <1464082585-13049-1-git-send-email-jasowang@redhat.com>

This patch tries to implement an device IOTLB for vhost. This could be
used with for co-operation with userspace(qemu) implementation of DMA
remapping.

The idea is simple, cache the translation in a software device IOTLB
(interval tree) implementation in vhost and make vhost_net file
descriptor could be read or wrote by a process. When vhost meets an
IOTLB miss, the fault address, size and access could be read from the
file. After userspace finishes the translation, it write the
translated address to the vhost_net file to update the device IOTLB.

When device IOTLB (VHOST_F_DEVICE_IOTLB) is enabled all vq address
set by ioctl were treated as iova instead of virtual address and the
accessing could only be done through IOTLB instead of direct
userspace memory access. Before each rounds or vq processing, all vq
metadata were prefetched in device IOTLB to make sure no translation
fault happens during vq processing.

In most cases, virtqueue were mapped contiguous even in virtual
address. So the IOTLB translation for virtqueue itself maybe a little
bit slower. We can add fast path on top of this patch.

Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 drivers/vhost/net.c        |  54 ++++-
 drivers/vhost/vhost.c      | 586 +++++++++++++++++++++++++++++++++++++++++----
 drivers/vhost/vhost.h      |  35 ++-
 include/uapi/linux/vhost.h |  28 +++
 4 files changed, 656 insertions(+), 47 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index a584239..0f45e3d 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -61,7 +61,8 @@ MODULE_PARM_DESC(experimental_zcopytx, "Enable Zero Copy TX;"
 enum {
 	VHOST_NET_FEATURES = VHOST_FEATURES |
 			 (1ULL << VHOST_NET_F_VIRTIO_NET_HDR) |
-			 (1ULL << VIRTIO_NET_F_MRG_RXBUF)
+			 (1ULL << VIRTIO_NET_F_MRG_RXBUF) |
+			 (1ULL << VHOST_F_DEVICE_IOTLB)
 };
 
 enum {
@@ -308,7 +309,8 @@ static int vhost_net_tx_get_vq_desc(struct vhost_net *net,
 {
 	unsigned long uninitialized_var(endtime);
 	int r = vhost_get_vq_desc(vq, vq->iov, ARRAY_SIZE(vq->iov),
-				    out_num, in_num, NULL, NULL);
+				  out_num, in_num, NULL, NULL,
+				  VHOST_ACCESS_RO);
 
 	if (r == vq->num && vq->busyloop_timeout) {
 		preempt_disable();
@@ -318,7 +320,8 @@ static int vhost_net_tx_get_vq_desc(struct vhost_net *net,
 			cpu_relax_lowlatency();
 		preempt_enable();
 		r = vhost_get_vq_desc(vq, vq->iov, ARRAY_SIZE(vq->iov),
-					out_num, in_num, NULL, NULL);
+				      out_num, in_num, NULL, NULL,
+				      VHOST_ACCESS_RO);
 	}
 
 	return r;
@@ -351,6 +354,9 @@ static void handle_tx(struct vhost_net *net)
 	if (!sock)
 		goto out;
 
+	if (!vq_iotlb_prefetch(vq))
+		goto out;
+
 	vhost_disable_notify(&net->dev, vq);
 
 	hdr_size = nvq->vhost_hlen;
@@ -538,7 +544,7 @@ static int get_rx_bufs(struct vhost_virtqueue *vq,
 		}
 		r = vhost_get_vq_desc(vq, vq->iov + seg,
 				      ARRAY_SIZE(vq->iov) - seg, &out,
-				      &in, log, log_num);
+				      &in, log, log_num, VHOST_ACCESS_WO);
 		if (unlikely(r < 0))
 			goto err;
 
@@ -612,6 +618,10 @@ static void handle_rx(struct vhost_net *net)
 	sock = vq->private_data;
 	if (!sock)
 		goto out;
+
+	if (!vq_iotlb_prefetch(vq))
+		goto out;
+
 	vhost_disable_notify(&net->dev, vq);
 
 	vhost_hlen = nvq->vhost_hlen;
@@ -1085,6 +1095,11 @@ static int vhost_net_set_features(struct vhost_net *n, u64 features)
 		mutex_unlock(&n->dev.mutex);
 		return -EFAULT;
 	}
+	if ((features & (1ULL << VHOST_F_DEVICE_IOTLB))) {
+		if (vhost_init_device_iotlb(&n->dev, true))
+			return -EFAULT;
+	}
+
 	for (i = 0; i < VHOST_NET_VQ_MAX; ++i) {
 		mutex_lock(&n->vqs[i].vq.mutex);
 		n->vqs[i].vq.acked_features = features;
@@ -1167,9 +1182,40 @@ static long vhost_net_compat_ioctl(struct file *f, unsigned int ioctl,
 }
 #endif
 
+static ssize_t vhost_net_chr_read_iter(struct kiocb *iocb, struct iov_iter *to)
+{
+	struct file *file = iocb->ki_filp;
+	struct vhost_net *n = file->private_data;
+	struct vhost_dev *dev = &n->dev;
+	int noblock = file->f_flags & O_NONBLOCK;
+
+	return vhost_chr_read_iter(dev, to, noblock);
+}
+
+static ssize_t vhost_net_chr_write_iter(struct kiocb *iocb,
+					struct iov_iter *from)
+{
+	struct file *file = iocb->ki_filp;
+	struct vhost_net *n = file->private_data;
+	struct vhost_dev *dev = &n->dev;
+
+	return vhost_chr_write_iter(dev, from);
+}
+
+static unsigned int vhost_net_chr_poll(struct file *file, poll_table *wait)
+{
+	struct vhost_net *n = file->private_data;
+	struct vhost_dev *dev = &n->dev;
+
+	return vhost_chr_poll(file, dev, wait);
+}
+
 static const struct file_operations vhost_net_fops = {
 	.owner          = THIS_MODULE,
 	.release        = vhost_net_release,
+	.read_iter      = vhost_net_chr_read_iter,
+	.write_iter     = vhost_net_chr_write_iter,
+	.poll           = vhost_net_chr_poll,
 	.unlocked_ioctl = vhost_net_ioctl,
 #ifdef CONFIG_COMPAT
 	.compat_ioctl   = vhost_net_compat_ioctl,
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 166e779..7dd90f8 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -35,6 +35,10 @@ static ushort max_mem_regions = 64;
 module_param(max_mem_regions, ushort, 0444);
 MODULE_PARM_DESC(max_mem_regions,
 	"Maximum number of memory regions in memory map. (default: 64)");
+static int max_iotlb_entries = 2048;
+module_param(max_iotlb_entries, int, 0444);
+MODULE_PARM_DESC(max_iotlb_entries,
+	"Maximum number of iotlb entries. (default: 2048)");
 
 enum {
 	VHOST_MEMORY_F_LOG = 0x1,
@@ -309,6 +313,7 @@ static void vhost_vq_reset(struct vhost_dev *dev,
 	vhost_disable_cross_endian(vq);
 	vq->busyloop_timeout = 0;
 	vq->umem = NULL;
+	vq->iotlb = NULL;
 }
 
 static int vhost_worker(void *data)
@@ -413,9 +418,14 @@ void vhost_dev_init(struct vhost_dev *dev,
 	dev->log_ctx = NULL;
 	dev->log_file = NULL;
 	dev->umem = NULL;
+	dev->iotlb = NULL;
 	dev->mm = NULL;
 	spin_lock_init(&dev->work_lock);
 	INIT_LIST_HEAD(&dev->work_list);
+	init_waitqueue_head(&dev->wait);
+	INIT_LIST_HEAD(&dev->read_list);
+	INIT_LIST_HEAD(&dev->pending_list);
+	spin_lock_init(&dev->iotlb_lock);
 	dev->worker = NULL;
 
 	for (i = 0; i < dev->nvqs; ++i) {
@@ -563,6 +573,15 @@ void vhost_dev_stop(struct vhost_dev *dev)
 }
 EXPORT_SYMBOL_GPL(vhost_dev_stop);
 
+static void vhost_umem_free(struct vhost_umem *umem,
+			    struct vhost_umem_node *node)
+{
+	vhost_umem_interval_tree_remove(node, &umem->umem_tree);
+	list_del(&node->link);
+	kfree(node);
+	umem->numem--;
+}
+
 static void vhost_umem_clean(struct vhost_umem *umem)
 {
 	struct vhost_umem_node *node, *tmp;
@@ -570,14 +589,31 @@ static void vhost_umem_clean(struct vhost_umem *umem)
 	if (!umem)
 		return;
 
-	list_for_each_entry_safe(node, tmp, &umem->umem_list, link) {
-		vhost_umem_interval_tree_remove(node, &umem->umem_tree);
-		list_del(&node->link);
-		kvfree(node);
-	}
+	list_for_each_entry_safe(node, tmp, &umem->umem_list, link)
+		vhost_umem_free(umem, node);
+
 	kvfree(umem);
 }
 
+static void vhost_clear_msg(struct vhost_dev *dev)
+{
+	struct vhost_msg_node *node, *n;
+
+	spin_lock(&dev->iotlb_lock);
+
+	list_for_each_entry_safe(node, n, &dev->read_list, node) {
+		list_del(&node->node);
+		kfree(node);
+	}
+
+	list_for_each_entry_safe(node, n, &dev->pending_list, node) {
+		list_del(&node->node);
+		kfree(node);
+	}
+
+	spin_unlock(&dev->iotlb_lock);
+}
+
 /* Caller should have device mutex if and only if locked is set */
 void vhost_dev_cleanup(struct vhost_dev *dev, bool locked)
 {
@@ -606,6 +642,10 @@ void vhost_dev_cleanup(struct vhost_dev *dev, bool locked)
 	/* No one will access memory at this point */
 	vhost_umem_clean(dev->umem);
 	dev->umem = NULL;
+	vhost_umem_clean(dev->iotlb);
+	dev->iotlb = NULL;
+	vhost_clear_msg(dev);
+	wake_up_interruptible_poll(&dev->wait, POLLIN | POLLRDNORM);
 	WARN_ON(!list_empty(&dev->work_list));
 	if (dev->worker) {
 		kthread_stop(dev->worker);
@@ -681,28 +721,343 @@ static int memory_access_ok(struct vhost_dev *d, struct vhost_umem *umem,
 	return 1;
 }
 
-#define vhost_put_user(vq, x, ptr)  __put_user(x, ptr)
+static int translate_desc(struct vhost_virtqueue *vq, u64 addr, u32 len,
+			  struct iovec iov[], int iov_size, int access);
 
 static int vhost_copy_to_user(struct vhost_virtqueue *vq, void *to,
 			      const void *from, unsigned size)
 {
-	return __copy_to_user(to, from, size);
+	int ret;
+
+	if (!vq->iotlb)
+		return __copy_to_user(to, from, size);
+	else {
+		/* This function should be called after iotlb
+		 * prefetch, which means we're sure that all vq
+		 * could be access through iotlb. So -EAGAIN should
+		 * not happen in this case.
+		 */
+		/* TODO: more fast path */
+		struct iov_iter t;
+		ret = translate_desc(vq, (u64)to, size, vq->iotlb_iov,
+				     ARRAY_SIZE(vq->iotlb_iov),
+				     VHOST_ACCESS_WO);
+		if (ret < 0)
+			goto out;
+		iov_iter_init(&t, WRITE, vq->iotlb_iov, ret, size);
+		ret = copy_to_iter(from, size, &t);
+		if (ret == size)
+			ret = 0;
+	}
+out:
+	return ret;
 }
 
-#define vhost_get_user(vq, x, ptr) __get_user(x, ptr)
+#define vhost_put_user(vq, x, ptr) \
+({ \
+	int ret = -EFAULT; \
+	if (!vq->iotlb) { \
+		ret = __put_user(x, ptr); \
+	} else { \
+		__typeof__(*(ptr)) __x = (x);	\
+		ret = vhost_copy_to_user(vq, ptr, &__x,		\
+					 sizeof(*ptr));		\
+	} \
+	ret; \
+})
 
 static int vhost_copy_from_user(struct vhost_virtqueue *vq, void *to,
 				void *from, unsigned size)
 {
-	return __copy_from_user(to, from, size);
+	int ret;
+
+	if (!vq->iotlb)
+		return __copy_from_user(to, from, size);
+	else {
+		/* This function should be called after iotlb
+		 * prefetch, which means we're sure that vq
+		 * could be access through iotlb. So -EAGAIN should
+		 * not happen in this case.
+		 */
+		/* TODO: more fast path */
+		struct iov_iter f;
+		ret = translate_desc(vq, (u64)from, size, vq->iotlb_iov,
+				     ARRAY_SIZE(vq->iotlb_iov),
+				     VHOST_ACCESS_WO);
+		if (ret < 0) {
+			vq_err(vq, "IOTLB translation failure: uaddr "
+			       "0x%llx size 0x%llx\n",
+			       (unsigned long long) from,
+			       (unsigned long long) size);
+			goto out;
+		}
+		iov_iter_init(&f, READ, vq->iotlb_iov, ret, size);
+		ret = copy_from_iter(to, size, &f);
+		if (ret == size)
+			ret = 0;
+	}
+
+out:
+	return ret;
+}
+
+#define vhost_get_user(vq, x, ptr) \
+({ \
+	int ret; \
+	if (!vq->iotlb) { \
+		ret = __get_user(x, ptr); \
+	} else { \
+		ret = vhost_copy_from_user(vq, &x, ptr, sizeof(*ptr));	\
+	} \
+	ret; \
+})
+
+static void vhost_dev_lock_vqs(struct vhost_dev *d)
+{
+	int i = 0;
+	for (i = 0; i < d->nvqs; ++i)
+		mutex_lock(&d->vqs[i]->mutex);
+}
+
+static void vhost_dev_unlock_vqs(struct vhost_dev *d)
+{
+	int i = 0;
+	for (i = 0; i < d->nvqs; ++i)
+		mutex_unlock(&d->vqs[i]->mutex);
+}
+
+static int vhost_new_umem_range(struct vhost_umem *umem,
+				u64 start, u64 size, u64 end,
+				u64 userspace_addr, int perm)
+{
+	struct vhost_umem_node *tmp, *node = kmalloc(sizeof(*node), GFP_ATOMIC);
+
+	if (!node)
+		return -ENOMEM;
+
+	if (umem->numem == max_iotlb_entries) {
+		tmp = list_last_entry(&umem->umem_list, typeof(*tmp), link);
+		vhost_umem_free(umem, tmp);
+	}
+
+	node->start = start;
+	node->size = size;
+	node->last = end;
+	node->userspace_addr = userspace_addr;
+	node->perm = perm;
+	INIT_LIST_HEAD(&node->link);
+	list_add_tail(&node->link, &umem->umem_list);
+	vhost_umem_interval_tree_insert(node, &umem->umem_tree);
+	umem->numem++;
+
+	return 0;
+}
+
+static void vhost_del_umem_range(struct vhost_umem *umem,
+				 u64 start, u64 end)
+{
+	struct vhost_umem_node *node;
+
+	while ((node = vhost_umem_interval_tree_iter_first(&umem->umem_tree,
+							   start, end)))
+		vhost_umem_free(umem, node);
+}
+
+static void vhost_iotlb_notify_vq(struct vhost_dev *d,
+				  struct vhost_iotlb_msg *msg)
+{
+	struct vhost_msg_node *node, *n;
+
+	spin_lock(&d->iotlb_lock);
+
+	list_for_each_entry_safe(node, n, &d->pending_list, node) {
+		struct vhost_iotlb_msg *vq_msg = &node->msg.iotlb;
+		if (msg->iova <= vq_msg->iova &&
+		    msg->iova + msg->size - 1 > vq_msg->iova &&
+		    vq_msg->type == VHOST_IOTLB_MISS) {
+			vhost_poll_queue(&node->vq->poll);
+			list_del(&node->node);
+			kfree(node);
+		}
+	}
+
+	spin_unlock(&d->iotlb_lock);
+}
+
+static int umem_access_ok(u64 uaddr, u64 size, int access)
+{
+	if ((access & VHOST_ACCESS_RO) &&
+	    !access_ok(VERIFY_READ, uaddr, size))
+		return -EFAULT;
+	if ((access & VHOST_ACCESS_WO) &&
+	    !access_ok(VERIFY_WRITE, uaddr, size))
+		return -EFAULT;
+	return 0;
+}
+
+int vhost_process_iotlb_msg(struct vhost_dev *dev,
+			    struct vhost_iotlb_msg *msg)
+{
+	int ret = 0;
+
+	vhost_dev_lock_vqs(dev);
+	switch(msg->type) {
+	case VHOST_IOTLB_UPDATE:
+		if (!dev->iotlb) {
+			ret = -EFAULT;
+			goto done;
+		}
+		if (umem_access_ok(msg->uaddr, msg->size, msg->perm)) {
+			ret = -EFAULT;
+			goto done;
+		}
+		if (vhost_new_umem_range(dev->iotlb, msg->iova, msg->size,
+					 msg->iova + msg->size - 1,
+                                         msg->uaddr, msg->perm)) {
+			ret = -ENOMEM;
+			break;
+		}
+		vhost_iotlb_notify_vq(dev, msg);
+		break;
+	case VHOST_IOTLB_INVALIDATE:
+		vhost_del_umem_range(dev->iotlb, msg->iova,
+				     msg->iova + msg->size - 1);
+		break;
+	default:
+		ret = -EINVAL;
+		break;
+	}
+	vhost_dev_unlock_vqs(dev);
+
+done:
+	return ret;
+}
+ssize_t vhost_chr_write_iter(struct vhost_dev *dev,
+			     struct iov_iter *from)
+{
+	struct vhost_msg_node node;
+	unsigned size = sizeof(struct vhost_msg);
+	size_t ret;
+	int err;
+
+	if (iov_iter_count(from) < size)
+		return 0;
+	ret = copy_from_iter(&node.msg, size, from);
+	if (ret != size)
+		goto done;
+
+	switch (node.msg.type) {
+	case VHOST_IOTLB_MSG:
+		err = vhost_process_iotlb_msg(dev, &node.msg.iotlb);
+		if (err)
+			ret = err;
+		break;
+	default:
+		ret = -EINVAL;
+		break;
+	}
+
+done:
+	return ret;
+}
+EXPORT_SYMBOL(vhost_chr_write_iter);
+
+unsigned int vhost_chr_poll(struct file *file, struct vhost_dev *dev,
+			    poll_table *wait)
+{
+	unsigned int mask = 0;
+
+	poll_wait(file, &dev->wait, wait);
+
+	if (!list_empty(&dev->read_list))
+		mask |= POLLIN | POLLRDNORM;
+
+	return mask;
+}
+EXPORT_SYMBOL(vhost_chr_poll);
+
+ssize_t vhost_chr_read_iter(struct vhost_dev *dev, struct iov_iter *to,
+			    int noblock)
+{
+	DEFINE_WAIT(wait);
+	struct vhost_msg_node *node;
+	ssize_t ret = 0;
+	unsigned size = sizeof(struct vhost_msg);
+
+	if (iov_iter_count(to) < size)
+		return 0;
+
+	while (1) {
+		if (!noblock)
+			prepare_to_wait(&dev->wait, &wait,
+					TASK_INTERRUPTIBLE);
+
+		node = vhost_dequeue_msg(dev, &dev->read_list);
+		if (node)
+			break;
+		if (noblock) {
+			ret = -EAGAIN;
+			break;
+		}
+		if (signal_pending(current)) {
+			ret = -ERESTARTSYS;
+			break;
+		}
+		if (!dev->iotlb) {
+			ret = -EBADFD;
+			break;
+		}
+
+		schedule();
+	}
+
+	if (!noblock)
+		finish_wait(&dev->wait, &wait);
+
+	if (node) {
+		ret = copy_to_iter(&node->msg, size, to);
+
+		if (ret != size || node->msg.type != VHOST_IOTLB_MISS) {
+			kfree(node);
+			return ret;
+		}
+
+		vhost_enqueue_msg(dev, &dev->pending_list, node);
+	}
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(vhost_chr_read_iter);
+
+static int vhost_iotlb_miss(struct vhost_virtqueue *vq, u64 iova, int access)
+{
+	struct vhost_dev *dev = vq->dev;
+	struct vhost_msg_node *node;
+	struct vhost_iotlb_msg *msg;
+
+	node = vhost_new_msg(vq, VHOST_IOTLB_MISS);
+	if (!node)
+		return -ENOMEM;
+
+	msg = &node->msg.iotlb;
+	msg->type = VHOST_IOTLB_MISS;
+	msg->iova = iova;
+	msg->perm = access;
+	msg->size = ktime_get_ns();
+
+	vhost_enqueue_msg(dev, &dev->read_list, node);
+
+	return 0;
 }
 
 static int vq_access_ok(struct vhost_virtqueue *vq, unsigned int num,
 			struct vring_desc __user *desc,
 			struct vring_avail __user *avail,
 			struct vring_used __user *used)
+
 {
 	size_t s = vhost_has_feature(vq, VIRTIO_RING_F_EVENT_IDX) ? 2 : 0;
+
 	return access_ok(VERIFY_READ, desc, num * sizeof *desc) &&
 	       access_ok(VERIFY_READ, avail,
 			 sizeof *avail + num * sizeof *avail->ring + s) &&
@@ -710,6 +1065,54 @@ static int vq_access_ok(struct vhost_virtqueue *vq, unsigned int num,
 			sizeof *used + num * sizeof *used->ring + s);
 }
 
+static int iotlb_access_ok(struct vhost_virtqueue *vq,
+			   int access, u64 addr, u64 len)
+{
+	const struct vhost_umem_node *node;
+	struct vhost_umem *umem = vq->iotlb;
+	u64 s = 0, size;
+
+	while (len > s) {
+		node = vhost_umem_interval_tree_iter_first(&umem->umem_tree,
+							   addr,
+							   addr + len - 1);
+		if (node == NULL || node->start > addr) {
+			vhost_iotlb_miss(vq, addr, access);
+			return false;
+		} else if (!(node->perm & access)) {
+			/* Report the possible access violation by
+			 * request another translation from userspace.
+			 */
+			return false;
+		}
+
+		size = node->size - addr + node->start;
+		s += size;
+		addr += size;
+	}
+
+	return true;
+}
+
+int vq_iotlb_prefetch(struct vhost_virtqueue *vq)
+{
+	size_t s = vhost_has_feature(vq, VIRTIO_RING_F_EVENT_IDX) ? 2 : 0;
+	unsigned int num = vq->num;
+
+	if (!vq->iotlb)
+		return 1;
+
+	return iotlb_access_ok(vq, VHOST_ACCESS_RO, (u64)vq->desc,
+			       num * sizeof *vq->desc) &&
+	       iotlb_access_ok(vq, VHOST_ACCESS_RO, (u64)vq->avail,
+			       sizeof *vq->avail +
+			       num * sizeof *vq->avail->ring + s) &&
+	       iotlb_access_ok(vq, VHOST_ACCESS_WO, (u64)vq->used,
+			       sizeof *vq->used +
+			       num * sizeof *vq->used->ring + s);
+}
+EXPORT_SYMBOL_GPL(vq_iotlb_prefetch);
+
 /* Can we log writes? */
 /* Caller should have device mutex but not vq mutex */
 int vhost_log_access_ok(struct vhost_dev *dev)
@@ -736,16 +1139,35 @@ static int vq_log_access_ok(struct vhost_virtqueue *vq,
 /* Caller should have vq mutex and device mutex */
 int vhost_vq_access_ok(struct vhost_virtqueue *vq)
 {
+	if (vq->iotlb) {
+		/* When device IOTLB was used, the access validation
+		 * will be validated during prefetching.
+		 */
+		return 1;
+	}
 	return vq_access_ok(vq, vq->num, vq->desc, vq->avail, vq->used) &&
 		vq_log_access_ok(vq, vq->log_base);
 }
 EXPORT_SYMBOL_GPL(vhost_vq_access_ok);
 
+static struct vhost_umem *vhost_umem_alloc(void)
+{
+	struct vhost_umem *umem = vhost_kvzalloc(sizeof(*umem));
+
+	if (!umem)
+		return NULL;
+
+	umem->umem_tree = RB_ROOT;
+	umem->numem = 0;
+	INIT_LIST_HEAD(&umem->umem_list);
+
+	return umem;
+}
+
 static long vhost_set_memory(struct vhost_dev *d, struct vhost_memory __user *m)
 {
 	struct vhost_memory mem, *newmem;
 	struct vhost_memory_region *region;
-	struct vhost_umem_node *node;
 	struct vhost_umem *newumem, *oldumem;
 	unsigned long size = offsetof(struct vhost_memory, regions);
 	int i;
@@ -767,28 +1189,23 @@ static long vhost_set_memory(struct vhost_dev *d, struct vhost_memory __user *m)
 		return -EFAULT;
 	}
 
-	newumem = vhost_kvzalloc(sizeof(*newumem));
+	newumem = vhost_umem_alloc();
 	if (!newumem) {
 		kvfree(newmem);
 		return -ENOMEM;
 	}
 
-	newumem->umem_tree = RB_ROOT;
-	INIT_LIST_HEAD(&newumem->umem_list);
-
 	for (region = newmem->regions;
 	     region < newmem->regions + mem.nregions;
 	     region++) {
-		node = vhost_kvzalloc(sizeof(*node));
-		if (!node)
+		if (vhost_new_umem_range(newumem,
+					 region->guest_phys_addr,
+					 region->memory_size,
+					 region->guest_phys_addr +
+					 region->memory_size - 1,
+					 region->userspace_addr,
+				         VHOST_ACCESS_RW))
 			goto err;
-		node->start = region->guest_phys_addr;
-		node->size = region->memory_size;
-		node->last = node->start + node->size - 1;
-		node->userspace_addr = region->userspace_addr;
-		INIT_LIST_HEAD(&node->link);
-		list_add_tail(&node->link, &newumem->umem_list);
-		vhost_umem_interval_tree_insert(node, &newumem->umem_tree);
 	}
 
 	if (!memory_access_ok(d, newumem, 0))
@@ -1032,6 +1449,30 @@ long vhost_vring_ioctl(struct vhost_dev *d, int ioctl, void __user *argp)
 }
 EXPORT_SYMBOL_GPL(vhost_vring_ioctl);
 
+int vhost_init_device_iotlb(struct vhost_dev *d, bool enabled)
+{
+	struct vhost_umem *niotlb, *oiotlb;
+	int i;
+
+	niotlb = vhost_umem_alloc();
+	if (!niotlb)
+		return -ENOMEM;
+
+	oiotlb = d->iotlb;
+	d->iotlb = niotlb;
+
+	for (i = 0; i < d->nvqs; ++i) {
+		mutex_lock(&d->vqs[i]->mutex);
+		d->vqs[i]->iotlb = niotlb;
+		mutex_unlock(&d->vqs[i]->mutex);
+	}
+
+	vhost_umem_clean(oiotlb);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(vhost_init_device_iotlb);
+
 /* Caller must have device mutex */
 long vhost_dev_ioctl(struct vhost_dev *d, unsigned int ioctl, void __user *argp)
 {
@@ -1246,15 +1687,20 @@ int vhost_vq_init_access(struct vhost_virtqueue *vq)
 	if (r)
 		goto err;
 	vq->signalled_used_valid = false;
-	if (!access_ok(VERIFY_READ, &vq->used->idx, sizeof vq->used->idx)) {
+	if (!vq->iotlb &&
+	    !access_ok(VERIFY_READ, &vq->used->idx, sizeof vq->used->idx)) {
 		r = -EFAULT;
 		goto err;
 	}
 	r = vhost_get_user(vq, last_used_idx, &vq->used->idx);
-	if (r)
+	if (r) {
+		vq_err(vq, "Can't access used idx at 0x%llx\n",
+			(unsigned long long) &vq->used->idx);
 		goto err;
+	}
 	vq->last_used_idx = vhost16_to_cpu(vq, last_used_idx);
 	return 0;
+
 err:
 	vq->is_le = is_le;
 	return r;
@@ -1262,10 +1708,11 @@ err:
 EXPORT_SYMBOL_GPL(vhost_vq_init_access);
 
 static int translate_desc(struct vhost_virtqueue *vq, u64 addr, u32 len,
-			  struct iovec iov[], int iov_size)
+			  struct iovec iov[], int iov_size, int access)
 {
 	const struct vhost_umem_node *node;
-	struct vhost_umem *umem = vq->umem;
+	struct vhost_dev *dev = vq->dev;
+	struct vhost_umem *umem = dev->iotlb ? dev->iotlb : dev->umem;
 	struct iovec *_iov;
 	u64 s = 0;
 	int ret = 0;
@@ -1276,12 +1723,21 @@ static int translate_desc(struct vhost_virtqueue *vq, u64 addr, u32 len,
 			ret = -ENOBUFS;
 			break;
 		}
+
 		node = vhost_umem_interval_tree_iter_first(&umem->umem_tree,
 							addr, addr + len - 1);
 		if (node == NULL || node->start > addr) {
-			ret = -EFAULT;
+			if (umem != dev->iotlb) {
+				ret = -EFAULT;
+				break;
+			}
+			ret = -EAGAIN;
+			break;
+		} else if (!(node->perm & access)) {
+			ret = -EPERM;
 			break;
 		}
+
 		_iov = iov + ret;
 		size = node->size - addr + node->start;
 		_iov->iov_len = min((u64)len - s, size);
@@ -1292,6 +1748,8 @@ static int translate_desc(struct vhost_virtqueue *vq, u64 addr, u32 len,
 		++ret;
 	}
 
+	if (ret == -EAGAIN)
+		vhost_iotlb_miss(vq, addr, access);
 	return ret;
 }
 
@@ -1320,7 +1778,7 @@ static int get_indirect(struct vhost_virtqueue *vq,
 			struct iovec iov[], unsigned int iov_size,
 			unsigned int *out_num, unsigned int *in_num,
 			struct vhost_log *log, unsigned int *log_num,
-			struct vring_desc *indirect)
+			struct vring_desc *indirect, int access)
 {
 	struct vring_desc desc;
 	unsigned int i = 0, count, found = 0;
@@ -1338,9 +1796,10 @@ static int get_indirect(struct vhost_virtqueue *vq,
 	}
 
 	ret = translate_desc(vq, vhost64_to_cpu(vq, indirect->addr), len, vq->indirect,
-			     UIO_MAXIOV);
+			     UIO_MAXIOV, VHOST_ACCESS_RO);
 	if (unlikely(ret < 0)) {
-		vq_err(vq, "Translation failure %d in indirect.\n", ret);
+		if (ret != -EAGAIN)
+			vq_err(vq, "Translation failure %d in indirect.\n", ret);
 		return ret;
 	}
 	iov_iter_init(&from, READ, vq->indirect, ret, len);
@@ -1380,10 +1839,11 @@ static int get_indirect(struct vhost_virtqueue *vq,
 
 		ret = translate_desc(vq, vhost64_to_cpu(vq, desc.addr),
 				     vhost32_to_cpu(vq, desc.len), iov + iov_count,
-				     iov_size - iov_count);
+				     iov_size - iov_count, access);
 		if (unlikely(ret < 0)) {
-			vq_err(vq, "Translation failure %d indirect idx %d\n",
-			       ret, i);
+			if (ret != -EAGAIN)
+				vq_err(vq, "Translation failure %d indirect idx %d\n",
+					ret, i);
 			return ret;
 		}
 		/* If this is an input descriptor, increment that count. */
@@ -1419,7 +1879,8 @@ static int get_indirect(struct vhost_virtqueue *vq,
 int vhost_get_vq_desc(struct vhost_virtqueue *vq,
 		      struct iovec iov[], unsigned int iov_size,
 		      unsigned int *out_num, unsigned int *in_num,
-		      struct vhost_log *log, unsigned int *log_num)
+		      struct vhost_log *log, unsigned int *log_num,
+		      int access)
 {
 	struct vring_desc desc;
 	unsigned int i, head, found = 0;
@@ -1498,10 +1959,11 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
 		if (desc.flags & cpu_to_vhost16(vq, VRING_DESC_F_INDIRECT)) {
 			ret = get_indirect(vq, iov, iov_size,
 					   out_num, in_num,
-					   log, log_num, &desc);
+					   log, log_num, &desc, access);
 			if (unlikely(ret < 0)) {
-				vq_err(vq, "Failure detected "
-				       "in indirect descriptor at idx %d\n", i);
+				if (ret != -EAGAIN)
+					vq_err(vq, "Failure detected "
+						"in indirect descriptor at idx %d\n", i);
 				return ret;
 			}
 			continue;
@@ -1509,10 +1971,11 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
 
 		ret = translate_desc(vq, vhost64_to_cpu(vq, desc.addr),
 				     vhost32_to_cpu(vq, desc.len), iov + iov_count,
-				     iov_size - iov_count);
+				     iov_size - iov_count, access);
 		if (unlikely(ret < 0)) {
-			vq_err(vq, "Translation failure %d descriptor idx %d\n",
-			       ret, i);
+			if (ret != -EAGAIN)
+				vq_err(vq, "Translation failure %d descriptor idx %d\n",
+					ret, i);
 			return ret;
 		}
 		if (desc.flags & cpu_to_vhost16(vq, VRING_DESC_F_WRITE)) {
@@ -1781,6 +2244,47 @@ void vhost_disable_notify(struct vhost_dev *dev, struct vhost_virtqueue *vq)
 }
 EXPORT_SYMBOL_GPL(vhost_disable_notify);
 
+/* Create a new message. */
+struct vhost_msg_node *vhost_new_msg(struct vhost_virtqueue *vq, int type)
+{
+	struct vhost_msg_node *node = kmalloc(sizeof *node, GFP_KERNEL);
+	if (!node)
+		return NULL;
+	node->vq = vq;
+	node->msg.type = type;
+	return node;
+}
+EXPORT_SYMBOL_GPL(vhost_new_msg);
+
+void vhost_enqueue_msg(struct vhost_dev *dev, struct list_head *head,
+		       struct vhost_msg_node *node)
+{
+	spin_lock(&dev->iotlb_lock);
+	list_add_tail(&node->node, head);
+	spin_unlock(&dev->iotlb_lock);
+
+	wake_up_interruptible_poll(&dev->wait, POLLIN | POLLRDNORM);
+}
+EXPORT_SYMBOL_GPL(vhost_enqueue_msg);
+
+struct vhost_msg_node *vhost_dequeue_msg(struct vhost_dev *dev,
+					 struct list_head *head)
+{
+	struct vhost_msg_node *node = NULL;
+
+	spin_lock(&dev->iotlb_lock);
+	if (!list_empty(head)) {
+		node = list_first_entry(head, struct vhost_msg_node,
+					node);
+		list_del(&node->node);
+	}
+	spin_unlock(&dev->iotlb_lock);
+
+	return node;
+}
+EXPORT_SYMBOL_GPL(vhost_dequeue_msg);
+
+
 static int __init vhost_init(void)
 {
 	return 0;
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index b93b6a0..ab82eaa 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -63,13 +63,15 @@ struct vhost_umem_node {
 	__u64 last;
 	__u64 size;
 	__u64 userspace_addr;
-	__u64 flags_padding;
+	__u32 perm;
+	__u32 flags_padding;
 	__u64 __subtree_last;
 };
 
 struct vhost_umem {
 	struct rb_root umem_tree;
 	struct list_head umem_list;
+	int numem;
 };
 
 /* The virtqueue structure describes a queue attached to a device. */
@@ -117,10 +119,12 @@ struct vhost_virtqueue {
 	u64 log_addr;
 
 	struct iovec iov[UIO_MAXIOV];
+	struct iovec iotlb_iov[64];
 	struct iovec *indirect;
 	struct vring_used_elem *heads;
 	/* Protected by virtqueue mutex. */
 	struct vhost_umem *umem;
+	struct vhost_umem *iotlb;
 	void *private_data;
 	u64 acked_features;
 	/* Log write descriptors */
@@ -137,6 +141,12 @@ struct vhost_virtqueue {
 	u32 busyloop_timeout;
 };
 
+struct vhost_msg_node {
+  struct vhost_msg msg;
+  struct vhost_virtqueue *vq;
+  struct list_head node;
+};
+
 struct vhost_dev {
 	struct mm_struct *mm;
 	struct mutex mutex;
@@ -148,6 +158,11 @@ struct vhost_dev {
 	struct list_head work_list;
 	struct task_struct *worker;
 	struct vhost_umem *umem;
+	struct vhost_umem *iotlb;
+	spinlock_t iotlb_lock;
+	struct list_head read_list;
+	struct list_head pending_list;
+	wait_queue_head_t wait;
 };
 
 void vhost_dev_init(struct vhost_dev *, struct vhost_virtqueue **vqs, int nvqs);
@@ -166,7 +181,8 @@ int vhost_log_access_ok(struct vhost_dev *);
 int vhost_get_vq_desc(struct vhost_virtqueue *,
 		      struct iovec iov[], unsigned int iov_count,
 		      unsigned int *out_num, unsigned int *in_num,
-		      struct vhost_log *log, unsigned int *log_num);
+		      struct vhost_log *log, unsigned int *log_num,
+		      int access);
 void vhost_discard_vq_desc(struct vhost_virtqueue *, int n);
 
 int vhost_vq_init_access(struct vhost_virtqueue *);
@@ -184,6 +200,21 @@ bool vhost_enable_notify(struct vhost_dev *, struct vhost_virtqueue *);
 
 int vhost_log_write(struct vhost_virtqueue *vq, struct vhost_log *log,
 		    unsigned int log_num, u64 len);
+int vq_iotlb_prefetch(struct vhost_virtqueue *vq);
+
+struct vhost_msg_node *vhost_new_msg(struct vhost_virtqueue *vq, int type);
+void vhost_enqueue_msg(struct vhost_dev *dev,
+                       struct list_head *head,
+                       struct vhost_msg_node *node);
+struct vhost_msg_node *vhost_dequeue_msg(struct vhost_dev *dev,
+					 struct list_head *head);
+unsigned int vhost_chr_poll(struct file *file, struct vhost_dev *dev,
+                            poll_table *wait);
+ssize_t vhost_chr_read_iter(struct vhost_dev *dev, struct iov_iter *to,
+			    int noblock);
+ssize_t vhost_chr_write_iter(struct vhost_dev *dev,
+                             struct iov_iter *from);
+int vhost_init_device_iotlb(struct vhost_dev *d, bool enabled);
 
 #define vq_err(vq, fmt, ...) do {                                  \
 		pr_debug(pr_fmt(fmt), ##__VA_ARGS__);       \
diff --git a/include/uapi/linux/vhost.h b/include/uapi/linux/vhost.h
index 61a8777..f2e5757 100644
--- a/include/uapi/linux/vhost.h
+++ b/include/uapi/linux/vhost.h
@@ -47,6 +47,32 @@ struct vhost_vring_addr {
 	__u64 log_guest_addr;
 };
 
+/* no alignment requirement */
+struct vhost_iotlb_msg {
+  __u64 iova;
+  __u64 size;
+  __u64 uaddr;
+#define VHOST_ACCESS_RO      0x1
+#define VHOST_ACCESS_WO      0x2
+#define VHOST_ACCESS_RW      0x3
+  __u8 perm;
+#define VHOST_IOTLB_MISS           1
+#define VHOST_IOTLB_UPDATE         2
+#define VHOST_IOTLB_INVALIDATE     3
+#define VHOST_IOTLB_ACCESS_FAIL    4
+  __u8 type;
+};
+
+#define VHOST_IOTLB_MSG 0x1
+
+struct vhost_msg {
+  int type;
+  union {
+    struct vhost_iotlb_msg iotlb;
+    __u8 padding[64];
+  };
+};
+
 struct vhost_memory_region {
 	__u64 guest_phys_addr;
 	__u64 memory_size; /* bytes */
@@ -146,6 +172,8 @@ struct vhost_memory {
 #define VHOST_F_LOG_ALL 26
 /* vhost-net should add virtio_net_hdr for RX, and strip for TX packets. */
 #define VHOST_NET_F_VIRTIO_NET_HDR 27
+/* Vhost have device IOTLB */
+#define VHOST_F_DEVICE_IOTLB 63
 
 /* VHOST_SCSI specific definitions */
 
-- 
2.7.4

^ permalink raw reply related

* [RFC PATCH V3 2/3] vhost: convert pre sorted vhost memory array to interval tree
From: Jason Wang @ 2016-05-24  9:36 UTC (permalink / raw)
  To: mst, kvm, virtualization, netdev, linux-kernel; +Cc: vkaplans, wexu, peterx
In-Reply-To: <1464082585-13049-1-git-send-email-jasowang@redhat.com>

Current pre-sorted memory region array has some limitations for future
device IOTLB conversion:

1) need extra work for adding and removing a single region, and it's
   expected to be slow because of sorting or memory re-allocation.
2) need extra work of removing a large range which may intersect
   several regions with different size.
3) need trick for a replacement policy like LRU

To overcome the above shortcomings, this patch convert it to interval
tree which can easily address the above issue with almost no extra
work.

The patch could be used for:

- Extend the current API and only let the userspace to send diffs of
  memory table.
- Simplify Device IOTLB implementation.

Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 drivers/vhost/net.c   |   8 +--
 drivers/vhost/vhost.c | 182 ++++++++++++++++++++++++++++----------------------
 drivers/vhost/vhost.h |  27 ++++++--
 3 files changed, 128 insertions(+), 89 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index beaeb17..a584239 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -1037,20 +1037,20 @@ static long vhost_net_reset_owner(struct vhost_net *n)
 	struct socket *tx_sock = NULL;
 	struct socket *rx_sock = NULL;
 	long err;
-	struct vhost_memory *memory;
+	struct vhost_umem *umem;
 
 	mutex_lock(&n->dev.mutex);
 	err = vhost_dev_check_owner(&n->dev);
 	if (err)
 		goto done;
-	memory = vhost_dev_reset_owner_prepare();
-	if (!memory) {
+	umem = vhost_dev_reset_owner_prepare();
+	if (!umem) {
 		err = -ENOMEM;
 		goto done;
 	}
 	vhost_net_stop(n, &tx_sock, &rx_sock);
 	vhost_net_flush(n);
-	vhost_dev_reset_owner(&n->dev, memory);
+	vhost_dev_reset_owner(&n->dev, umem);
 	vhost_net_vq_reset(n);
 done:
 	mutex_unlock(&n->dev.mutex);
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 9f2a63a..166e779 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -27,6 +27,7 @@
 #include <linux/cgroup.h>
 #include <linux/module.h>
 #include <linux/sort.h>
+#include <linux/interval_tree_generic.h>
 
 #include "vhost.h"
 
@@ -42,6 +43,10 @@ enum {
 #define vhost_used_event(vq) ((__virtio16 __user *)&vq->avail->ring[vq->num])
 #define vhost_avail_event(vq) ((__virtio16 __user *)&vq->used->ring[vq->num])
 
+INTERVAL_TREE_DEFINE(struct vhost_umem_node,
+		     rb, __u64, __subtree_last,
+		     START, LAST, , vhost_umem_interval_tree);
+
 #ifdef CONFIG_VHOST_CROSS_ENDIAN_LEGACY
 static void vhost_disable_cross_endian(struct vhost_virtqueue *vq)
 {
@@ -300,10 +305,10 @@ static void vhost_vq_reset(struct vhost_dev *dev,
 	vq->call_ctx = NULL;
 	vq->call = NULL;
 	vq->log_ctx = NULL;
-	vq->memory = NULL;
 	vhost_reset_is_le(vq);
 	vhost_disable_cross_endian(vq);
 	vq->busyloop_timeout = 0;
+	vq->umem = NULL;
 }
 
 static int vhost_worker(void *data)
@@ -407,7 +412,7 @@ void vhost_dev_init(struct vhost_dev *dev,
 	mutex_init(&dev->mutex);
 	dev->log_ctx = NULL;
 	dev->log_file = NULL;
-	dev->memory = NULL;
+	dev->umem = NULL;
 	dev->mm = NULL;
 	spin_lock_init(&dev->work_lock);
 	INIT_LIST_HEAD(&dev->work_list);
@@ -512,27 +517,36 @@ err_mm:
 }
 EXPORT_SYMBOL_GPL(vhost_dev_set_owner);
 
-struct vhost_memory *vhost_dev_reset_owner_prepare(void)
+static void *vhost_kvzalloc(unsigned long size)
 {
-	return kmalloc(offsetof(struct vhost_memory, regions), GFP_KERNEL);
+	void *n = kzalloc(size, GFP_KERNEL | __GFP_NOWARN | __GFP_REPEAT);
+
+	if (!n)
+		n = vzalloc(size);
+	return n;
+}
+
+struct vhost_umem *vhost_dev_reset_owner_prepare(void)
+{
+	return vhost_kvzalloc(sizeof(struct vhost_umem));
 }
 EXPORT_SYMBOL_GPL(vhost_dev_reset_owner_prepare);
 
 /* Caller should have device mutex */
-void vhost_dev_reset_owner(struct vhost_dev *dev, struct vhost_memory *memory)
+void vhost_dev_reset_owner(struct vhost_dev *dev, struct vhost_umem *umem)
 {
 	int i;
 
 	vhost_dev_cleanup(dev, true);
 
 	/* Restore memory to default empty mapping. */
-	memory->nregions = 0;
-	dev->memory = memory;
+	INIT_LIST_HEAD(&umem->umem_list);
+	dev->umem = umem;
 	/* We don't need VQ locks below since vhost_dev_cleanup makes sure
 	 * VQs aren't running.
 	 */
 	for (i = 0; i < dev->nvqs; ++i)
-		dev->vqs[i]->memory = memory;
+		dev->vqs[i]->umem = umem;
 }
 EXPORT_SYMBOL_GPL(vhost_dev_reset_owner);
 
@@ -549,6 +563,21 @@ void vhost_dev_stop(struct vhost_dev *dev)
 }
 EXPORT_SYMBOL_GPL(vhost_dev_stop);
 
+static void vhost_umem_clean(struct vhost_umem *umem)
+{
+	struct vhost_umem_node *node, *tmp;
+
+	if (!umem)
+		return;
+
+	list_for_each_entry_safe(node, tmp, &umem->umem_list, link) {
+		vhost_umem_interval_tree_remove(node, &umem->umem_tree);
+		list_del(&node->link);
+		kvfree(node);
+	}
+	kvfree(umem);
+}
+
 /* Caller should have device mutex if and only if locked is set */
 void vhost_dev_cleanup(struct vhost_dev *dev, bool locked)
 {
@@ -575,8 +604,8 @@ void vhost_dev_cleanup(struct vhost_dev *dev, bool locked)
 		fput(dev->log_file);
 	dev->log_file = NULL;
 	/* No one will access memory at this point */
-	kvfree(dev->memory);
-	dev->memory = NULL;
+	vhost_umem_clean(dev->umem);
+	dev->umem = NULL;
 	WARN_ON(!list_empty(&dev->work_list));
 	if (dev->worker) {
 		kthread_stop(dev->worker);
@@ -602,25 +631,25 @@ static int log_access_ok(void __user *log_base, u64 addr, unsigned long sz)
 }
 
 /* Caller should have vq mutex and device mutex. */
-static int vq_memory_access_ok(void __user *log_base, struct vhost_memory *mem,
+static int vq_memory_access_ok(void __user *log_base, struct vhost_umem *umem,
 			       int log_all)
 {
-	int i;
+	struct vhost_umem_node *node;
 
-	if (!mem)
+	if (!umem)
 		return 0;
 
-	for (i = 0; i < mem->nregions; ++i) {
-		struct vhost_memory_region *m = mem->regions + i;
-		unsigned long a = m->userspace_addr;
-		if (m->memory_size > ULONG_MAX)
+	list_for_each_entry(node, &umem->umem_list, link) {
+		unsigned long a = node->userspace_addr;
+
+		if (node->size > ULONG_MAX)
 			return 0;
 		else if (!access_ok(VERIFY_WRITE, (void __user *)a,
-				    m->memory_size))
+				    node->size))
 			return 0;
 		else if (log_all && !log_access_ok(log_base,
-						   m->guest_phys_addr,
-						   m->memory_size))
+						   node->start,
+						   node->size))
 			return 0;
 	}
 	return 1;
@@ -628,7 +657,7 @@ static int vq_memory_access_ok(void __user *log_base, struct vhost_memory *mem,
 
 /* Can we switch to this memory table? */
 /* Caller should have device mutex but not vq mutex */
-static int memory_access_ok(struct vhost_dev *d, struct vhost_memory *mem,
+static int memory_access_ok(struct vhost_dev *d, struct vhost_umem *umem,
 			    int log_all)
 {
 	int i;
@@ -641,7 +670,8 @@ static int memory_access_ok(struct vhost_dev *d, struct vhost_memory *mem,
 		log = log_all || vhost_has_feature(d->vqs[i], VHOST_F_LOG_ALL);
 		/* If ring is inactive, will check when it's enabled. */
 		if (d->vqs[i]->private_data)
-			ok = vq_memory_access_ok(d->vqs[i]->log_base, mem, log);
+			ok = vq_memory_access_ok(d->vqs[i]->log_base,
+						 umem, log);
 		else
 			ok = 1;
 		mutex_unlock(&d->vqs[i]->mutex);
@@ -684,7 +714,7 @@ static int vq_access_ok(struct vhost_virtqueue *vq, unsigned int num,
 /* Caller should have device mutex but not vq mutex */
 int vhost_log_access_ok(struct vhost_dev *dev)
 {
-	return memory_access_ok(dev, dev->memory, 1);
+	return memory_access_ok(dev, dev->umem, 1);
 }
 EXPORT_SYMBOL_GPL(vhost_log_access_ok);
 
@@ -695,7 +725,7 @@ static int vq_log_access_ok(struct vhost_virtqueue *vq,
 {
 	size_t s = vhost_has_feature(vq, VIRTIO_RING_F_EVENT_IDX) ? 2 : 0;
 
-	return vq_memory_access_ok(log_base, vq->memory,
+	return vq_memory_access_ok(log_base, vq->umem,
 				   vhost_has_feature(vq, VHOST_F_LOG_ALL)) &&
 		(!vq->log_used || log_access_ok(log_base, vq->log_addr,
 					sizeof *vq->used +
@@ -711,28 +741,12 @@ int vhost_vq_access_ok(struct vhost_virtqueue *vq)
 }
 EXPORT_SYMBOL_GPL(vhost_vq_access_ok);
 
-static int vhost_memory_reg_sort_cmp(const void *p1, const void *p2)
-{
-	const struct vhost_memory_region *r1 = p1, *r2 = p2;
-	if (r1->guest_phys_addr < r2->guest_phys_addr)
-		return 1;
-	if (r1->guest_phys_addr > r2->guest_phys_addr)
-		return -1;
-	return 0;
-}
-
-static void *vhost_kvzalloc(unsigned long size)
-{
-	void *n = kzalloc(size, GFP_KERNEL | __GFP_NOWARN | __GFP_REPEAT);
-
-	if (!n)
-		n = vzalloc(size);
-	return n;
-}
-
 static long vhost_set_memory(struct vhost_dev *d, struct vhost_memory __user *m)
 {
-	struct vhost_memory mem, *newmem, *oldmem;
+	struct vhost_memory mem, *newmem;
+	struct vhost_memory_region *region;
+	struct vhost_umem_node *node;
+	struct vhost_umem *newumem, *oldumem;
 	unsigned long size = offsetof(struct vhost_memory, regions);
 	int i;
 
@@ -752,24 +766,52 @@ static long vhost_set_memory(struct vhost_dev *d, struct vhost_memory __user *m)
 		kvfree(newmem);
 		return -EFAULT;
 	}
-	sort(newmem->regions, newmem->nregions, sizeof(*newmem->regions),
-		vhost_memory_reg_sort_cmp, NULL);
 
-	if (!memory_access_ok(d, newmem, 0)) {
+	newumem = vhost_kvzalloc(sizeof(*newumem));
+	if (!newumem) {
 		kvfree(newmem);
-		return -EFAULT;
+		return -ENOMEM;
 	}
-	oldmem = d->memory;
-	d->memory = newmem;
+
+	newumem->umem_tree = RB_ROOT;
+	INIT_LIST_HEAD(&newumem->umem_list);
+
+	for (region = newmem->regions;
+	     region < newmem->regions + mem.nregions;
+	     region++) {
+		node = vhost_kvzalloc(sizeof(*node));
+		if (!node)
+			goto err;
+		node->start = region->guest_phys_addr;
+		node->size = region->memory_size;
+		node->last = node->start + node->size - 1;
+		node->userspace_addr = region->userspace_addr;
+		INIT_LIST_HEAD(&node->link);
+		list_add_tail(&node->link, &newumem->umem_list);
+		vhost_umem_interval_tree_insert(node, &newumem->umem_tree);
+	}
+
+	if (!memory_access_ok(d, newumem, 0))
+		goto err;
+
+	oldumem = d->umem;
+	d->umem = newumem;
 
 	/* All memory accesses are done under some VQ mutex. */
 	for (i = 0; i < d->nvqs; ++i) {
 		mutex_lock(&d->vqs[i]->mutex);
-		d->vqs[i]->memory = newmem;
+		d->vqs[i]->umem = newumem;
 		mutex_unlock(&d->vqs[i]->mutex);
 	}
-	kvfree(oldmem);
+
+	kvfree(newmem);
+	vhost_umem_clean(oldumem);
 	return 0;
+
+err:
+	vhost_umem_clean(newumem);
+	kvfree(newmem);
+	return -EFAULT;
 }
 
 long vhost_vring_ioctl(struct vhost_dev *d, int ioctl, void __user *argp)
@@ -1072,28 +1114,6 @@ done:
 }
 EXPORT_SYMBOL_GPL(vhost_dev_ioctl);
 
-static const struct vhost_memory_region *find_region(struct vhost_memory *mem,
-						     __u64 addr, __u32 len)
-{
-	const struct vhost_memory_region *reg;
-	int start = 0, end = mem->nregions;
-
-	while (start < end) {
-		int slot = start + (end - start) / 2;
-		reg = mem->regions + slot;
-		if (addr >= reg->guest_phys_addr)
-			end = slot;
-		else
-			start = slot + 1;
-	}
-
-	reg = mem->regions + start;
-	if (addr >= reg->guest_phys_addr &&
-		reg->guest_phys_addr + reg->memory_size > addr)
-		return reg;
-	return NULL;
-}
-
 /* TODO: This is really inefficient.  We need something like get_user()
  * (instruction directly accesses the data, with an exception table entry
  * returning -EFAULT). See Documentation/x86/exception-tables.txt.
@@ -1244,29 +1264,29 @@ EXPORT_SYMBOL_GPL(vhost_vq_init_access);
 static int translate_desc(struct vhost_virtqueue *vq, u64 addr, u32 len,
 			  struct iovec iov[], int iov_size)
 {
-	const struct vhost_memory_region *reg;
-	struct vhost_memory *mem;
+	const struct vhost_umem_node *node;
+	struct vhost_umem *umem = vq->umem;
 	struct iovec *_iov;
 	u64 s = 0;
 	int ret = 0;
 
-	mem = vq->memory;
 	while ((u64)len > s) {
 		u64 size;
 		if (unlikely(ret >= iov_size)) {
 			ret = -ENOBUFS;
 			break;
 		}
-		reg = find_region(mem, addr, len);
-		if (unlikely(!reg)) {
+		node = vhost_umem_interval_tree_iter_first(&umem->umem_tree,
+							addr, addr + len - 1);
+		if (node == NULL || node->start > addr) {
 			ret = -EFAULT;
 			break;
 		}
 		_iov = iov + ret;
-		size = reg->memory_size - addr + reg->guest_phys_addr;
+		size = node->size - addr + node->start;
 		_iov->iov_len = min((u64)len - s, size);
 		_iov->iov_base = (void __user *)(unsigned long)
-			(reg->userspace_addr + addr - reg->guest_phys_addr);
+			(node->userspace_addr + addr - node->start);
 		s += size;
 		addr += size;
 		++ret;
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index d36d8be..b93b6a0 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -53,6 +53,25 @@ struct vhost_log {
 	u64 len;
 };
 
+#define START(node) ((node)->start)
+#define LAST(node) ((node)->last)
+
+struct vhost_umem_node {
+	struct rb_node rb;
+	struct list_head link;
+	__u64 start;
+	__u64 last;
+	__u64 size;
+	__u64 userspace_addr;
+	__u64 flags_padding;
+	__u64 __subtree_last;
+};
+
+struct vhost_umem {
+	struct rb_root umem_tree;
+	struct list_head umem_list;
+};
+
 /* The virtqueue structure describes a queue attached to a device. */
 struct vhost_virtqueue {
 	struct vhost_dev *dev;
@@ -101,7 +120,7 @@ struct vhost_virtqueue {
 	struct iovec *indirect;
 	struct vring_used_elem *heads;
 	/* Protected by virtqueue mutex. */
-	struct vhost_memory *memory;
+	struct vhost_umem *umem;
 	void *private_data;
 	u64 acked_features;
 	/* Log write descriptors */
@@ -119,7 +138,6 @@ struct vhost_virtqueue {
 };
 
 struct vhost_dev {
-	struct vhost_memory *memory;
 	struct mm_struct *mm;
 	struct mutex mutex;
 	struct vhost_virtqueue **vqs;
@@ -129,14 +147,15 @@ struct vhost_dev {
 	spinlock_t work_lock;
 	struct list_head work_list;
 	struct task_struct *worker;
+	struct vhost_umem *umem;
 };
 
 void vhost_dev_init(struct vhost_dev *, struct vhost_virtqueue **vqs, int nvqs);
 long vhost_dev_set_owner(struct vhost_dev *dev);
 bool vhost_dev_has_owner(struct vhost_dev *dev);
 long vhost_dev_check_owner(struct vhost_dev *);
-struct vhost_memory *vhost_dev_reset_owner_prepare(void);
-void vhost_dev_reset_owner(struct vhost_dev *, struct vhost_memory *);
+struct vhost_umem *vhost_dev_reset_owner_prepare(void);
+void vhost_dev_reset_owner(struct vhost_dev *, struct vhost_umem *);
 void vhost_dev_cleanup(struct vhost_dev *, bool locked);
 void vhost_dev_stop(struct vhost_dev *);
 long vhost_dev_ioctl(struct vhost_dev *, unsigned int ioctl, void __user *argp);
-- 
2.7.4

^ permalink raw reply related

* [RFC PATCH V3 1/3] vhost: introduce vhost memory accessors
From: Jason Wang @ 2016-05-24  9:36 UTC (permalink / raw)
  To: mst, kvm, virtualization, netdev, linux-kernel; +Cc: vkaplans, wexu, peterx
In-Reply-To: <1464082585-13049-1-git-send-email-jasowang@redhat.com>

This patch introduces vhost memory accessors which were just wrappers
for userspace address access helpers. This is a requirement for vhost
device iotlb implementation which will add iotlb translations in those
accessors.

Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 drivers/vhost/net.c   |  1 +
 drivers/vhost/vhost.c | 50 +++++++++++++++++++++++++++++++++++---------------
 2 files changed, 36 insertions(+), 15 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index f744eeb..beaeb17 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -985,6 +985,7 @@ static long vhost_net_set_backend(struct vhost_net *n, unsigned index, int fd)
 
 		vhost_net_disable_vq(n, vq);
 		vq->private_data = sock;
+		/* FIXME: iotlb prefetch here? */
 		r = vhost_vq_init_access(vq);
 		if (r)
 			goto err_used;
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 669fef1..9f2a63a 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -651,6 +651,22 @@ static int memory_access_ok(struct vhost_dev *d, struct vhost_memory *mem,
 	return 1;
 }
 
+#define vhost_put_user(vq, x, ptr)  __put_user(x, ptr)
+
+static int vhost_copy_to_user(struct vhost_virtqueue *vq, void *to,
+			      const void *from, unsigned size)
+{
+	return __copy_to_user(to, from, size);
+}
+
+#define vhost_get_user(vq, x, ptr) __get_user(x, ptr)
+
+static int vhost_copy_from_user(struct vhost_virtqueue *vq, void *to,
+				void *from, unsigned size)
+{
+	return __copy_from_user(to, from, size);
+}
+
 static int vq_access_ok(struct vhost_virtqueue *vq, unsigned int num,
 			struct vring_desc __user *desc,
 			struct vring_avail __user *avail,
@@ -1156,7 +1172,8 @@ EXPORT_SYMBOL_GPL(vhost_log_write);
 static int vhost_update_used_flags(struct vhost_virtqueue *vq)
 {
 	void __user *used;
-	if (__put_user(cpu_to_vhost16(vq, vq->used_flags), &vq->used->flags) < 0)
+	if (vhost_put_user(vq, cpu_to_vhost16(vq, vq->used_flags),
+			   &vq->used->flags) < 0)
 		return -EFAULT;
 	if (unlikely(vq->log_used)) {
 		/* Make sure the flag is seen before log. */
@@ -1174,7 +1191,8 @@ static int vhost_update_used_flags(struct vhost_virtqueue *vq)
 
 static int vhost_update_avail_event(struct vhost_virtqueue *vq, u16 avail_event)
 {
-	if (__put_user(cpu_to_vhost16(vq, vq->avail_idx), vhost_avail_event(vq)))
+	if (vhost_put_user(vq, cpu_to_vhost16(vq, vq->avail_idx),
+			   vhost_avail_event(vq)))
 		return -EFAULT;
 	if (unlikely(vq->log_used)) {
 		void __user *used;
@@ -1212,7 +1230,7 @@ int vhost_vq_init_access(struct vhost_virtqueue *vq)
 		r = -EFAULT;
 		goto err;
 	}
-	r = __get_user(last_used_idx, &vq->used->idx);
+	r = vhost_get_user(vq, last_used_idx, &vq->used->idx);
 	if (r)
 		goto err;
 	vq->last_used_idx = vhost16_to_cpu(vq, last_used_idx);
@@ -1392,7 +1410,7 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
 
 	/* Check it isn't doing very strange things with descriptor numbers. */
 	last_avail_idx = vq->last_avail_idx;
-	if (unlikely(__get_user(avail_idx, &vq->avail->idx))) {
+	if (unlikely(vhost_get_user(vq, avail_idx, &vq->avail->idx))) {
 		vq_err(vq, "Failed to access avail idx at %p\n",
 		       &vq->avail->idx);
 		return -EFAULT;
@@ -1414,8 +1432,8 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
 
 	/* Grab the next descriptor number they're advertising, and increment
 	 * the index we've seen. */
-	if (unlikely(__get_user(ring_head,
-				&vq->avail->ring[last_avail_idx & (vq->num - 1)]))) {
+	if (unlikely(vhost_get_user(vq, ring_head,
+		     &vq->avail->ring[last_avail_idx & (vq->num - 1)]))) {
 		vq_err(vq, "Failed to read head: idx %d address %p\n",
 		       last_avail_idx,
 		       &vq->avail->ring[last_avail_idx % vq->num]);
@@ -1450,7 +1468,8 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
 			       i, vq->num, head);
 			return -EINVAL;
 		}
-		ret = __copy_from_user(&desc, vq->desc + i, sizeof desc);
+		ret = vhost_copy_from_user(vq, &desc, vq->desc + i,
+					   sizeof desc);
 		if (unlikely(ret)) {
 			vq_err(vq, "Failed to get descriptor: idx %d addr %p\n",
 			       i, vq->desc + i);
@@ -1538,15 +1557,15 @@ static int __vhost_add_used_n(struct vhost_virtqueue *vq,
 	start = vq->last_used_idx & (vq->num - 1);
 	used = vq->used->ring + start;
 	if (count == 1) {
-		if (__put_user(heads[0].id, &used->id)) {
+		if (vhost_put_user(vq, heads[0].id, &used->id)) {
 			vq_err(vq, "Failed to write used id");
 			return -EFAULT;
 		}
-		if (__put_user(heads[0].len, &used->len)) {
+		if (vhost_put_user(vq, heads[0].len, &used->len)) {
 			vq_err(vq, "Failed to write used len");
 			return -EFAULT;
 		}
-	} else if (__copy_to_user(used, heads, count * sizeof *used)) {
+	} else if (vhost_copy_to_user(vq, used, heads, count * sizeof *used)) {
 		vq_err(vq, "Failed to write used");
 		return -EFAULT;
 	}
@@ -1590,7 +1609,8 @@ int vhost_add_used_n(struct vhost_virtqueue *vq, struct vring_used_elem *heads,
 
 	/* Make sure buffer is written before we update index. */
 	smp_wmb();
-	if (__put_user(cpu_to_vhost16(vq, vq->last_used_idx), &vq->used->idx)) {
+	if (vhost_put_user(vq, cpu_to_vhost16(vq, vq->last_used_idx),
+			   &vq->used->idx)) {
 		vq_err(vq, "Failed to increment used idx");
 		return -EFAULT;
 	}
@@ -1622,7 +1642,7 @@ static bool vhost_notify(struct vhost_dev *dev, struct vhost_virtqueue *vq)
 
 	if (!vhost_has_feature(vq, VIRTIO_RING_F_EVENT_IDX)) {
 		__virtio16 flags;
-		if (__get_user(flags, &vq->avail->flags)) {
+		if (vhost_get_user(vq, flags, &vq->avail->flags)) {
 			vq_err(vq, "Failed to get flags");
 			return true;
 		}
@@ -1636,7 +1656,7 @@ static bool vhost_notify(struct vhost_dev *dev, struct vhost_virtqueue *vq)
 	if (unlikely(!v))
 		return true;
 
-	if (__get_user(event, vhost_used_event(vq))) {
+	if (vhost_get_user(vq, event, vhost_used_event(vq))) {
 		vq_err(vq, "Failed to get used event idx");
 		return true;
 	}
@@ -1678,7 +1698,7 @@ bool vhost_vq_avail_empty(struct vhost_dev *dev, struct vhost_virtqueue *vq)
 	__virtio16 avail_idx;
 	int r;
 
-	r = __get_user(avail_idx, &vq->avail->idx);
+	r = vhost_get_user(vq, avail_idx, &vq->avail->idx);
 	if (r)
 		return false;
 
@@ -1713,7 +1733,7 @@ bool vhost_enable_notify(struct vhost_dev *dev, struct vhost_virtqueue *vq)
 	/* They could have slipped one in as we were doing that: make
 	 * sure it's written, then check again. */
 	smp_mb();
-	r = __get_user(avail_idx, &vq->avail->idx);
+	r = vhost_get_user(vq, avail_idx, &vq->avail->idx);
 	if (r) {
 		vq_err(vq, "Failed to check avail idx at %p: %d\n",
 		       &vq->avail->idx, r);
-- 
2.7.4

^ permalink raw reply related

* [RFC PATCH V3 0/3] basic device IOTLB support
From: Jason Wang @ 2016-05-24  9:36 UTC (permalink / raw)
  To: mst, kvm, virtualization, netdev, linux-kernel; +Cc: vkaplans, wexu, peterx

This patch tries to implement an device IOTLB for vhost. This could be
used with for co-operation with userspace IOMMU implementation (qemu)
for a secure DMA environment (DMAR) in guest.

The idea is simple. When vhost meets an IOTLB miss, it will request
the assistance of userspace to do the translation, this is done
through:

- when there's a IOTLB miss, it will notify userspace through
  vhost_net fd and then userspace read the fault address, size and
  access from vhost fd.
- userspace write the translation result back to vhost fd, vhost can
  then update its IOTLB.

The codes were optimized for fixed mapping users e.g dpdk in guest. It
will be slow if dynamic mappings were used in guest. We could do
optimizations on top.

The codes were designed to be architecture independent. It should be
easily ported to any architecture.

Stress tested with l2fwd/vfio in guest with 4K/2M/1G page size. On 1G
hugepage case, 100% TLB hit rate were noticed.

Changes from V2:
- introduce memory accessors for vhost
- switch from ioctls to oridinary file read/write for iotlb miss and
  updating
- do not assume virtqueue were virtually mapped contiguously, all
  virtqueue access were done throug IOTLB
- verify memory access during IOTLB update and fail early
- introduce a module parameter for the size of IOTLB

Changes from V1:
- support any size/range of updating and invalidation through
  introducing the interval tree.
- convert from per device iotlb request to per virtqueue iotlb
  request, this solves the possible deadlock in V1.
- read/write permission check support.

Please review.

Jason Wang (3):
  vhost: introduce vhost memory accessors
  vhost: convert pre sorted vhost memory array to interval tree
  vhost: device IOTLB API

 drivers/vhost/net.c        |  63 +++-
 drivers/vhost/vhost.c      | 760 ++++++++++++++++++++++++++++++++++++++-------
 drivers/vhost/vhost.h      |  60 +++-
 include/uapi/linux/vhost.h |  28 ++
 4 files changed, 790 insertions(+), 121 deletions(-)

-- 
2.7.4

^ permalink raw reply

* RE: [PATCH RFC kernel] balloon: speed up inflating/deflating process
From: Li, Liang Z @ 2016-05-24  7:51 UTC (permalink / raw)
  To: Paolo Bonzini, mst@redhat.com
  Cc: kvm@vger.kernel.org, qemu-devel@nongnu.org,
	linux-kernel@vger.kernel.org,
	virtualization@lists.linux-foundation.org, amit.shah@redhat.com,
	akpm@linux-foundation.org, dgilbert@redhat.com
In-Reply-To: <3e47bf87-3e9d-f836-021b-8a90919f9002@redhat.com>

> On 20/05/2016 11:59, Liang Li wrote:
> > +
> > +		sg_init_table(sg, 5);
> > +		sg_set_buf(&sg[0], &flags, sizeof(flags));
> > +		sg_set_buf(&sg[1], &start_pfn, sizeof(start_pfn));
> > +		sg_set_buf(&sg[2], &page_shift, sizeof(page_shift));
> > +		sg_set_buf(&sg[3], &bmap_len, sizeof(bmap_len));
> 
> These four should probably be placed in a single struct and therefore a single
> sg entry.  It might even be faster to place it together with the bitmap, thus
> avoiding the use of indirect descriptors.
> 

Yes, thanks for your suggestion.

> You should also test ballooning of a 64GB guest after filling in the page cache,
> not just ballooning of a freshly booted 4GB guest.  This will give you a much
> more sparse bitmap.  Still, the improvement in sending PFNs to the host are

I will include the test result for that case in next version.


Thanks,

Liang

> impressive.
> 
> Thanks,
> 
> Paolo
> 
> > +		sg_set_buf(&sg[4], vb->page_bitmap +
> > +				 (start_pfn / BITS_PER_LONG), bmap_len);
> > +		virtqueue_add_outbuf(vq, sg, 5, vb, GFP_KERNEL);

^ permalink raw reply

* RE: [PATCH RFC kernel] balloon: speed up inflating/deflating process
From: Li, Liang Z @ 2016-05-24  7:48 UTC (permalink / raw)
  To: Cornelia Huck
  Cc: kvm@vger.kernel.org, mst@redhat.com, qemu-devel@nongnu.org,
	linux-kernel@vger.kernel.org,
	virtualization@lists.linux-foundation.org, amit.shah@redhat.com,
	pbonzini@redhat.com, akpm@linux-foundation.org,
	dgilbert@redhat.com
In-Reply-To: <20160520123243.59b91c00.cornelia.huck@de.ibm.com>

> On Fri, 20 May 2016 17:59:46 +0800
> Liang Li <liang.z.li@intel.com> wrote:
> 
> > The implementation of the current virtio-balloon is not very
> > efficient, Bellow is test result of time spends on inflating the
> > balloon to 3GB of a 4GB idle guest:
> >
> > a. allocating pages (6.5%, 103ms)
> > b. sending PFNs to host (68.3%, 787ms) c. address translation (6.1%,
> > 96ms) d. madvise (19%, 300ms)
> >
> > It takes about 1577ms for the whole inflating process to complete. The
> > test shows that the bottle neck is the stage b and stage d.
> >
> > If using a bitmap to send the page info instead of the PFNs, we can
> > reduce the overhead spends on stage b quite a lot. Furthermore, it's
> > possible to do the address translation and do the madvise with a bulk
> > of pages, instead of the current page per page way, so the overhead of
> > stage c and stage d can also be reduced a lot.
> >
> > This patch is the kernel side implementation which is intended to
> > speed up the inflating & deflating process by adding a new feature to
> > the virtio-balloon device. And now, inflating the balloon to 3GB of a
> > 4GB idle guest only takes 175ms, it's about 9 times as fast as before.
> >
> > TODO: optimize stage a by allocating/freeing a chunk of pages instead
> > of a single page at a time.
> 
> Not commenting on the approach, but...
> 
> >
> > Signed-off-by: Liang Li <liang.z.li@intel.com>
> > ---
> >  drivers/virtio/virtio_balloon.c     | 199
> ++++++++++++++++++++++++++++++++++--
> >  include/uapi/linux/virtio_balloon.h |   1 +
> >  mm/page_alloc.c                     |   6 ++
> >  3 files changed, 198 insertions(+), 8 deletions(-)
> >
> 
> >  static void tell_host(struct virtio_balloon *vb, struct virtqueue
> > *vq)  {
> > -	struct scatterlist sg;
> >  	unsigned int len;
> >
> > -	sg_init_one(&sg, vb->pfns, sizeof(vb->pfns[0]) * vb->num_pfns);
> > +	if (virtio_has_feature(vb->vdev,
> VIRTIO_BALLOON_F_PAGE_BITMAP)) {
> > +		u32 page_shift = PAGE_SHIFT;
> > +		unsigned long start_pfn, end_pfn, flags = 0, bmap_len;
> > +		struct scatterlist sg[5];
> > +
> > +		start_pfn = rounddown(vb->start_pfn, BITS_PER_LONG);
> > +		end_pfn = roundup(vb->end_pfn, BITS_PER_LONG);
> > +		bmap_len = (end_pfn - start_pfn) / BITS_PER_LONG *
> sizeof(long);
> > +
> > +		sg_init_table(sg, 5);
> > +		sg_set_buf(&sg[0], &flags, sizeof(flags));
> > +		sg_set_buf(&sg[1], &start_pfn, sizeof(start_pfn));
> > +		sg_set_buf(&sg[2], &page_shift, sizeof(page_shift));
> > +		sg_set_buf(&sg[3], &bmap_len, sizeof(bmap_len));
> > +		sg_set_buf(&sg[4], vb->page_bitmap +
> > +				 (start_pfn / BITS_PER_LONG), bmap_len);
> > +		virtqueue_add_outbuf(vq, sg, 5, vb, GFP_KERNEL);
> > +
> 
> ...you need to take care of the endianness of the data you put on the queue,
> otherwise virtio-1 on big endian won't work. (There's just been a patch for
> that problem.)

OK, thanks for your reminding.

Liang

^ permalink raw reply

* [PATCH 50/54] MAINTAINERS: Add file patterns for virtio device tree bindings
From: Geert Uytterhoeven @ 2016-05-22  9:06 UTC (permalink / raw)
  To: devicetree, linux-kernel
  Cc: virtualization, Geert Uytterhoeven, Michael S. Tsirkin
In-Reply-To: <1463907991-7916-1-git-send-email-geert@linux-m68k.org>

Submitters of device tree binding documentation may forget to CC
the subsystem maintainer if this is missing.

Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: virtualization@lists.linux-foundation.org
---
Please apply this patch directly if you want to be involved in device
tree binding documentation for your subsystem.
---
 MAINTAINERS | 1 +
 1 file changed, 1 insertion(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index d79cedf604642ee4..28829bfa140c47b2 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -12168,6 +12168,7 @@ VIRTIO CORE, NET AND BLOCK DRIVERS
 M:	"Michael S. Tsirkin" <mst@redhat.com>
 L:	virtualization@lists.linux-foundation.org
 S:	Maintained
+F:	Documentation/devicetree/bindings/virtio/
 F:	drivers/virtio/
 F:	tools/virtio/
 F:	drivers/net/virtio_net.c
-- 
1.9.1

^ permalink raw reply related

* [PATCH v6 03/12] mm: balloon: use general non-lru movable page feature
From: Minchan Kim @ 2016-05-20 14:23 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rafael Aquini, Minchan Kim, linux-kernel, virtualization,
	linux-mm, Gioh Kim, Konstantin Khlebnikov
In-Reply-To: <1463754225-31311-1-git-send-email-minchan@kernel.org>

Now, VM has a feature to migrate non-lru movable pages so
balloon doesn't need custom migration hooks in migrate.c
and compaction.c. Instead, this patch implements
page->mapping->a_ops->{isolate|migrate|putback} functions.

With that, we could remove hooks for ballooning in general
migration functions and make balloon compaction simple.

Cc: virtualization@lists.linux-foundation.org
Cc: Rafael Aquini <aquini@redhat.com>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Signed-off-by: Gioh Kim <gi-oh.kim@profitbricks.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 drivers/virtio/virtio_balloon.c    | 54 +++++++++++++++++++---
 include/linux/balloon_compaction.h | 53 +++++++--------------
 include/uapi/linux/magic.h         |  1 +
 mm/balloon_compaction.c            | 94 +++++++-------------------------------
 mm/compaction.c                    |  7 ---
 mm/migrate.c                       | 19 +-------
 mm/vmscan.c                        |  2 +-
 7 files changed, 85 insertions(+), 145 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 476c0e3a7150..88d5609375de 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -30,6 +30,7 @@
 #include <linux/oom.h>
 #include <linux/wait.h>
 #include <linux/mm.h>
+#include <linux/mount.h>
 
 /*
  * Balloon device works in 4K page units.  So each page is pointed to by
@@ -45,6 +46,10 @@ static int oom_pages = OOM_VBALLOON_DEFAULT_PAGES;
 module_param(oom_pages, int, S_IRUSR | S_IWUSR);
 MODULE_PARM_DESC(oom_pages, "pages to free on OOM");
 
+#ifdef CONFIG_BALLOON_COMPACTION
+static struct vfsmount *balloon_mnt;
+#endif
+
 struct virtio_balloon {
 	struct virtio_device *vdev;
 	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq;
@@ -488,8 +493,26 @@ static int virtballoon_migratepage(struct balloon_dev_info *vb_dev_info,
 
 	put_page(page); /* balloon reference */
 
-	return MIGRATEPAGE_SUCCESS;
+	return 0;
 }
+
+static struct dentry *balloon_mount(struct file_system_type *fs_type,
+		int flags, const char *dev_name, void *data)
+{
+	static const struct dentry_operations ops = {
+		.d_dname = simple_dname,
+	};
+
+	return mount_pseudo(fs_type, "balloon-kvm:", NULL, &ops,
+				BALLOON_KVM_MAGIC);
+}
+
+static struct file_system_type balloon_fs = {
+	.name           = "balloon-kvm",
+	.mount          = balloon_mount,
+	.kill_sb        = kill_anon_super,
+};
+
 #endif /* CONFIG_BALLOON_COMPACTION */
 
 static int virtballoon_probe(struct virtio_device *vdev)
@@ -519,9 +542,6 @@ static int virtballoon_probe(struct virtio_device *vdev)
 	vb->vdev = vdev;
 
 	balloon_devinfo_init(&vb->vb_dev_info);
-#ifdef CONFIG_BALLOON_COMPACTION
-	vb->vb_dev_info.migratepage = virtballoon_migratepage;
-#endif
 
 	err = init_vqs(vb);
 	if (err)
@@ -531,13 +551,33 @@ static int virtballoon_probe(struct virtio_device *vdev)
 	vb->nb.priority = VIRTBALLOON_OOM_NOTIFY_PRIORITY;
 	err = register_oom_notifier(&vb->nb);
 	if (err < 0)
-		goto out_oom_notify;
+		goto out_del_vqs;
+
+#ifdef CONFIG_BALLOON_COMPACTION
+	balloon_mnt = kern_mount(&balloon_fs);
+	if (IS_ERR(balloon_mnt)) {
+		err = PTR_ERR(balloon_mnt);
+		unregister_oom_notifier(&vb->nb);
+		goto out_del_vqs;
+	}
+
+	vb->vb_dev_info.migratepage = virtballoon_migratepage;
+	vb->vb_dev_info.inode = alloc_anon_inode(balloon_mnt->mnt_sb);
+	if (IS_ERR(vb->vb_dev_info.inode)) {
+		err = PTR_ERR(vb->vb_dev_info.inode);
+		kern_unmount(balloon_mnt);
+		unregister_oom_notifier(&vb->nb);
+		vb->vb_dev_info.inode = NULL;
+		goto out_del_vqs;
+	}
+	vb->vb_dev_info.inode->i_mapping->a_ops = &balloon_aops;
+#endif
 
 	virtio_device_ready(vdev);
 
 	return 0;
 
-out_oom_notify:
+out_del_vqs:
 	vdev->config->del_vqs(vdev);
 out_free_vb:
 	kfree(vb);
@@ -571,6 +611,8 @@ static void virtballoon_remove(struct virtio_device *vdev)
 	cancel_work_sync(&vb->update_balloon_stats_work);
 
 	remove_common(vb);
+	if (vb->vb_dev_info.inode)
+		iput(vb->vb_dev_info.inode);
 	kfree(vb);
 }
 
diff --git a/include/linux/balloon_compaction.h b/include/linux/balloon_compaction.h
index 9b0a15d06a4f..c0c430d06a9b 100644
--- a/include/linux/balloon_compaction.h
+++ b/include/linux/balloon_compaction.h
@@ -45,9 +45,10 @@
 #define _LINUX_BALLOON_COMPACTION_H
 #include <linux/pagemap.h>
 #include <linux/page-flags.h>
-#include <linux/migrate.h>
+#include <linux/compaction.h>
 #include <linux/gfp.h>
 #include <linux/err.h>
+#include <linux/fs.h>
 
 /*
  * Balloon device information descriptor.
@@ -62,6 +63,7 @@ struct balloon_dev_info {
 	struct list_head pages;		/* Pages enqueued & handled to Host */
 	int (*migratepage)(struct balloon_dev_info *, struct page *newpage,
 			struct page *page, enum migrate_mode mode);
+	struct inode *inode;
 };
 
 extern struct page *balloon_page_enqueue(struct balloon_dev_info *b_dev_info);
@@ -73,45 +75,19 @@ static inline void balloon_devinfo_init(struct balloon_dev_info *balloon)
 	spin_lock_init(&balloon->pages_lock);
 	INIT_LIST_HEAD(&balloon->pages);
 	balloon->migratepage = NULL;
+	balloon->inode = NULL;
 }
 
 #ifdef CONFIG_BALLOON_COMPACTION
-extern bool balloon_page_isolate(struct page *page);
+extern const struct address_space_operations balloon_aops;
+extern bool balloon_page_isolate(struct page *page,
+				isolate_mode_t mode);
 extern void balloon_page_putback(struct page *page);
-extern int balloon_page_migrate(struct page *newpage,
+extern int balloon_page_migrate(struct address_space *mapping,
+				struct page *newpage,
 				struct page *page, enum migrate_mode mode);
 
 /*
- * __is_movable_balloon_page - helper to perform @page PageBalloon tests
- */
-static inline bool __is_movable_balloon_page(struct page *page)
-{
-	return PageBalloon(page);
-}
-
-/*
- * balloon_page_movable - test PageBalloon to identify balloon pages
- *			  and PagePrivate to check that the page is not
- *			  isolated and can be moved by compaction/migration.
- *
- * As we might return false positives in the case of a balloon page being just
- * released under us, this need to be re-tested later, under the page lock.
- */
-static inline bool balloon_page_movable(struct page *page)
-{
-	return PageBalloon(page) && PagePrivate(page);
-}
-
-/*
- * isolated_balloon_page - identify an isolated balloon page on private
- *			   compaction/migration page lists.
- */
-static inline bool isolated_balloon_page(struct page *page)
-{
-	return PageBalloon(page);
-}
-
-/*
  * balloon_page_insert - insert a page into the balloon's page list and make
  *			 the page->private assignment accordingly.
  * @balloon : pointer to balloon device
@@ -124,7 +100,7 @@ static inline void balloon_page_insert(struct balloon_dev_info *balloon,
 				       struct page *page)
 {
 	__SetPageBalloon(page);
-	SetPagePrivate(page);
+	__SetPageMovable(page, balloon->inode->i_mapping);
 	set_page_private(page, (unsigned long)balloon);
 	list_add(&page->lru, &balloon->pages);
 }
@@ -140,11 +116,14 @@ static inline void balloon_page_insert(struct balloon_dev_info *balloon,
 static inline void balloon_page_delete(struct page *page)
 {
 	__ClearPageBalloon(page);
+	__ClearPageMovable(page);
 	set_page_private(page, 0);
-	if (PagePrivate(page)) {
-		ClearPagePrivate(page);
+	/*
+	 * No touch page.lru field once @page has been isolated
+	 * because VM is using the field.
+	 */
+	if (!PageIsolated(page))
 		list_del(&page->lru);
-	}
 }
 
 /*
diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
index 546b38886e11..d829ce63529d 100644
--- a/include/uapi/linux/magic.h
+++ b/include/uapi/linux/magic.h
@@ -80,5 +80,6 @@
 #define BPF_FS_MAGIC		0xcafe4a11
 /* Since UDF 2.01 is ISO 13346 based... */
 #define UDF_SUPER_MAGIC		0x15013346
+#define BALLOON_KVM_MAGIC	0x13661366
 
 #endif /* __LINUX_MAGIC_H__ */
diff --git a/mm/balloon_compaction.c b/mm/balloon_compaction.c
index 57b3e9bd6bc5..da91df50ba31 100644
--- a/mm/balloon_compaction.c
+++ b/mm/balloon_compaction.c
@@ -70,7 +70,7 @@ struct page *balloon_page_dequeue(struct balloon_dev_info *b_dev_info)
 		 */
 		if (trylock_page(page)) {
 #ifdef CONFIG_BALLOON_COMPACTION
-			if (!PagePrivate(page)) {
+			if (PageIsolated(page)) {
 				/* raced with isolation */
 				unlock_page(page);
 				continue;
@@ -106,110 +106,50 @@ EXPORT_SYMBOL_GPL(balloon_page_dequeue);
 
 #ifdef CONFIG_BALLOON_COMPACTION
 
-static inline void __isolate_balloon_page(struct page *page)
+bool balloon_page_isolate(struct page *page, isolate_mode_t mode)
+
 {
 	struct balloon_dev_info *b_dev_info = balloon_page_device(page);
 	unsigned long flags;
 
 	spin_lock_irqsave(&b_dev_info->pages_lock, flags);
-	ClearPagePrivate(page);
 	list_del(&page->lru);
 	b_dev_info->isolated_pages++;
 	spin_unlock_irqrestore(&b_dev_info->pages_lock, flags);
+
+	return true;
 }
 
-static inline void __putback_balloon_page(struct page *page)
+void balloon_page_putback(struct page *page)
 {
 	struct balloon_dev_info *b_dev_info = balloon_page_device(page);
 	unsigned long flags;
 
 	spin_lock_irqsave(&b_dev_info->pages_lock, flags);
-	SetPagePrivate(page);
 	list_add(&page->lru, &b_dev_info->pages);
 	b_dev_info->isolated_pages--;
 	spin_unlock_irqrestore(&b_dev_info->pages_lock, flags);
 }
 
-/* __isolate_lru_page() counterpart for a ballooned page */
-bool balloon_page_isolate(struct page *page)
-{
-	/*
-	 * Avoid burning cycles with pages that are yet under __free_pages(),
-	 * or just got freed under us.
-	 *
-	 * In case we 'win' a race for a balloon page being freed under us and
-	 * raise its refcount preventing __free_pages() from doing its job
-	 * the put_page() at the end of this block will take care of
-	 * release this page, thus avoiding a nasty leakage.
-	 */
-	if (likely(get_page_unless_zero(page))) {
-		/*
-		 * As balloon pages are not isolated from LRU lists, concurrent
-		 * compaction threads can race against page migration functions
-		 * as well as race against the balloon driver releasing a page.
-		 *
-		 * In order to avoid having an already isolated balloon page
-		 * being (wrongly) re-isolated while it is under migration,
-		 * or to avoid attempting to isolate pages being released by
-		 * the balloon driver, lets be sure we have the page lock
-		 * before proceeding with the balloon page isolation steps.
-		 */
-		if (likely(trylock_page(page))) {
-			/*
-			 * A ballooned page, by default, has PagePrivate set.
-			 * Prevent concurrent compaction threads from isolating
-			 * an already isolated balloon page by clearing it.
-			 */
-			if (balloon_page_movable(page)) {
-				__isolate_balloon_page(page);
-				unlock_page(page);
-				return true;
-			}
-			unlock_page(page);
-		}
-		put_page(page);
-	}
-	return false;
-}
-
-/* putback_lru_page() counterpart for a ballooned page */
-void balloon_page_putback(struct page *page)
-{
-	/*
-	 * 'lock_page()' stabilizes the page and prevents races against
-	 * concurrent isolation threads attempting to re-isolate it.
-	 */
-	lock_page(page);
-
-	if (__is_movable_balloon_page(page)) {
-		__putback_balloon_page(page);
-		/* drop the extra ref count taken for page isolation */
-		put_page(page);
-	} else {
-		WARN_ON(1);
-		dump_page(page, "not movable balloon page");
-	}
-	unlock_page(page);
-}
 
 /* move_to_new_page() counterpart for a ballooned page */
-int balloon_page_migrate(struct page *newpage,
-			 struct page *page, enum migrate_mode mode)
+int balloon_page_migrate(struct address_space *mapping,
+		struct page *newpage, struct page *page,
+		enum migrate_mode mode)
 {
 	struct balloon_dev_info *balloon = balloon_page_device(page);
-	int rc = -EAGAIN;
 
 	VM_BUG_ON_PAGE(!PageLocked(page), page);
 	VM_BUG_ON_PAGE(!PageLocked(newpage), newpage);
 
-	if (WARN_ON(!__is_movable_balloon_page(page))) {
-		dump_page(page, "not movable balloon page");
-		return rc;
-	}
+	return balloon->migratepage(balloon, newpage, page, mode);
+}
 
-	if (balloon && balloon->migratepage)
-		rc = balloon->migratepage(balloon, newpage, page, mode);
+const struct address_space_operations balloon_aops = {
+	.migratepage = balloon_page_migrate,
+	.isolate_page = balloon_page_isolate,
+	.putback_page = balloon_page_putback,
+};
+EXPORT_SYMBOL_GPL(balloon_aops);
 
-	return rc;
-}
 #endif /* CONFIG_BALLOON_COMPACTION */
diff --git a/mm/compaction.c b/mm/compaction.c
index 2d6862d0df60..95877987b3c5 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -792,13 +792,6 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
 		 */
 		is_lru = PageLRU(page);
 		if (!is_lru) {
-			if (unlikely(balloon_page_movable(page))) {
-				if (balloon_page_isolate(page)) {
-					/* Successfully isolated */
-					goto isolate_success;
-				}
-			}
-
 			/*
 			 * __PageMovable can return false positive so we need
 			 * to verify it under page_lock.
diff --git a/mm/migrate.c b/mm/migrate.c
index 57559ca7c904..d1b3acf9b9a9 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -168,14 +168,12 @@ void putback_movable_pages(struct list_head *l)
 		list_del(&page->lru);
 		dec_zone_page_state(page, NR_ISOLATED_ANON +
 				page_is_file_cache(page));
-		if (unlikely(isolated_balloon_page(page))) {
-			balloon_page_putback(page);
 		/*
 		 * We isolated non-lru movable page so here we can use
 		 * __PageMovable because LRU page's mapping cannot have
 		 * PAGE_MAPPING_MOVABLE.
 		 */
-		} else if (unlikely(__PageMovable(page))) {
+		if (unlikely(__PageMovable(page))) {
 			VM_BUG_ON_PAGE(!PageIsolated(page), page);
 			lock_page(page);
 			if (PageMovable(page))
@@ -985,18 +983,6 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
 	if (unlikely(!trylock_page(newpage)))
 		goto out_unlock;
 
-	if (unlikely(isolated_balloon_page(page))) {
-		/*
-		 * A ballooned page does not need any special attention from
-		 * physical to virtual reverse mapping procedures.
-		 * Skip any attempt to unmap PTEs or to remap swap cache,
-		 * in order to avoid burning cycles at rmap level, and perform
-		 * the page migration right away (proteced by page lock).
-		 */
-		rc = balloon_page_migrate(newpage, page, mode);
-		goto out_unlock_both;
-	}
-
 	if (unlikely(!is_lru)) {
 		rc = move_to_new_page(newpage, page, mode);
 		goto out_unlock_both;
@@ -1051,8 +1037,7 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
 	 * list in here.
 	 */
 	if (rc == MIGRATEPAGE_SUCCESS) {
-		if (unlikely(__is_movable_balloon_page(newpage) ||
-				__PageMovable(newpage)))
+		if (unlikely(__PageMovable(newpage)))
 			put_page(newpage);
 		else
 			putback_lru_page(newpage);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index c4a2f4512fca..93ba33789ac6 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1254,7 +1254,7 @@ unsigned long reclaim_clean_pages_from_list(struct zone *zone,
 
 	list_for_each_entry_safe(page, next, page_list, lru) {
 		if (page_is_file_cache(page) && !PageDirty(page) &&
-		    !isolated_balloon_page(page)) {
+		    !__PageMovable(page)) {
 			ClearPageActive(page);
 			list_move(&page->lru, &clean_pages);
 		}
-- 
1.9.1

^ permalink raw reply related

* [PATCH v6 02/12] mm: migrate: support non-lru movable page migration
From: Minchan Kim @ 2016-05-20 14:23 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Sergey Senozhatsky, Rafael Aquini, Minchan Kim,
	Jonathan Corbet, Hugh Dickins, linux-kernel, dri-devel,
	virtualization, John Einar Reitan, linux-mm, Gioh Kim, Mel Gorman,
	Joonsoo Kim, Vlastimil Babka
In-Reply-To: <1463754225-31311-1-git-send-email-minchan@kernel.org>

We have allowed migration for only LRU pages until now and it was
enough to make high-order pages. But recently, embedded system(e.g.,
webOS, android) uses lots of non-movable pages(e.g., zram, GPU memory)
so we have seen several reports about troubles of small high-order
allocation. For fixing the problem, there were several efforts
(e,g,. enhance compaction algorithm, SLUB fallback to 0-order page,
reserved memory, vmalloc and so on) but if there are lots of
non-movable pages in system, their solutions are void in the long run.

So, this patch is to support facility to change non-movable pages
with movable. For the feature, this patch introduces functions related
to migration to address_space_operations as well as some page flags.

If a driver want to make own pages movable, it should define three functions
which are function pointers of struct address_space_operations.

1. bool (*isolate_page) (struct page *page, isolate_mode_t mode);

What VM expects on isolate_page function of driver is to return *true*
if driver isolates page successfully. On returing true, VM marks the page
as PG_isolated so concurrent isolation in several CPUs skip the page
for isolation. If a driver cannot isolate the page, it should return *false*.

Once page is successfully isolated, VM uses page.lru fields so driver
shouldn't expect to preserve values in that fields.

2. int (*migratepage) (struct address_space *mapping,
		struct page *newpage, struct page *oldpage, enum migrate_mode);

After isolation, VM calls migratepage of driver with isolated page.
The function of migratepage is to move content of the old page to new page
and set up fields of struct page newpage. Keep in mind that you should
clear PG_movable of oldpage via __ClearPageMovable under page_lock if you
migrated the oldpage successfully and returns 0.
If driver cannot migrate the page at the moment, driver can return -EAGAIN.
On -EAGAIN, VM will retry page migration in a short time because VM interprets
-EAGAIN as "temporal migration failure". On returning any error except -EAGAIN,
VM will give up the page migration without retrying in this time.

Driver shouldn't touch page.lru field VM using in the functions.

3. void (*putback_page)(struct page *);

If migration fails on isolated page, VM should return the isolated page
to the driver so VM calls driver's putback_page with migration failed page.
In this function, driver should put the isolated page back to the own data
structure.

4. non-lru movable page flags

There are two page flags for supporting non-lru movable page.

* PG_movable

Driver should use the below function to make page movable under page_lock.

	void __SetPageMovable(struct page *page, struct address_space *mapping)

It needs argument of address_space for registering migration family functions
which will be called by VM. Exactly speaking, PG_movable is not a real flag of
struct page. Rather than, VM reuses page->mapping's lower bits to represent it.

	#define PAGE_MAPPING_MOVABLE 0x2
	page->mapping = page->mapping | PAGE_MAPPING_MOVABLE;

so driver shouldn't access page->mapping directly. Instead, driver should
use page_mapping which mask off the low two bits of page->mapping so it can get
right struct address_space.

For testing of non-lru movable page, VM supports __PageMovable function.
However, it doesn't guarantee to identify non-lru movable page because
page->mapping field is unified with other variables in struct page.
As well, if driver releases the page after isolation by VM, page->mapping
doesn't have stable value although it has PAGE_MAPPING_MOVABLE
(Look at __ClearPageMovable). But __PageMovable is cheap to catch whether
page is LRU or non-lru movable once the page has been isolated. Because
LRU pages never can have PAGE_MAPPING_MOVABLE in page->mapping. It is also
good for just peeking to test non-lru movable pages before more expensive
checking with lock_page in pfn scanning to select victim.

For guaranteeing non-lru movable page, VM provides PageMovable function.
Unlike __PageMovable, PageMovable functions validates page->mapping and
mapping->a_ops->isolate_page under lock_page. The lock_page prevents sudden
destroying of page->mapping.

Driver using __SetPageMovable should clear the flag via __ClearMovablePage
under page_lock before the releasing the page.

* PG_isolated

To prevent concurrent isolation among several CPUs, VM marks isolated page
as PG_isolated under lock_page. So if a CPU encounters PG_isolated non-lru
movable page, it can skip it. Driver doesn't need to manipulate the flag
because VM will set/clear it automatically. Keep in mind that if driver
sees PG_isolated page, it means the page have been isolated by VM so it
shouldn't touch page.lru field.
PG_isolated is alias with PG_reclaim flag so driver shouldn't use the flag
for own purpose.

Cc: Rik van Riel <riel@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: Rafael Aquini <aquini@redhat.com>
Cc: virtualization@lists.linux-foundation.org
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: John Einar Reitan <john.reitan@foss.arm.com>
Cc: dri-devel@lists.freedesktop.org
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Gioh Kim <gi-oh.kim@profitbricks.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 Documentation/filesystems/Locking |   4 +
 Documentation/filesystems/vfs.txt |  11 +++
 Documentation/vm/page_migration   | 107 ++++++++++++++++++++-
 include/linux/compaction.h        |  17 ++++
 include/linux/fs.h                |   2 +
 include/linux/ksm.h               |   3 +-
 include/linux/migrate.h           |   2 +
 include/linux/mm.h                |   1 +
 include/linux/page-flags.h        |  33 +++++--
 mm/compaction.c                   |  82 ++++++++++++----
 mm/ksm.c                          |   4 +-
 mm/migrate.c                      | 191 ++++++++++++++++++++++++++++++++++----
 mm/page_alloc.c                   |   2 +-
 mm/util.c                         |   6 +-
 14 files changed, 411 insertions(+), 54 deletions(-)

diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking
index 75eea7ce3d7c..dda6e3f8e203 100644
--- a/Documentation/filesystems/Locking
+++ b/Documentation/filesystems/Locking
@@ -195,7 +195,9 @@ unlocks and drops the reference.
 	int (*releasepage) (struct page *, int);
 	void (*freepage)(struct page *);
 	int (*direct_IO)(struct kiocb *, struct iov_iter *iter);
+	bool (*isolate_page) (struct page *, isolate_mode_t);
 	int (*migratepage)(struct address_space *, struct page *, struct page *);
+	void (*putback_page) (struct page *);
 	int (*launder_page)(struct page *);
 	int (*is_partially_uptodate)(struct page *, unsigned long, unsigned long);
 	int (*error_remove_page)(struct address_space *, struct page *);
@@ -219,7 +221,9 @@ invalidatepage:		yes
 releasepage:		yes
 freepage:		yes
 direct_IO:
+isolate_page:		yes
 migratepage:		yes (both)
+putback_page:		yes
 launder_page:		yes
 is_partially_uptodate:	yes
 error_remove_page:	yes
diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt
index c61a223ef3ff..900360cbcdae 100644
--- a/Documentation/filesystems/vfs.txt
+++ b/Documentation/filesystems/vfs.txt
@@ -592,9 +592,14 @@ struct address_space_operations {
 	int (*releasepage) (struct page *, int);
 	void (*freepage)(struct page *);
 	ssize_t (*direct_IO)(struct kiocb *, struct iov_iter *iter);
+	/* isolate a page for migration */
+	bool (*isolate_page) (struct page *, isolate_mode_t);
 	/* migrate the contents of a page to the specified target */
 	int (*migratepage) (struct page *, struct page *);
+	/* put migration-failed page back to right list */
+	void (*putback_page) (struct page *);
 	int (*launder_page) (struct page *);
+
 	int (*is_partially_uptodate) (struct page *, unsigned long,
 					unsigned long);
 	void (*is_dirty_writeback) (struct page *, bool *, bool *);
@@ -747,6 +752,10 @@ struct address_space_operations {
         and transfer data directly between the storage and the
         application's address space.
 
+  isolate_page: Called by the VM when isolating a movable non-lru page.
+	If page is successfully isolated, VM marks the page as PG_isolated
+	via __SetPageIsolated.
+
   migrate_page:  This is used to compact the physical memory usage.
         If the VM wants to relocate a page (maybe off a memory card
         that is signalling imminent failure) it will pass a new page
@@ -754,6 +763,8 @@ struct address_space_operations {
 	transfer any private data across and update any references
         that it has to the page.
 
+  putback_page: Called by the VM when isolated page's migration fails.
+
   launder_page: Called before freeing a page - it writes back the dirty page. To
   	prevent redirtying the page, it is kept locked during the whole
 	operation.
diff --git a/Documentation/vm/page_migration b/Documentation/vm/page_migration
index fea5c0864170..80e98af46e95 100644
--- a/Documentation/vm/page_migration
+++ b/Documentation/vm/page_migration
@@ -142,5 +142,110 @@ is increased so that the page cannot be freed while page migration occurs.
 20. The new page is moved to the LRU and can be scanned by the swapper
     etc again.
 
-Christoph Lameter, May 8, 2006.
+C. Non-LRU page migration
+-------------------------
+
+Although original migration aimed for reducing the latency of memory access
+for NUMA, compaction who want to create high-order page is also main customer.
+
+Current problem of the implementation is that it is designed to migrate only
+*LRU* pages. However, there are potential non-lru pages which can be migrated
+in drivers, for example, zsmalloc, virtio-balloon pages.
+
+For virtio-balloon pages, some parts of migration code path have been hooked
+up and added virtio-balloon specific functions to intercept migration logics.
+It's too specific to a driver so other drivers who want to make their pages
+movable would have to add own specific hooks in migration path.
+
+To overclome the problem, VM supports non-LRU page migration which provides
+generic functions for non-LRU movable pages without driver specific hooks
+migration path.
+
+If a driver want to make own pages movable, it should define three functions
+which are function pointers of struct address_space_operations.
+
+1. bool (*isolate_page) (struct page *page, isolate_mode_t mode);
+
+What VM expects on isolate_page function of driver is to return *true*
+if driver isolates page successfully. On returing true, VM marks the page
+as PG_isolated so concurrent isolation in several CPUs skip the page
+for isolation. If a driver cannot isolate the page, it should return *false*.
+
+Once page is successfully isolated, VM uses page.lru fields so driver
+shouldn't expect to preserve values in that fields.
+
+2. int (*migratepage) (struct address_space *mapping,
+		struct page *newpage, struct page *oldpage, enum migrate_mode);
+
+After isolation, VM calls migratepage of driver with isolated page.
+The function of migratepage is to move content of the old page to new page
+and set up fields of struct page newpage. Keep in mind that you should
+clear PG_movable of oldpage via __ClearPageMovable under page_lock if you
+migrated the oldpage successfully and returns 0.
+If driver cannot migrate the page at the moment, driver can return -EAGAIN.
+On -EAGAIN, VM will retry page migration in a short time because VM interprets
+-EAGAIN as "temporal migration failure". On returning any error except -EAGAIN,
+VM will give up the page migration without retrying in this time.
+
+Driver shouldn't touch page.lru field VM using in the functions.
+
+3. void (*putback_page)(struct page *);
+
+If migration fails on isolated page, VM should return the isolated page
+to the driver so VM calls driver's putback_page with migration failed page.
+In this function, driver should put the isolated page back to the own data
+structure.
 
+4. non-lru movable page flags
+
+There are two page flags for supporting non-lru movable page.
+
+* PG_movable
+
+Driver should use the below function to make page movable under page_lock.
+
+	void __SetPageMovable(struct page *page, struct address_space *mapping)
+
+It needs argument of address_space for registering migration family functions
+which will be called by VM. Exactly speaking, PG_movable is not a real flag of
+struct page. Rather than, VM reuses page->mapping's lower bits to represent it.
+
+	#define PAGE_MAPPING_MOVABLE 0x2
+	page->mapping = page->mapping | PAGE_MAPPING_MOVABLE;
+
+so driver shouldn't access page->mapping directly. Instead, driver should
+use page_mapping which mask off the low two bits of page->mapping so it can get
+right struct address_space.
+
+For testing of non-lru movable page, VM supports __PageMovable function.
+However, it doesn't guarantee to identify non-lru movable page because
+page->mapping field is unified with other variables in struct page.
+As well, if driver releases the page after isolation by VM, page->mapping
+doesn't have stable value although it has PAGE_MAPPING_MOVABLE
+(Look at __ClearPageMovable). But __PageMovable is cheap to catch whether
+page is LRU or non-lru movable once the page has been isolated. Because
+LRU pages never can have PAGE_MAPPING_MOVABLE in page->mapping. It is also
+good for just peeking to test non-lru movable pages before more expensive
+checking with lock_page in pfn scanning to select victim.
+
+For guaranteeing non-lru movable page, VM provides PageMovable function.
+Unlike __PageMovable, PageMovable functions validates page->mapping and
+mapping->a_ops->isolate_page under lock_page. The lock_page prevents sudden
+destroying of page->mapping.
+
+Driver using __SetPageMovable should clear the flag via __ClearMovablePage
+under page_lock before the releasing the page.
+
+* PG_isolated
+
+To prevent concurrent isolation among several CPUs, VM marks isolated page
+as PG_isolated under lock_page. So if a CPU encounters PG_isolated non-lru
+movable page, it can skip it. Driver doesn't need to manipulate the flag
+because VM will set/clear it automatically. Keep in mind that if driver
+sees PG_isolated page, it means the page have been isolated by VM so it
+shouldn't touch page.lru field.
+PG_isolated is alias with PG_reclaim flag so driver shouldn't use the flag
+for own purpose.
+
+Christoph Lameter, May 8, 2006.
+Minchan Kim, Mar 28, 2016.
diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index a58c852a268f..c6b47c861cea 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -54,6 +54,9 @@ enum compact_result {
 struct alloc_context; /* in mm/internal.h */
 
 #ifdef CONFIG_COMPACTION
+extern int PageMovable(struct page *page);
+extern void __SetPageMovable(struct page *page, struct address_space *mapping);
+extern void __ClearPageMovable(struct page *page);
 extern int sysctl_compact_memory;
 extern int sysctl_compaction_handler(struct ctl_table *table, int write,
 			void __user *buffer, size_t *length, loff_t *ppos);
@@ -151,6 +154,19 @@ extern void kcompactd_stop(int nid);
 extern void wakeup_kcompactd(pg_data_t *pgdat, int order, int classzone_idx);
 
 #else
+static inline int PageMovable(struct page *page)
+{
+	return 0;
+}
+static inline void __SetPageMovable(struct page *page,
+			struct address_space *mapping)
+{
+}
+
+static inline void __ClearPageMovable(struct page *page)
+{
+}
+
 static inline enum compact_result try_to_compact_pages(gfp_t gfp_mask,
 			unsigned int order, int alloc_flags,
 			const struct alloc_context *ac,
@@ -212,6 +228,7 @@ static inline void wakeup_kcompactd(pg_data_t *pgdat, int order, int classzone_i
 #endif /* CONFIG_COMPACTION */
 
 #if defined(CONFIG_COMPACTION) && defined(CONFIG_SYSFS) && defined(CONFIG_NUMA)
+struct node;
 extern int compaction_register_node(struct node *node);
 extern void compaction_unregister_node(struct node *node);
 
diff --git a/include/linux/fs.h b/include/linux/fs.h
index c9cc1f699dc1..6a2ce439ea42 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -402,6 +402,8 @@ struct address_space_operations {
 	 */
 	int (*migratepage) (struct address_space *,
 			struct page *, struct page *, enum migrate_mode);
+	bool (*isolate_page)(struct page *, isolate_mode_t);
+	void (*putback_page)(struct page *);
 	int (*launder_page) (struct page *);
 	int (*is_partially_uptodate) (struct page *, unsigned long,
 					unsigned long);
diff --git a/include/linux/ksm.h b/include/linux/ksm.h
index 7ae216a39c9e..481c8c4627ca 100644
--- a/include/linux/ksm.h
+++ b/include/linux/ksm.h
@@ -43,8 +43,7 @@ static inline struct stable_node *page_stable_node(struct page *page)
 static inline void set_page_stable_node(struct page *page,
 					struct stable_node *stable_node)
 {
-	page->mapping = (void *)stable_node +
-				(PAGE_MAPPING_ANON | PAGE_MAPPING_KSM);
+	page->mapping = (void *)((unsigned long)stable_node | PAGE_MAPPING_KSM);
 }
 
 /*
diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 9b50325e4ddf..404fbfefeb33 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -37,6 +37,8 @@ extern int migrate_page(struct address_space *,
 			struct page *, struct page *, enum migrate_mode);
 extern int migrate_pages(struct list_head *l, new_page_t new, free_page_t free,
 		unsigned long private, enum migrate_mode mode, int reason);
+extern bool isolate_movable_page(struct page *page, isolate_mode_t mode);
+extern void putback_movable_page(struct page *page);
 
 extern int migrate_prep(void);
 extern int migrate_prep_local(void);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index a00ec816233a..33eaec57e997 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1035,6 +1035,7 @@ static inline pgoff_t page_file_index(struct page *page)
 }
 
 bool page_mapped(struct page *page);
+struct address_space *page_mapping(struct page *page);
 
 /*
  * Return true only if the page has been allocated with
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index e5a32445f930..f8a2c4881608 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -129,6 +129,9 @@ enum pageflags {
 
 	/* Compound pages. Stored in first tail page's flags */
 	PG_double_map = PG_private_2,
+
+	/* non-lru isolated movable page */
+	PG_isolated = PG_reclaim,
 };
 
 #ifndef __GENERATING_BOUNDS_H
@@ -357,29 +360,37 @@ PAGEFLAG(Idle, idle, PF_ANY)
  * with the PAGE_MAPPING_ANON bit set to distinguish it.  See rmap.h.
  *
  * On an anonymous page in a VM_MERGEABLE area, if CONFIG_KSM is enabled,
- * the PAGE_MAPPING_KSM bit may be set along with the PAGE_MAPPING_ANON bit;
- * and then page->mapping points, not to an anon_vma, but to a private
+ * the PAGE_MAPPING_MOVABLE bit may be set along with the PAGE_MAPPING_ANON
+ * bit; and then page->mapping points, not to an anon_vma, but to a private
  * structure which KSM associates with that merged page.  See ksm.h.
  *
- * PAGE_MAPPING_KSM without PAGE_MAPPING_ANON is currently never used.
+ * PAGE_MAPPING_KSM without PAGE_MAPPING_ANON is used for non-lru movable
+ * page and then page->mapping points a struct address_space.
  *
  * Please note that, confusingly, "page_mapping" refers to the inode
  * address_space which maps the page from disk; whereas "page_mapped"
  * refers to user virtual address space into which the page is mapped.
  */
-#define PAGE_MAPPING_ANON	1
-#define PAGE_MAPPING_KSM	2
-#define PAGE_MAPPING_FLAGS	(PAGE_MAPPING_ANON | PAGE_MAPPING_KSM)
+#define PAGE_MAPPING_ANON	0x1
+#define PAGE_MAPPING_MOVABLE	0x2
+#define PAGE_MAPPING_KSM	(PAGE_MAPPING_ANON | PAGE_MAPPING_MOVABLE)
+#define PAGE_MAPPING_FLAGS	(PAGE_MAPPING_ANON | PAGE_MAPPING_MOVABLE)
 
-static __always_inline int PageAnonHead(struct page *page)
+static __always_inline int PageMappingFlag(struct page *page)
 {
-	return ((unsigned long)page->mapping & PAGE_MAPPING_ANON) != 0;
+	return ((unsigned long)page->mapping & PAGE_MAPPING_FLAGS) != 0;
 }
 
 static __always_inline int PageAnon(struct page *page)
 {
 	page = compound_head(page);
-	return PageAnonHead(page);
+	return ((unsigned long)page->mapping & PAGE_MAPPING_ANON) != 0;
+}
+
+static __always_inline int __PageMovable(struct page *page)
+{
+	return ((unsigned long)page->mapping & PAGE_MAPPING_FLAGS) ==
+				PAGE_MAPPING_MOVABLE;
 }
 
 #ifdef CONFIG_KSM
@@ -393,7 +404,7 @@ static __always_inline int PageKsm(struct page *page)
 {
 	page = compound_head(page);
 	return ((unsigned long)page->mapping & PAGE_MAPPING_FLAGS) ==
-				(PAGE_MAPPING_ANON | PAGE_MAPPING_KSM);
+				PAGE_MAPPING_KSM;
 }
 #else
 TESTPAGEFLAG_FALSE(Ksm)
@@ -641,6 +652,8 @@ static inline void __ClearPageBalloon(struct page *page)
 	atomic_set(&page->_mapcount, -1);
 }
 
+__PAGEFLAG(Isolated, isolated, PF_ANY);
+
 /*
  * If network-based swap is enabled, sl*b must keep track of whether pages
  * were allocated from pfmemalloc reserves.
diff --git a/mm/compaction.c b/mm/compaction.c
index 1427366ad673..2d6862d0df60 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -81,6 +81,41 @@ static inline bool migrate_async_suitable(int migratetype)
 
 #ifdef CONFIG_COMPACTION
 
+int PageMovable(struct page *page)
+{
+	struct address_space *mapping;
+
+	WARN_ON(!PageLocked(page));
+	if (!__PageMovable(page))
+		goto out;
+
+	mapping = page_mapping(page);
+	if (mapping && mapping->a_ops && mapping->a_ops->isolate_page)
+		return 1;
+out:
+	return 0;
+}
+EXPORT_SYMBOL(PageMovable);
+
+void __SetPageMovable(struct page *page, struct address_space *mapping)
+{
+	VM_BUG_ON_PAGE(!PageLocked(page), page);
+	VM_BUG_ON_PAGE((unsigned long)mapping & PAGE_MAPPING_MOVABLE, page);
+	page->mapping = (void *)((unsigned long)mapping | PAGE_MAPPING_MOVABLE);
+}
+EXPORT_SYMBOL(__SetPageMovable);
+
+void __ClearPageMovable(struct page *page)
+{
+	VM_BUG_ON_PAGE(!PageLocked(page), page);
+	VM_BUG_ON_PAGE(!PageMovable(page), page);
+	VM_BUG_ON_PAGE(!((unsigned long)page->mapping & PAGE_MAPPING_MOVABLE),
+				page);
+	page->mapping = (void *)((unsigned long)page->mapping &
+				PAGE_MAPPING_MOVABLE);
+}
+EXPORT_SYMBOL(__ClearPageMovable);
+
 /* Do not skip compaction more than 64 times */
 #define COMPACT_MAX_DEFER_SHIFT 6
 
@@ -735,21 +770,6 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
 		}
 
 		/*
-		 * Check may be lockless but that's ok as we recheck later.
-		 * It's possible to migrate LRU pages and balloon pages
-		 * Skip any other type of page
-		 */
-		is_lru = PageLRU(page);
-		if (!is_lru) {
-			if (unlikely(balloon_page_movable(page))) {
-				if (balloon_page_isolate(page)) {
-					/* Successfully isolated */
-					goto isolate_success;
-				}
-			}
-		}
-
-		/*
 		 * Regardless of being on LRU, compound pages such as THP and
 		 * hugetlbfs are not to be compacted. We can potentially save
 		 * a lot of iterations if we skip them at once. The check is
@@ -765,8 +785,38 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
 			goto isolate_fail;
 		}
 
-		if (!is_lru)
+		/*
+		 * Check may be lockless but that's ok as we recheck later.
+		 * It's possible to migrate LRU and non-lru movable pages.
+		 * Skip any other type of page
+		 */
+		is_lru = PageLRU(page);
+		if (!is_lru) {
+			if (unlikely(balloon_page_movable(page))) {
+				if (balloon_page_isolate(page)) {
+					/* Successfully isolated */
+					goto isolate_success;
+				}
+			}
+
+			/*
+			 * __PageMovable can return false positive so we need
+			 * to verify it under page_lock.
+			 */
+			if (unlikely(__PageMovable(page)) &&
+					!PageIsolated(page)) {
+				if (locked) {
+					spin_unlock_irqrestore(&zone->lru_lock,
+									flags);
+					locked = false;
+				}
+
+				if (isolate_movable_page(page, isolate_mode))
+					goto isolate_success;
+			}
+
 			goto isolate_fail;
+		}
 
 		/*
 		 * Migration will fail if an anonymous page is pinned in memory,
diff --git a/mm/ksm.c b/mm/ksm.c
index 4786b4150f62..35b8aef867a9 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -532,8 +532,8 @@ static struct page *get_ksm_page(struct stable_node *stable_node, bool lock_it)
 	void *expected_mapping;
 	unsigned long kpfn;
 
-	expected_mapping = (void *)stable_node +
-				(PAGE_MAPPING_ANON | PAGE_MAPPING_KSM);
+	expected_mapping = (void *)((unsigned long)stable_node |
+					PAGE_MAPPING_KSM);
 again:
 	kpfn = READ_ONCE(stable_node->kpfn);
 	page = pfn_to_page(kpfn);
diff --git a/mm/migrate.c b/mm/migrate.c
index 2666f28b5236..57559ca7c904 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -31,6 +31,7 @@
 #include <linux/vmalloc.h>
 #include <linux/security.h>
 #include <linux/backing-dev.h>
+#include <linux/compaction.h>
 #include <linux/syscalls.h>
 #include <linux/hugetlb.h>
 #include <linux/hugetlb_cgroup.h>
@@ -73,6 +74,79 @@ int migrate_prep_local(void)
 	return 0;
 }
 
+bool isolate_movable_page(struct page *page, isolate_mode_t mode)
+{
+	struct address_space *mapping;
+
+	/*
+	 * Avoid burning cycles with pages that are yet under __free_pages(),
+	 * or just got freed under us.
+	 *
+	 * In case we 'win' a race for a movable page being freed under us and
+	 * raise its refcount preventing __free_pages() from doing its job
+	 * the put_page() at the end of this block will take care of
+	 * release this page, thus avoiding a nasty leakage.
+	 */
+	if (unlikely(!get_page_unless_zero(page)))
+		goto out;
+
+	/*
+	 * Check PageMovable before holding a PG_lock because page's owner
+	 * assumes anybody doesn't touch PG_lock of newly allocated page
+	 * so unconditionally grapping the lock ruins page's owner side.
+	 */
+	if (unlikely(!__PageMovable(page)))
+		goto out_putpage;
+	/*
+	 * As movable pages are not isolated from LRU lists, concurrent
+	 * compaction threads can race against page migration functions
+	 * as well as race against the releasing a page.
+	 *
+	 * In order to avoid having an already isolated movable page
+	 * being (wrongly) re-isolated while it is under migration,
+	 * or to avoid attempting to isolate pages being released,
+	 * lets be sure we have the page lock
+	 * before proceeding with the movable page isolation steps.
+	 */
+	if (unlikely(!trylock_page(page)))
+		goto out_putpage;
+
+	if (!PageMovable(page) || PageIsolated(page))
+		goto out_no_isolated;
+
+	mapping = page_mapping(page);
+	if (!mapping->a_ops->isolate_page(page, mode))
+		goto out_no_isolated;
+
+	/* Driver shouldn't use PG_isolated bit of page->flags */
+	WARN_ON_ONCE(PageIsolated(page));
+	__SetPageIsolated(page);
+	unlock_page(page);
+
+	return true;
+
+out_no_isolated:
+	unlock_page(page);
+out_putpage:
+	put_page(page);
+out:
+	return false;
+}
+
+/* It should be called on page which is PG_movable */
+void putback_movable_page(struct page *page)
+{
+	struct address_space *mapping;
+
+	VM_BUG_ON_PAGE(!PageLocked(page), page);
+	VM_BUG_ON_PAGE(!PageMovable(page), page);
+	VM_BUG_ON_PAGE(!PageIsolated(page), page);
+
+	mapping = page_mapping(page);
+	mapping->a_ops->putback_page(page);
+	__ClearPageIsolated(page);
+}
+
 /*
  * Put previously isolated pages back onto the appropriate lists
  * from where they were once taken off for compaction/migration.
@@ -94,10 +168,25 @@ void putback_movable_pages(struct list_head *l)
 		list_del(&page->lru);
 		dec_zone_page_state(page, NR_ISOLATED_ANON +
 				page_is_file_cache(page));
-		if (unlikely(isolated_balloon_page(page)))
+		if (unlikely(isolated_balloon_page(page))) {
 			balloon_page_putback(page);
-		else
+		/*
+		 * We isolated non-lru movable page so here we can use
+		 * __PageMovable because LRU page's mapping cannot have
+		 * PAGE_MAPPING_MOVABLE.
+		 */
+		} else if (unlikely(__PageMovable(page))) {
+			VM_BUG_ON_PAGE(!PageIsolated(page), page);
+			lock_page(page);
+			if (PageMovable(page))
+				putback_movable_page(page);
+			else
+				__ClearPageIsolated(page);
+			unlock_page(page);
+			put_page(page);
+		} else {
 			putback_lru_page(page);
+		}
 	}
 }
 
@@ -592,7 +681,7 @@ void migrate_page_copy(struct page *newpage, struct page *page)
  ***********************************************************/
 
 /*
- * Common logic to directly migrate a single page suitable for
+ * Common logic to directly migrate a single LRU page suitable for
  * pages that do not use PagePrivate/PagePrivate2.
  *
  * Pages are locked upon entry and exit.
@@ -755,33 +844,69 @@ static int move_to_new_page(struct page *newpage, struct page *page,
 				enum migrate_mode mode)
 {
 	struct address_space *mapping;
-	int rc;
+	int rc = -EAGAIN;
+	bool is_lru = !__PageMovable(page);
 
 	VM_BUG_ON_PAGE(!PageLocked(page), page);
 	VM_BUG_ON_PAGE(!PageLocked(newpage), newpage);
 
 	mapping = page_mapping(page);
-	if (!mapping)
-		rc = migrate_page(mapping, newpage, page, mode);
-	else if (mapping->a_ops->migratepage)
-		/*
-		 * Most pages have a mapping and most filesystems provide a
-		 * migratepage callback. Anonymous pages are part of swap
-		 * space which also has its own migratepage callback. This
-		 * is the most common path for page migration.
-		 */
-		rc = mapping->a_ops->migratepage(mapping, newpage, page, mode);
-	else
-		rc = fallback_migrate_page(mapping, newpage, page, mode);
+	/*
+	 * In case of non-lru page, it could be released after
+	 * isolation step. In that case, we shouldn't try
+	 * fallback migration which is designed for LRU pages.
+	 */
+	if (unlikely(!is_lru)) {
+		VM_BUG_ON_PAGE(!PageIsolated(page), page);
+		if (!PageMovable(page)) {
+			rc = MIGRATEPAGE_SUCCESS;
+			__ClearPageIsolated(page);
+			goto out;
+		}
+	}
+
+	if (likely(is_lru)) {
+		if (!mapping)
+			rc = migrate_page(mapping, newpage, page, mode);
+		else if (mapping->a_ops->migratepage)
+			/*
+			 * Most pages have a mapping and most filesystems
+			 * provide a migratepage callback. Anonymous pages
+			 * are part of swap space which also has its own
+			 * migratepage callback. This is the most common path
+			 * for page migration.
+			 */
+			rc = mapping->a_ops->migratepage(mapping, newpage,
+							page, mode);
+		else
+			rc = fallback_migrate_page(mapping, newpage,
+							page, mode);
+	} else {
+		rc = mapping->a_ops->migratepage(mapping, newpage,
+						page, mode);
+		WARN_ON_ONCE(rc == MIGRATEPAGE_SUCCESS &&
+			!PageIsolated(page));
+	}
 
 	/*
 	 * When successful, old pagecache page->mapping must be cleared before
 	 * page is freed; but stats require that PageAnon be left as PageAnon.
 	 */
 	if (rc == MIGRATEPAGE_SUCCESS) {
-		if (!PageAnon(page))
+		if (__PageMovable(page)) {
+			VM_BUG_ON_PAGE(!PageIsolated(page), page);
+
+			/*
+			 * We clear PG_movable under page_lock so any compactor
+			 * cannot try to migrate this page.
+			 */
+			__ClearPageIsolated(page);
+		}
+
+		if (!((unsigned long)page->mapping & PAGE_MAPPING_FLAGS))
 			page->mapping = NULL;
 	}
+out:
 	return rc;
 }
 
@@ -791,6 +916,7 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
 	int rc = -EAGAIN;
 	int page_was_mapped = 0;
 	struct anon_vma *anon_vma = NULL;
+	bool is_lru = !__PageMovable(page);
 
 	if (!trylock_page(page)) {
 		if (!force || mode == MIGRATE_ASYNC)
@@ -871,6 +997,11 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
 		goto out_unlock_both;
 	}
 
+	if (unlikely(!is_lru)) {
+		rc = move_to_new_page(newpage, page, mode);
+		goto out_unlock_both;
+	}
+
 	/*
 	 * Corner case handling:
 	 * 1. When a new swap-cache page is read into, it is added to the LRU
@@ -920,7 +1051,8 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
 	 * list in here.
 	 */
 	if (rc == MIGRATEPAGE_SUCCESS) {
-		if (unlikely(__is_movable_balloon_page(newpage)))
+		if (unlikely(__is_movable_balloon_page(newpage) ||
+				__PageMovable(newpage)))
 			put_page(newpage);
 		else
 			putback_lru_page(newpage);
@@ -961,6 +1093,12 @@ static ICE_noinline int unmap_and_move(new_page_t get_new_page,
 		/* page was freed from under us. So we are done. */
 		ClearPageActive(page);
 		ClearPageUnevictable(page);
+		if (unlikely(__PageMovable(page))) {
+			lock_page(page);
+			if (!PageMovable(page))
+				__ClearPageIsolated(page);
+			unlock_page(page);
+		}
 		if (put_new_page)
 			put_new_page(newpage, private);
 		else
@@ -1010,8 +1148,21 @@ static ICE_noinline int unmap_and_move(new_page_t get_new_page,
 				num_poisoned_pages_inc();
 		}
 	} else {
-		if (rc != -EAGAIN)
-			putback_lru_page(page);
+		if (rc != -EAGAIN) {
+			if (likely(!__PageMovable(page))) {
+				putback_lru_page(page);
+				goto put_new;
+			}
+
+			lock_page(page);
+			if (PageMovable(page))
+				putback_movable_page(page);
+			else
+				__ClearPageIsolated(page);
+			unlock_page(page);
+			put_page(page);
+		}
+put_new:
 		if (put_new_page)
 			put_new_page(newpage, private);
 		else
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f8f3bfc435ee..26868bbaecce 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1008,7 +1008,7 @@ static __always_inline bool free_pages_prepare(struct page *page,
 			(page + i)->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
 		}
 	}
-	if (PageAnonHead(page))
+	if (PageMappingFlag(page))
 		page->mapping = NULL;
 	if (check_free)
 		bad += free_pages_check(page);
diff --git a/mm/util.c b/mm/util.c
index 224d36e43a94..a04ccff7cc17 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -399,10 +399,12 @@ struct address_space *page_mapping(struct page *page)
 	}
 
 	mapping = page->mapping;
-	if ((unsigned long)mapping & PAGE_MAPPING_FLAGS)
+	if ((unsigned long)mapping & PAGE_MAPPING_ANON)
 		return NULL;
-	return mapping;
+
+	return (void *)((unsigned long)mapping & ~PAGE_MAPPING_FLAGS);
 }
+EXPORT_SYMBOL(page_mapping);
 
 /* Slow path of page_mapcount() for compound pages */
 int __page_mapcount(struct page *page)
-- 
1.9.1

^ permalink raw reply related

* [PATCH v6 00/12] Support non-lru page migration
From: Minchan Kim @ 2016-05-20 14:23 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Sergey Senozhatsky, Rafael Aquini, Minchan Kim,
	Chan Gyun Jeong, Jonathan Corbet, Hugh Dickins, linux-kernel,
	dri-devel, virtualization, John Einar Reitan, linux-mm,
	Chulmin Kim, Gioh Kim, Konstantin Khlebnikov, Sangseok Lee,
	Joonsoo Kim, Kyeongdon Kim, Naoya Horiguchi, Vlastimil Babka,
	Mel Gorman

Recently, I got many reports about perfermance degradation in embedded
system(Android mobile phone, webOS TV and so on) and easy fork fail.

The problem was fragmentation caused by zram and GPU driver mainly.
With memory pressure, their pages were spread out all of pageblock and
it cannot be migrated with current compaction algorithm which supports
only LRU pages. In the end, compaction cannot work well so reclaimer
shrinks all of working set pages. It made system very slow and even to
fail to fork easily which requires order-[2 or 3] allocations.

Other pain point is that they cannot use CMA memory space so when OOM
kill happens, I can see many free pages in CMA area, which is not
memory efficient. In our product which has big CMA memory, it reclaims
zones too exccessively to allocate GPU and zram page although there are
lots of free space in CMA so system becomes very slow easily.

To solve these problem, this patch tries to add facility to migrate
non-lru pages via introducing new functions and page flags to help
migration.

struct address_space_operations {
	..
	..
	bool (*isolate_page)(struct page *, isolate_mode_t);
	void (*putback_page)(struct page *);
	..
}

new page flags

	PG_movable
	PG_isolated

For details, please read description in "mm: migrate: support non-lru
movable page migration".

Originally, Gioh Kim had tried to support this feature but he moved so
I took over the work. I took many code from his work and changed a little
bit and Konstantin Khlebnikov helped Gioh a lot so he should deserve to have
many credit, too.

And I should mention Chulmin who have tested this patchset heavily
so I can find many bugs from him. :)

Thanks, Gioh, Konstantin and Chulmin!

This patchset consists of five parts.

1. clean up migration
  mm: use put_page to free page instead of putback_lru_page

2. add non-lru page migration feature
  mm: migrate: support non-lru movable page migration

3. rework KVM memory-ballooning
  mm: balloon: use general non-lru movable page feature

4. zsmalloc refactoring for preparing page migration
  zsmalloc: keep max_object in size_class
  zsmalloc: use bit_spin_lock
  zsmalloc: use accessor
  zsmalloc: factor page chain functionality out
  zsmalloc: introduce zspage structure
  zsmalloc: separate free_zspage from putback_zspage
  zsmalloc: use freeobj for index

5. zsmalloc page migration
  zsmalloc: page migration support
  zram: use __GFP_MOVABLE for memory allocation

* From v5
  * rebase on next-20160520
  * move utility functions to compaction.c and export - Sergey
  * zsmalloc dobule free fix - Sergey
  * add additional Reviewed-by for zsmalloc - Sergey

* From v4
  * rebase on mmotm-2016-05-05-17-19
  * fix huge object migration - Chulmin
  * !CONFIG_COMPACTION support for zsmalloc

* From v3
  * rebase on mmotm-2016-04-06-20-40
  * fix swap_info deadlock - Chulmin
  * race without page_lock - Vlastimil
  * no use page._mapcount for potential user-mapped page driver - Vlastimil
  * fix and enhance doc/description - Vlastimil
  * use page->mapping lower bits to represent PG_movable
  * make driver side's rule simple.

* From v2
  * rebase on mmotm-2016-03-29-15-54-16
  * check PageMovable before lock_page - Joonsoo
  * check PageMovable before PageIsolated checking - Joonsoo
  * add more description about rule

* From v1
  * rebase on v4.5-mmotm-2016-03-17-15-04
  * reordering patches to merge clean-up patches first
  * add Acked-by/Reviewed-by from Vlastimil and Sergey
  * use each own mount model instead of reusing anon_inode_fs - Al Viro
  * small changes - YiPing, Gioh

Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: dri-devel@lists.freedesktop.org
Cc: Hugh Dickins <hughd@google.com>
Cc: John Einar Reitan <john.reitan@foss.arm.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Rafael Aquini <aquini@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: virtualization@lists.linux-foundation.org
Cc: Gioh Kim <gi-oh.kim@profitbricks.com>
Cc: Chan Gyun Jeong <chan.jeong@lge.com>
Cc: Sangseok Lee <sangseok.lee@lge.com>
Cc: Kyeongdon Kim <kyeongdon.kim@lge.com>
Cc: Chulmin Kim <cmlaika.kim@samsung.com>

Minchan Kim (12):
  mm: use put_page to free page instead of putback_lru_page
  mm: migrate: support non-lru movable page migration
  mm: balloon: use general non-lru movable page feature
  zsmalloc: keep max_object in size_class
  zsmalloc: use bit_spin_lock
  zsmalloc: use accessor
  zsmalloc: factor page chain functionality out
  zsmalloc: introduce zspage structure
  zsmalloc: separate free_zspage from putback_zspage
  zsmalloc: use freeobj for index
  zsmalloc: page migration support
  zram: use __GFP_MOVABLE for memory allocation

 Documentation/filesystems/Locking  |    4 +
 Documentation/filesystems/vfs.txt  |   11 +
 Documentation/vm/page_migration    |  107 ++-
 drivers/block/zram/zram_drv.c      |    6 +-
 drivers/virtio/virtio_balloon.c    |   54 +-
 include/linux/balloon_compaction.h |   53 +-
 include/linux/compaction.h         |   17 +
 include/linux/fs.h                 |    2 +
 include/linux/ksm.h                |    3 +-
 include/linux/migrate.h            |    2 +
 include/linux/mm.h                 |    1 +
 include/linux/page-flags.h         |   33 +-
 include/uapi/linux/magic.h         |    2 +
 mm/balloon_compaction.c            |   94 +--
 mm/compaction.c                    |   76 +-
 mm/ksm.c                           |    4 +-
 mm/migrate.c                       |  256 +++++--
 mm/page_alloc.c                    |    2 +-
 mm/util.c                          |    6 +-
 mm/vmscan.c                        |    2 +-
 mm/zsmalloc.c                      | 1351 +++++++++++++++++++++++++-----------
 21 files changed, 1475 insertions(+), 611 deletions(-)

-- 
1.9.1

^ permalink raw reply

* Re: [PATCH RFC kernel] balloon: speed up inflating/deflating process
From: Michael S. Tsirkin @ 2016-05-20 12:00 UTC (permalink / raw)
  To: Liang Li
  Cc: kvm, qemu-devel, linux-kernel, virtualization, amit.shah,
	pbonzini, akpm, dgilbert
In-Reply-To: <1463738386-30868-1-git-send-email-liang.z.li@intel.com>

On Fri, May 20, 2016 at 05:59:46PM +0800, Liang Li wrote:
> The implementation of the current virtio-balloon is not very efficient,
> Bellow is test result of time spends on inflating the balloon to 3GB of
> a 4GB idle guest:
> 
> a. allocating pages (6.5%, 103ms)
> b. sending PFNs to host (68.3%, 787ms)
> c. address translation (6.1%, 96ms)
> d. madvise (19%, 300ms)
> 
> It takes about 1577ms for the whole inflating process to complete. The
> test shows that the bottle neck is the stage b and stage d.
> 
> If using a bitmap to send the page info instead of the PFNs, we can
> reduce the overhead spends on stage b quite a lot. Furthermore, it's
> possible to do the address translation and do the madvise with a bulk
> of pages, instead of the current page per page way, so the overhead of
> stage c and stage d can also be reduced a lot.
> 
> This patch is the kernel side implementation which is intended to speed
> up the inflating & deflating process by adding a new feature to the
> virtio-balloon device. And now, inflating the balloon to 3GB of a 4GB
> idle guest only takes 175ms, it's about 9 times as fast as before.
> 
> TODO: optimize stage a by allocating/freeing a chunk of pages instead
> of a single page at a time.
> 
> Signed-off-by: Liang Li <liang.z.li@intel.com>
> ---
>  drivers/virtio/virtio_balloon.c     | 199 ++++++++++++++++++++++++++++++++++--
>  include/uapi/linux/virtio_balloon.h |   1 +
>  mm/page_alloc.c                     |   6 ++
>  3 files changed, 198 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
> index 7b6d74f..5330b6f 100644
> --- a/drivers/virtio/virtio_balloon.c
> +++ b/drivers/virtio/virtio_balloon.c
> @@ -45,6 +45,8 @@ static int oom_pages = OOM_VBALLOON_DEFAULT_PAGES;
>  module_param(oom_pages, int, S_IRUSR | S_IWUSR);
>  MODULE_PARM_DESC(oom_pages, "pages to free on OOM");
>  
> +extern unsigned long get_max_pfn(void);
> +
>  struct virtio_balloon {
>  	struct virtio_device *vdev;
>  	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq;
> @@ -62,6 +64,9 @@ struct virtio_balloon {
>  
>  	/* Number of balloon pages we've told the Host we're not using. */
>  	unsigned int num_pages;
> +	unsigned long *page_bitmap;
> +	unsigned long start_pfn, end_pfn;
> +	unsigned long bmap_len;
>  	/*
>  	 * The pages we've told the Host we're not using are enqueued
>  	 * at vb_dev_info->pages list.
> @@ -111,15 +116,66 @@ static void balloon_ack(struct virtqueue *vq)
>  	wake_up(&vb->acked);
>  }
>  
> +static int balloon_page_bitmap_init(struct virtio_balloon *vb)
> +{
> +	unsigned long max_pfn, bmap_bytes;
> +
> +	max_pfn = get_max_pfn();

This is racy. max_pfn could be increased by memory hotplug
after you got it.


> +	bmap_bytes = ALIGN(max_pfn, BITS_PER_LONG) / BITS_PER_BYTE;
> +	if (!vb->page_bitmap)
> +		vb->page_bitmap = kzalloc(bmap_bytes, GFP_KERNEL);

Likely to fail for a huge busy guest.
Why not init on device probe?
this way
	- probe will fail, or we can clear the feature bit
	- free memory is more likely to be available


> +	else {
> +		if (bmap_bytes <= vb->bmap_len)
> +			memset(vb->page_bitmap, 0, bmap_bytes);
> +		else {
> +			kfree(vb->page_bitmap);
> +			vb->page_bitmap = kzalloc(bmap_bytes, GFP_KERNEL);
> +		}
> +	}
> +	if (!vb->page_bitmap) {
> +		dev_err(&vb->vdev->dev, "%s failure: allocate page bitmap\n",
> +			 __func__);
> +		return -ENOMEM;
> +	}
> +	vb->bmap_len = bmap_bytes;
> +	vb->start_pfn = max_pfn;
> +	vb->end_pfn = 0;
> +
> +	return 0;
> +}
> +

>  {
> -	struct scatterlist sg;
>  	unsigned int len;
>  
> -	sg_init_one(&sg, vb->pfns, sizeof(vb->pfns[0]) * vb->num_pfns);
> +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_PAGE_BITMAP)) {
> +		u32 page_shift = PAGE_SHIFT;
> +		unsigned long start_pfn, end_pfn, flags = 0, bmap_len;
> +		struct scatterlist sg[5];
> +
> +		start_pfn = rounddown(vb->start_pfn, BITS_PER_LONG);
> +		end_pfn = roundup(vb->end_pfn, BITS_PER_LONG);
> +		bmap_len = (end_pfn - start_pfn) / BITS_PER_LONG * sizeof(long);
> +
> +		sg_init_table(sg, 5);
> +		sg_set_buf(&sg[0], &flags, sizeof(flags));
> +		sg_set_buf(&sg[1], &start_pfn, sizeof(start_pfn));
> +		sg_set_buf(&sg[2], &page_shift, sizeof(page_shift));
> +		sg_set_buf(&sg[3], &bmap_len, sizeof(bmap_len));
> +		sg_set_buf(&sg[4], vb->page_bitmap +
> +				 (start_pfn / BITS_PER_LONG), bmap_len);

This can be pre-initialized, correct?

> +		virtqueue_add_outbuf(vq, sg, 5, vb, GFP_KERNEL);
> +
> +	} else {
> +		struct scatterlist sg;
> +
> +		sg_init_one(&sg, vb->pfns, sizeof(vb->pfns[0]) * vb->num_pfns);
> +		/* We should always be able to add one buffer to an
> +		* empty queue.
> +		*/
> +		virtqueue_add_outbuf(vq, &sg, 1, vb, GFP_KERNEL);
> +	}
>  
> -	/* We should always be able to add one buffer to an empty queue. */
> -	virtqueue_add_outbuf(vq, &sg, 1, vb, GFP_KERNEL);
>  	virtqueue_kick(vq);
>  
>  	/* When host has read buffer, this completes via balloon_ack */
> @@ -137,7 +193,21 @@ static void set_page_pfns(u32 pfns[], struct page *page)
>  		pfns[i] = page_to_balloon_pfn(page) + i;
>  }
>  
> -static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
> +static void set_page_bitmap(struct virtio_balloon *vb, struct page *page)
> +{
> +	unsigned int i;
> +	unsigned long *bitmap = vb->page_bitmap;
> +	unsigned long balloon_pfn = page_to_balloon_pfn(page);
> +
> +	for (i = 0; i < VIRTIO_BALLOON_PAGES_PER_PAGE; i++)
> +		set_bit(balloon_pfn + i, bitmap);
> +	if (balloon_pfn < vb->start_pfn)
> +		vb->start_pfn = balloon_pfn;
> +	if (balloon_pfn > vb->end_pfn)
> +		vb->end_pfn = balloon_pfn;
> +}
> +
> +static unsigned fill_balloon_pfns(struct virtio_balloon *vb, size_t num)
>  {
>  	struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info;
>  	unsigned num_allocated_pages;
> @@ -174,7 +244,104 @@ static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
>  	return num_allocated_pages;
>  }
>  
> -static void release_pages_balloon(struct virtio_balloon *vb)
> +static long fill_balloon_bitmap(struct virtio_balloon *vb, size_t num)
> +{
> +	struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info;
> +	long num_allocated_pages = 0;
> +
> +	if (balloon_page_bitmap_init(vb) < 0)
> +		return num;
> +
> +	mutex_lock(&vb->balloon_lock);
> +	for (vb->num_pfns = 0; vb->num_pfns < num;
> +	     vb->num_pfns += VIRTIO_BALLOON_PAGES_PER_PAGE) {
> +		struct page *page = balloon_page_enqueue(vb_dev_info);
> +
> +		if (!page) {
> +			dev_info_ratelimited(&vb->vdev->dev,
> +					     "Out of puff! Can't get %u pages\n",
> +					     VIRTIO_BALLOON_PAGES_PER_PAGE);
> +			/* Sleep for at least 1/5 of a second before retry. */
> +			msleep(200);
> +			break;
> +		}
> +		set_page_bitmap(vb, page);
> +		vb->num_pages += VIRTIO_BALLOON_PAGES_PER_PAGE;
> +		if (!virtio_has_feature(vb->vdev,
> +					VIRTIO_BALLOON_F_DEFLATE_ON_OOM))
> +			adjust_managed_page_count(page, -1);
> +	}

This is grossly inefficient if you only requested a single page.
And it's also allocating memory very aggressively without ever telling
the host what is going on.
> +
> +	num_allocated_pages = vb->num_pfns;
> +	/* Did we get any? */
> +	if (vb->num_pfns != 0)
> +		tell_host(vb, vb->inflate_vq);
> +	mutex_unlock(&vb->balloon_lock);
> +
> +	return num_allocated_pages;
> +}
> +
> +static long fill_balloon(struct virtio_balloon *vb, size_t num)
> +{
> +	long num_allocated_pages;
> +
> +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_PAGE_BITMAP))
> +		num_allocated_pages = fill_balloon_bitmap(vb, num);
> +	else
> +		num_allocated_pages = fill_balloon_pfns(vb, num);
> +
> +	return num_allocated_pages;
> +}
> +
> +static void release_pages_balloon_bitmap(struct virtio_balloon *vb)
> +{
> +	unsigned long pfn, offset, size;
> +	struct page *page;
> +
> +	size = min(vb->bmap_len * BITS_PER_BYTE, vb->end_pfn);
> +	for (offset = vb->start_pfn; offset < size;
> +		 offset = pfn + VIRTIO_BALLOON_PAGES_PER_PAGE) {
> +		pfn = find_next_bit(vb->page_bitmap, size, offset);
> +		if (pfn < size) {
> +			page = balloon_pfn_to_page(pfn);
> +			if (!virtio_has_feature(vb->vdev,
> +					VIRTIO_BALLOON_F_DEFLATE_ON_OOM))
> +				adjust_managed_page_count(page, 1);
> +			put_page(page);
> +		}
> +	}
> +}
> +
> +static unsigned long leak_balloon_bitmap(struct virtio_balloon *vb, size_t num)
> +{
> +	unsigned long num_freed_pages = num;
> +	struct page *page;
> +	struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info;
> +
> +	if (balloon_page_bitmap_init(vb) < 0)
> +		return num_freed_pages;
> +
> +	mutex_lock(&vb->balloon_lock);
> +	for (vb->num_pfns = 0; vb->num_pfns < num;
> +	     vb->num_pfns += VIRTIO_BALLOON_PAGES_PER_PAGE) {
> +		page = balloon_page_dequeue(vb_dev_info);
> +		if (!page)
> +			break;
> +		set_page_bitmap(vb, page);
> +		vb->num_pages -= VIRTIO_BALLOON_PAGES_PER_PAGE;
> +	}
> +
> +	num_freed_pages = vb->num_pfns;
> +
> +	if (vb->num_pfns != 0)
> +		tell_host(vb, vb->deflate_vq);
> +	release_pages_balloon_bitmap(vb);
> +	mutex_unlock(&vb->balloon_lock);
> +
> +	return num_freed_pages;
> +}
> +
> +static void release_pages_balloon_pfns(struct virtio_balloon *vb)
>  {
>  	unsigned int i;
>  
> @@ -188,7 +355,7 @@ static void release_pages_balloon(struct virtio_balloon *vb)
>  	}
>  }
>  
> -static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
> +static unsigned leak_balloon_pfns(struct virtio_balloon *vb, size_t num)
>  {
>  	unsigned num_freed_pages;
>  	struct page *page;
> @@ -215,11 +382,23 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
>  	 */
>  	if (vb->num_pfns != 0)
>  		tell_host(vb, vb->deflate_vq);
> -	release_pages_balloon(vb);
> +	release_pages_balloon_pfns(vb);
>  	mutex_unlock(&vb->balloon_lock);
>  	return num_freed_pages;
>  }
>  
> +static long leak_balloon(struct virtio_balloon *vb, size_t num)
> +{
> +	long num_freed_pages;
> +
> +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_PAGE_BITMAP))
> +		num_freed_pages = leak_balloon_bitmap(vb, num);
> +	else
> +		num_freed_pages = leak_balloon_pfns(vb, num);
> +
> +	return num_freed_pages;
> +}
> +
>  static inline void update_stat(struct virtio_balloon *vb, int idx,
>  			       u16 tag, u64 val)
>  {
> @@ -510,6 +689,8 @@ static int virtballoon_probe(struct virtio_device *vdev)
>  	spin_lock_init(&vb->stop_update_lock);
>  	vb->stop_update = false;
>  	vb->num_pages = 0;
> +	vb->page_bitmap = NULL;
> +	vb->bmap_len = 0;
>  	mutex_init(&vb->balloon_lock);
>  	init_waitqueue_head(&vb->acked);
>  	vb->vdev = vdev;
> @@ -567,6 +748,7 @@ static void virtballoon_remove(struct virtio_device *vdev)
>  	cancel_work_sync(&vb->update_balloon_stats_work);
>  
>  	remove_common(vb);
> +	kfree(vb->page_bitmap);
>  	kfree(vb);
>  }
>  
> @@ -605,6 +787,7 @@ static unsigned int features[] = {
>  	VIRTIO_BALLOON_F_MUST_TELL_HOST,
>  	VIRTIO_BALLOON_F_STATS_VQ,
>  	VIRTIO_BALLOON_F_DEFLATE_ON_OOM,
> +	VIRTIO_BALLOON_F_PAGE_BITMAP,
>  };
>  
>  static struct virtio_driver virtio_balloon_driver = {
> diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h
> index 343d7dd..f78fa47 100644
> --- a/include/uapi/linux/virtio_balloon.h
> +++ b/include/uapi/linux/virtio_balloon.h
> @@ -34,6 +34,7 @@
>  #define VIRTIO_BALLOON_F_MUST_TELL_HOST	0 /* Tell before reclaiming pages */
>  #define VIRTIO_BALLOON_F_STATS_VQ	1 /* Memory Stats virtqueue */
>  #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM	2 /* Deflate balloon on OOM */
> +#define VIRTIO_BALLOON_F_PAGE_BITMAP	3 /* Send page info with bitmap */
>  
>  /* Size of a PFN in the balloon interface. */
>  #define VIRTIO_BALLOON_PFN_SHIFT 12
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index c1069ef..74b2fc5 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2139,6 +2139,12 @@ void drain_all_pages(struct zone *zone)
>  								zone, 1);
>  }
>  
> +unsigned long get_max_pfn(void)
> +{
> +	return max_pfn;
> +}
> +EXPORT_SYMBOL(get_max_pfn);
> +
>  #ifdef CONFIG_HIBERNATION
>  
>  void mark_free_pages(struct zone *zone)

Suggestion to address all above comments:
	1. allocate a bunch of pages and link them up,
	   calculating the min and the max pfn.
	   if max-min exceeds the allocated bitmap size,
	   tell host.
	2. limit allocated bitmap size to something reasonable.
	   How about 32Kbytes? This is 256kilo bit in the map, which comes
	   out to 1Giga bytes of memory in the balloon.
> -- 
> 1.9.1

^ permalink raw reply

* Re: [PATCH RFC kernel] balloon: speed up inflating/deflating process
From: Paolo Bonzini @ 2016-05-20 11:19 UTC (permalink / raw)
  To: Liang Li, mst
  Cc: kvm, qemu-devel, linux-kernel, virtualization, amit.shah, akpm,
	dgilbert
In-Reply-To: <1463738386-30868-1-git-send-email-liang.z.li@intel.com>

On 20/05/2016 11:59, Liang Li wrote:
> +
> +		sg_init_table(sg, 5);
> +		sg_set_buf(&sg[0], &flags, sizeof(flags));
> +		sg_set_buf(&sg[1], &start_pfn, sizeof(start_pfn));
> +		sg_set_buf(&sg[2], &page_shift, sizeof(page_shift));
> +		sg_set_buf(&sg[3], &bmap_len, sizeof(bmap_len));

These four should probably be placed in a single struct and therefore a
single sg entry.  It might even be faster to place it together with the
bitmap, thus avoiding the use of indirect descriptors.

You should also test ballooning of a 64GB guest after filling in the
page cache, not just ballooning of a freshly booted 4GB guest.  This
will give you a much more sparse bitmap.  Still, the improvement in
sending PFNs to the host are impressive.

Thanks,

Paolo

> +		sg_set_buf(&sg[4], vb->page_bitmap +
> +				 (start_pfn / BITS_PER_LONG), bmap_len);
> +		virtqueue_add_outbuf(vq, sg, 5, vb, GFP_KERNEL);

^ permalink raw reply

* Re: [PATCH RFC kernel] balloon: speed up inflating/deflating process
From: Cornelia Huck @ 2016-05-20 10:32 UTC (permalink / raw)
  To: Liang Li
  Cc: kvm, mst, qemu-devel, linux-kernel, virtualization, amit.shah,
	pbonzini, akpm, dgilbert
In-Reply-To: <1463738386-30868-1-git-send-email-liang.z.li@intel.com>

On Fri, 20 May 2016 17:59:46 +0800
Liang Li <liang.z.li@intel.com> wrote:

> The implementation of the current virtio-balloon is not very efficient,
> Bellow is test result of time spends on inflating the balloon to 3GB of
> a 4GB idle guest:
> 
> a. allocating pages (6.5%, 103ms)
> b. sending PFNs to host (68.3%, 787ms)
> c. address translation (6.1%, 96ms)
> d. madvise (19%, 300ms)
> 
> It takes about 1577ms for the whole inflating process to complete. The
> test shows that the bottle neck is the stage b and stage d.
> 
> If using a bitmap to send the page info instead of the PFNs, we can
> reduce the overhead spends on stage b quite a lot. Furthermore, it's
> possible to do the address translation and do the madvise with a bulk
> of pages, instead of the current page per page way, so the overhead of
> stage c and stage d can also be reduced a lot.
> 
> This patch is the kernel side implementation which is intended to speed
> up the inflating & deflating process by adding a new feature to the
> virtio-balloon device. And now, inflating the balloon to 3GB of a 4GB
> idle guest only takes 175ms, it's about 9 times as fast as before.
> 
> TODO: optimize stage a by allocating/freeing a chunk of pages instead
> of a single page at a time.

Not commenting on the approach, but...

> 
> Signed-off-by: Liang Li <liang.z.li@intel.com>
> ---
>  drivers/virtio/virtio_balloon.c     | 199 ++++++++++++++++++++++++++++++++++--
>  include/uapi/linux/virtio_balloon.h |   1 +
>  mm/page_alloc.c                     |   6 ++
>  3 files changed, 198 insertions(+), 8 deletions(-)
> 

>  static void tell_host(struct virtio_balloon *vb, struct virtqueue *vq)
>  {
> -	struct scatterlist sg;
>  	unsigned int len;
> 
> -	sg_init_one(&sg, vb->pfns, sizeof(vb->pfns[0]) * vb->num_pfns);
> +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_PAGE_BITMAP)) {
> +		u32 page_shift = PAGE_SHIFT;
> +		unsigned long start_pfn, end_pfn, flags = 0, bmap_len;
> +		struct scatterlist sg[5];
> +
> +		start_pfn = rounddown(vb->start_pfn, BITS_PER_LONG);
> +		end_pfn = roundup(vb->end_pfn, BITS_PER_LONG);
> +		bmap_len = (end_pfn - start_pfn) / BITS_PER_LONG * sizeof(long);
> +
> +		sg_init_table(sg, 5);
> +		sg_set_buf(&sg[0], &flags, sizeof(flags));
> +		sg_set_buf(&sg[1], &start_pfn, sizeof(start_pfn));
> +		sg_set_buf(&sg[2], &page_shift, sizeof(page_shift));
> +		sg_set_buf(&sg[3], &bmap_len, sizeof(bmap_len));
> +		sg_set_buf(&sg[4], vb->page_bitmap +
> +				 (start_pfn / BITS_PER_LONG), bmap_len);
> +		virtqueue_add_outbuf(vq, sg, 5, vb, GFP_KERNEL);
> +

...you need to take care of the endianness of the data you put on the
queue, otherwise virtio-1 on big endian won't work. (There's just been
a patch for that problem.)

^ permalink raw reply

* [PATCH RFC kernel] balloon: speed up inflating/deflating process
From: Liang Li @ 2016-05-20  9:59 UTC (permalink / raw)
  To: mst
  Cc: kvm, qemu-devel, Liang Li, linux-kernel, virtualization,
	amit.shah, pbonzini, akpm, dgilbert

The implementation of the current virtio-balloon is not very efficient,
Bellow is test result of time spends on inflating the balloon to 3GB of
a 4GB idle guest:

a. allocating pages (6.5%, 103ms)
b. sending PFNs to host (68.3%, 787ms)
c. address translation (6.1%, 96ms)
d. madvise (19%, 300ms)

It takes about 1577ms for the whole inflating process to complete. The
test shows that the bottle neck is the stage b and stage d.

If using a bitmap to send the page info instead of the PFNs, we can
reduce the overhead spends on stage b quite a lot. Furthermore, it's
possible to do the address translation and do the madvise with a bulk
of pages, instead of the current page per page way, so the overhead of
stage c and stage d can also be reduced a lot.

This patch is the kernel side implementation which is intended to speed
up the inflating & deflating process by adding a new feature to the
virtio-balloon device. And now, inflating the balloon to 3GB of a 4GB
idle guest only takes 175ms, it's about 9 times as fast as before.

TODO: optimize stage a by allocating/freeing a chunk of pages instead
of a single page at a time.

Signed-off-by: Liang Li <liang.z.li@intel.com>
---
 drivers/virtio/virtio_balloon.c     | 199 ++++++++++++++++++++++++++++++++++--
 include/uapi/linux/virtio_balloon.h |   1 +
 mm/page_alloc.c                     |   6 ++
 3 files changed, 198 insertions(+), 8 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 7b6d74f..5330b6f 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -45,6 +45,8 @@ static int oom_pages = OOM_VBALLOON_DEFAULT_PAGES;
 module_param(oom_pages, int, S_IRUSR | S_IWUSR);
 MODULE_PARM_DESC(oom_pages, "pages to free on OOM");
 
+extern unsigned long get_max_pfn(void);
+
 struct virtio_balloon {
 	struct virtio_device *vdev;
 	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq;
@@ -62,6 +64,9 @@ struct virtio_balloon {
 
 	/* Number of balloon pages we've told the Host we're not using. */
 	unsigned int num_pages;
+	unsigned long *page_bitmap;
+	unsigned long start_pfn, end_pfn;
+	unsigned long bmap_len;
 	/*
 	 * The pages we've told the Host we're not using are enqueued
 	 * at vb_dev_info->pages list.
@@ -111,15 +116,66 @@ static void balloon_ack(struct virtqueue *vq)
 	wake_up(&vb->acked);
 }
 
+static int balloon_page_bitmap_init(struct virtio_balloon *vb)
+{
+	unsigned long max_pfn, bmap_bytes;
+
+	max_pfn = get_max_pfn();
+	bmap_bytes = ALIGN(max_pfn, BITS_PER_LONG) / BITS_PER_BYTE;
+	if (!vb->page_bitmap)
+		vb->page_bitmap = kzalloc(bmap_bytes, GFP_KERNEL);
+	else {
+		if (bmap_bytes <= vb->bmap_len)
+			memset(vb->page_bitmap, 0, bmap_bytes);
+		else {
+			kfree(vb->page_bitmap);
+			vb->page_bitmap = kzalloc(bmap_bytes, GFP_KERNEL);
+		}
+	}
+	if (!vb->page_bitmap) {
+		dev_err(&vb->vdev->dev, "%s failure: allocate page bitmap\n",
+			 __func__);
+		return -ENOMEM;
+	}
+	vb->bmap_len = bmap_bytes;
+	vb->start_pfn = max_pfn;
+	vb->end_pfn = 0;
+
+	return 0;
+}
+
 static void tell_host(struct virtio_balloon *vb, struct virtqueue *vq)
 {
-	struct scatterlist sg;
 	unsigned int len;
 
-	sg_init_one(&sg, vb->pfns, sizeof(vb->pfns[0]) * vb->num_pfns);
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_PAGE_BITMAP)) {
+		u32 page_shift = PAGE_SHIFT;
+		unsigned long start_pfn, end_pfn, flags = 0, bmap_len;
+		struct scatterlist sg[5];
+
+		start_pfn = rounddown(vb->start_pfn, BITS_PER_LONG);
+		end_pfn = roundup(vb->end_pfn, BITS_PER_LONG);
+		bmap_len = (end_pfn - start_pfn) / BITS_PER_LONG * sizeof(long);
+
+		sg_init_table(sg, 5);
+		sg_set_buf(&sg[0], &flags, sizeof(flags));
+		sg_set_buf(&sg[1], &start_pfn, sizeof(start_pfn));
+		sg_set_buf(&sg[2], &page_shift, sizeof(page_shift));
+		sg_set_buf(&sg[3], &bmap_len, sizeof(bmap_len));
+		sg_set_buf(&sg[4], vb->page_bitmap +
+				 (start_pfn / BITS_PER_LONG), bmap_len);
+		virtqueue_add_outbuf(vq, sg, 5, vb, GFP_KERNEL);
+
+	} else {
+		struct scatterlist sg;
+
+		sg_init_one(&sg, vb->pfns, sizeof(vb->pfns[0]) * vb->num_pfns);
+		/* We should always be able to add one buffer to an
+		* empty queue.
+		*/
+		virtqueue_add_outbuf(vq, &sg, 1, vb, GFP_KERNEL);
+	}
 
-	/* We should always be able to add one buffer to an empty queue. */
-	virtqueue_add_outbuf(vq, &sg, 1, vb, GFP_KERNEL);
 	virtqueue_kick(vq);
 
 	/* When host has read buffer, this completes via balloon_ack */
@@ -137,7 +193,21 @@ static void set_page_pfns(u32 pfns[], struct page *page)
 		pfns[i] = page_to_balloon_pfn(page) + i;
 }
 
-static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
+static void set_page_bitmap(struct virtio_balloon *vb, struct page *page)
+{
+	unsigned int i;
+	unsigned long *bitmap = vb->page_bitmap;
+	unsigned long balloon_pfn = page_to_balloon_pfn(page);
+
+	for (i = 0; i < VIRTIO_BALLOON_PAGES_PER_PAGE; i++)
+		set_bit(balloon_pfn + i, bitmap);
+	if (balloon_pfn < vb->start_pfn)
+		vb->start_pfn = balloon_pfn;
+	if (balloon_pfn > vb->end_pfn)
+		vb->end_pfn = balloon_pfn;
+}
+
+static unsigned fill_balloon_pfns(struct virtio_balloon *vb, size_t num)
 {
 	struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info;
 	unsigned num_allocated_pages;
@@ -174,7 +244,104 @@ static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
 	return num_allocated_pages;
 }
 
-static void release_pages_balloon(struct virtio_balloon *vb)
+static long fill_balloon_bitmap(struct virtio_balloon *vb, size_t num)
+{
+	struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info;
+	long num_allocated_pages = 0;
+
+	if (balloon_page_bitmap_init(vb) < 0)
+		return num;
+
+	mutex_lock(&vb->balloon_lock);
+	for (vb->num_pfns = 0; vb->num_pfns < num;
+	     vb->num_pfns += VIRTIO_BALLOON_PAGES_PER_PAGE) {
+		struct page *page = balloon_page_enqueue(vb_dev_info);
+
+		if (!page) {
+			dev_info_ratelimited(&vb->vdev->dev,
+					     "Out of puff! Can't get %u pages\n",
+					     VIRTIO_BALLOON_PAGES_PER_PAGE);
+			/* Sleep for at least 1/5 of a second before retry. */
+			msleep(200);
+			break;
+		}
+		set_page_bitmap(vb, page);
+		vb->num_pages += VIRTIO_BALLOON_PAGES_PER_PAGE;
+		if (!virtio_has_feature(vb->vdev,
+					VIRTIO_BALLOON_F_DEFLATE_ON_OOM))
+			adjust_managed_page_count(page, -1);
+	}
+
+	num_allocated_pages = vb->num_pfns;
+	/* Did we get any? */
+	if (vb->num_pfns != 0)
+		tell_host(vb, vb->inflate_vq);
+	mutex_unlock(&vb->balloon_lock);
+
+	return num_allocated_pages;
+}
+
+static long fill_balloon(struct virtio_balloon *vb, size_t num)
+{
+	long num_allocated_pages;
+
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_PAGE_BITMAP))
+		num_allocated_pages = fill_balloon_bitmap(vb, num);
+	else
+		num_allocated_pages = fill_balloon_pfns(vb, num);
+
+	return num_allocated_pages;
+}
+
+static void release_pages_balloon_bitmap(struct virtio_balloon *vb)
+{
+	unsigned long pfn, offset, size;
+	struct page *page;
+
+	size = min(vb->bmap_len * BITS_PER_BYTE, vb->end_pfn);
+	for (offset = vb->start_pfn; offset < size;
+		 offset = pfn + VIRTIO_BALLOON_PAGES_PER_PAGE) {
+		pfn = find_next_bit(vb->page_bitmap, size, offset);
+		if (pfn < size) {
+			page = balloon_pfn_to_page(pfn);
+			if (!virtio_has_feature(vb->vdev,
+					VIRTIO_BALLOON_F_DEFLATE_ON_OOM))
+				adjust_managed_page_count(page, 1);
+			put_page(page);
+		}
+	}
+}
+
+static unsigned long leak_balloon_bitmap(struct virtio_balloon *vb, size_t num)
+{
+	unsigned long num_freed_pages = num;
+	struct page *page;
+	struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info;
+
+	if (balloon_page_bitmap_init(vb) < 0)
+		return num_freed_pages;
+
+	mutex_lock(&vb->balloon_lock);
+	for (vb->num_pfns = 0; vb->num_pfns < num;
+	     vb->num_pfns += VIRTIO_BALLOON_PAGES_PER_PAGE) {
+		page = balloon_page_dequeue(vb_dev_info);
+		if (!page)
+			break;
+		set_page_bitmap(vb, page);
+		vb->num_pages -= VIRTIO_BALLOON_PAGES_PER_PAGE;
+	}
+
+	num_freed_pages = vb->num_pfns;
+
+	if (vb->num_pfns != 0)
+		tell_host(vb, vb->deflate_vq);
+	release_pages_balloon_bitmap(vb);
+	mutex_unlock(&vb->balloon_lock);
+
+	return num_freed_pages;
+}
+
+static void release_pages_balloon_pfns(struct virtio_balloon *vb)
 {
 	unsigned int i;
 
@@ -188,7 +355,7 @@ static void release_pages_balloon(struct virtio_balloon *vb)
 	}
 }
 
-static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
+static unsigned leak_balloon_pfns(struct virtio_balloon *vb, size_t num)
 {
 	unsigned num_freed_pages;
 	struct page *page;
@@ -215,11 +382,23 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
 	 */
 	if (vb->num_pfns != 0)
 		tell_host(vb, vb->deflate_vq);
-	release_pages_balloon(vb);
+	release_pages_balloon_pfns(vb);
 	mutex_unlock(&vb->balloon_lock);
 	return num_freed_pages;
 }
 
+static long leak_balloon(struct virtio_balloon *vb, size_t num)
+{
+	long num_freed_pages;
+
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_PAGE_BITMAP))
+		num_freed_pages = leak_balloon_bitmap(vb, num);
+	else
+		num_freed_pages = leak_balloon_pfns(vb, num);
+
+	return num_freed_pages;
+}
+
 static inline void update_stat(struct virtio_balloon *vb, int idx,
 			       u16 tag, u64 val)
 {
@@ -510,6 +689,8 @@ static int virtballoon_probe(struct virtio_device *vdev)
 	spin_lock_init(&vb->stop_update_lock);
 	vb->stop_update = false;
 	vb->num_pages = 0;
+	vb->page_bitmap = NULL;
+	vb->bmap_len = 0;
 	mutex_init(&vb->balloon_lock);
 	init_waitqueue_head(&vb->acked);
 	vb->vdev = vdev;
@@ -567,6 +748,7 @@ static void virtballoon_remove(struct virtio_device *vdev)
 	cancel_work_sync(&vb->update_balloon_stats_work);
 
 	remove_common(vb);
+	kfree(vb->page_bitmap);
 	kfree(vb);
 }
 
@@ -605,6 +787,7 @@ static unsigned int features[] = {
 	VIRTIO_BALLOON_F_MUST_TELL_HOST,
 	VIRTIO_BALLOON_F_STATS_VQ,
 	VIRTIO_BALLOON_F_DEFLATE_ON_OOM,
+	VIRTIO_BALLOON_F_PAGE_BITMAP,
 };
 
 static struct virtio_driver virtio_balloon_driver = {
diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h
index 343d7dd..f78fa47 100644
--- a/include/uapi/linux/virtio_balloon.h
+++ b/include/uapi/linux/virtio_balloon.h
@@ -34,6 +34,7 @@
 #define VIRTIO_BALLOON_F_MUST_TELL_HOST	0 /* Tell before reclaiming pages */
 #define VIRTIO_BALLOON_F_STATS_VQ	1 /* Memory Stats virtqueue */
 #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM	2 /* Deflate balloon on OOM */
+#define VIRTIO_BALLOON_F_PAGE_BITMAP	3 /* Send page info with bitmap */
 
 /* Size of a PFN in the balloon interface. */
 #define VIRTIO_BALLOON_PFN_SHIFT 12
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c1069ef..74b2fc5 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2139,6 +2139,12 @@ void drain_all_pages(struct zone *zone)
 								zone, 1);
 }
 
+unsigned long get_max_pfn(void)
+{
+	return max_pfn;
+}
+EXPORT_SYMBOL(get_max_pfn);
+
 #ifdef CONFIG_HIBERNATION
 
 void mark_free_pages(struct zone *zone)
-- 
1.9.1

^ permalink raw reply related

* Re: [PATCH] virtio_balloon: fix PFN format for virtio-1
From: Cornelia Huck @ 2016-05-18 13:27 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Christian Borntraeger, linux-kernel, virtualization
In-Reply-To: <1463574848-15630-1-git-send-email-mst@redhat.com>

On Wed, 18 May 2016 15:38:53 +0300
"Michael S. Tsirkin" <mst@redhat.com> wrote:

> Everything should be LE when using virtio-1, but
> the linux balloon driver does not seem to care about that.
> 
> Cc: stable@vger.kernel.org
> Reported-by: Cornelia Huck <cornelia.huck@de.ibm.com>
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> ---
>  drivers/virtio/virtio_balloon.c | 20 ++++++++++++--------
>  1 file changed, 12 insertions(+), 8 deletions(-)

Keeping the pfns in proper byte order seems less hacky than my
approach, and it fixes the crash for my setup as well.

Tested-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com>

^ permalink raw reply

* [PATCH] virtio_balloon: fix PFN format for virtio-1
From: Michael S. Tsirkin @ 2016-05-18 12:38 UTC (permalink / raw)
  To: linux-kernel; +Cc: virtualization

Everything should be LE when using virtio-1, but
the linux balloon driver does not seem to care about that.

Cc: stable@vger.kernel.org
Reported-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 drivers/virtio/virtio_balloon.c | 20 ++++++++++++--------
 1 file changed, 12 insertions(+), 8 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 7b6d74f..476c0e3 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -75,7 +75,7 @@ struct virtio_balloon {
 
 	/* The array of pfns we tell the Host about. */
 	unsigned int num_pfns;
-	u32 pfns[VIRTIO_BALLOON_ARRAY_PFNS_MAX];
+	__virtio32 pfns[VIRTIO_BALLOON_ARRAY_PFNS_MAX];
 
 	/* Memory statistics */
 	struct virtio_balloon_stat stats[VIRTIO_BALLOON_S_NR];
@@ -127,14 +127,16 @@ static void tell_host(struct virtio_balloon *vb, struct virtqueue *vq)
 
 }
 
-static void set_page_pfns(u32 pfns[], struct page *page)
+static void set_page_pfns(struct virtio_balloon *vb,
+			  __virtio32 pfns[], struct page *page)
 {
 	unsigned int i;
 
 	/* Set balloon pfns pointing at this page.
 	 * Note that the first pfn points at start of the page. */
 	for (i = 0; i < VIRTIO_BALLOON_PAGES_PER_PAGE; i++)
-		pfns[i] = page_to_balloon_pfn(page) + i;
+		pfns[i] = cpu_to_virtio32(vb->vdev,
+					  page_to_balloon_pfn(page) + i);
 }
 
 static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
@@ -158,7 +160,7 @@ static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
 			msleep(200);
 			break;
 		}
-		set_page_pfns(vb->pfns + vb->num_pfns, page);
+		set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
 		vb->num_pages += VIRTIO_BALLOON_PAGES_PER_PAGE;
 		if (!virtio_has_feature(vb->vdev,
 					VIRTIO_BALLOON_F_DEFLATE_ON_OOM))
@@ -177,10 +179,12 @@ static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
 static void release_pages_balloon(struct virtio_balloon *vb)
 {
 	unsigned int i;
+	struct page *page;
 
 	/* Find pfns pointing at start of each page, get pages and free them. */
 	for (i = 0; i < vb->num_pfns; i += VIRTIO_BALLOON_PAGES_PER_PAGE) {
-		struct page *page = balloon_pfn_to_page(vb->pfns[i]);
+		page = balloon_pfn_to_page(virtio32_to_cpu(vb->vdev,
+							   vb->pfns[i]));
 		if (!virtio_has_feature(vb->vdev,
 					VIRTIO_BALLOON_F_DEFLATE_ON_OOM))
 			adjust_managed_page_count(page, 1);
@@ -203,7 +207,7 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
 		page = balloon_page_dequeue(vb_dev_info);
 		if (!page)
 			break;
-		set_page_pfns(vb->pfns + vb->num_pfns, page);
+		set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
 		vb->num_pages -= VIRTIO_BALLOON_PAGES_PER_PAGE;
 	}
 
@@ -471,13 +475,13 @@ static int virtballoon_migratepage(struct balloon_dev_info *vb_dev_info,
 	__count_vm_event(BALLOON_MIGRATE);
 	spin_unlock_irqrestore(&vb_dev_info->pages_lock, flags);
 	vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
-	set_page_pfns(vb->pfns, newpage);
+	set_page_pfns(vb, vb->pfns, newpage);
 	tell_host(vb, vb->inflate_vq);
 
 	/* balloon's page migration 2nd step -- deflate "page" */
 	balloon_page_delete(page);
 	vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
-	set_page_pfns(vb->pfns, page);
+	set_page_pfns(vb, vb->pfns, page);
 	tell_host(vb, vb->deflate_vq);
 
 	mutex_unlock(&vb->balloon_lock);
-- 
MST

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox