Linux virtualization list
 help / color / mirror / Atom feed
* [PATCH 4.14 01/92] compiler-gcc.h: Add __attribute__((gnu_inline)) to all inline declarations
From: Greg Kroah-Hartman @ 2018-07-20 12:13 UTC (permalink / raw)
  To: linux-kernel
  Cc: andrea.parri, kstewart, linux-efi, brijesh.singh, Peter Zijlstra,
	jan.kiszka, will.deacon, jarkko.sakkinen, virtualization,
	yamada.masahiro, manojgupta, H. Peter Anvin, Thomas Gleixner,
	tweek, mawilcox, akataria, ghackmann, Ingo Molnar, mjg59, mka,
	geert, rientjes, aryabinin, thomas.lendacky, Arnd Bergmann,
	linux-kbuild, pombredanne, rostedt, acme, caoj.fnst, jpoimboe,
	sedat.dilek, boris.ostrovsky, Juergen
In-Reply-To: <20180720121417.206337808@linuxfoundation.org>

4.14-stable review patch.  If anyone has any objections, please let me know.

------------------

From: Nick Desaulniers <ndesaulniers@google.com>

commit d03db2bc26f0e4a6849ad649a09c9c73fccdc656 upstream.

Functions marked extern inline do not emit an externally visible
function when the gnu89 C standard is used. Some KBUILD Makefiles
overwrite KBUILD_CFLAGS. This is an issue for GCC 5.1+ users as without
an explicit C standard specified, the default is gnu11. Since c99, the
semantics of extern inline have changed such that an externally visible
function is always emitted. This can lead to multiple definition errors
of extern inline functions at link time of compilation units whose build
files have removed an explicit C standard compiler flag for users of GCC
5.1+ or Clang.

Suggested-by: Arnd Bergmann <arnd@arndb.de>
Suggested-by: H. Peter Anvin <hpa@zytor.com>
Suggested-by: Joe Perches <joe@perches.com>
Signed-off-by: Nick Desaulniers <ndesaulniers@google.com>
Acked-by: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: acme@redhat.com
Cc: akataria@vmware.com
Cc: akpm@linux-foundation.org
Cc: andrea.parri@amarulasolutions.com
Cc: ard.biesheuvel@linaro.org
Cc: aryabinin@virtuozzo.com
Cc: astrachan@google.com
Cc: boris.ostrovsky@oracle.com
Cc: brijesh.singh@amd.com
Cc: caoj.fnst@cn.fujitsu.com
Cc: geert@linux-m68k.org
Cc: ghackmann@google.com
Cc: gregkh@linuxfoundation.org
Cc: jan.kiszka@siemens.com
Cc: jarkko.sakkinen@linux.intel.com
Cc: jpoimboe@redhat.com
Cc: keescook@google.com
Cc: kirill.shutemov@linux.intel.com
Cc: kstewart@linuxfoundation.org
Cc: linux-efi@vger.kernel.org
Cc: linux-kbuild@vger.kernel.org
Cc: manojgupta@google.com
Cc: mawilcox@microsoft.com
Cc: michal.lkml@markovi.net
Cc: mjg59@google.com
Cc: mka@chromium.org
Cc: pombredanne@nexb.com
Cc: rientjes@google.com
Cc: rostedt@goodmis.org
Cc: sedat.dilek@gmail.com
Cc: thomas.lendacky@amd.com
Cc: tstellar@redhat.com
Cc: tweek@google.com
Cc: virtualization@lists.linux-foundation.org
Cc: will.deacon@arm.com
Cc: yamada.masahiro@socionext.com
Link: http://lkml.kernel.org/r/20180621162324.36656-2-ndesaulniers@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

---
 include/linux/compiler-gcc.h |   29 ++++++++++++++++++++++-------
 1 file changed, 22 insertions(+), 7 deletions(-)

--- a/include/linux/compiler-gcc.h
+++ b/include/linux/compiler-gcc.h
@@ -66,25 +66,40 @@
 #endif
 
 /*
+ * Feature detection for gnu_inline (gnu89 extern inline semantics). Either
+ * __GNUC_STDC_INLINE__ is defined (not using gnu89 extern inline semantics,
+ * and we opt in to the gnu89 semantics), or __GNUC_STDC_INLINE__ is not
+ * defined so the gnu89 semantics are the default.
+ */
+#ifdef __GNUC_STDC_INLINE__
+# define __gnu_inline	__attribute__((gnu_inline))
+#else
+# define __gnu_inline
+#endif
+
+/*
  * Force always-inline if the user requests it so via the .config,
  * or if gcc is too old.
  * GCC does not warn about unused static inline functions for
  * -Wunused-function.  This turns out to avoid the need for complex #ifdef
  * directives.  Suppress the warning in clang as well by using "unused"
  * function attribute, which is redundant but not harmful for gcc.
+ * Prefer gnu_inline, so that extern inline functions do not emit an
+ * externally visible function. This makes extern inline behave as per gnu89
+ * semantics rather than c99. This prevents multiple symbol definition errors
+ * of extern inline functions at link time.
+ * A lot of inline functions can cause havoc with function tracing.
  */
 #if !defined(CONFIG_ARCH_SUPPORTS_OPTIMIZED_INLINING) ||		\
     !defined(CONFIG_OPTIMIZE_INLINING) || (__GNUC__ < 4)
-#define inline inline		__attribute__((always_inline,unused)) notrace
-#define __inline__ __inline__	__attribute__((always_inline,unused)) notrace
-#define __inline __inline	__attribute__((always_inline,unused)) notrace
+#define inline \
+	inline __attribute__((always_inline, unused)) notrace __gnu_inline
 #else
-/* A lot of inline functions can cause havoc with function tracing */
-#define inline inline		__attribute__((unused)) notrace
-#define __inline__ __inline__	__attribute__((unused)) notrace
-#define __inline __inline	__attribute__((unused)) notrace
+#define inline inline		__attribute__((unused)) notrace __gnu_inline
 #endif
 
+#define __inline__ inline
+#define __inline inline
 #define __always_inline	inline __attribute__((always_inline))
 #define  noinline	__attribute__((noinline))

^ permalink raw reply

* [PATCH v36 5/5] virtio-balloon: VIRTIO_BALLOON_F_PAGE_POISON
From: Wei Wang @ 2018-07-20  8:33 UTC (permalink / raw)
  To: virtio-dev, linux-kernel, virtualization, kvm, linux-mm, mst,
	mhocko, akpm, torvalds
  Cc: yang.zhang.wz, riel, quan.xu0, liliang.opensource, pbonzini,
	nilal
In-Reply-To: <1532075585-39067-1-git-send-email-wei.w.wang@intel.com>

The VIRTIO_BALLOON_F_PAGE_POISON feature bit is used to indicate if the
guest is using page poisoning. Guest writes to the poison_val config
field to tell host about the page poisoning value that is in use.

Suggested-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
---
 drivers/virtio/virtio_balloon.c     | 10 ++++++++++
 include/uapi/linux/virtio_balloon.h |  3 +++
 2 files changed, 13 insertions(+)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 82cd497..6340cc1 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -814,6 +814,7 @@ static int virtio_balloon_register_shrinker(struct virtio_balloon *vb)
 static int virtballoon_probe(struct virtio_device *vdev)
 {
 	struct virtio_balloon *vb;
+	__u32 poison_val;
 	int err;
 
 	if (!vdev->config->get) {
@@ -883,6 +884,11 @@ static int virtballoon_probe(struct virtio_device *vdev)
 						  VIRTIO_BALLOON_CMD_ID_STOP);
 		spin_lock_init(&vb->free_page_list_lock);
 		INIT_LIST_HEAD(&vb->free_page_list);
+		if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_PAGE_POISON)) {
+			memset(&poison_val, PAGE_POISON, sizeof(poison_val));
+			virtio_cwrite(vb->vdev, struct virtio_balloon_config,
+				      poison_val, &poison_val);
+		}
 	}
 
 	err = virtio_balloon_register_shrinker(vb);
@@ -979,6 +985,9 @@ static int virtballoon_restore(struct virtio_device *vdev)
 
 static int virtballoon_validate(struct virtio_device *vdev)
 {
+	if (!page_poisoning_enabled())
+		__virtio_clear_bit(vdev, VIRTIO_BALLOON_F_PAGE_POISON);
+
 	__virtio_clear_bit(vdev, VIRTIO_F_IOMMU_PLATFORM);
 	return 0;
 }
@@ -988,6 +997,7 @@ static unsigned int features[] = {
 	VIRTIO_BALLOON_F_STATS_VQ,
 	VIRTIO_BALLOON_F_DEFLATE_ON_OOM,
 	VIRTIO_BALLOON_F_FREE_PAGE_HINT,
+	VIRTIO_BALLOON_F_PAGE_POISON,
 };
 
 static struct virtio_driver virtio_balloon_driver = {
diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h
index 18ee430..80a7b7e 100644
--- a/include/uapi/linux/virtio_balloon.h
+++ b/include/uapi/linux/virtio_balloon.h
@@ -35,6 +35,7 @@
 #define VIRTIO_BALLOON_F_STATS_VQ	1 /* Memory Stats virtqueue */
 #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM	2 /* Deflate balloon on OOM */
 #define VIRTIO_BALLOON_F_FREE_PAGE_HINT	3 /* VQ to report free pages */
+#define VIRTIO_BALLOON_F_PAGE_POISON	4 /* Guest is using page poisoning */
 
 /* Size of a PFN in the balloon interface. */
 #define VIRTIO_BALLOON_PFN_SHIFT 12
@@ -47,6 +48,8 @@ struct virtio_balloon_config {
 	__u32 actual;
 	/* Free page report command id, readonly by guest */
 	__u32 free_page_report_cmd_id;
+	/* Stores PAGE_POISON if page poisoning is in use */
+	__u32 poison_val;
 };
 
 #define VIRTIO_BALLOON_S_SWAP_IN  0   /* Amount of memory swapped in */
-- 
2.7.4

^ permalink raw reply related

* [PATCH v36 4/5] mm/page_poison: expose page_poisoning_enabled to kernel modules
From: Wei Wang @ 2018-07-20  8:33 UTC (permalink / raw)
  To: virtio-dev, linux-kernel, virtualization, kvm, linux-mm, mst,
	mhocko, akpm, torvalds
  Cc: yang.zhang.wz, riel, quan.xu0, liliang.opensource, pbonzini,
	nilal
In-Reply-To: <1532075585-39067-1-git-send-email-wei.w.wang@intel.com>

In some usages, e.g. virtio-balloon, a kernel module needs to know if
page poisoning is in use. This patch exposes the page_poisoning_enabled
function to kernel modules.

Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Andrew Morton <akpm@linux-foundation.org>
---
 mm/page_poison.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/mm/page_poison.c b/mm/page_poison.c
index aa2b3d3..830f604 100644
--- a/mm/page_poison.c
+++ b/mm/page_poison.c
@@ -17,6 +17,11 @@ static int __init early_page_poison_param(char *buf)
 }
 early_param("page_poison", early_page_poison_param);
 
+/**
+ * page_poisoning_enabled - check if page poisoning is enabled
+ *
+ * Return true if page poisoning is enabled, or false if not.
+ */
 bool page_poisoning_enabled(void)
 {
 	/*
@@ -29,6 +34,7 @@ bool page_poisoning_enabled(void)
 		(!IS_ENABLED(CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC) &&
 		debug_pagealloc_enabled()));
 }
+EXPORT_SYMBOL_GPL(page_poisoning_enabled);
 
 static void poison_page(struct page *page)
 {
-- 
2.7.4

^ permalink raw reply related

* [PATCH v36 3/5] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT
From: Wei Wang @ 2018-07-20  8:33 UTC (permalink / raw)
  To: virtio-dev, linux-kernel, virtualization, kvm, linux-mm, mst,
	mhocko, akpm, torvalds
  Cc: yang.zhang.wz, riel, quan.xu0, liliang.opensource, pbonzini,
	nilal
In-Reply-To: <1532075585-39067-1-git-send-email-wei.w.wang@intel.com>

Negotiation of the VIRTIO_BALLOON_F_FREE_PAGE_HINT feature indicates the
support of reporting hints of guest free pages to host via virtio-balloon.
Currenlty, only free page blocks of MAX_ORDER - 1 are reported. They are
obtained one by one from the mm free list via the regular allocation
function. The allocated pages are given back to mm after they are put onto
the vq.

Host requests the guest to report free page hints by sending a new cmd id
to the guest via the free_page_report_cmd_id configuration register. When
the guest starts to report, it first sends a start cmd to host via the
free page vq, which acks to host the cmd id received. When the guest
finishes reporting free pages, a stop cmd is sent to host via the vq.

TODO:
- Add a batch page allocation API to amortize the allocation overhead.

Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Signed-off-by: Liang Li <liang.z.li@intel.com>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
---
 drivers/virtio/virtio_balloon.c     | 331 +++++++++++++++++++++++++++++++++---
 include/uapi/linux/virtio_balloon.h |   4 +
 2 files changed, 307 insertions(+), 28 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index c6fd406..82cd497 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -42,6 +42,14 @@
 #define DEFAULT_BALLOON_PAGES_TO_SHRINK 256
 #define VIRTBALLOON_OOM_NOTIFY_PRIORITY 80
 
+#define VIRTIO_BALLOON_FREE_PAGE_ALLOC_FLAG (__GFP_NORETRY | __GFP_NOWARN | \
+					     __GFP_NOMEMALLOC)
+/* The order of free page blocks to report to host */
+#define VIRTIO_BALLOON_FREE_PAGE_ORDER (MAX_ORDER - 1)
+/* The size of a free page block in bytes */
+#define VIRTIO_BALLOON_FREE_PAGE_SIZE \
+	(1 << (VIRTIO_BALLOON_FREE_PAGE_ORDER + PAGE_SHIFT))
+
 static unsigned long balloon_pages_to_shrink = DEFAULT_BALLOON_PAGES_TO_SHRINK;
 module_param(balloon_pages_to_shrink, ulong, 0600);
 MODULE_PARM_DESC(balloon_pages_to_shrink, "pages to free on memory presure");
@@ -50,9 +58,22 @@ MODULE_PARM_DESC(balloon_pages_to_shrink, "pages to free on memory presure");
 static struct vfsmount *balloon_mnt;
 #endif
 
+enum virtio_balloon_vq {
+	VIRTIO_BALLOON_VQ_INFLATE,
+	VIRTIO_BALLOON_VQ_DEFLATE,
+	VIRTIO_BALLOON_VQ_STATS,
+	VIRTIO_BALLOON_VQ_FREE_PAGE,
+	VIRTIO_BALLOON_VQ_MAX
+};
+
 struct virtio_balloon {
 	struct virtio_device *vdev;
-	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq;
+	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *free_page_vq;
+
+	/* Balloon's own wq for cpu-intensive work items */
+	struct workqueue_struct *balloon_wq;
+	/* The free page reporting work item submitted to the balloon wq */
+	struct work_struct report_free_page_work;
 
 	/* The balloon servicing is delegated to a freezable workqueue. */
 	struct work_struct update_balloon_stats_work;
@@ -62,6 +83,16 @@ struct virtio_balloon {
 	spinlock_t stop_update_lock;
 	bool stop_update;
 
+	/* The list of allocated free pages, waiting to be given back to mm */
+	struct list_head free_page_list;
+	spinlock_t free_page_list_lock;
+	/* The cmd id received from host */
+	u32 cmd_id_received;
+	/* The cmd id that is actively in use */
+	__virtio32 cmd_id_active;
+	/* Buffer to store the stop sign */
+	__virtio32 cmd_id_stop;
+
 	/* Waiting for host to ack the pages we released. */
 	wait_queue_head_t acked;
 
@@ -325,17 +356,6 @@ static void stats_handle_request(struct virtio_balloon *vb)
 	virtqueue_kick(vq);
 }
 
-static void virtballoon_changed(struct virtio_device *vdev)
-{
-	struct virtio_balloon *vb = vdev->priv;
-	unsigned long flags;
-
-	spin_lock_irqsave(&vb->stop_update_lock, flags);
-	if (!vb->stop_update)
-		queue_work(system_freezable_wq, &vb->update_balloon_size_work);
-	spin_unlock_irqrestore(&vb->stop_update_lock, flags);
-}
-
 static inline s64 towards_target(struct virtio_balloon *vb)
 {
 	s64 target;
@@ -352,6 +372,52 @@ static inline s64 towards_target(struct virtio_balloon *vb)
 	return target - vb->num_pages;
 }
 
+static void virtballoon_changed(struct virtio_device *vdev)
+{
+	struct virtio_balloon *vb = vdev->priv;
+	unsigned long flags;
+	s64 diff = towards_target(vb);
+
+	if (diff) {
+		spin_lock_irqsave(&vb->stop_update_lock, flags);
+		if (!vb->stop_update)
+			queue_work(system_freezable_wq,
+				   &vb->update_balloon_size_work);
+		spin_unlock_irqrestore(&vb->stop_update_lock, flags);
+	}
+
+	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_FREE_PAGE_HINT)) {
+		virtio_cread(vdev, struct virtio_balloon_config,
+			     free_page_report_cmd_id, &vb->cmd_id_received);
+		if (vb->cmd_id_received != VIRTIO_BALLOON_CMD_ID_STOP &&
+		    vb->cmd_id_received !=
+				virtio32_to_cpu(vdev, vb->cmd_id_active)) {
+			spin_lock_irqsave(&vb->stop_update_lock, flags);
+			if (!vb->stop_update) {
+				queue_work(vb->balloon_wq,
+					   &vb->report_free_page_work);
+			}
+			spin_unlock_irqrestore(&vb->stop_update_lock, flags);
+		}
+	}
+}
+
+static unsigned long return_free_pages_to_mm(struct virtio_balloon *vb)
+{
+	struct page *page;
+	unsigned long num = 0;
+
+	spin_lock_irq(&vb->free_page_list_lock);
+	while ((page = balloon_page_pop(&vb->free_page_list))) {
+		free_pages((unsigned long)page_address(page),
+			   VIRTIO_BALLOON_FREE_PAGE_ORDER);
+		num++;
+	}
+	spin_unlock_irq(&vb->free_page_list_lock);
+
+	return num;
+}
+
 static void update_balloon_size(struct virtio_balloon *vb)
 {
 	u32 actual = vb->num_pages;
@@ -394,26 +460,44 @@ static void update_balloon_size_func(struct work_struct *work)
 
 static int init_vqs(struct virtio_balloon *vb)
 {
-	struct virtqueue *vqs[3];
-	vq_callback_t *callbacks[] = { balloon_ack, balloon_ack, stats_request };
-	static const char * const names[] = { "inflate", "deflate", "stats" };
-	int err, nvqs;
+	struct virtqueue *vqs[VIRTIO_BALLOON_VQ_MAX];
+	vq_callback_t *callbacks[VIRTIO_BALLOON_VQ_MAX];
+	const char *names[VIRTIO_BALLOON_VQ_MAX];
+	int err;
 
 	/*
-	 * We expect two virtqueues: inflate and deflate, and
-	 * optionally stat.
+	 * Inflateq and deflateq are used unconditionally. The names[]
+	 * will be NULL if the related feature is not enabled, which will
+	 * cause no allocation for the corresponding virtqueue in find_vqs.
 	 */
-	nvqs = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ) ? 3 : 2;
-	err = virtio_find_vqs(vb->vdev, nvqs, vqs, callbacks, names, NULL);
+	callbacks[VIRTIO_BALLOON_VQ_INFLATE] = balloon_ack;
+	names[VIRTIO_BALLOON_VQ_INFLATE] = "inflate";
+	callbacks[VIRTIO_BALLOON_VQ_DEFLATE] = balloon_ack;
+	names[VIRTIO_BALLOON_VQ_DEFLATE] = "deflate";
+	names[VIRTIO_BALLOON_VQ_STATS] = NULL;
+	names[VIRTIO_BALLOON_VQ_FREE_PAGE] = NULL;
+
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
+		names[VIRTIO_BALLOON_VQ_STATS] = "stats";
+		callbacks[VIRTIO_BALLOON_VQ_STATS] = stats_request;
+	}
+
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_FREE_PAGE_HINT)) {
+		names[VIRTIO_BALLOON_VQ_FREE_PAGE] = "free_page_vq";
+		callbacks[VIRTIO_BALLOON_VQ_FREE_PAGE] = NULL;
+	}
+
+	err = vb->vdev->config->find_vqs(vb->vdev, VIRTIO_BALLOON_VQ_MAX,
+					 vqs, callbacks, names, NULL, NULL);
 	if (err)
 		return err;
 
-	vb->inflate_vq = vqs[0];
-	vb->deflate_vq = vqs[1];
+	vb->inflate_vq = vqs[VIRTIO_BALLOON_VQ_INFLATE];
+	vb->deflate_vq = vqs[VIRTIO_BALLOON_VQ_DEFLATE];
 	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
 		struct scatterlist sg;
 		unsigned int num_stats;
-		vb->stats_vq = vqs[2];
+		vb->stats_vq = vqs[VIRTIO_BALLOON_VQ_STATS];
 
 		/*
 		 * Prime this virtqueue with one buffer so the hypervisor can
@@ -431,9 +515,145 @@ static int init_vqs(struct virtio_balloon *vb)
 		}
 		virtqueue_kick(vb->stats_vq);
 	}
+
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_FREE_PAGE_HINT))
+		vb->free_page_vq = vqs[VIRTIO_BALLOON_VQ_FREE_PAGE];
+
+	return 0;
+}
+
+static int send_cmd_id_start(struct virtio_balloon *vb)
+{
+	struct scatterlist sg;
+	struct virtqueue *vq = vb->free_page_vq;
+	int err, unused;
+
+	/* Detach all the used buffers from the vq */
+	while (virtqueue_get_buf(vq, &unused))
+		;
+
+	vb->cmd_id_active = cpu_to_virtio32(vb->vdev, vb->cmd_id_received);
+	sg_init_one(&sg, &vb->cmd_id_active, sizeof(vb->cmd_id_active));
+	err = virtqueue_add_outbuf(vq, &sg, 1, &vb->cmd_id_active, GFP_KERNEL);
+	if (!err)
+		virtqueue_kick(vq);
+	return err;
+}
+
+static int send_cmd_id_stop(struct virtio_balloon *vb)
+{
+	struct scatterlist sg;
+	struct virtqueue *vq = vb->free_page_vq;
+	int err, unused;
+
+	/* Detach all the used buffers from the vq */
+	while (virtqueue_get_buf(vq, &unused))
+		;
+
+	sg_init_one(&sg, &vb->cmd_id_stop, sizeof(vb->cmd_id_stop));
+	err = virtqueue_add_outbuf(vq, &sg, 1, &vb->cmd_id_stop, GFP_KERNEL);
+	if (!err)
+		virtqueue_kick(vq);
+	return err;
+}
+
+static int get_free_page_and_send(struct virtio_balloon *vb)
+{
+	struct virtqueue *vq = vb->free_page_vq;
+	struct page *page;
+	struct scatterlist sg;
+	int err, unused;
+	void *p;
+
+	/* Detach all the used buffers from the vq */
+	while (virtqueue_get_buf(vq, &unused))
+		;
+
+	page = alloc_pages(VIRTIO_BALLOON_FREE_PAGE_ALLOC_FLAG,
+			   VIRTIO_BALLOON_FREE_PAGE_ORDER);
+	/*
+	 * When the allocation returns NULL, it indicates that we have got all
+	 * the possible free pages, so return -EINTR to stop.
+	 */
+	if (!page)
+		return -EINTR;
+
+	p = page_address(page);
+	sg_init_one(&sg, p, VIRTIO_BALLOON_FREE_PAGE_SIZE);
+	/* There is always one entry reserved for the cmd id to use. */
+	if (vq->num_free > 1) {
+		err = virtqueue_add_inbuf(vq, &sg, 1, p, GFP_KERNEL);
+		if (unlikely(err)) {
+			free_pages((unsigned long)p,
+				   VIRTIO_BALLOON_FREE_PAGE_ORDER);
+			return err;
+		}
+		virtqueue_kick(vq);
+		spin_lock_irq(&vb->free_page_list_lock);
+		balloon_page_push(&vb->free_page_list, page);
+		spin_unlock_irq(&vb->free_page_list_lock);
+	} else {
+		/*
+		 * The vq has no available entry to add this page block, so
+		 * just free it.
+		 */
+		free_pages((unsigned long)p, VIRTIO_BALLOON_FREE_PAGE_ORDER);
+	}
+
 	return 0;
 }
 
+static int send_free_pages(struct virtio_balloon *vb)
+{
+	int err;
+	u32 cmd_id_active;
+
+	while (1) {
+		/*
+		 * If a stop id or a new cmd id was just received from host,
+		 * stop the reporting.
+		 */
+		cmd_id_active = virtio32_to_cpu(vb->vdev, vb->cmd_id_active);
+		if (cmd_id_active != vb->cmd_id_received)
+			break;
+
+		/*
+		 * The free page blocks are allocated and sent to host one by
+		 * one.
+		 */
+		err = get_free_page_and_send(vb);
+		if (err == -EINTR)
+			break;
+		else if (unlikely(err))
+			return err;
+	}
+
+	return 0;
+}
+
+static void report_free_page_func(struct work_struct *work)
+{
+	int err;
+	struct virtio_balloon *vb = container_of(work, struct virtio_balloon,
+						 report_free_page_work);
+
+	/* Start by sending the received cmd id to host with an outbuf. */
+	err = send_cmd_id_start(vb);
+	if (unlikely(err))
+		goto out_err;
+
+	err = send_free_pages(vb);
+	return_free_pages_to_mm(vb);
+	if (unlikely(err))
+		goto out_err;
+
+	/* End by sending a stop id to host with an outbuf. */
+	err = send_cmd_id_stop(vb);
+out_err:
+	if (err)
+		dev_err(&vb->vdev->dev, "%s: err = %d\n", __func__, err);
+}
+
 #ifdef CONFIG_BALLOON_COMPACTION
 /*
  * virtballoon_migratepage - perform the balloon page migration on behalf of
@@ -523,6 +743,16 @@ static unsigned long virtio_balloon_shrinker_scan(struct shrinker *shrinker,
 	struct virtio_balloon *vb = container_of(shrinker,
 					struct virtio_balloon, shrinker);
 
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_FREE_PAGE_HINT)) {
+		pages_freed = return_free_pages_to_mm(vb) <<
+			      VIRTIO_BALLOON_FREE_PAGE_ORDER;
+	}
+
+	if (pages_freed >= pages_to_free)
+		return pages_freed;
+
+	pages_to_free -= pages_freed;
+
 	/*
 	 * One invocation of leak_balloon can deflate at most
 	 * VIRTIO_BALLOON_ARRAY_PFNS_MAX balloon pages, so we call it
@@ -530,8 +760,8 @@ static unsigned long virtio_balloon_shrinker_scan(struct shrinker *shrinker,
 	 * balloon_pages_to_shrink pages.
 	 */
 	while (vb->num_pages && pages_to_free) {
-		pages_to_free = balloon_pages_to_shrink - pages_freed;
 		pages_freed += leak_balloon(vb, pages_to_free);
+		pages_to_free -= pages_freed;
 	}
 	update_balloon_size(vb);
 
@@ -541,6 +771,7 @@ static unsigned long virtio_balloon_shrinker_scan(struct shrinker *shrinker,
 static unsigned long virtio_balloon_shrinker_count(struct shrinker *shrinker,
 						   struct shrink_control *sc)
 {
+	unsigned long count;
 	struct virtio_balloon *vb = container_of(shrinker,
 					struct virtio_balloon, shrinker);
 
@@ -551,8 +782,18 @@ static unsigned long virtio_balloon_shrinker_count(struct shrinker *shrinker,
 	if (!virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_DEFLATE_ON_OOM))
 		return 0;
 
-	return min_t(unsigned long, vb->num_pages, balloon_pages_to_shrink) /
-	       VIRTIO_BALLOON_PAGES_PER_PAGE;
+	count = min_t(unsigned long, vb->num_pages, balloon_pages_to_shrink) /
+		VIRTIO_BALLOON_PAGES_PER_PAGE;
+
+	/*
+	 * Just add one block of free pages for the count estimation. We will
+	 * release all of them in shrinker_scan regardless of the count
+	 * returned here.
+	 */
+	if (!list_empty(&vb->free_page_list))
+		count += 1 << VIRTIO_BALLOON_FREE_PAGE_ORDER;
+
+	return count;
 }
 
 static void virtio_balloon_unregister_shrinker(struct virtio_balloon *vb)
@@ -615,13 +856,38 @@ static int virtballoon_probe(struct virtio_device *vdev)
 		err = PTR_ERR(vb->vb_dev_info.inode);
 		kern_unmount(balloon_mnt);
 		vb->vb_dev_info.inode = NULL;
-		goto out_del_vqs;
+		goto out_del_balloon_wq;
 	}
 	vb->vb_dev_info.inode->i_mapping->a_ops = &balloon_aops;
 #endif
+	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_FREE_PAGE_HINT)) {
+		/*
+		 * There is always one entry reserved for cmd id, so the ring
+		 * size needs to be at least two to report free page hints.
+		 */
+		if (virtqueue_get_vring_size(vb->free_page_vq) < 2) {
+			err = -ENOSPC;
+			goto out_del_vqs;
+		}
+		vb->balloon_wq = alloc_workqueue("balloon-wq",
+					WQ_FREEZABLE | WQ_CPU_INTENSIVE, 0);
+		if (!vb->balloon_wq) {
+			err = -ENOMEM;
+			goto out_del_vqs;
+		}
+		INIT_WORK(&vb->report_free_page_work, report_free_page_func);
+		vb->cmd_id_received = VIRTIO_BALLOON_CMD_ID_STOP;
+		vb->cmd_id_active = cpu_to_virtio32(vb->vdev,
+						  VIRTIO_BALLOON_CMD_ID_STOP);
+		vb->cmd_id_stop = cpu_to_virtio32(vb->vdev,
+						  VIRTIO_BALLOON_CMD_ID_STOP);
+		spin_lock_init(&vb->free_page_list_lock);
+		INIT_LIST_HEAD(&vb->free_page_list);
+	}
+
 	err = virtio_balloon_register_shrinker(vb);
 	if (err)
-		goto out_del_vqs;
+		goto out_del_balloon_wq;
 
 	virtio_device_ready(vdev);
 
@@ -629,6 +895,9 @@ static int virtballoon_probe(struct virtio_device *vdev)
 		virtballoon_changed(vdev);
 	return 0;
 
+out_del_balloon_wq:
+	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_FREE_PAGE_HINT))
+		destroy_workqueue(vb->balloon_wq);
 out_del_vqs:
 	vdev->config->del_vqs(vdev);
 out_free_vb:
@@ -662,6 +931,11 @@ static void virtballoon_remove(struct virtio_device *vdev)
 	cancel_work_sync(&vb->update_balloon_size_work);
 	cancel_work_sync(&vb->update_balloon_stats_work);
 
+	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_FREE_PAGE_HINT)) {
+		cancel_work_sync(&vb->report_free_page_work);
+		destroy_workqueue(vb->balloon_wq);
+	}
+
 	remove_common(vb);
 #ifdef CONFIG_BALLOON_COMPACTION
 	if (vb->vb_dev_info.inode)
@@ -713,6 +987,7 @@ static unsigned int features[] = {
 	VIRTIO_BALLOON_F_MUST_TELL_HOST,
 	VIRTIO_BALLOON_F_STATS_VQ,
 	VIRTIO_BALLOON_F_DEFLATE_ON_OOM,
+	VIRTIO_BALLOON_F_FREE_PAGE_HINT,
 };
 
 static struct virtio_driver virtio_balloon_driver = {
diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h
index 13b8cb5..18ee430 100644
--- a/include/uapi/linux/virtio_balloon.h
+++ b/include/uapi/linux/virtio_balloon.h
@@ -34,15 +34,19 @@
 #define VIRTIO_BALLOON_F_MUST_TELL_HOST	0 /* Tell before reclaiming pages */
 #define VIRTIO_BALLOON_F_STATS_VQ	1 /* Memory Stats virtqueue */
 #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM	2 /* Deflate balloon on OOM */
+#define VIRTIO_BALLOON_F_FREE_PAGE_HINT	3 /* VQ to report free pages */
 
 /* Size of a PFN in the balloon interface. */
 #define VIRTIO_BALLOON_PFN_SHIFT 12
 
+#define VIRTIO_BALLOON_CMD_ID_STOP	0
 struct virtio_balloon_config {
 	/* Number of pages host wants Guest to give up. */
 	__u32 num_pages;
 	/* Number of pages we've actually got in balloon. */
 	__u32 actual;
+	/* Free page report command id, readonly by guest */
+	__u32 free_page_report_cmd_id;
 };
 
 #define VIRTIO_BALLOON_S_SWAP_IN  0   /* Amount of memory swapped in */
-- 
2.7.4

^ permalink raw reply related

* [PATCH v36 2/5] virtio_balloon: replace oom notifier with shrinker
From: Wei Wang @ 2018-07-20  8:33 UTC (permalink / raw)
  To: virtio-dev, linux-kernel, virtualization, kvm, linux-mm, mst,
	mhocko, akpm, torvalds
  Cc: yang.zhang.wz, riel, quan.xu0, liliang.opensource, pbonzini,
	nilal
In-Reply-To: <1532075585-39067-1-git-send-email-wei.w.wang@intel.com>

The OOM notifier is getting deprecated to use for the reasons mentioned
here by Michal Hocko: https://lkml.org/lkml/2018/7/12/314

This patch replaces the virtio-balloon oom notifier with a shrinker
to release balloon pages on memory pressure.

In addition, the bug in the replaced virtballoon_oom_notify that only
VIRTIO_BALLOON_ARRAY_PFNS_MAX (i.e 256) balloon pages can be freed
though the user has specified more than that number is fixed in the
shrinker_scan function.

Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
---
 drivers/virtio/virtio_balloon.c | 113 +++++++++++++++++++++++-----------------
 1 file changed, 65 insertions(+), 48 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 9356a1a..c6fd406 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -27,7 +27,6 @@
 #include <linux/slab.h>
 #include <linux/module.h>
 #include <linux/balloon_compaction.h>
-#include <linux/oom.h>
 #include <linux/wait.h>
 #include <linux/mm.h>
 #include <linux/mount.h>
@@ -40,12 +39,12 @@
  */
 #define VIRTIO_BALLOON_PAGES_PER_PAGE (unsigned)(PAGE_SIZE >> VIRTIO_BALLOON_PFN_SHIFT)
 #define VIRTIO_BALLOON_ARRAY_PFNS_MAX 256
-#define OOM_VBALLOON_DEFAULT_PAGES 256
+#define DEFAULT_BALLOON_PAGES_TO_SHRINK 256
 #define VIRTBALLOON_OOM_NOTIFY_PRIORITY 80
 
-static int oom_pages = OOM_VBALLOON_DEFAULT_PAGES;
-module_param(oom_pages, int, S_IRUSR | S_IWUSR);
-MODULE_PARM_DESC(oom_pages, "pages to free on OOM");
+static unsigned long balloon_pages_to_shrink = DEFAULT_BALLOON_PAGES_TO_SHRINK;
+module_param(balloon_pages_to_shrink, ulong, 0600);
+MODULE_PARM_DESC(balloon_pages_to_shrink, "pages to free on memory presure");
 
 #ifdef CONFIG_BALLOON_COMPACTION
 static struct vfsmount *balloon_mnt;
@@ -86,8 +85,8 @@ struct virtio_balloon {
 	/* Memory statistics */
 	struct virtio_balloon_stat stats[VIRTIO_BALLOON_S_NR];
 
-	/* To register callback in oom notifier call chain */
-	struct notifier_block nb;
+	/* To register a shrinker to shrink memory upon memory pressure */
+	struct shrinker shrinker;
 };
 
 static struct virtio_device_id id_table[] = {
@@ -365,38 +364,6 @@ static void update_balloon_size(struct virtio_balloon *vb)
 		      &actual);
 }
 
-/*
- * virtballoon_oom_notify - release pages when system is under severe
- *			    memory pressure (called from out_of_memory())
- * @self : notifier block struct
- * @dummy: not used
- * @parm : returned - number of freed pages
- *
- * The balancing of memory by use of the virtio balloon should not cause
- * the termination of processes while there are pages in the balloon.
- * If virtio balloon manages to release some memory, it will make the
- * system return and retry the allocation that forced the OOM killer
- * to run.
- */
-static int virtballoon_oom_notify(struct notifier_block *self,
-				  unsigned long dummy, void *parm)
-{
-	struct virtio_balloon *vb;
-	unsigned long *freed;
-	unsigned num_freed_pages;
-
-	vb = container_of(self, struct virtio_balloon, nb);
-	if (!virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_DEFLATE_ON_OOM))
-		return NOTIFY_OK;
-
-	freed = parm;
-	num_freed_pages = leak_balloon(vb, oom_pages);
-	update_balloon_size(vb);
-	*freed += num_freed_pages;
-
-	return NOTIFY_OK;
-}
-
 static void update_balloon_stats_func(struct work_struct *work)
 {
 	struct virtio_balloon *vb;
@@ -548,6 +515,61 @@ static struct file_system_type balloon_fs = {
 
 #endif /* CONFIG_BALLOON_COMPACTION */
 
+static unsigned long virtio_balloon_shrinker_scan(struct shrinker *shrinker,
+						  struct shrink_control *sc)
+{
+	unsigned long pages_to_free = balloon_pages_to_shrink,
+		      pages_freed = 0;
+	struct virtio_balloon *vb = container_of(shrinker,
+					struct virtio_balloon, shrinker);
+
+	/*
+	 * One invocation of leak_balloon can deflate at most
+	 * VIRTIO_BALLOON_ARRAY_PFNS_MAX balloon pages, so we call it
+	 * multiple times to deflate pages till reaching
+	 * balloon_pages_to_shrink pages.
+	 */
+	while (vb->num_pages && pages_to_free) {
+		pages_to_free = balloon_pages_to_shrink - pages_freed;
+		pages_freed += leak_balloon(vb, pages_to_free);
+	}
+	update_balloon_size(vb);
+
+	return pages_freed / VIRTIO_BALLOON_PAGES_PER_PAGE;
+}
+
+static unsigned long virtio_balloon_shrinker_count(struct shrinker *shrinker,
+						   struct shrink_control *sc)
+{
+	struct virtio_balloon *vb = container_of(shrinker,
+					struct virtio_balloon, shrinker);
+
+	/*
+	 * We continue to use VIRTIO_BALLOON_F_DEFLATE_ON_OOM to handle the
+	 * case when shrinker needs to be invoked to relieve memory pressure.
+	 */
+	if (!virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_DEFLATE_ON_OOM))
+		return 0;
+
+	return min_t(unsigned long, vb->num_pages, balloon_pages_to_shrink) /
+	       VIRTIO_BALLOON_PAGES_PER_PAGE;
+}
+
+static void virtio_balloon_unregister_shrinker(struct virtio_balloon *vb)
+{
+	unregister_shrinker(&vb->shrinker);
+}
+
+static int virtio_balloon_register_shrinker(struct virtio_balloon *vb)
+{
+	vb->shrinker.scan_objects = virtio_balloon_shrinker_scan;
+	vb->shrinker.count_objects = virtio_balloon_shrinker_count;
+	vb->shrinker.batch = 0;
+	vb->shrinker.seeks = DEFAULT_SEEKS;
+
+	return register_shrinker(&vb->shrinker);
+}
+
 static int virtballoon_probe(struct virtio_device *vdev)
 {
 	struct virtio_balloon *vb;
@@ -580,17 +602,10 @@ static int virtballoon_probe(struct virtio_device *vdev)
 	if (err)
 		goto out_free_vb;
 
-	vb->nb.notifier_call = virtballoon_oom_notify;
-	vb->nb.priority = VIRTBALLOON_OOM_NOTIFY_PRIORITY;
-	err = register_oom_notifier(&vb->nb);
-	if (err < 0)
-		goto out_del_vqs;
-
 #ifdef CONFIG_BALLOON_COMPACTION
 	balloon_mnt = kern_mount(&balloon_fs);
 	if (IS_ERR(balloon_mnt)) {
 		err = PTR_ERR(balloon_mnt);
-		unregister_oom_notifier(&vb->nb);
 		goto out_del_vqs;
 	}
 
@@ -599,12 +614,14 @@ static int virtballoon_probe(struct virtio_device *vdev)
 	if (IS_ERR(vb->vb_dev_info.inode)) {
 		err = PTR_ERR(vb->vb_dev_info.inode);
 		kern_unmount(balloon_mnt);
-		unregister_oom_notifier(&vb->nb);
 		vb->vb_dev_info.inode = NULL;
 		goto out_del_vqs;
 	}
 	vb->vb_dev_info.inode->i_mapping->a_ops = &balloon_aops;
 #endif
+	err = virtio_balloon_register_shrinker(vb);
+	if (err)
+		goto out_del_vqs;
 
 	virtio_device_ready(vdev);
 
@@ -637,7 +654,7 @@ static void virtballoon_remove(struct virtio_device *vdev)
 {
 	struct virtio_balloon *vb = vdev->priv;
 
-	unregister_oom_notifier(&vb->nb);
+	virtio_balloon_unregister_shrinker(vb);
 
 	spin_lock_irq(&vb->stop_update_lock);
 	vb->stop_update = true;
-- 
2.7.4

^ permalink raw reply related

* [PATCH v36 1/5] virtio-balloon: remove BUG() in init_vqs
From: Wei Wang @ 2018-07-20  8:33 UTC (permalink / raw)
  To: virtio-dev, linux-kernel, virtualization, kvm, linux-mm, mst,
	mhocko, akpm, torvalds
  Cc: yang.zhang.wz, riel, quan.xu0, liliang.opensource, pbonzini,
	nilal
In-Reply-To: <1532075585-39067-1-git-send-email-wei.w.wang@intel.com>

It's a bit overkill to use BUG when failing to add an entry to the
stats_vq in init_vqs. So remove it and just return the error to the
caller to bail out nicely.

Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Cc: Michael S. Tsirkin <mst@redhat.com>
---
 drivers/virtio/virtio_balloon.c | 10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 6b237e3..9356a1a 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -455,9 +455,13 @@ static int init_vqs(struct virtio_balloon *vb)
 		num_stats = update_balloon_stats(vb);
 
 		sg_init_one(&sg, vb->stats, sizeof(vb->stats[0]) * num_stats);
-		if (virtqueue_add_outbuf(vb->stats_vq, &sg, 1, vb, GFP_KERNEL)
-		    < 0)
-			BUG();
+		err = virtqueue_add_outbuf(vb->stats_vq, &sg, 1, vb,
+					   GFP_KERNEL);
+		if (err) {
+			dev_warn(&vb->vdev->dev, "%s: add stat_vq failed\n",
+				 __func__);
+			return err;
+		}
 		virtqueue_kick(vb->stats_vq);
 	}
 	return 0;
-- 
2.7.4

^ permalink raw reply related

* [PATCH v36 0/5] Virtio-balloon: support free page reporting
From: Wei Wang @ 2018-07-20  8:33 UTC (permalink / raw)
  To: virtio-dev, linux-kernel, virtualization, kvm, linux-mm, mst,
	mhocko, akpm, torvalds
  Cc: yang.zhang.wz, riel, quan.xu0, liliang.opensource, pbonzini,
	nilal

This patch series is separated from the previous "Virtio-balloon
Enhancement" series. The new feature, VIRTIO_BALLOON_F_FREE_PAGE_HINT,  
implemented by this series enables the virtio-balloon driver to report
hints of guest free pages to the host. It can be used to accelerate live
migration of VMs. Here is an introduction of this usage:

Live migration needs to transfer the VM's memory from the source machine
to the destination round by round. For the 1st round, all the VM's memory
is transferred. From the 2nd round, only the pieces of memory that were
written by the guest (after the 1st round) are transferred. One method
that is popularly used by the hypervisor to track which part of memory is
written is to write-protect all the guest memory.

This feature enables the optimization by skipping the transfer of guest
free pages during VM live migration. It is not concerned that the memory
pages are used after they are given to the hypervisor as a hint of the
free pages, because they will be tracked by the hypervisor and transferred
in the subsequent round if they are used and written.

* Tests
- Test Environment
    Host: Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz
    Guest: 8G RAM, 4 vCPU
    Migration setup: migrate_set_speed 100G, migrate_set_downtime 2 second

- Test Results
    - Idle Guest Live Migration Time (results are averaged over 10 runs):
        - Optimization v.s. Legacy = 409ms vs 1757ms --> ~77% reduction
	(setting page poisoning zero and enabling ksm don't affect the
         comparison result)
    - Guest with Linux Compilation Workload (make bzImage -j4):
        - Live Migration Time (average)
          Optimization v.s. Legacy = 1407ms v.s. 2528ms --> ~44% reduction
        - Linux Compilation Time
          Optimization v.s. Legacy = 5min4s v.s. 5min12s
          --> no obvious difference

ChangeLog:
v35->v36:
    - remove the mm patch, as Linus has a suggestion to get free page
      addresses via allocation, instead of reading from the free page
      list.
    - virtio-balloon:
        - replace oom notifier with shrinker;
        - the guest to host communication interface remains the same as
          v32.
	- allocate free page blocks and send to host one by one, and free
          them after sending all the pages.

For ChangeLogs from v22 to v35, please reference
https://lwn.net/Articles/759413/

For ChangeLogs before v21, please reference
https://lwn.net/Articles/743660/

Wei Wang (5):
  virtio-balloon: remove BUG() in init_vqs
  virtio_balloon: replace oom notifier with shrinker
  virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT
  mm/page_poison: expose page_poisoning_enabled to kernel modules
  virtio-balloon: VIRTIO_BALLOON_F_PAGE_POISON

 drivers/virtio/virtio_balloon.c     | 456 ++++++++++++++++++++++++++++++------
 include/uapi/linux/virtio_balloon.h |   7 +
 mm/page_poison.c                    |   6 +
 3 files changed, 394 insertions(+), 75 deletions(-)

-- 
2.7.4

^ permalink raw reply

* [RFC 4/4] virtio: Add platform specific DMA API translation for virito devices
From: Anshuman Khandual @ 2018-07-20  3:59 UTC (permalink / raw)
  To: virtualization, linux-kernel
  Cc: robh, srikar, mst, benh, linuxram, hch, paulus, mpe, joe,
	khandual, linuxppc-dev, elfring, haren, david
In-Reply-To: <20180720035941.6844-1-khandual@linux.vnet.ibm.com>

This adds a hook which a platform can define in order to allow it to
override virtio device's DMA OPS irrespective of whether it has the
flag VIRTIO_F_IOMMU_PLATFORM set or not. We want to use this to do
bounce-buffering of data on the new secure pSeries platform, currently
under development, where a KVM host cannot access all of the memory
space of a secure KVM guest.  The host can only access the pages which
the guest has explicitly requested to be shared with the host, thus
the virtio implementation in the guest has to copy data to and from
shared pages.

With this hook, the platform code in the secure guest can force the
use of swiotlb for virtio buffers, with a back-end for swiotlb which
will use a pool of pre-allocated shared pages.  Thus all data being
sent or received by virtio devices will be copied through pages which
the host has access to.

Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/dma-mapping.h | 6 ++++++
 arch/powerpc/platforms/pseries/iommu.c | 6 ++++++
 drivers/virtio/virtio.c                | 7 +++++++
 3 files changed, 19 insertions(+)

diff --git a/arch/powerpc/include/asm/dma-mapping.h b/arch/powerpc/include/asm/dma-mapping.h
index 8fa3945..bc5a9d3 100644
--- a/arch/powerpc/include/asm/dma-mapping.h
+++ b/arch/powerpc/include/asm/dma-mapping.h
@@ -116,3 +116,9 @@ extern u64 __dma_get_required_mask(struct device *dev);
 
 #endif /* __KERNEL__ */
 #endif	/* _ASM_DMA_MAPPING_H */
+
+#define platform_override_dma_ops platform_override_dma_ops
+
+struct virtio_device;
+
+extern void platform_override_dma_ops(struct virtio_device *vdev);
diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
index 06f0296..5773bc7 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -38,6 +38,7 @@
 #include <linux/of.h>
 #include <linux/iommu.h>
 #include <linux/rculist.h>
+#include <linux/virtio.h>
 #include <asm/io.h>
 #include <asm/prom.h>
 #include <asm/rtas.h>
@@ -1396,3 +1397,8 @@ static int __init disable_multitce(char *str)
 __setup("multitce=", disable_multitce);
 
 machine_subsys_initcall_sync(pseries, tce_iommu_bus_notifier_init);
+
+void platform_override_dma_ops(struct virtio_device *vdev)
+{
+	/* Override vdev->parent.dma_ops if required */
+}
diff --git a/drivers/virtio/virtio.c b/drivers/virtio/virtio.c
index 6b13987..432c332 100644
--- a/drivers/virtio/virtio.c
+++ b/drivers/virtio/virtio.c
@@ -168,6 +168,12 @@ EXPORT_SYMBOL_GPL(virtio_add_status);
 
 const struct dma_map_ops virtio_direct_dma_ops;
 
+#ifndef platform_override_dma_ops
+static inline void platform_override_dma_ops(struct virtio_device *vdev)
+{
+}
+#endif
+
 int virtio_finalize_features(struct virtio_device *dev)
 {
 	int ret = dev->config->finalize_features(dev);
@@ -179,6 +185,7 @@ int virtio_finalize_features(struct virtio_device *dev)
 	if (virtio_has_iommu_quirk(dev))
 		set_dma_ops(dev->dev.parent, &virtio_direct_dma_ops);
 
+	platform_override_dma_ops(dev);
 	if (!virtio_has_feature(dev, VIRTIO_F_VERSION_1))
 		return 0;
 
-- 
2.9.3

^ permalink raw reply related

* [RFC 3/4] virtio: Force virtio core to use DMA API callbacks for all virtio devices
From: Anshuman Khandual @ 2018-07-20  3:59 UTC (permalink / raw)
  To: virtualization, linux-kernel
  Cc: robh, srikar, mst, benh, linuxram, hch, paulus, mpe, joe,
	khandual, linuxppc-dev, elfring, haren, david
In-Reply-To: <20180720035941.6844-1-khandual@linux.vnet.ibm.com>

Virtio core should use DMA API callbacks for all virtio devices which may
generate either GAP or IOVA depending on VIRTIO_F_IOMMU_PLATFORM flag and
resulting QEMU expectations. This implies that every virtio device needs
to have a DMA OPS structure. This removes previous GPA fallback code paths.

Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
---
 drivers/virtio/virtio_ring.c | 65 ++------------------------------------------
 1 file changed, 2 insertions(+), 63 deletions(-)

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index 814b395..c265964 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -141,26 +141,6 @@ struct vring_virtqueue {
  * unconditionally on data path.
  */
 
-static bool vring_use_dma_api(struct virtio_device *vdev)
-{
-	if (!virtio_has_iommu_quirk(vdev))
-		return true;
-
-	/* Otherwise, we are left to guess. */
-	/*
-	 * In theory, it's possible to have a buggy QEMU-supposed
-	 * emulated Q35 IOMMU and Xen enabled at the same time.  On
-	 * such a configuration, virtio has never worked and will
-	 * not work without an even larger kludge.  Instead, enable
-	 * the DMA API if we're a Xen guest, which at least allows
-	 * all of the sensible Xen configurations to work correctly.
-	 */
-	if (xen_domain())
-		return true;
-
-	return false;
-}
-
 /*
  * The DMA ops on various arches are rather gnarly right now, and
  * making all of the arch DMA ops work on the vring device itself
@@ -176,9 +156,6 @@ static dma_addr_t vring_map_one_sg(const struct vring_virtqueue *vq,
 				   struct scatterlist *sg,
 				   enum dma_data_direction direction)
 {
-	if (!vring_use_dma_api(vq->vq.vdev))
-		return (dma_addr_t)sg_phys(sg);
-
 	/*
 	 * We can't use dma_map_sg, because we don't use scatterlists in
 	 * the way it expects (we don't guarantee that the scatterlist
@@ -193,9 +170,6 @@ static dma_addr_t vring_map_single(const struct vring_virtqueue *vq,
 				   void *cpu_addr, size_t size,
 				   enum dma_data_direction direction)
 {
-	if (!vring_use_dma_api(vq->vq.vdev))
-		return (dma_addr_t)virt_to_phys(cpu_addr);
-
 	return dma_map_single(vring_dma_dev(vq),
 			      cpu_addr, size, direction);
 }
@@ -205,9 +179,6 @@ static void vring_unmap_one(const struct vring_virtqueue *vq,
 {
 	u16 flags;
 
-	if (!vring_use_dma_api(vq->vq.vdev))
-		return;
-
 	flags = virtio16_to_cpu(vq->vq.vdev, desc->flags);
 
 	if (flags & VRING_DESC_F_INDIRECT) {
@@ -228,9 +199,6 @@ static void vring_unmap_one(const struct vring_virtqueue *vq,
 static int vring_mapping_error(const struct vring_virtqueue *vq,
 			       dma_addr_t addr)
 {
-	if (!vring_use_dma_api(vq->vq.vdev))
-		return 0;
-
 	return dma_mapping_error(vring_dma_dev(vq), addr);
 }
 
@@ -1016,43 +984,14 @@ EXPORT_SYMBOL_GPL(__vring_new_virtqueue);
 static void *vring_alloc_queue(struct virtio_device *vdev, size_t size,
 			      dma_addr_t *dma_handle, gfp_t flag)
 {
-	if (vring_use_dma_api(vdev)) {
-		return dma_alloc_coherent(vdev->dev.parent, size,
+	return dma_alloc_coherent(vdev->dev.parent, size,
 					  dma_handle, flag);
-	} else {
-		void *queue = alloc_pages_exact(PAGE_ALIGN(size), flag);
-		if (queue) {
-			phys_addr_t phys_addr = virt_to_phys(queue);
-			*dma_handle = (dma_addr_t)phys_addr;
-
-			/*
-			 * Sanity check: make sure we dind't truncate
-			 * the address.  The only arches I can find that
-			 * have 64-bit phys_addr_t but 32-bit dma_addr_t
-			 * are certain non-highmem MIPS and x86
-			 * configurations, but these configurations
-			 * should never allocate physical pages above 32
-			 * bits, so this is fine.  Just in case, throw a
-			 * warning and abort if we end up with an
-			 * unrepresentable address.
-			 */
-			if (WARN_ON_ONCE(*dma_handle != phys_addr)) {
-				free_pages_exact(queue, PAGE_ALIGN(size));
-				return NULL;
-			}
-		}
-		return queue;
-	}
 }
 
 static void vring_free_queue(struct virtio_device *vdev, size_t size,
 			     void *queue, dma_addr_t dma_handle)
 {
-	if (vring_use_dma_api(vdev)) {
-		dma_free_coherent(vdev->dev.parent, size, queue, dma_handle);
-	} else {
-		free_pages_exact(queue, PAGE_ALIGN(size));
-	}
+	dma_free_coherent(vdev->dev.parent, size, queue, dma_handle);
 }
 
 struct virtqueue *vring_create_virtqueue(
-- 
2.9.3

^ permalink raw reply related

* [RFC 2/4] virtio: Override device's DMA OPS with virtio_direct_dma_ops selectively
From: Anshuman Khandual @ 2018-07-20  3:59 UTC (permalink / raw)
  To: virtualization, linux-kernel
  Cc: robh, srikar, mst, benh, linuxram, hch, paulus, mpe, joe,
	khandual, linuxppc-dev, elfring, haren, david
In-Reply-To: <20180720035941.6844-1-khandual@linux.vnet.ibm.com>

Now that virtio core always needs all virtio devices to have DMA OPS, we
need to make sure that the structure it points is the right one. In the
absence of VIRTIO_F_IOMMU_PLATFORM flag QEMU expects GPA from guest kernel.
In such case, virtio device must use default virtio_direct_dma_ops DMA OPS
structure which transforms scatter gather buffer addresses as GPA. This
DMA OPS override must happen as early as possible during virtio device
initializatin sequence before virtio core starts using given device's DMA
OPS callbacks for I/O transactions. This change detects device's IOMMU flag
and does the override in case the flag is cleared.

Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
---
 drivers/virtio/virtio.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/drivers/virtio/virtio.c b/drivers/virtio/virtio.c
index 7907ad3..6b13987 100644
--- a/drivers/virtio/virtio.c
+++ b/drivers/virtio/virtio.c
@@ -166,6 +166,8 @@ void virtio_add_status(struct virtio_device *dev, unsigned int status)
 }
 EXPORT_SYMBOL_GPL(virtio_add_status);
 
+const struct dma_map_ops virtio_direct_dma_ops;
+
 int virtio_finalize_features(struct virtio_device *dev)
 {
 	int ret = dev->config->finalize_features(dev);
@@ -174,6 +176,9 @@ int virtio_finalize_features(struct virtio_device *dev)
 	if (ret)
 		return ret;
 
+	if (virtio_has_iommu_quirk(dev))
+		set_dma_ops(dev->dev.parent, &virtio_direct_dma_ops);
+
 	if (!virtio_has_feature(dev, VIRTIO_F_VERSION_1))
 		return 0;
 
-- 
2.9.3

^ permalink raw reply related

* [RFC 1/4] virtio: Define virtio_direct_dma_ops structure
From: Anshuman Khandual @ 2018-07-20  3:59 UTC (permalink / raw)
  To: virtualization, linux-kernel
  Cc: robh, srikar, mst, benh, linuxram, hch, paulus, mpe, joe,
	khandual, linuxppc-dev, elfring, haren, david
In-Reply-To: <20180720035941.6844-1-khandual@linux.vnet.ibm.com>

Current implementation of DMA API inside virtio core calls device's DMA OPS
callback functions when the flag VIRTIO_F_IOMMU_PLATFORM flag is set. But
in absence of the flag, virtio core falls back calling basic transformation
of the incoming SG addresses as GPA. Going forward virtio should only call
DMA API based transformations generating either GPA or IOVA depending on
QEMU expectations again based on VIRTIO_F_IOMMU_PLATFORM flag. It requires
removing existing fallback code path for GPA transformation and replacing
that with a direct map DMA OPS structure. This adds that direct mapping DMA
OPS structure to be used in later patches which will make virtio core call
DMA API all the time for all virtio devices.

Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
---
 drivers/virtio/virtio.c            | 60 ++++++++++++++++++++++++++++++++++++++
 drivers/virtio/virtio_pci_common.h |  3 ++
 2 files changed, 63 insertions(+)

diff --git a/drivers/virtio/virtio.c b/drivers/virtio/virtio.c
index 59e36ef..7907ad3 100644
--- a/drivers/virtio/virtio.c
+++ b/drivers/virtio/virtio.c
@@ -3,6 +3,7 @@
 #include <linux/virtio_config.h>
 #include <linux/module.h>
 #include <linux/idr.h>
+#include <linux/dma-mapping.h>
 #include <uapi/linux/virtio_ids.h>
 
 /* Unique numbering for virtio devices. */
@@ -442,3 +443,62 @@ core_initcall(virtio_init);
 module_exit(virtio_exit);
 
 MODULE_LICENSE("GPL");
+
+/*
+ * Virtio direct mapping DMA API operations structure
+ *
+ * This defines DMA API structure for all virtio devices which would not
+ * either bring in their own DMA OPS from architecture or they would not
+ * like to use architecture specific IOMMU based DMA OPS because QEMU
+ * expects GPA instead of an IOVA in absence of VIRTIO_F_IOMMU_PLATFORM.
+ */
+dma_addr_t virtio_direct_map_page(struct device *dev, struct page *page,
+			    unsigned long offset, size_t size,
+			    enum dma_data_direction dir,
+			    unsigned long attrs)
+{
+	return page_to_phys(page) + offset;
+}
+
+void virtio_direct_unmap_page(struct device *hwdev, dma_addr_t dev_addr,
+			size_t size, enum dma_data_direction dir,
+			unsigned long attrs)
+{
+}
+
+int virtio_direct_mapping_error(struct device *hwdev, dma_addr_t dma_addr)
+{
+	return 0;
+}
+
+void *virtio_direct_alloc(struct device *dev, size_t size, dma_addr_t *dma_handle,
+		gfp_t gfp, unsigned long attrs)
+{
+	void *queue = alloc_pages_exact(PAGE_ALIGN(size), gfp);
+
+	if (queue) {
+		phys_addr_t phys_addr = virt_to_phys(queue);
+		*dma_handle = (dma_addr_t)phys_addr;
+
+		if (WARN_ON_ONCE(*dma_handle != phys_addr)) {
+			free_pages_exact(queue, PAGE_ALIGN(size));
+			return NULL;
+		}
+	}
+	return queue;
+}
+
+void virtio_direct_free(struct device *dev, size_t size, void *vaddr,
+		dma_addr_t dma_addr, unsigned long attrs)
+{
+	free_pages_exact(vaddr, PAGE_ALIGN(size));
+}
+
+const struct dma_map_ops virtio_direct_dma_ops = {
+	.alloc			= virtio_direct_alloc,
+	.free			= virtio_direct_free,
+	.map_page		= virtio_direct_map_page,
+	.unmap_page		= virtio_direct_unmap_page,
+	.mapping_error		= virtio_direct_mapping_error,
+};
+EXPORT_SYMBOL(virtio_direct_dma_ops);
diff --git a/drivers/virtio/virtio_pci_common.h b/drivers/virtio/virtio_pci_common.h
index 135ee3c..ec44d2f 100644
--- a/drivers/virtio/virtio_pci_common.h
+++ b/drivers/virtio/virtio_pci_common.h
@@ -31,6 +31,9 @@
 #include <linux/highmem.h>
 #include <linux/spinlock.h>
 
+extern struct dma_map_ops virtio_direct_dma_ops;
+
+
 struct virtio_pci_vq_info {
 	/* the actual virtqueue */
 	struct virtqueue *vq;
-- 
2.9.3

^ permalink raw reply related

* [RFC 0/4] Virtio uses DMA API for all devices
From: Anshuman Khandual @ 2018-07-20  3:59 UTC (permalink / raw)
  To: virtualization, linux-kernel
  Cc: robh, srikar, mst, benh, linuxram, hch, paulus, mpe, joe,
	khandual, linuxppc-dev, elfring, haren, david

This patch series is the follow up on the discussions we had before about
the RFC titled [RFC,V2] virtio: Add platform specific DMA API translation
for virito devices (https://patchwork.kernel.org/patch/10417371/). There
were suggestions about doing away with two different paths of transactions
with the host/QEMU, first being the direct GPA and the other being the DMA
API based translations.

First patch attempts to create a direct GPA mapping based DMA operations
structure called 'virtio_direct_dma_ops' with exact same implementation
of the direct GPA path which virtio core currently has but just wrapped in
a DMA API format. Virtio core must use 'virtio_direct_dma_ops' instead of
the arch default in absence of VIRTIO_F_IOMMU_PLATFORM flag to preserve the
existing semantics. The second patch does exactly that inside the function
virtio_finalize_features(). The third patch removes the default direct GPA
path from virtio core forcing it to use DMA API callbacks for all devices.
Now with that change, every device must have a DMA operations structure
associated with it. The fourth patch adds an additional hook which gives
the platform an opportunity to do yet another override if required. This
platform hook can be used on POWER Ultravisor based protected guests to
load up SWIOTLB DMA callbacks to do the required (as discussed previously
in the above mentioned thread how host is allowed to access only parts of
the guest GPA range) bounce buffering into the shared memory for all I/O
scatter gather buffers to be consumed on the host side.

Please go through these patches and review whether this approach broadly
makes sense. I will appreciate suggestions, inputs, comments regarding
the patches or the approach in general. Thank you.

Anshuman Khandual (4):
  virtio: Define virtio_direct_dma_ops structure
  virtio: Override device's DMA OPS with virtio_direct_dma_ops selectively
  virtio: Force virtio core to use DMA API callbacks for all virtio devices
  virtio: Add platform specific DMA API translation for virito devices

 arch/powerpc/include/asm/dma-mapping.h |  6 +++
 arch/powerpc/platforms/pseries/iommu.c |  6 +++
 drivers/virtio/virtio.c                | 72 ++++++++++++++++++++++++++++++++++
 drivers/virtio/virtio_pci_common.h     |  3 ++
 drivers/virtio/virtio_ring.c           | 65 +-----------------------------
 5 files changed, 89 insertions(+), 63 deletions(-)

-- 
2.9.3

^ permalink raw reply

* Re: Memory Read Only Enforcement: VMM assisted kernel rootkit mitigation for KVM
From: Konrad Rzeszutek Wilk @ 2018-07-20  2:45 UTC (permalink / raw)
  To: Ahmed Abd El Mawgood, xen-devel
  Cc: nathan Corbet, Ard Biesheuvel, rkrcmar, Kees Cook, kvm, linux-doc,
	David Vrabel, x86, Boris Lukashev, virtualization, Ingo Molnar,
	nigel.edwards, hpa, Kernel Hardening, Paolo Bonzini,
	Thomas Gleixner, Rik van Riel
In-Reply-To: <20180719213802.17161-1-ahmedsoliman0x666@gmail.com>

On Thu, Jul 19, 2018 at 11:37:59PM +0200, Ahmed Abd El Mawgood wrote:
> Hi,
> 
> This is my first set of patches that works as I would expect, and the
> third revision I sent to mailing lists.
> 
> Following up with my previous discussions about kernel rootkit mitigation
> via placing R/O protection on critical data structure, static data,
> privileged registers with static content. These patches present the
> first part where it is only possible to place these protections on
> memory pages. Feature-wise, this set of patches is incomplete in the sense of:
> - They still don't protect privileged registers
> - They don't protect guest TLB from malicious gva -> gpa page mappings.
> But they provide sketches for a basic working design. Note that I am totally
> noob and it took lots of time and effort to get to this point. So sorry in
> advance if I overlooked something.

This reminds me of Xen PV page model. That is the hypervisor is the one
auditing the page tables and the guest's pages are read-only.

Ditto for IDT, GDT, etc. Gosh, did you by chance look at how
Xen PV mechanism is done? It may provide the protection you are looking for?

CC-ing xen-devel.
> 
> [PATCH 1/3] [RFC V3] KVM: X86: Memory ROE documentation
> [PATCH 2/3] [RFC V3] KVM: X86: Adding arbitrary data pointer in kvm memslot itterator functions
> [PATCH 3/3] [RFC V3] KVM: X86: Adding skeleton for Memory ROE
> 
> Summery:
> 
>  Documentation/virtual/kvm/hypercalls.txt |  14 ++++
>  arch/x86/include/asm/kvm_host.h          |  11 ++-
>  arch/x86/kvm/Kconfig                     |   7 ++
>  arch/x86/kvm/mmu.c                       | 127 ++++++++++++++++++++++---------
>  arch/x86/kvm/x86.c                       |  82 +++++++++++++++++++-
>  include/linux/kvm_host.h                 |   3 +
>  include/uapi/linux/kvm_para.h            |   1 +
>  virt/kvm/kvm_main.c                      |  29 ++++++-
>  8 files changed, 232 insertions(+), 42 deletions(-)
> 

^ permalink raw reply

* Re: [PATCH 1/3] [RFC V3] KVM: X86: Memory ROE documentation
From: Randy Dunlap @ 2018-07-20  1:11 UTC (permalink / raw)
  To: Ahmed Abd El Mawgood, kvm, Kernel Hardening, virtualization,
	linux-doc, x86
  Cc: Ard Biesheuvel, Kees Cook, nathan Corbet, David Vrabel, rkrcmar,
	Boris Lukashev, Ingo Molnar, nigel.edwards, hpa, Paolo Bonzini,
	Thomas Gleixner, Rik van Riel
In-Reply-To: <20180719213802.17161-2-ahmedsoliman0x666@gmail.com>

On 07/19/2018 02:38 PM, Ahmed Abd El Mawgood wrote:

>  Documentation/virtual/kvm/hypercalls.txt | 14 ++++++++++++++
>  1 file changed, 14 insertions(+)
> 
> diff --git a/Documentation/virtual/kvm/hypercalls.txt b/Documentation/virtual/kvm/hypercalls.txt
> index a890529c63ed..a9db68adb7c9 100644
> --- a/Documentation/virtual/kvm/hypercalls.txt
> +++ b/Documentation/virtual/kvm/hypercalls.txt
> @@ -121,3 +121,17 @@ compute the CLOCK_REALTIME for its clock, at the same instant.
>  
>  Returns KVM_EOPNOTSUPP if the host does not use TSC clocksource,
>  or if clock type is different than KVM_CLOCK_PAIRING_WALLCLOCK.
> +
> +7. KVM_HC_HMROE
> +----------------
> +Architecture: x86
> +Status: active
> +Purpose: Hypercall used to apply Read-Only Enforcement to guest pages
> +Usage:
> +     a0: start address of page that should be protected.

Is this done one page per call?  No grouping, no multiple pages?

> +
> +This hypercall lets a guest kernel to have part of its read/write memory

                  lets a guest kernel have part of

> +converted into read-only.  This action is irreversible. KVM_HC_HMROE can
> +not be triggered from guest Ring 3 (user mode). The reason is that user
> +mode malicious software can make use of it enforce read only protection on

                               make use of it to enforce

> +an arbitrary memory page thus crashing the kernel.
> 


-- 
~Randy

^ permalink raw reply

* Re: [PATCH 3/3] [RFC V3] KVM: X86: Adding skeleton for Memory ROE
From: Randy Dunlap @ 2018-07-20  1:07 UTC (permalink / raw)
  To: Ahmed Abd El Mawgood, kvm, Kernel Hardening, virtualization,
	linux-doc, x86
  Cc: Ard Biesheuvel, Kees Cook, nathan Corbet, David Vrabel, rkrcmar,
	Boris Lukashev, Ingo Molnar, nigel.edwards, hpa, Paolo Bonzini,
	Thomas Gleixner, Rik van Riel
In-Reply-To: <20180719213802.17161-4-ahmedsoliman0x666@gmail.com>

On 07/19/2018 02:38 PM, Ahmed Abd El Mawgood wrote:
> This patch introduces a hypercall implemented for X86 that can assist
> against subset of kernel rootkits, it works by place readonly protection in
> shadow PTE. The end result protection is also kept in a bitmap for each
> kvm_memory_slot and is used as reference when updating SPTEs. The whole
> goal is to protect the guest kernel static data from modification if
> attacker is running from guest ring 0, for this reason there is no
> hypercall to revert effect of Memory ROE hypercall. This patch doesn't
> implement integrity check on guest TLB so obvious attack on the current
> implementation will involve guest virtual address -> guest physical
> address remapping, but there are plans to fix that.
> 
> Signed-off-by: Ahmed Abd El Mawgood <ahmedsoliman0x666@gmail.com>
> ---

> diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
> index 92fd433c50b9..8ae822a8dc7a 100644
> --- a/arch/x86/kvm/Kconfig
> +++ b/arch/x86/kvm/Kconfig
> @@ -96,6 +96,13 @@ config KVM_MMU_AUDIT
>  	 This option adds a R/W kVM module parameter 'mmu_audit', which allows
>  	 auditing of KVM MMU events at runtime.
>  
> +config KVM_MROE
> +	bool "Hypercall Memory Read-Only Enforcement"
> +	depends on KVM && X86
> +	help
> +	This option add KVM_HC_HMROE hypercall to kvm which as hardening

	            adds                       to kvm as a hardening   (???)


> +	mechanism to protect memory pages from being edited.
> +
>  # OK, it's a little counter-intuitive to do this, but it puts it neatly under
>  # the virtualization menu.
>  source drivers/vhost/Kconfig


-- 
~Randy

^ permalink raw reply

* Re: [PATCH 3/3] [RFC V3] KVM: X86: Adding skeleton for Memory ROE
From: Ahmed Soliman @ 2018-07-20  0:26 UTC (permalink / raw)
  To: Jann Horn
  Cc: Jonathan Corbet, Ard Biesheuvel, Radim Krčmář,
	Kees Cook, kvm, linux-doc, David Vrabel, the arch/x86 maintainers,
	Boris Lukashev, virtualization, Ingo Molnar, nigel.edwards,
	H . Peter Anvin, Kernel Hardening, Paolo Bonzini, Thomas Gleixner,
	Rik van Riel
In-Reply-To: <CAG48ez3EyU=ROBczUdHEuOYBtZghYqOpq3K16Bs4RQLO1OO6oA@mail.gmail.com>

On 20 July 2018 at 00:59, Jann Horn <jannh@google.com> wrote:
> On Thu, Jul 19, 2018 at 11:40 PM Ahmed Abd El Mawgood

> Why are you implementing this in the kernel, instead of doing it in
> host userspace?

I thought about implementing it completely in QEMU but It won't be
possible for few reasons:

- After talking to QEMU folks I came up to conclusion that it when it
 comes to managing memory allocated for guest, it is always better to let
 KVM handles everything, unless there is a good reason to play with that
 memory chunk inside QEMU itself.
- But actually there is a good reason for implementing ROE in kernel space,
 it is that ROE is architecture dependent to great extent. I should have
 emphasized that the only currently supported architecture is X86. I am
 not sure how deep the dependency on architecture goes. But as for now
 the current set of patches does a SPTE enumeration as part of the process.
 To my best knowledge, this isn't exposed outside arch/x68/kvm let alone
 having a host user space interface for it. Also the way I am planning to
 protect TLB from malicious gva -> gpa mapping is by knowing that in x86
 it is possible to VMEXIT on page faults, I am not sure if it will safe to
 assume that all kvm supported architectures will behave this way.

For these reasons I thought it will be better if arch dependent stuff (the
mechanism implementation) is kept in arch/*/kvm folder and with minimal
modifications to virt/kvm/* after setting a kconfig variable to enable ROE.
But I left room for the user space app using kvm to decide the rightful policy
for handling ROE violations. The way it works by KVM_EXIT_MMIO error to user
space, keeping all the architectural details hidden away from user space.

A last note is that I didn't create this from scratch, instead I extended
KVM_MEM_READONLY implementation to also allow R/O per page instead
R/O per whole slot which is already done in kernel space.

^ permalink raw reply

* [PATCH net-next 9/9] vhost_net: batch update used ring for datacopy TX
From: Jason Wang @ 2018-07-20  0:15 UTC (permalink / raw)
  To: mst, jasowang, netdev; +Cc: linux-kernel, kvm, virtualization
In-Reply-To: <1532045721-4958-1-git-send-email-jasowang@redhat.com>

Like commit e2b3b35eb989 ("vhost_net: batch used ring update in rx"),
this patches implements batch used ring update for datacopy TX
(zerocopy has already done some kind of batching).

Testpmd transmission from guest to host (XDP_DROP on tap) shows 25.8%
improvement (from ~3.1Mpps to ~3.9Mpps) on Broadwell i7-5600U CPU @
2.60GHz machine. Netperf TCP tests does not show obvious differences.

Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 drivers/vhost/net.c | 40 +++++++++++++++++++++++++---------------
 1 file changed, 25 insertions(+), 15 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 2fd2f0e3..367d802 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -428,16 +428,31 @@ static int vhost_net_enable_vq(struct vhost_net *n,
 	return vhost_poll_start(poll, sock->file);
 }
 
+static void vhost_net_signal_used(struct vhost_net_virtqueue *nvq)
+{
+	struct vhost_virtqueue *vq = &nvq->vq;
+	struct vhost_dev *dev = vq->dev;
+
+	if (!nvq->done_idx)
+		return;
+
+	vhost_add_used_and_signal_n(dev, vq, vq->heads, nvq->done_idx);
+	nvq->done_idx = 0;
+}
+
 static int vhost_net_tx_get_vq_desc(struct vhost_net *net,
-				    struct vhost_virtqueue *vq,
+				    struct vhost_net_virtqueue *nvq,
 				    unsigned int *out_num, unsigned int *in_num,
 				    bool *busyloop_intr)
 {
+	struct vhost_virtqueue *vq = &nvq->vq;
 	unsigned long uninitialized_var(endtime);
 	int r = vhost_get_vq_desc(vq, vq->iov, ARRAY_SIZE(vq->iov),
 				  out_num, in_num, NULL, NULL);
 
 	if (r == vq->num && vq->busyloop_timeout) {
+		if (!vhost_sock_zcopy(vq->private_data))
+			vhost_net_signal_used(nvq);
 		preempt_disable();
 		endtime = busy_clock() + vq->busyloop_timeout;
 		while (vhost_can_busy_poll(endtime)) {
@@ -493,7 +508,8 @@ static int get_tx_bufs(struct vhost_net *net,
 	struct vhost_virtqueue *vq = &nvq->vq;
 	int ret;
 
-	ret = vhost_net_tx_get_vq_desc(net, vq, out, in, busyloop_intr);
+	ret = vhost_net_tx_get_vq_desc(net, nvq, out, in, busyloop_intr);
+
 	if (ret < 0 || ret == vq->num)
 		return ret;
 
@@ -557,6 +573,9 @@ static void handle_tx_copy(struct vhost_net *net, struct socket *sock)
 			break;
 		}
 
+		vq->heads[nvq->done_idx].id = cpu_to_vhost32(vq, head);
+		vq->heads[nvq->done_idx].len = 0;
+
 		total_len += len;
 		if (tx_can_batch(vq, total_len))
 			msg.msg_flags |= MSG_MORE;
@@ -573,12 +592,15 @@ static void handle_tx_copy(struct vhost_net *net, struct socket *sock)
 		if (err != len)
 			pr_debug("Truncated TX packet: len %d != %zd\n",
 				 err, len);
-		vhost_add_used_and_signal(&net->dev, vq, head, 0);
+		if (++nvq->done_idx >= VHOST_NET_BATCH)
+			vhost_net_signal_used(nvq);
 		if (vhost_exceeds_weight(++sent_pkts, total_len)) {
 			vhost_poll_queue(&vq->poll);
 			break;
 		}
 	}
+
+	vhost_net_signal_used(nvq);
 }
 
 static void handle_tx_zerocopy(struct vhost_net *net, struct socket *sock)
@@ -741,18 +763,6 @@ static int sk_has_rx_data(struct sock *sk)
 	return skb_queue_empty(&sk->sk_receive_queue);
 }
 
-static void vhost_net_signal_used(struct vhost_net_virtqueue *nvq)
-{
-	struct vhost_virtqueue *vq = &nvq->vq;
-	struct vhost_dev *dev = vq->dev;
-
-	if (!nvq->done_idx)
-		return;
-
-	vhost_add_used_and_signal_n(dev, vq, vq->heads, nvq->done_idx);
-	nvq->done_idx = 0;
-}
-
 static int vhost_net_rx_peek_head_len(struct vhost_net *net, struct sock *sk,
 				      bool *busyloop_intr)
 {
-- 
2.7.4

^ permalink raw reply related

* [PATCH net-next 8/9] vhost_net: rename VHOST_RX_BATCH to VHOST_NET_BATCH
From: Jason Wang @ 2018-07-20  0:15 UTC (permalink / raw)
  To: mst, jasowang, netdev; +Cc: linux-kernel, kvm, virtualization
In-Reply-To: <1532045721-4958-1-git-send-email-jasowang@redhat.com>

A more generic name which could be used for TX as well.

Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 drivers/vhost/net.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 53d305b..2fd2f0e3 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -94,7 +94,7 @@ struct vhost_net_ubuf_ref {
 	struct vhost_virtqueue *vq;
 };
 
-#define VHOST_RX_BATCH 64
+#define VHOST_NET_BATCH 64
 struct vhost_net_buf {
 	void **queue;
 	int tail;
@@ -168,7 +168,7 @@ static int vhost_net_buf_produce(struct vhost_net_virtqueue *nvq)
 
 	rxq->head = 0;
 	rxq->tail = ptr_ring_consume_batched(nvq->rx_ring, rxq->queue,
-					      VHOST_RX_BATCH);
+					      VHOST_NET_BATCH);
 	return rxq->tail;
 }
 
@@ -1007,7 +1007,7 @@ static void handle_rx(struct vhost_net *net)
 			goto out;
 		}
 		nvq->done_idx += headcount;
-		if (nvq->done_idx > VHOST_RX_BATCH)
+		if (nvq->done_idx > VHOST_NET_BATCH)
 			vhost_net_signal_used(nvq);
 		if (unlikely(vq_log))
 			vhost_log_write(vq, vq_log, log, vhost_len);
@@ -1075,7 +1075,7 @@ static int vhost_net_open(struct inode *inode, struct file *f)
 		return -ENOMEM;
 	}
 
-	queue = kmalloc_array(VHOST_RX_BATCH, sizeof(void *),
+	queue = kmalloc_array(VHOST_NET_BATCH, sizeof(void *),
 			      GFP_KERNEL);
 	if (!queue) {
 		kfree(vqs);
-- 
2.7.4

^ permalink raw reply related

* [PATCH net-next 7/9] vhost_net: rename vhost_rx_signal_used() to vhost_net_signal_used()
From: Jason Wang @ 2018-07-20  0:15 UTC (permalink / raw)
  To: mst, jasowang, netdev; +Cc: linux-kernel, kvm, virtualization
In-Reply-To: <1532045721-4958-1-git-send-email-jasowang@redhat.com>

Rename for reusing this for TX.

Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 drivers/vhost/net.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 9cef0b2..53d305b 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -741,7 +741,7 @@ static int sk_has_rx_data(struct sock *sk)
 	return skb_queue_empty(&sk->sk_receive_queue);
 }
 
-static void vhost_rx_signal_used(struct vhost_net_virtqueue *nvq)
+static void vhost_net_signal_used(struct vhost_net_virtqueue *nvq)
 {
 	struct vhost_virtqueue *vq = &nvq->vq;
 	struct vhost_dev *dev = vq->dev;
@@ -765,7 +765,7 @@ static int vhost_net_rx_peek_head_len(struct vhost_net *net, struct sock *sk,
 
 	if (!len && tvq->busyloop_timeout) {
 		/* Flush batched heads first */
-		vhost_rx_signal_used(rnvq);
+		vhost_net_signal_used(rnvq);
 		/* Both tx vq and rx socket were polled here */
 		mutex_lock_nested(&tvq->mutex, 1);
 		vhost_disable_notify(&net->dev, tvq);
@@ -1008,7 +1008,7 @@ static void handle_rx(struct vhost_net *net)
 		}
 		nvq->done_idx += headcount;
 		if (nvq->done_idx > VHOST_RX_BATCH)
-			vhost_rx_signal_used(nvq);
+			vhost_net_signal_used(nvq);
 		if (unlikely(vq_log))
 			vhost_log_write(vq, vq_log, log, vhost_len);
 		total_len += vhost_len;
@@ -1022,7 +1022,7 @@ static void handle_rx(struct vhost_net *net)
 	else
 		vhost_net_enable_vq(net, vq);
 out:
-	vhost_rx_signal_used(nvq);
+	vhost_net_signal_used(nvq);
 	mutex_unlock(&vq->mutex);
 }
 
-- 
2.7.4

^ permalink raw reply related

* [PATCH net-next 6/9] vhost_net: split out datacopy logic
From: Jason Wang @ 2018-07-20  0:15 UTC (permalink / raw)
  To: mst, jasowang, netdev; +Cc: linux-kernel, kvm, virtualization
In-Reply-To: <1532045721-4958-1-git-send-email-jasowang@redhat.com>

Instead of mixing zerocopy and datacopy logics, this patch tries to
split datacopy logic out. This results for a more compact code and
ad-hoc optimization could be done on top more easily.

Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 drivers/vhost/net.c | 110 ++++++++++++++++++++++++++++++++++++++++++----------
 1 file changed, 90 insertions(+), 20 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index f59b615..9cef0b2 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -520,9 +520,7 @@ static bool tx_can_batch(struct vhost_virtqueue *vq, size_t total_len)
 	       !vhost_vq_avail_empty(vq->dev, vq);
 }
 
-/* Expects to be always run from workqueue - which acts as
- * read-size critical section for our kind of RCU. */
-static void handle_tx(struct vhost_net *net)
+static void handle_tx_copy(struct vhost_net *net, struct socket *sock)
 {
 	struct vhost_net_virtqueue *nvq = &net->vqs[VHOST_NET_VQ_TX];
 	struct vhost_virtqueue *vq = &nvq->vq;
@@ -537,30 +535,76 @@ static void handle_tx(struct vhost_net *net)
 	};
 	size_t len, total_len = 0;
 	int err;
-	struct socket *sock;
-	struct vhost_net_ubuf_ref *uninitialized_var(ubufs);
-	bool zcopy, zcopy_used;
 	int sent_pkts = 0;
 
-	mutex_lock(&vq->mutex);
-	sock = vq->private_data;
-	if (!sock)
-		goto out;
+	for (;;) {
+		bool busyloop_intr = false;
 
-	if (!vq_iotlb_prefetch(vq))
-		goto out;
+		head = get_tx_bufs(net, nvq, &msg, &out, &in, &len,
+				   &busyloop_intr);
+		/* On error, stop handling until the next kick. */
+		if (unlikely(head < 0))
+			break;
+		/* Nothing new?  Wait for eventfd to tell us they refilled. */
+		if (head == vq->num) {
+			if (unlikely(busyloop_intr)) {
+				vhost_poll_queue(&vq->poll);
+			} else if (unlikely(vhost_enable_notify(&net->dev,
+								vq))) {
+				vhost_disable_notify(&net->dev, vq);
+				continue;
+			}
+			break;
+		}
 
-	vhost_disable_notify(&net->dev, vq);
-	vhost_net_disable_vq(net, vq);
+		total_len += len;
+		if (tx_can_batch(vq, total_len))
+			msg.msg_flags |= MSG_MORE;
+		else
+			msg.msg_flags &= ~MSG_MORE;
+
+		/* TODO: Check specific error and bomb out unless ENOBUFS? */
+		err = sock->ops->sendmsg(sock, &msg, len);
+		if (unlikely(err < 0)) {
+			vhost_discard_vq_desc(vq, 1);
+			vhost_net_enable_vq(net, vq);
+			break;
+		}
+		if (err != len)
+			pr_debug("Truncated TX packet: len %d != %zd\n",
+				 err, len);
+		vhost_add_used_and_signal(&net->dev, vq, head, 0);
+		if (vhost_exceeds_weight(++sent_pkts, total_len)) {
+			vhost_poll_queue(&vq->poll);
+			break;
+		}
+	}
+}
 
-	zcopy = nvq->ubufs;
+static void handle_tx_zerocopy(struct vhost_net *net, struct socket *sock)
+{
+	struct vhost_net_virtqueue *nvq = &net->vqs[VHOST_NET_VQ_TX];
+	struct vhost_virtqueue *vq = &nvq->vq;
+	unsigned out, in;
+	int head;
+	struct msghdr msg = {
+		.msg_name = NULL,
+		.msg_namelen = 0,
+		.msg_control = NULL,
+		.msg_controllen = 0,
+		.msg_flags = MSG_DONTWAIT,
+	};
+	size_t len, total_len = 0;
+	int err;
+	struct vhost_net_ubuf_ref *uninitialized_var(ubufs);
+	bool zcopy_used;
+	int sent_pkts = 0;
 
 	for (;;) {
 		bool busyloop_intr;
 
 		/* Release DMAs done buffers first */
-		if (zcopy)
-			vhost_zerocopy_signal_used(net, vq);
+		vhost_zerocopy_signal_used(net, vq);
 
 		busyloop_intr = false;
 		head = get_tx_bufs(net, nvq, &msg, &out, &in, &len,
@@ -579,9 +623,9 @@ static void handle_tx(struct vhost_net *net)
 			break;
 		}
 
-		zcopy_used = zcopy && len >= VHOST_GOODCOPY_LEN
-				   && !vhost_exceeds_maxpend(net)
-				   && vhost_net_tx_select_zcopy(net);
+		zcopy_used = len >= VHOST_GOODCOPY_LEN
+			     && !vhost_exceeds_maxpend(net)
+			     && vhost_net_tx_select_zcopy(net);
 
 		/* use msg_control to pass vhost zerocopy ubuf info to skb */
 		if (zcopy_used) {
@@ -636,6 +680,32 @@ static void handle_tx(struct vhost_net *net)
 			break;
 		}
 	}
+}
+
+/* Expects to be always run from workqueue - which acts as
+ * read-size critical section for our kind of RCU. */
+static void handle_tx(struct vhost_net *net)
+{
+	struct vhost_net_virtqueue *nvq = &net->vqs[VHOST_NET_VQ_TX];
+	struct vhost_virtqueue *vq = &nvq->vq;
+	struct socket *sock;
+
+	mutex_lock(&vq->mutex);
+	sock = vq->private_data;
+	if (!sock)
+		goto out;
+
+	if (!vq_iotlb_prefetch(vq))
+		goto out;
+
+	vhost_disable_notify(&net->dev, vq);
+	vhost_net_disable_vq(net, vq);
+
+	if (vhost_sock_zcopy(sock))
+		handle_tx_zerocopy(net, sock);
+	else
+		handle_tx_copy(net, sock);
+
 out:
 	mutex_unlock(&vq->mutex);
 }
-- 
2.7.4

^ permalink raw reply related

* [PATCH net-next 5/9] vhost_net: introduce tx_can_batch()
From: Jason Wang @ 2018-07-20  0:15 UTC (permalink / raw)
  To: mst, jasowang, netdev; +Cc: linux-kernel, kvm, virtualization
In-Reply-To: <1532045721-4958-1-git-send-email-jasowang@redhat.com>

Introduce tx_can_batch() to determine whether TX could be
batched. This will help to reduce the code duplication in the future.

Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 drivers/vhost/net.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index a014ca0..f59b615 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -514,6 +514,12 @@ static int get_tx_bufs(struct vhost_net *net,
 	return ret;
 }
 
+static bool tx_can_batch(struct vhost_virtqueue *vq, size_t total_len)
+{
+	return total_len < VHOST_NET_WEIGHT &&
+	       !vhost_vq_avail_empty(vq->dev, vq);
+}
+
 /* Expects to be always run from workqueue - which acts as
  * read-size critical section for our kind of RCU. */
 static void handle_tx(struct vhost_net *net)
@@ -598,8 +604,7 @@ static void handle_tx(struct vhost_net *net)
 			ubufs = NULL;
 		}
 		total_len += len;
-		if (total_len < VHOST_NET_WEIGHT &&
-		    !vhost_vq_avail_empty(&net->dev, vq) &&
+		if (tx_can_batch(vq, total_len) &&
 		    likely(!vhost_exceeds_maxpend(net))) {
 			msg.msg_flags |= MSG_MORE;
 		} else {
-- 
2.7.4

^ permalink raw reply related

* [PATCH net-next 4/9] vhost_net: introduce get_tx_bufs()
From: Jason Wang @ 2018-07-20  0:15 UTC (permalink / raw)
  To: mst, jasowang, netdev; +Cc: linux-kernel, kvm, virtualization
In-Reply-To: <1532045721-4958-1-git-send-email-jasowang@redhat.com>

Factor out logic of getting tx buffer and iov iter
initialization. This will be used for reducing codes duplication in
the future.

Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 drivers/vhost/net.c | 49 ++++++++++++++++++++++++++++++++-----------------
 1 file changed, 32 insertions(+), 17 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index b9e1674..a014ca0 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -484,6 +484,36 @@ static bool vhost_exceeds_weight(int pkts, int total_len)
 	       pkts >= VHOST_NET_PKT_WEIGHT;
 }
 
+static int get_tx_bufs(struct vhost_net *net,
+		       struct vhost_net_virtqueue *nvq,
+		       struct msghdr *msg,
+		       unsigned int *out, unsigned int *in,
+		       size_t *len, bool *busyloop_intr)
+{
+	struct vhost_virtqueue *vq = &nvq->vq;
+	int ret;
+
+	ret = vhost_net_tx_get_vq_desc(net, vq, out, in, busyloop_intr);
+	if (ret < 0 || ret == vq->num)
+		return ret;
+
+	if (*in) {
+		vq_err(vq, "Unexpected descriptor format for TX: out %d, int %d\n",
+			*out, *in);
+		return -EFAULT;
+	}
+
+	/* Sanity check */
+	*len = init_iov_iter(vq, &msg->msg_iter, nvq->vhost_hlen, *out);
+	if (*len == 0) {
+		vq_err(vq, "Unexpected header len for TX: %zd expected %zd\n",
+			*len, nvq->vhost_hlen);
+		return -EFAULT;
+	}
+
+	return ret;
+}
+
 /* Expects to be always run from workqueue - which acts as
  * read-size critical section for our kind of RCU. */
 static void handle_tx(struct vhost_net *net)
@@ -501,7 +531,6 @@ static void handle_tx(struct vhost_net *net)
 	};
 	size_t len, total_len = 0;
 	int err;
-	size_t hdr_size;
 	struct socket *sock;
 	struct vhost_net_ubuf_ref *uninitialized_var(ubufs);
 	bool zcopy, zcopy_used;
@@ -518,7 +547,6 @@ static void handle_tx(struct vhost_net *net)
 	vhost_disable_notify(&net->dev, vq);
 	vhost_net_disable_vq(net, vq);
 
-	hdr_size = nvq->vhost_hlen;
 	zcopy = nvq->ubufs;
 
 	for (;;) {
@@ -529,8 +557,8 @@ static void handle_tx(struct vhost_net *net)
 			vhost_zerocopy_signal_used(net, vq);
 
 		busyloop_intr = false;
-		head = vhost_net_tx_get_vq_desc(net, vq, &out, &in,
-						&busyloop_intr);
+		head = get_tx_bufs(net, nvq, &msg, &out, &in, &len,
+				   &busyloop_intr);
 		/* On error, stop handling until the next kick. */
 		if (unlikely(head < 0))
 			break;
@@ -544,19 +572,6 @@ static void handle_tx(struct vhost_net *net)
 			}
 			break;
 		}
-		if (in) {
-			vq_err(vq, "Unexpected descriptor format for TX: "
-			       "out %d, int %d\n", out, in);
-			break;
-		}
-
-		/* Sanity check */
-		len = init_iov_iter(vq, &msg.msg_iter, hdr_size, out);
-		if (!len) {
-			vq_err(vq, "Unexpected header len for TX: %zd expected %zd\n",
-			len, hdr_size);
-			break;
-		}
 
 		zcopy_used = zcopy && len >= VHOST_GOODCOPY_LEN
 				   && !vhost_exceeds_maxpend(net)
-- 
2.7.4

^ permalink raw reply related

* [PATCH net-next 3/9] vhost_net: introduce vhost_exceeds_weight()
From: Jason Wang @ 2018-07-20  0:15 UTC (permalink / raw)
  To: mst, jasowang, netdev; +Cc: linux-kernel, kvm, virtualization
In-Reply-To: <1532045721-4958-1-git-send-email-jasowang@redhat.com>

Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 drivers/vhost/net.c | 13 ++++++++-----
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index cac28fd..b9e1674 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -478,6 +478,12 @@ static size_t init_iov_iter(struct vhost_virtqueue *vq, struct iov_iter *iter,
 	return iov_iter_count(iter);
 }
 
+static bool vhost_exceeds_weight(int pkts, int total_len)
+{
+	return total_len >= VHOST_NET_WEIGHT ||
+	       pkts >= VHOST_NET_PKT_WEIGHT;
+}
+
 /* Expects to be always run from workqueue - which acts as
  * read-size critical section for our kind of RCU. */
 static void handle_tx(struct vhost_net *net)
@@ -576,7 +582,6 @@ static void handle_tx(struct vhost_net *net)
 			msg.msg_control = NULL;
 			ubufs = NULL;
 		}
-
 		total_len += len;
 		if (total_len < VHOST_NET_WEIGHT &&
 		    !vhost_vq_avail_empty(&net->dev, vq) &&
@@ -606,8 +611,7 @@ static void handle_tx(struct vhost_net *net)
 		else
 			vhost_zerocopy_signal_used(net, vq);
 		vhost_net_tx_packet(net);
-		if (unlikely(total_len >= VHOST_NET_WEIGHT) ||
-		    unlikely(++sent_pkts >= VHOST_NET_PKT_WEIGHT)) {
+		if (unlikely(vhost_exceeds_weight(++sent_pkts, total_len))) {
 			vhost_poll_queue(&vq->poll);
 			break;
 		}
@@ -918,8 +922,7 @@ static void handle_rx(struct vhost_net *net)
 		if (unlikely(vq_log))
 			vhost_log_write(vq, vq_log, log, vhost_len);
 		total_len += vhost_len;
-		if (unlikely(total_len >= VHOST_NET_WEIGHT) ||
-		    unlikely(++recv_pkts >= VHOST_NET_PKT_WEIGHT)) {
+		if (unlikely(vhost_exceeds_weight(++recv_pkts, total_len))) {
 			vhost_poll_queue(&vq->poll);
 			goto out;
 		}
-- 
2.7.4

^ permalink raw reply related

* [PATCH net-next 2/9] vhost_net: introduce helper to initialize tx iov iter
From: Jason Wang @ 2018-07-20  0:15 UTC (permalink / raw)
  To: mst, jasowang, netdev; +Cc: linux-kernel, kvm, virtualization
In-Reply-To: <1532045721-4958-1-git-send-email-jasowang@redhat.com>

Introduce init_iov_iter() in order to be reused by future patch.

Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 drivers/vhost/net.c | 26 +++++++++++++++++---------
 1 file changed, 17 insertions(+), 9 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 1a8175a..cac28fd 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -466,6 +466,18 @@ static bool vhost_exceeds_maxpend(struct vhost_net *net)
 	       min_t(unsigned int, VHOST_MAX_PEND, vq->num >> 2);
 }
 
+static size_t init_iov_iter(struct vhost_virtqueue *vq, struct iov_iter *iter,
+			    size_t hdr_size, int out)
+{
+	/* Skip header. TODO: support TSO. */
+	size_t len = iov_length(vq->iov, out);
+
+	iov_iter_init(iter, WRITE, vq->iov, out, len);
+	iov_iter_advance(iter, hdr_size);
+
+	return iov_iter_count(iter);
+}
+
 /* Expects to be always run from workqueue - which acts as
  * read-size critical section for our kind of RCU. */
 static void handle_tx(struct vhost_net *net)
@@ -531,18 +543,14 @@ static void handle_tx(struct vhost_net *net)
 			       "out %d, int %d\n", out, in);
 			break;
 		}
-		/* Skip header. TODO: support TSO. */
-		len = iov_length(vq->iov, out);
-		iov_iter_init(&msg.msg_iter, WRITE, vq->iov, out, len);
-		iov_iter_advance(&msg.msg_iter, hdr_size);
+
 		/* Sanity check */
-		if (!msg_data_left(&msg)) {
-			vq_err(vq, "Unexpected header len for TX: "
-			       "%zd expected %zd\n",
-			       len, hdr_size);
+		len = init_iov_iter(vq, &msg.msg_iter, hdr_size, out);
+		if (!len) {
+			vq_err(vq, "Unexpected header len for TX: %zd expected %zd\n",
+			len, hdr_size);
 			break;
 		}
-		len = msg_data_left(&msg);
 
 		zcopy_used = zcopy && len >= VHOST_GOODCOPY_LEN
 				   && !vhost_exceeds_maxpend(net)
-- 
2.7.4

^ permalink raw reply related

* [PATCH net-next 1/9] vhost_net: drop unnecessary parameter
From: Jason Wang @ 2018-07-20  0:15 UTC (permalink / raw)
  To: mst, jasowang, netdev; +Cc: linux-kernel, kvm, virtualization
In-Reply-To: <1532045721-4958-1-git-send-email-jasowang@redhat.com>

Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 drivers/vhost/net.c | 6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index b224036..1a8175a 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -430,7 +430,6 @@ static int vhost_net_enable_vq(struct vhost_net *n,
 
 static int vhost_net_tx_get_vq_desc(struct vhost_net *net,
 				    struct vhost_virtqueue *vq,
-				    struct iovec iov[], unsigned int iov_size,
 				    unsigned int *out_num, unsigned int *in_num,
 				    bool *busyloop_intr)
 {
@@ -512,9 +511,8 @@ static void handle_tx(struct vhost_net *net)
 			vhost_zerocopy_signal_used(net, vq);
 
 		busyloop_intr = false;
-		head = vhost_net_tx_get_vq_desc(net, vq, vq->iov,
-						ARRAY_SIZE(vq->iov),
-						&out, &in, &busyloop_intr);
+		head = vhost_net_tx_get_vq_desc(net, vq, &out, &in,
+						&busyloop_intr);
 		/* On error, stop handling until the next kick. */
 		if (unlikely(head < 0))
 			break;
-- 
2.7.4

^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox