Linux virtualization list
 help / color / mirror / Atom feed
* [PATCH 1/2] drm/virtio: use new drm_crtc_send_vblank_event()
From: Gustavo Padovan @ 2016-03-21 19:23 UTC (permalink / raw)
  To: Gerd Hoffmann
  Cc: David Airlie, Gustavo Padovan, open list,
	open list:VIRTIO GPU DRIVER, open list:VIRTIO GPU DRIVER

From: Gustavo Padovan <gustavo.padovan@collabora.co.uk>

Simplify code by using the new vblank crtc helpers.

Signed-off-by: Gustavo Padovan <gustavo.padovan@collabora.co.uk>
---
 drivers/gpu/drm/virtio/virtgpu_display.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/virtio/virtgpu_display.c b/drivers/gpu/drm/virtio/virtgpu_display.c
index 429aa31..b70bb8b 100644
--- a/drivers/gpu/drm/virtio/virtgpu_display.c
+++ b/drivers/gpu/drm/virtio/virtgpu_display.c
@@ -163,7 +163,7 @@ static int virtio_gpu_page_flip(struct drm_crtc *crtc,
 
 	if (event) {
 		spin_lock_irqsave(&crtc->dev->event_lock, irqflags);
-		drm_send_vblank_event(crtc->dev, -1, event);
+		drm_crtc_send_vblank_event(crtc, event);
 		spin_unlock_irqrestore(&crtc->dev->event_lock, irqflags);
 	}
 
-- 
2.5.0

^ permalink raw reply related

* [PATCH 2/2] drm/virtio: send vblank event on plane atomic update
From: Gustavo Padovan @ 2016-03-21 19:23 UTC (permalink / raw)
  To: Gerd Hoffmann
  Cc: David Airlie, Gustavo Padovan, open list,
	open list:VIRTIO GPU DRIVER, open list:VIRTIO GPU DRIVER
In-Reply-To: <1458588187-26002-1-git-send-email-gustavo@padovan.org>

From: Gustavo Padovan <gustavo.padovan@collabora.co.uk>

virtio_gpu was failing to send vblank events when using the atomic IOCTL
with the DRM_MODE_PAGE_FLIP_EVENT flag set. This patch fixes each and
enables atomic pageflips updates.

Signed-off-by: Gustavo Padovan <gustavo.padovan@collabora.co.uk>
---
 drivers/gpu/drm/virtio/virtgpu_plane.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/virtio/virtgpu_plane.c b/drivers/gpu/drm/virtio/virtgpu_plane.c
index 70b44a2..20260ad 100644
--- a/drivers/gpu/drm/virtio/virtgpu_plane.c
+++ b/drivers/gpu/drm/virtio/virtgpu_plane.c
@@ -63,9 +63,11 @@ static void virtio_gpu_plane_atomic_update(struct drm_plane *plane,
 {
 	struct drm_device *dev = plane->dev;
 	struct virtio_gpu_device *vgdev = dev->dev_private;
-	struct virtio_gpu_output *output = drm_crtc_to_virtio_gpu_output(plane->crtc);
+	struct drm_crtc *crtc = plane->crtc;
+	struct virtio_gpu_output *output = drm_crtc_to_virtio_gpu_output(crtc);
 	struct virtio_gpu_framebuffer *vgfb;
 	struct virtio_gpu_object *bo;
+	unsigned long flags;
 	uint32_t handle;
 
 	if (plane->state->fb) {
@@ -96,6 +98,11 @@ static void virtio_gpu_plane_atomic_update(struct drm_plane *plane,
 				      plane->state->crtc_y,
 				      plane->state->crtc_w,
 				      plane->state->crtc_h);
+
+	spin_lock_irqsave(&crtc->dev->event_lock, flags);
+	if (crtc->state->event)
+		drm_crtc_send_vblank_event(crtc, crtc->state->event);
+	spin_unlock_irqrestore(&crtc->dev->event_lock, flags);
 }
 
 
-- 
2.5.0

^ permalink raw reply related

* Re: [PATCH v2 14/18] mm/balloon: use general movable page feature into balloon
From: Minchan Kim @ 2016-03-22  2:19 UTC (permalink / raw)
  To: kbuild test robot
  Cc: aquini, rknize, virtualization, bfields, linux-mm, Sangseok Lee,
	jlayton, koct9i, YiPing Xu, Hugh Dickins, Gioh Kim, Mel Gorman,
	Rik van Riel, Vlastimil Babka, Al Viro, Andrew Morton,
	Chan Gyun Jeong, linux-kernel, Sergey Senozhatsky, Gioh Kim,
	kbuild-all, Joonsoo Kim
In-Reply-To: <201603211608.zNLWtmQ0%fengguang.wu@intel.com>

On Mon, Mar 21, 2016 at 04:29:55PM +0800, kbuild test robot wrote:
> Hi Minchan,
> 
> [auto build test ERROR on next-20160318]
> [cannot apply to v4.5-rc7 v4.5-rc6 v4.5-rc5 v4.5]
> [if your patch is applied to the wrong git tree, please drop us a note to help improving the system]
> 
> url:    https://github.com/0day-ci/linux/commits/Minchan-Kim/Support-non-lru-page-migration/20160321-143339
> config: x86_64-randconfig-x000-201612 (attached as .config)
> reproduce:
>         # save the attached .config to linux build tree
>         make ARCH=x86_64 
> 
> All error/warnings (new ones prefixed by >>):
> 
>    drivers/virtio/virtio_balloon.c: In function 'virtballoon_probe':
> >> drivers/virtio/virtio_balloon.c:578:15: error: 'balloon_mnt' undeclared (first use in this function)
>      kern_unmount(balloon_mnt);
>                   ^
>    drivers/virtio/virtio_balloon.c:578:15: note: each undeclared identifier is reported only once for each function it appears in
> >> drivers/virtio/virtio_balloon.c:579:1: warning: label 'out_free_vb' defined but not used [-Wunused-label]
>     out_free_vb:
>     ^
> 
> vim +/balloon_mnt +578 drivers/virtio/virtio_balloon.c
> 
>    572	
>    573	out_oom_notify:
>    574		vdev->config->del_vqs(vdev);
>    575	out_unmount:
>    576		if (vb->vb_dev_info.inode)
>    577			iput(vb->vb_dev_info.inode);
>  > 578		kern_unmount(balloon_mnt);
>  > 579	out_free_vb:
>    580		kfree(vb);
>    581	out:
>    582		return err;
> 
> ---
> 0-DAY kernel test infrastructure                Open Source Technology Center
> https://lists.01.org/pipermail/kbuild-all                   Intel Corporation


Thanks, kbuild.
Fixed.

From 7006a7ee62bb09273f96d8cb45c32e42453ab931 Mon Sep 17 00:00:00 2001
From: Minchan Kim <minchan@kernel.org>
Date: Thu, 3 Mar 2016 14:28:45 +0900
Subject: [PATCH] mm/balloon: use general movable page feature into balloon

Now, VM has a feature to migrate non-lru movable pages so
balloon doesn't need custom migration hooks in migrate.c
and compact.c. Instead, this patch implements page->mapping
->{isolate|migrate|putback} functions.

With that, we could remove hooks for ballooning in general
migration functions and make balloon compaction simple.

Cc: virtualization@lists.linux-foundation.org
Cc: Rafael Aquini <aquini@redhat.com>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Signed-off-by: Gioh Kim <gurugio@hanmail.net>
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 drivers/virtio/virtio_balloon.c    |  53 ++++++++++++++++---
 include/linux/balloon_compaction.h |  47 ++++-------------
 include/linux/page-flags.h         |  52 +++++++++++--------
 include/uapi/linux/magic.h         |   1 +
 mm/balloon_compaction.c            | 101 ++++++++-----------------------------
 mm/compaction.c                    |   7 ---
 mm/migrate.c                       |  22 ++------
 mm/vmscan.c                        |   2 +-
 8 files changed, 116 insertions(+), 169 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 7b6d74f0c72f..0c16192d2684 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -30,6 +30,7 @@
 #include <linux/oom.h>
 #include <linux/wait.h>
 #include <linux/mm.h>
+#include <linux/mount.h>
 
 /*
  * Balloon device works in 4K page units.  So each page is pointed to by
@@ -45,6 +46,10 @@ static int oom_pages = OOM_VBALLOON_DEFAULT_PAGES;
 module_param(oom_pages, int, S_IRUSR | S_IWUSR);
 MODULE_PARM_DESC(oom_pages, "pages to free on OOM");
 
+#ifdef CONFIG_BALLOON_COMPACTION
+static struct vfsmount *balloon_mnt;
+#endif
+
 struct virtio_balloon {
 	struct virtio_device *vdev;
 	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq;
@@ -482,10 +487,29 @@ static int virtballoon_migratepage(struct balloon_dev_info *vb_dev_info,
 
 	mutex_unlock(&vb->balloon_lock);
 
+	ClearPageIsolated(page);
 	put_page(page); /* balloon reference */
 
 	return MIGRATEPAGE_SUCCESS;
 }
+
+static struct dentry *balloon_mount(struct file_system_type *fs_type,
+		int flags, const char *dev_name, void *data)
+{
+	static const struct dentry_operations ops = {
+		.d_dname = simple_dname,
+	};
+
+	return mount_pseudo(fs_type, "balloon-kvm:", NULL, &ops,
+				BALLOON_KVM_MAGIC);
+}
+
+static struct file_system_type balloon_fs = {
+	.name           = "balloon-kvm",
+	.mount          = balloon_mount,
+	.kill_sb        = kill_anon_super,
+};
+
 #endif /* CONFIG_BALLOON_COMPACTION */
 
 static int virtballoon_probe(struct virtio_device *vdev)
@@ -515,10 +539,6 @@ static int virtballoon_probe(struct virtio_device *vdev)
 	vb->vdev = vdev;
 
 	balloon_devinfo_init(&vb->vb_dev_info);
-#ifdef CONFIG_BALLOON_COMPACTION
-	vb->vb_dev_info.migratepage = virtballoon_migratepage;
-#endif
-
 	err = init_vqs(vb);
 	if (err)
 		goto out_free_vb;
@@ -527,13 +547,32 @@ static int virtballoon_probe(struct virtio_device *vdev)
 	vb->nb.priority = VIRTBALLOON_OOM_NOTIFY_PRIORITY;
 	err = register_oom_notifier(&vb->nb);
 	if (err < 0)
-		goto out_oom_notify;
+		goto out_del_vqs;
+
+#ifdef CONFIG_BALLOON_COMPACTION
+	balloon_mnt = kern_mount(&balloon_fs);
+	if (IS_ERR(balloon_mnt)) {
+		err = PTR_ERR(balloon_mnt);
+		unregister_oom_notifier(&vb->nb);
+		goto out_del_vqs;
+	}
 
+	vb->vb_dev_info.migratepage = virtballoon_migratepage;
+	vb->vb_dev_info.inode = alloc_anon_inode(balloon_mnt->mnt_sb);
+	if (IS_ERR(vb->vb_dev_info.inode)) {
+		err = PTR_ERR(vb->vb_dev_info.inode);
+		kern_unmount(balloon_mnt);
+		unregister_oom_notifier(&vb->nb);
+		vb->vb_dev_info.inode = NULL;
+		goto out_del_vqs;
+	}
+	vb->vb_dev_info.inode->i_mapping->a_ops = &balloon_aops;
+#endif
 	virtio_device_ready(vdev);
 
 	return 0;
 
-out_oom_notify:
+out_del_vqs:
 	vdev->config->del_vqs(vdev);
 out_free_vb:
 	kfree(vb);
@@ -567,6 +606,8 @@ static void virtballoon_remove(struct virtio_device *vdev)
 	cancel_work_sync(&vb->update_balloon_stats_work);
 
 	remove_common(vb);
+	if (vb->vb_dev_info.inode)
+		iput(vb->vb_dev_info.inode);
 	kfree(vb);
 }
 
diff --git a/include/linux/balloon_compaction.h b/include/linux/balloon_compaction.h
index 9b0a15d06a4f..43a858545844 100644
--- a/include/linux/balloon_compaction.h
+++ b/include/linux/balloon_compaction.h
@@ -48,6 +48,7 @@
 #include <linux/migrate.h>
 #include <linux/gfp.h>
 #include <linux/err.h>
+#include <linux/fs.h>
 
 /*
  * Balloon device information descriptor.
@@ -62,6 +63,7 @@ struct balloon_dev_info {
 	struct list_head pages;		/* Pages enqueued & handled to Host */
 	int (*migratepage)(struct balloon_dev_info *, struct page *newpage,
 			struct page *page, enum migrate_mode mode);
+	struct inode *inode;
 };
 
 extern struct page *balloon_page_enqueue(struct balloon_dev_info *b_dev_info);
@@ -73,45 +75,19 @@ static inline void balloon_devinfo_init(struct balloon_dev_info *balloon)
 	spin_lock_init(&balloon->pages_lock);
 	INIT_LIST_HEAD(&balloon->pages);
 	balloon->migratepage = NULL;
+	balloon->inode = NULL;
 }
 
 #ifdef CONFIG_BALLOON_COMPACTION
-extern bool balloon_page_isolate(struct page *page);
+extern const struct address_space_operations balloon_aops;
+extern bool balloon_page_isolate(struct page *page,
+				isolate_mode_t mode);
 extern void balloon_page_putback(struct page *page);
-extern int balloon_page_migrate(struct page *newpage,
+extern int balloon_page_migrate(struct address_space *mapping,
+				struct page *newpage,
 				struct page *page, enum migrate_mode mode);
 
 /*
- * __is_movable_balloon_page - helper to perform @page PageBalloon tests
- */
-static inline bool __is_movable_balloon_page(struct page *page)
-{
-	return PageBalloon(page);
-}
-
-/*
- * balloon_page_movable - test PageBalloon to identify balloon pages
- *			  and PagePrivate to check that the page is not
- *			  isolated and can be moved by compaction/migration.
- *
- * As we might return false positives in the case of a balloon page being just
- * released under us, this need to be re-tested later, under the page lock.
- */
-static inline bool balloon_page_movable(struct page *page)
-{
-	return PageBalloon(page) && PagePrivate(page);
-}
-
-/*
- * isolated_balloon_page - identify an isolated balloon page on private
- *			   compaction/migration page lists.
- */
-static inline bool isolated_balloon_page(struct page *page)
-{
-	return PageBalloon(page);
-}
-
-/*
  * balloon_page_insert - insert a page into the balloon's page list and make
  *			 the page->private assignment accordingly.
  * @balloon : pointer to balloon device
@@ -123,8 +99,8 @@ static inline bool isolated_balloon_page(struct page *page)
 static inline void balloon_page_insert(struct balloon_dev_info *balloon,
 				       struct page *page)
 {
+	page->mapping = balloon->inode->i_mapping;
 	__SetPageBalloon(page);
-	SetPagePrivate(page);
 	set_page_private(page, (unsigned long)balloon);
 	list_add(&page->lru, &balloon->pages);
 }
@@ -140,11 +116,10 @@ static inline void balloon_page_insert(struct balloon_dev_info *balloon,
 static inline void balloon_page_delete(struct page *page)
 {
 	__ClearPageBalloon(page);
+	page->mapping = NULL;
 	set_page_private(page, 0);
-	if (PagePrivate(page)) {
-		ClearPagePrivate(page);
+	if (!PageIsolated(page))
 		list_del(&page->lru);
-	}
 }
 
 /*
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 3885064641c4..4853e0487175 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -599,50 +599,58 @@ static inline void __ClearPageBuddy(struct page *page)
 
 extern bool is_free_buddy_page(struct page *page);
 
-#define PAGE_BALLOON_MAPCOUNT_VALUE (-256)
+#define PAGE_MOVABLE_MAPCOUNT_VALUE (-256)
+#define PAGE_BALLOON_MAPCOUNT_VALUE PAGE_MOVABLE_MAPCOUNT_VALUE
 
-static inline int PageBalloon(struct page *page)
+static inline int PageMovable(struct page *page)
 {
-	return atomic_read(&page->_mapcount) == PAGE_BALLOON_MAPCOUNT_VALUE;
+	return (test_bit(PG_movable, &(page)->flags) &&
+		atomic_read(&page->_mapcount) == PAGE_MOVABLE_MAPCOUNT_VALUE);
 }
 
-static inline void __SetPageBalloon(struct page *page)
+/* Caller should hold a PG_lock */
+static inline void __SetPageMovable(struct page *page)
 {
-	VM_BUG_ON_PAGE(atomic_read(&page->_mapcount) != -1, page);
-	atomic_set(&page->_mapcount, PAGE_BALLOON_MAPCOUNT_VALUE);
+	__set_bit(PG_movable, &page->flags);
+	atomic_set(&page->_mapcount, PAGE_MOVABLE_MAPCOUNT_VALUE);
 }
 
-static inline void __ClearPageBalloon(struct page *page)
+static inline void __ClearPageMovable(struct page *page)
 {
-	VM_BUG_ON_PAGE(!PageBalloon(page), page);
 	atomic_set(&page->_mapcount, -1);
+	__clear_bit(PG_movable, &(page)->flags);
 }
 
-#define PAGE_MOVABLE_MAPCOUNT_VALUE (-255)
+PAGEFLAG(Isolated, isolated, PF_ANY);
 
-static inline int PageMovable(struct page *page)
+static inline int PageBalloon(struct page *page)
 {
-	return ((test_bit(PG_movable, &(page)->flags) &&
-		atomic_read(&page->_mapcount) == PAGE_MOVABLE_MAPCOUNT_VALUE)
-		|| PageBalloon(page));
+	return atomic_read(&page->_mapcount) == PAGE_BALLOON_MAPCOUNT_VALUE
+		&& PagePrivate2(page);
 }
 
-/*
- * Caller should hold a PG_lock */
-static inline void __SetPageMovable(struct page *page)
+static inline void __SetPageBalloon(struct page *page)
 {
-	__set_bit(PG_movable, &page->flags);
-	atomic_set(&page->_mapcount, PAGE_MOVABLE_MAPCOUNT_VALUE);
+	VM_BUG_ON_PAGE(atomic_read(&page->_mapcount) != -1, page);
+#ifdef CONFIG_BALLOON_COMPACTION
+	__SetPageMovable(page);
+#else
+	atomic_set(&page->_mapcount, PAGE_BALLOON_MAPCOUNT_VALUE);
+#endif
+	SetPagePrivate2(page);
 }
 
-static inline void __ClearPageMovable(struct page *page)
+static inline void __ClearPageBalloon(struct page *page)
 {
+	VM_BUG_ON_PAGE(!PageBalloon(page), page);
+#ifdef CONFIG_BALLOON_COMPACTION
+	__ClearPageMovable(page);
+#else
 	atomic_set(&page->_mapcount, -1);
-	__clear_bit(PG_movable, &(page)->flags);
+#endif
+	ClearPagePrivate2(page);
 }
 
-PAGEFLAG(Isolated, isolated, PF_ANY);
-
 /*
  * If network-based swap is enabled, sl*b must keep track of whether pages
  * were allocated from pfmemalloc reserves.
diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
index 0de181ad73d5..e1fbe72c39c0 100644
--- a/include/uapi/linux/magic.h
+++ b/include/uapi/linux/magic.h
@@ -78,5 +78,6 @@
 #define BTRFS_TEST_MAGIC	0x73727279
 #define NSFS_MAGIC		0x6e736673
 #define BPF_FS_MAGIC		0xcafe4a11
+#define BALLOON_KVM_MAGIC	0x13661366
 
 #endif /* __LINUX_MAGIC_H__ */
diff --git a/mm/balloon_compaction.c b/mm/balloon_compaction.c
index 57b3e9bd6bc5..1fbc7fb387bb 100644
--- a/mm/balloon_compaction.c
+++ b/mm/balloon_compaction.c
@@ -70,7 +70,7 @@ struct page *balloon_page_dequeue(struct balloon_dev_info *b_dev_info)
 		 */
 		if (trylock_page(page)) {
 #ifdef CONFIG_BALLOON_COMPACTION
-			if (!PagePrivate(page)) {
+			if (PageIsolated(page)) {
 				/* raced with isolation */
 				unlock_page(page);
 				continue;
@@ -106,110 +106,53 @@ EXPORT_SYMBOL_GPL(balloon_page_dequeue);
 
 #ifdef CONFIG_BALLOON_COMPACTION
 
-static inline void __isolate_balloon_page(struct page *page)
+/* __isolate_lru_page() counterpart for a ballooned page */
+bool balloon_page_isolate(struct page *page, isolate_mode_t mode)
 {
 	struct balloon_dev_info *b_dev_info = balloon_page_device(page);
 	unsigned long flags;
 
 	spin_lock_irqsave(&b_dev_info->pages_lock, flags);
-	ClearPagePrivate(page);
 	list_del(&page->lru);
 	b_dev_info->isolated_pages++;
 	spin_unlock_irqrestore(&b_dev_info->pages_lock, flags);
+	SetPageIsolated(page);
+
+	return true;
 }
 
-static inline void __putback_balloon_page(struct page *page)
+/* putback_lru_page() counterpart for a ballooned page */
+void balloon_page_putback(struct page *page)
 {
 	struct balloon_dev_info *b_dev_info = balloon_page_device(page);
 	unsigned long flags;
 
+	ClearPageIsolated(page);
 	spin_lock_irqsave(&b_dev_info->pages_lock, flags);
-	SetPagePrivate(page);
 	list_add(&page->lru, &b_dev_info->pages);
 	b_dev_info->isolated_pages--;
 	spin_unlock_irqrestore(&b_dev_info->pages_lock, flags);
 }
 
-/* __isolate_lru_page() counterpart for a ballooned page */
-bool balloon_page_isolate(struct page *page)
-{
-	/*
-	 * Avoid burning cycles with pages that are yet under __free_pages(),
-	 * or just got freed under us.
-	 *
-	 * In case we 'win' a race for a balloon page being freed under us and
-	 * raise its refcount preventing __free_pages() from doing its job
-	 * the put_page() at the end of this block will take care of
-	 * release this page, thus avoiding a nasty leakage.
-	 */
-	if (likely(get_page_unless_zero(page))) {
-		/*
-		 * As balloon pages are not isolated from LRU lists, concurrent
-		 * compaction threads can race against page migration functions
-		 * as well as race against the balloon driver releasing a page.
-		 *
-		 * In order to avoid having an already isolated balloon page
-		 * being (wrongly) re-isolated while it is under migration,
-		 * or to avoid attempting to isolate pages being released by
-		 * the balloon driver, lets be sure we have the page lock
-		 * before proceeding with the balloon page isolation steps.
-		 */
-		if (likely(trylock_page(page))) {
-			/*
-			 * A ballooned page, by default, has PagePrivate set.
-			 * Prevent concurrent compaction threads from isolating
-			 * an already isolated balloon page by clearing it.
-			 */
-			if (balloon_page_movable(page)) {
-				__isolate_balloon_page(page);
-				unlock_page(page);
-				return true;
-			}
-			unlock_page(page);
-		}
-		put_page(page);
-	}
-	return false;
-}
-
-/* putback_lru_page() counterpart for a ballooned page */
-void balloon_page_putback(struct page *page)
-{
-	/*
-	 * 'lock_page()' stabilizes the page and prevents races against
-	 * concurrent isolation threads attempting to re-isolate it.
-	 */
-	lock_page(page);
-
-	if (__is_movable_balloon_page(page)) {
-		__putback_balloon_page(page);
-		/* drop the extra ref count taken for page isolation */
-		put_page(page);
-	} else {
-		WARN_ON(1);
-		dump_page(page, "not movable balloon page");
-	}
-	unlock_page(page);
-}
-
 /* move_to_new_page() counterpart for a ballooned page */
-int balloon_page_migrate(struct page *newpage,
-			 struct page *page, enum migrate_mode mode)
+int balloon_page_migrate(struct address_space *mapping,
+		struct page *newpage, struct page *page,
+		enum migrate_mode mode)
 {
 	struct balloon_dev_info *balloon = balloon_page_device(page);
-	int rc = -EAGAIN;
 
 	VM_BUG_ON_PAGE(!PageLocked(page), page);
 	VM_BUG_ON_PAGE(!PageLocked(newpage), newpage);
+	VM_BUG_ON_PAGE(!PageMovable(page), page);
+	VM_BUG_ON_PAGE(!PageIsolated(page), page);
 
-	if (WARN_ON(!__is_movable_balloon_page(page))) {
-		dump_page(page, "not movable balloon page");
-		return rc;
-	}
-
-	if (balloon && balloon->migratepage)
-		rc = balloon->migratepage(balloon, newpage, page, mode);
-
-	return rc;
+	return balloon->migratepage(balloon, newpage, page, mode);
 }
+
+const struct address_space_operations balloon_aops = {
+	.migratepage = balloon_page_migrate,
+	.isolate_page = balloon_page_isolate,
+	.putback_page = balloon_page_putback,
+};
+EXPORT_SYMBOL_GPL(balloon_aops);
 #endif /* CONFIG_BALLOON_COMPACTION */
diff --git a/mm/compaction.c b/mm/compaction.c
index 7557aedddaee..e336c620fd7b 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -708,13 +708,6 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
 		 */
 		is_lru = PageLRU(page);
 		if (!is_lru) {
-			if (unlikely(balloon_page_movable(page))) {
-				if (balloon_page_isolate(page)) {
-					/* Successfully isolated */
-					goto isolate_success;
-				}
-			}
-
 			if (unlikely(PageMovable(page)) &&
 					!PageIsolated(page)) {
 				if (locked) {
diff --git a/mm/migrate.c b/mm/migrate.c
index ab87ef45a3b9..3234c14ed1cd 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -147,8 +147,8 @@ void putback_movable_page(struct page *page)
  * from where they were once taken off for compaction/migration.
  *
  * This function shall be used whenever the isolated pageset has been
- * built from lru, balloon, hugetlbfs page. See isolate_migratepages_range()
- * and isolate_huge_page().
+ * built from lru, movable, hugetlbfs page.
+ * See isolate_migratepages_range() and isolate_huge_page().
  */
 void putback_movable_pages(struct list_head *l)
 {
@@ -163,9 +163,7 @@ void putback_movable_pages(struct list_head *l)
 		list_del(&page->lru);
 		dec_zone_page_state(page, NR_ISOLATED_ANON +
 				page_is_file_cache(page));
-		if (unlikely(isolated_balloon_page(page)))
-			balloon_page_putback(page);
-		else if (unlikely(PageIsolated(page)))
+		if (unlikely(PageIsolated(page)))
 			putback_movable_page(page);
 		else
 			putback_lru_page(page);
@@ -959,18 +957,6 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
 	if (unlikely(!trylock_page(newpage)))
 		goto out_unlock;
 
-	if (unlikely(isolated_balloon_page(page))) {
-		/*
-		 * A ballooned page does not need any special attention from
-		 * physical to virtual reverse mapping procedures.
-		 * Skip any attempt to unmap PTEs or to remap swap cache,
-		 * in order to avoid burning cycles at rmap level, and perform
-		 * the page migration right away (proteced by page lock).
-		 */
-		rc = balloon_page_migrate(newpage, page, mode);
-		goto out_unlock_both;
-	}
-
 	/*
 	 * Corner case handling:
 	 * 1. When a new swap-cache page is read into, it is added to the LRU
@@ -1015,7 +1001,7 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
 out:
 	/* If migration is successful, move newpage to right list */
 	if (rc == MIGRATEPAGE_SUCCESS) {
-		if (unlikely(__is_movable_balloon_page(newpage)))
+		if (unlikely(PageMovable(newpage)))
 			put_page(newpage);
 		else
 			putback_lru_page(newpage);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index c72032dbe8db..e5dfa0cf6fdc 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1254,7 +1254,7 @@ unsigned long reclaim_clean_pages_from_list(struct zone *zone,
 
 	list_for_each_entry_safe(page, next, page_list, lru) {
 		if (page_is_file_cache(page) && !PageDirty(page) &&
-		    !isolated_balloon_page(page)) {
+		    !PageIsolated(page)) {
 			ClearPageActive(page);
 			list_move(&page->lru, &clean_pages);
 		}
-- 
1.9.1

^ permalink raw reply related

* Re: [PATCH] zsmalloc: fix semicolon.cocci warnings
From: Minchan Kim @ 2016-03-22  2:22 UTC (permalink / raw)
  To: kbuild test robot
  Cc: Rik van Riel, YiPing Xu, aquini, rknize, Sergey Senozhatsky,
	Chan Gyun Jeong, Hugh Dickins, linux-kernel, Al Viro,
	virtualization, bfields, linux-mm, Gioh Kim, kbuild-all,
	Sangseok Lee, Joonsoo Kim, Andrew Morton, koct9i, jlayton,
	Vlastimil Babka, Mel Gorman
In-Reply-To: <20160321094825.GA50080@roam.intel.com>

On Mon, Mar 21, 2016 at 05:48:25PM +0800, kbuild test robot wrote:
> mm/zsmalloc.c:1103:2-3: Unneeded semicolon
> 
> 
>  Remove unneeded semicolon.
> 
> Generated by: scripts/coccinelle/misc/semicolon.cocci
> 
> CC: Minchan Kim <minchan@kernel.org>
> Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
> ---
> 
>  zsmalloc.c |    2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> --- a/mm/zsmalloc.c
> +++ b/mm/zsmalloc.c
> @@ -1100,7 +1100,7 @@ void unlock_zspage(struct page *first_pa
>  		VM_BUG_ON_PAGE(!PageLocked(cursor), cursor);
>  		if (cursor != locked_page)
>  			unlock_page(cursor);
> -	};
> +	}
>  }
>  
>  static void free_zspage(struct zs_pool *pool, struct page *first_page)

Thanks.
Fixed.


From bb46f8265b55228f31b8096bd1c13dd6e6ee1bc4 Mon Sep 17 00:00:00 2001
From: Minchan Kim <minchan@kernel.org>
Date: Wed, 9 Mar 2016 09:37:57 +0900
Subject: [PATCH] zsmalloc: migrate tail pages in zspage

This patch enables tail page migration of zspage.

In this point, I tested zsmalloc regression with micro-benchmark
which does zs_malloc/map/unmap/zs_free for all size class
in every CPU(my system is 12) during 20 sec.

It shows 1% regression which is really small when we consider
the benefit of this feature and realworkload overhead(i.e.,
most overhead comes from compression).

Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 mm/zsmalloc.c | 131 +++++++++++++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 115 insertions(+), 16 deletions(-)

diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index 9b4b03d8f993..3f1d488633e1 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -551,6 +551,19 @@ static void set_zspage_mapping(struct page *first_page,
 	m->class = class_idx;
 }
 
+static bool check_isolated_page(struct page *first_page)
+{
+	struct page *cursor;
+
+	for (cursor = first_page; cursor != NULL; cursor =
+					get_next_page(cursor)) {
+		if (PageIsolated(cursor))
+			return true;
+	}
+
+	return false;
+}
+
 /*
  * zsmalloc divides the pool into various size classes where each
  * class maintains a list of zspages where each zspage is divided
@@ -1052,6 +1065,44 @@ void lock_zspage(struct page *first_page)
 	} while ((cursor = get_next_page(cursor)) != NULL);
 }
 
+int trylock_zspage(struct page *first_page, struct page *locked_page)
+{
+	struct page *cursor, *fail;
+
+	VM_BUG_ON_PAGE(!is_first_page(first_page), first_page);
+
+	for (cursor = first_page; cursor != NULL; cursor =
+			get_next_page(cursor)) {
+		if (cursor != locked_page) {
+			if (!trylock_page(cursor)) {
+				fail = cursor;
+				goto unlock;
+			}
+		}
+	}
+
+	return 1;
+unlock:
+	for (cursor = first_page; cursor != fail; cursor =
+			get_next_page(cursor)) {
+		if (cursor != locked_page)
+			unlock_page(cursor);
+	}
+
+	return 0;
+}
+
+void unlock_zspage(struct page *first_page, struct page *locked_page)
+{
+	struct page *cursor = first_page;
+
+	for (; cursor != NULL; cursor = get_next_page(cursor)) {
+		VM_BUG_ON_PAGE(!PageLocked(cursor), cursor);
+		if (cursor != locked_page)
+			unlock_page(cursor);
+	}
+}
+
 static void free_zspage(struct zs_pool *pool, struct page *first_page)
 {
 	struct page *nextp, *tmp;
@@ -1090,16 +1141,17 @@ static void init_zspage(struct size_class *class, struct page *first_page,
 	first_page->freelist = NULL;
 	INIT_LIST_HEAD(&first_page->lru);
 	set_zspage_inuse(first_page, 0);
-	BUG_ON(!trylock_page(first_page));
-	first_page->mapping = mapping;
-	__SetPageMovable(first_page);
-	unlock_page(first_page);
 
 	while (page) {
 		struct page *next_page;
 		struct link_free *link;
 		void *vaddr;
 
+		BUG_ON(!trylock_page(page));
+		page->mapping = mapping;
+		__SetPageMovable(page);
+		unlock_page(page);
+
 		vaddr = kmap_atomic(page);
 		link = (struct link_free *)vaddr + off / sizeof(*link);
 
@@ -1850,6 +1902,7 @@ static enum fullness_group putback_zspage(struct size_class *class,
 
 	VM_BUG_ON_PAGE(!list_empty(&first_page->lru), first_page);
 	VM_BUG_ON_PAGE(ZsPageIsolate(first_page), first_page);
+	VM_BUG_ON_PAGE(check_isolated_page(first_page), first_page);
 
 	fullness = get_fullness_group(class, first_page);
 	insert_zspage(class, fullness, first_page);
@@ -1956,6 +2009,12 @@ static struct page *isolate_source_page(struct size_class *class)
 		if (!page)
 			continue;
 
+		/* To prevent race between object and page migration */
+		if (!trylock_zspage(page, NULL)) {
+			page = NULL;
+			continue;
+		}
+
 		remove_zspage(class, i, page);
 
 		inuse = get_zspage_inuse(page);
@@ -1964,6 +2023,7 @@ static struct page *isolate_source_page(struct size_class *class)
 		if (inuse != freezed) {
 			unfreeze_zspage(class, page, freezed);
 			putback_zspage(class, page);
+			unlock_zspage(page, NULL);
 			page = NULL;
 			continue;
 		}
@@ -1995,6 +2055,12 @@ static struct page *isolate_target_page(struct size_class *class)
 		if (!page)
 			continue;
 
+		/* To prevent race between object and page migration */
+		if (!trylock_zspage(page, NULL)) {
+			page = NULL;
+			continue;
+		}
+
 		remove_zspage(class, i, page);
 
 		inuse = get_zspage_inuse(page);
@@ -2003,6 +2069,7 @@ static struct page *isolate_target_page(struct size_class *class)
 		if (inuse != freezed) {
 			unfreeze_zspage(class, page, freezed);
 			putback_zspage(class, page);
+			unlock_zspage(page, NULL);
 			page = NULL;
 			continue;
 		}
@@ -2076,11 +2143,13 @@ static void __zs_compact(struct zs_pool *pool, struct size_class *class)
 			putback_zspage(class, dst_page);
 			unfreeze_zspage(class, dst_page,
 				class->objs_per_zspage);
+			unlock_zspage(dst_page, NULL);
 			spin_unlock(&class->lock);
 			dst_page = NULL;
 		}
 
 		if (zspage_empty(class, src_page)) {
+			unlock_zspage(src_page, NULL);
 			free_zspage(pool, src_page);
 			spin_lock(&class->lock);
 			zs_stat_dec(class, OBJ_ALLOCATED,
@@ -2103,12 +2172,14 @@ static void __zs_compact(struct zs_pool *pool, struct size_class *class)
 		putback_zspage(class, src_page);
 		unfreeze_zspage(class, src_page,
 				class->objs_per_zspage);
+		unlock_zspage(src_page, NULL);
 	}
 
 	if (dst_page) {
 		putback_zspage(class, dst_page);
 		unfreeze_zspage(class, dst_page,
 				class->objs_per_zspage);
+		unlock_zspage(dst_page, NULL);
 	}
 
 	spin_unlock(&class->lock);
@@ -2211,10 +2282,11 @@ bool zs_page_isolate(struct page *page, isolate_mode_t mode)
 	VM_BUG_ON_PAGE(!PageLocked(page), page);
 	VM_BUG_ON_PAGE(PageIsolated(page), page);
 	/*
-	 * In this implementation, it allows only first page migration.
+	 * first_page will not be destroyed by PG_lock of @page but it could
+	 * be migrated out. For prohibiting it, zs_page_migrate calls
+	 * trylock_zspage so it closes the race.
 	 */
-	VM_BUG_ON_PAGE(!is_first_page(page), page);
-	first_page = page;
+	first_page = get_first_page(page);
 
 	/*
 	 * Without class lock, fullness is meaningless while constant
@@ -2228,9 +2300,18 @@ bool zs_page_isolate(struct page *page, isolate_mode_t mode)
 	if (!spin_trylock(&class->lock))
 		return false;
 
+	if (check_isolated_page(first_page))
+		goto skip_isolate;
+
+	/*
+	 * If this is first time isolation for zspage, isolate zspage from
+	 * size_class to prevent further allocations from the zspage.
+	 */
 	get_zspage_mapping(first_page, &class_idx, &fullness);
 	remove_zspage(class, fullness, first_page);
 	SetZsPageIsolate(first_page);
+
+skip_isolate:
 	SetPageIsolated(page);
 	spin_unlock(&class->lock);
 
@@ -2253,7 +2334,7 @@ int zs_page_migrate(struct address_space *mapping, struct page *newpage,
 	VM_BUG_ON_PAGE(!PageMovable(page), page);
 	VM_BUG_ON_PAGE(!PageIsolated(page), page);
 
-	first_page = page;
+	first_page = get_first_page(page);
 	get_zspage_mapping(first_page, &class_idx, &fullness);
 	pool = page->mapping->private_data;
 	class = pool->size_class[class_idx];
@@ -2268,6 +2349,13 @@ int zs_page_migrate(struct address_space *mapping, struct page *newpage,
 	if (get_zspage_inuse(first_page) == 0)
 		goto out_class_unlock;
 
+	/*
+	 * It prevents first_page migration during tail page opeartion for
+	 * get_first_page's stability.
+	 */
+	if (!trylock_zspage(first_page, page))
+		goto out_class_unlock;
+
 	freezed = freeze_zspage(class, first_page);
 	if (freezed != get_zspage_inuse(first_page))
 		goto out_unfreeze;
@@ -2306,21 +2394,26 @@ int zs_page_migrate(struct address_space *mapping, struct page *newpage,
 	kunmap_atomic(addr);
 
 	replace_sub_page(class, first_page, newpage, page);
-	first_page = newpage;
+	first_page = get_first_page(newpage);
 	get_page(newpage);
 	VM_BUG_ON_PAGE(get_fullness_group(class, first_page) ==
 			ZS_EMPTY, first_page);
-	ClearZsPageIsolate(first_page);
-	putback_zspage(class, first_page);
+	if (!check_isolated_page(first_page)) {
+		INIT_LIST_HEAD(&first_page->lru);
+		ClearZsPageIsolate(first_page);
+		putback_zspage(class, first_page);
+	}
+
 
 	/* Migration complete. Free old page */
 	reset_page(page);
 	ClearPageIsolated(page);
 	put_page(page);
 	ret = MIGRATEPAGE_SUCCESS;
-
+	page = newpage;
 out_unfreeze:
 	unfreeze_zspage(class, first_page, freezed);
+	unlock_zspage(first_page, page);
 out_class_unlock:
 	spin_unlock(&class->lock);
 
@@ -2338,7 +2431,7 @@ void zs_page_putback(struct page *page)
 	VM_BUG_ON_PAGE(!PageMovable(page), page);
 	VM_BUG_ON_PAGE(!PageIsolated(page), page);
 
-	first_page = page;
+	first_page = get_first_page(page);
 	get_zspage_mapping(first_page, &class_idx, &fullness);
 	pool = page->mapping->private_data;
 	class = pool->size_class[class_idx];
@@ -2348,11 +2441,17 @@ void zs_page_putback(struct page *page)
 	 * in zs_free will wait the page lock of @page without
 	 * destroying of zspage.
 	 */
-	INIT_LIST_HEAD(&first_page->lru);
 	spin_lock(&class->lock);
 	ClearPageIsolated(page);
-	ClearZsPageIsolate(first_page);
-	putback_zspage(class, first_page);
+	/*
+	 * putback zspage to right list if this is last isolated page
+	 * putback in the zspage.
+	 */
+	if (!check_isolated_page(first_page)) {
+		INIT_LIST_HEAD(&first_page->lru);
+		ClearZsPageIsolate(first_page);
+		putback_zspage(class, first_page);
+	}
 	spin_unlock(&class->lock);
 }
 
-- 
1.9.1

^ permalink raw reply related

* Re: [PATCH v2 13/18] mm/compaction: support non-lru movable page migration
From: Joonsoo Kim @ 2016-03-22  5:50 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Rik van Riel, YiPing Xu, aquini, rknize, Sergey Senozhatsky,
	Chan Gyun Jeong, dri-devel, Hugh Dickins, linux-kernel, Al Viro,
	virtualization, bfields, linux-mm, Gioh Kim, Gioh Kim, koct9i,
	Sangseok Lee, Andrew Morton, jlayton, Vlastimil Babka, Mel Gorman
In-Reply-To: <1458541867-27380-14-git-send-email-minchan@kernel.org>

On Mon, Mar 21, 2016 at 03:31:02PM +0900, Minchan Kim wrote:
> We have allowed migration for only LRU pages until now and it was
> enough to make high-order pages. But recently, embedded system(e.g.,
> webOS, android) uses lots of non-movable pages(e.g., zram, GPU memory)
> so we have seen several reports about troubles of small high-order
> allocation. For fixing the problem, there were several efforts
> (e,g,. enhance compaction algorithm, SLUB fallback to 0-order page,
> reserved memory, vmalloc and so on) but if there are lots of
> non-movable pages in system, their solutions are void in the long run.
> 
> So, this patch is to support facility to change non-movable pages
> with movable. For the feature, this patch introduces functions related
> to migration to address_space_operations as well as some page flags.
> 
> Basically, this patch supports two page-flags and two functions related
> to page migration. The flag and page->mapping stability are protected
> by PG_lock.
> 
> 	PG_movable
> 	PG_isolated
> 
> 	bool (*isolate_page) (struct page *, isolate_mode_t);
> 	void (*putback_page) (struct page *);
> 
> Duty of subsystem want to make their pages as migratable are
> as follows:
> 
> 1. It should register address_space to page->mapping then mark
> the page as PG_movable via __SetPageMovable.
> 
> 2. It should mark the page as PG_isolated via SetPageIsolated
> if isolation is sucessful and return true.
> 
> 3. If migration is successful, it should clear PG_isolated and
> PG_movable of the page for free preparation then release the
> reference of the page to free.
> 
> 4. If migration fails, putback function of subsystem should
> clear PG_isolated via ClearPageIsolated.

I think that this feature needs a separate document to describe
requirement of each step in more detail. For example, #1 can be
possible without holding a lock? I'm not sure because you lock
the page when implementing zsmalloc page migration in 15th patch.

#3 also need more explanation. Before release, we need to
unregister address_space. I guess that it needs to be done
in migratepage() but there is no explanation.

> 
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: dri-devel@lists.freedesktop.org
> Cc: virtualization@lists.linux-foundation.org
> Signed-off-by: Gioh Kim <gurugio@hanmail.net>
> Signed-off-by: Minchan Kim <minchan@kernel.org>
> ---
>  Documentation/filesystems/Locking      |   4 +
>  Documentation/filesystems/vfs.txt      |   5 ++
>  fs/proc/page.c                         |   3 +
>  include/linux/fs.h                     |   2 +
>  include/linux/migrate.h                |   2 +
>  include/linux/page-flags.h             |  29 ++++++++
>  include/uapi/linux/kernel-page-flags.h |   1 +
>  mm/compaction.c                        |  14 +++-
>  mm/migrate.c                           | 132 +++++++++++++++++++++++++++++----
>  9 files changed, 177 insertions(+), 15 deletions(-)
> 
> diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking
> index 619af9bfdcb3..0bb79560abb3 100644
> --- a/Documentation/filesystems/Locking
> +++ b/Documentation/filesystems/Locking
> @@ -195,7 +195,9 @@ unlocks and drops the reference.
>  	int (*releasepage) (struct page *, int);
>  	void (*freepage)(struct page *);
>  	int (*direct_IO)(struct kiocb *, struct iov_iter *iter, loff_t offset);
> +	bool (*isolate_page) (struct page *, isolate_mode_t);
>  	int (*migratepage)(struct address_space *, struct page *, struct page *);
> +	void (*putback_page) (struct page *);
>  	int (*launder_page)(struct page *);
>  	int (*is_partially_uptodate)(struct page *, unsigned long, unsigned long);
>  	int (*error_remove_page)(struct address_space *, struct page *);
> @@ -219,7 +221,9 @@ invalidatepage:		yes
>  releasepage:		yes
>  freepage:		yes
>  direct_IO:
> +isolate_page:		yes
>  migratepage:		yes (both)
> +putback_page:		yes
>  launder_page:		yes
>  is_partially_uptodate:	yes
>  error_remove_page:	yes
> diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt
> index b02a7d598258..4c1b6c3b4bc8 100644
> --- a/Documentation/filesystems/vfs.txt
> +++ b/Documentation/filesystems/vfs.txt
> @@ -592,9 +592,14 @@ struct address_space_operations {
>  	int (*releasepage) (struct page *, int);
>  	void (*freepage)(struct page *);
>  	ssize_t (*direct_IO)(struct kiocb *, struct iov_iter *iter, loff_t offset);
> +	/* isolate a page for migration */
> +	bool (*isolate_page) (struct page *, isolate_mode_t);
>  	/* migrate the contents of a page to the specified target */
>  	int (*migratepage) (struct page *, struct page *);
> +	/* put the page back to right list */
> +	void (*putback_page) (struct page *);
>  	int (*launder_page) (struct page *);
> +
>  	int (*is_partially_uptodate) (struct page *, unsigned long,
>  					unsigned long);
>  	void (*is_dirty_writeback) (struct page *, bool *, bool *);
> diff --git a/fs/proc/page.c b/fs/proc/page.c
> index 712f1b9992cc..e2066e73a9b8 100644
> --- a/fs/proc/page.c
> +++ b/fs/proc/page.c
> @@ -157,6 +157,9 @@ u64 stable_page_flags(struct page *page)
>  	if (page_is_idle(page))
>  		u |= 1 << KPF_IDLE;
>  
> +	if (PageMovable(page))
> +		u |= 1 << KPF_MOVABLE;
> +
>  	u |= kpf_copy_bit(k, KPF_LOCKED,	PG_locked);
>  
>  	u |= kpf_copy_bit(k, KPF_SLAB,		PG_slab);
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 14a97194b34b..b7ef2e41fa4a 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -401,6 +401,8 @@ struct address_space_operations {
>  	 */
>  	int (*migratepage) (struct address_space *,
>  			struct page *, struct page *, enum migrate_mode);
> +	bool (*isolate_page)(struct page *, isolate_mode_t);
> +	void (*putback_page)(struct page *);
>  	int (*launder_page) (struct page *);
>  	int (*is_partially_uptodate) (struct page *, unsigned long,
>  					unsigned long);
> diff --git a/include/linux/migrate.h b/include/linux/migrate.h
> index 9b50325e4ddf..404fbfefeb33 100644
> --- a/include/linux/migrate.h
> +++ b/include/linux/migrate.h
> @@ -37,6 +37,8 @@ extern int migrate_page(struct address_space *,
>  			struct page *, struct page *, enum migrate_mode);
>  extern int migrate_pages(struct list_head *l, new_page_t new, free_page_t free,
>  		unsigned long private, enum migrate_mode mode, int reason);
> +extern bool isolate_movable_page(struct page *page, isolate_mode_t mode);
> +extern void putback_movable_page(struct page *page);
>  
>  extern int migrate_prep(void);
>  extern int migrate_prep_local(void);
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index f4ed4f1b0c77..3885064641c4 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -129,6 +129,10 @@ enum pageflags {
>  
>  	/* Compound pages. Stored in first tail page's flags */
>  	PG_double_map = PG_private_2,
> +
> +	/* non-lru movable pages */
> +	PG_movable = PG_reclaim,
> +	PG_isolated = PG_owner_priv_1,
>  };
>  
>  #ifndef __GENERATING_BOUNDS_H
> @@ -614,6 +618,31 @@ static inline void __ClearPageBalloon(struct page *page)
>  	atomic_set(&page->_mapcount, -1);
>  }
>  
> +#define PAGE_MOVABLE_MAPCOUNT_VALUE (-255)
> +
> +static inline int PageMovable(struct page *page)
> +{
> +	return ((test_bit(PG_movable, &(page)->flags) &&
> +		atomic_read(&page->_mapcount) == PAGE_MOVABLE_MAPCOUNT_VALUE)
> +		|| PageBalloon(page));
> +}
> +
> +/*
> + * Caller should hold a PG_lock */
> +static inline void __SetPageMovable(struct page *page)
> +{
> +	__set_bit(PG_movable, &page->flags);
> +	atomic_set(&page->_mapcount, PAGE_MOVABLE_MAPCOUNT_VALUE);
> +}

I think there is no big benefit to use non-atomic version here.
PageMovable() is speculatively checked without holding a PG_lock
so some cpu can miss this flag set if we use non-atomic version.

> +
> +static inline void __ClearPageMovable(struct page *page)
> +{
> +	atomic_set(&page->_mapcount, -1);
> +	__clear_bit(PG_movable, &(page)->flags);
> +}
> +
> +PAGEFLAG(Isolated, isolated, PF_ANY);
> +
>  /*
>   * If network-based swap is enabled, sl*b must keep track of whether pages
>   * were allocated from pfmemalloc reserves.
> diff --git a/include/uapi/linux/kernel-page-flags.h b/include/uapi/linux/kernel-page-flags.h
> index 5da5f8751ce7..a184fd2434fa 100644
> --- a/include/uapi/linux/kernel-page-flags.h
> +++ b/include/uapi/linux/kernel-page-flags.h
> @@ -34,6 +34,7 @@
>  #define KPF_BALLOON		23
>  #define KPF_ZERO_PAGE		24
>  #define KPF_IDLE		25
> +#define KPF_MOVABLE		26
>  
>  
>  #endif /* _UAPILINUX_KERNEL_PAGE_FLAGS_H */
> diff --git a/mm/compaction.c b/mm/compaction.c
> index ccf97b02b85f..7557aedddaee 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -703,7 +703,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
>  
>  		/*
>  		 * Check may be lockless but that's ok as we recheck later.
> -		 * It's possible to migrate LRU pages and balloon pages
> +		 * It's possible to migrate LRU and movable kernel pages.
>  		 * Skip any other type of page
>  		 */
>  		is_lru = PageLRU(page);
> @@ -714,6 +714,18 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
>  					goto isolate_success;
>  				}
>  			}
> +
> +			if (unlikely(PageMovable(page)) &&
> +					!PageIsolated(page)) {
> +				if (locked) {
> +					spin_unlock_irqrestore(&zone->lru_lock,
> +									flags);
> +					locked = false;
> +				}
> +
> +				if (isolate_movable_page(page, isolate_mode))
> +					goto isolate_success;
> +			}
>  		}
>  
>  		/*
> diff --git a/mm/migrate.c b/mm/migrate.c
> index b65c84267ce0..fc2842a15807 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -73,6 +73,75 @@ int migrate_prep_local(void)
>  	return 0;
>  }
>  
> +bool isolate_movable_page(struct page *page, isolate_mode_t mode)
> +{
> +	bool ret = false;
> +
> +	/*
> +	 * Avoid burning cycles with pages that are yet under __free_pages(),
> +	 * or just got freed under us.
> +	 *
> +	 * In case we 'win' a race for a movable page being freed under us and
> +	 * raise its refcount preventing __free_pages() from doing its job
> +	 * the put_page() at the end of this block will take care of
> +	 * release this page, thus avoiding a nasty leakage.
> +	 */
> +	if (unlikely(!get_page_unless_zero(page)))
> +		goto out;

After getting the ref counter, we need to re-check PageMovable()
to ensure that we indeed handle PageMovable() type page. Without it,
the page we handle can be freed and re-allocated to someone else
that isn't related to PageMovable() before grabbing the page. Trying
trylock_page() in this case could cause a problem.

> +	/*
> +	 * As movable pages are not isolated from LRU lists, concurrent
> +	 * compaction threads can race against page migration functions
> +	 * as well as race against the releasing a page.
> +	 *
> +	 * In order to avoid having an already isolated movable page
> +	 * being (wrongly) re-isolated while it is under migration,
> +	 * or to avoid attempting to isolate pages being released,
> +	 * lets be sure we have the page lock
> +	 * before proceeding with the movable page isolation steps.
> +	 */
> +	if (unlikely(!trylock_page(page)))
> +		goto out_putpage;
> +
> +	if (!PageMovable(page) || PageIsolated(page))
> +		goto out_no_isolated;
> +
> +	ret = page->mapping->a_ops->isolate_page(page, mode);
> +	if (!ret)
> +		goto out_no_isolated;
> +
> +	WARN_ON_ONCE(!PageIsolated(page));
> +	unlock_page(page);
> +	return ret;
> +
> +out_no_isolated:
> +	unlock_page(page);
> +out_putpage:
> +	put_page(page);
> +out:
> +	return ret;
> +}
> +
> +void putback_movable_page(struct page *page)
> +{
> +	struct address_space *mapping;
> +
> +	/*
> +	 * 'lock_page()' stabilizes the page and prevents races against
> +	 * concurrent isolation threads attempting to re-isolate it.
> +	 */
> +	lock_page(page);
> +	mapping = page_mapping(page);
> +	if (mapping) {
> +		mapping->a_ops->putback_page(page);
> +		WARN_ON_ONCE(PageIsolated(page));
> +	}
> +	unlock_page(page);
> +	/* drop the extra ref count taken for movable page isolation */
> +	put_page(page);
> +}

This is complicated part for me. mapping can disappear? In this case,
who clear PageIsolated()?

> +
> +
>  /*
>   * Put previously isolated pages back onto the appropriate lists
>   * from where they were once taken off for compaction/migration.
> @@ -96,6 +165,8 @@ void putback_movable_pages(struct list_head *l)
>  				page_is_file_cache(page));
>  		if (unlikely(isolated_balloon_page(page)))
>  			balloon_page_putback(page);
> +		else if (unlikely(PageIsolated(page)))
> +			putback_movable_page(page);
>  		else
>  			putback_lru_page(page);
>  	}

I think that this will not work. You uses PG_owner_priv_1 as
PG_isolated and it is possible that some lru pages has this flag.
I guess you need to add PageMovable() check but it seems that mapping
and this flag can be cleared by others.

> @@ -592,7 +663,7 @@ void migrate_page_copy(struct page *newpage, struct page *page)
>   ***********************************************************/
>  
>  /*
> - * Common logic to directly migrate a single page suitable for
> + * Common logic to directly migrate a single LRU page suitable for
>   * pages that do not use PagePrivate/PagePrivate2.
>   *
>   * Pages are locked upon entry and exit.
> @@ -755,24 +826,53 @@ static int move_to_new_page(struct page *newpage, struct page *page,
>  				enum migrate_mode mode)
>  {
>  	struct address_space *mapping;
> -	int rc;
> +	int rc = -EAGAIN;
> +	bool isolated_lru_page;
>  
>  	VM_BUG_ON_PAGE(!PageLocked(page), page);
>  	VM_BUG_ON_PAGE(!PageLocked(newpage), newpage);
>  
>  	mapping = page_mapping(page);
> -	if (!mapping)
> -		rc = migrate_page(mapping, newpage, page, mode);
> -	else if (mapping->a_ops->migratepage)
> +	/*
> +	 * In case of non-lru page, it could be released after
> +	 * isolation step. In that case, we shouldn't try
> +	 * fallback migration which was designed for LRU pages.

So, page we try to migrate can be released during migration.
In this case, who does clear mapping and flag? And, without mapping,
how we can clear PageIsolated()?

> +	 * To identify such pages, we cannot use PageMovable
> +	 * because owner of the page can reset it. So intead,
> +	 * use PG_isolated bit.
> +	 */
> +	isolated_lru_page = !PageIsolated(page);

Ditto. PageIsolated() isn't sufficient to distinguish non-lru page.

My comment mainly points out rules about when/who clear mapping
and flag. Maybe, you need to answer just one of them. :)

Thanks.

^ permalink raw reply

* Re: [PATCH 2/2] drm/virtio: send vblank event on plane atomic update
From: Daniel Stone @ 2016-03-22 13:17 UTC (permalink / raw)
  To: Gustavo Padovan
  Cc: David Airlie, open list, open list:VIRTIO GPU DRIVER,
	open list:VIRTIO GPU DRIVER
In-Reply-To: <1458588187-26002-2-git-send-email-gustavo@padovan.org>

Hi,

On 21 March 2016 at 19:23, Gustavo Padovan <gustavo@padovan.org> wrote:
> @@ -96,6 +98,11 @@ static void virtio_gpu_plane_atomic_update(struct drm_plane *plane,
>                                       plane->state->crtc_y,
>                                       plane->state->crtc_w,
>                                       plane->state->crtc_h);
> +
> +       spin_lock_irqsave(&crtc->dev->event_lock, flags);
> +       if (crtc->state->event)
> +               drm_crtc_send_vblank_event(crtc, crtc->state->event);
> +       spin_unlock_irqrestore(&crtc->dev->event_lock, flags);

This seems like the wrong place to do it, in that it will generate one
flip event per plane, rather than one per CRTC. So this should
probably be done in the overall atomic_commit hook I think.

Also, without some kind of delay, this means that we'll generate
flip-complete events immediately, which will cause compositors like
Weston to render infinitely fast. It's probably worth looking at what
happened when this came up with Bochs - I'm not sure if we fake a 16ms
delay, or refuse to do async modesets, or what.

Cheers,
Daniel

^ permalink raw reply

* Re: [PATCH v2 13/18] mm/compaction: support non-lru movable page migration
From: Minchan Kim @ 2016-03-22 14:55 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Rik van Riel, YiPing Xu, aquini, rknize, Sergey Senozhatsky,
	Chan Gyun Jeong, dri-devel, Hugh Dickins, linux-kernel, Al Viro,
	virtualization, bfields, linux-mm, Gioh Kim, Gioh Kim, koct9i,
	Sangseok Lee, Andrew Morton, jlayton, Vlastimil Babka, Mel Gorman
In-Reply-To: <20160322055037.GC31955@js1304-P5Q-DELUXE>

On Tue, Mar 22, 2016 at 02:50:37PM +0900, Joonsoo Kim wrote:
> On Mon, Mar 21, 2016 at 03:31:02PM +0900, Minchan Kim wrote:
> > We have allowed migration for only LRU pages until now and it was
> > enough to make high-order pages. But recently, embedded system(e.g.,
> > webOS, android) uses lots of non-movable pages(e.g., zram, GPU memory)
> > so we have seen several reports about troubles of small high-order
> > allocation. For fixing the problem, there were several efforts
> > (e,g,. enhance compaction algorithm, SLUB fallback to 0-order page,
> > reserved memory, vmalloc and so on) but if there are lots of
> > non-movable pages in system, their solutions are void in the long run.
> > 
> > So, this patch is to support facility to change non-movable pages
> > with movable. For the feature, this patch introduces functions related
> > to migration to address_space_operations as well as some page flags.
> > 
> > Basically, this patch supports two page-flags and two functions related
> > to page migration. The flag and page->mapping stability are protected
> > by PG_lock.
> > 
> > 	PG_movable
> > 	PG_isolated
> > 
> > 	bool (*isolate_page) (struct page *, isolate_mode_t);
> > 	void (*putback_page) (struct page *);
> > 
> > Duty of subsystem want to make their pages as migratable are
> > as follows:
> > 
> > 1. It should register address_space to page->mapping then mark
> > the page as PG_movable via __SetPageMovable.
> > 
> > 2. It should mark the page as PG_isolated via SetPageIsolated
> > if isolation is sucessful and return true.
> > 
> > 3. If migration is successful, it should clear PG_isolated and
> > PG_movable of the page for free preparation then release the
> > reference of the page to free.
> > 
> > 4. If migration fails, putback function of subsystem should
> > clear PG_isolated via ClearPageIsolated.
> 
> I think that this feature needs a separate document to describe
> requirement of each step in more detail. For example, #1 can be
> possible without holding a lock? I'm not sure because you lock
> the page when implementing zsmalloc page migration in 15th patch.

Yes, we needs PG_lock because install page->mapping and PG_movable
should be atomic and PG_lock protects it.

Better interface might be

void __SetPageMovable(struct page *page, sruct address_space *mapping);

> 
> #3 also need more explanation. Before release, we need to
> unregister address_space. I guess that it needs to be done
> in migratepage() but there is no explanation.

Okay, we can unregister address_space in __ClearPageMovable.
I will change it.

> 
> > 
> > Cc: Vlastimil Babka <vbabka@suse.cz>
> > Cc: Mel Gorman <mgorman@suse.de>
> > Cc: Hugh Dickins <hughd@google.com>
> > Cc: dri-devel@lists.freedesktop.org
> > Cc: virtualization@lists.linux-foundation.org
> > Signed-off-by: Gioh Kim <gurugio@hanmail.net>
> > Signed-off-by: Minchan Kim <minchan@kernel.org>
> > ---
> >  Documentation/filesystems/Locking      |   4 +
> >  Documentation/filesystems/vfs.txt      |   5 ++
> >  fs/proc/page.c                         |   3 +
> >  include/linux/fs.h                     |   2 +
> >  include/linux/migrate.h                |   2 +
> >  include/linux/page-flags.h             |  29 ++++++++
> >  include/uapi/linux/kernel-page-flags.h |   1 +
> >  mm/compaction.c                        |  14 +++-
> >  mm/migrate.c                           | 132 +++++++++++++++++++++++++++++----
> >  9 files changed, 177 insertions(+), 15 deletions(-)
> > 
> > diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking
> > index 619af9bfdcb3..0bb79560abb3 100644
> > --- a/Documentation/filesystems/Locking
> > +++ b/Documentation/filesystems/Locking
> > @@ -195,7 +195,9 @@ unlocks and drops the reference.
> >  	int (*releasepage) (struct page *, int);
> >  	void (*freepage)(struct page *);
> >  	int (*direct_IO)(struct kiocb *, struct iov_iter *iter, loff_t offset);
> > +	bool (*isolate_page) (struct page *, isolate_mode_t);
> >  	int (*migratepage)(struct address_space *, struct page *, struct page *);
> > +	void (*putback_page) (struct page *);
> >  	int (*launder_page)(struct page *);
> >  	int (*is_partially_uptodate)(struct page *, unsigned long, unsigned long);
> >  	int (*error_remove_page)(struct address_space *, struct page *);
> > @@ -219,7 +221,9 @@ invalidatepage:		yes
> >  releasepage:		yes
> >  freepage:		yes
> >  direct_IO:
> > +isolate_page:		yes
> >  migratepage:		yes (both)
> > +putback_page:		yes
> >  launder_page:		yes
> >  is_partially_uptodate:	yes
> >  error_remove_page:	yes
> > diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt
> > index b02a7d598258..4c1b6c3b4bc8 100644
> > --- a/Documentation/filesystems/vfs.txt
> > +++ b/Documentation/filesystems/vfs.txt
> > @@ -592,9 +592,14 @@ struct address_space_operations {
> >  	int (*releasepage) (struct page *, int);
> >  	void (*freepage)(struct page *);
> >  	ssize_t (*direct_IO)(struct kiocb *, struct iov_iter *iter, loff_t offset);
> > +	/* isolate a page for migration */
> > +	bool (*isolate_page) (struct page *, isolate_mode_t);
> >  	/* migrate the contents of a page to the specified target */
> >  	int (*migratepage) (struct page *, struct page *);
> > +	/* put the page back to right list */
> > +	void (*putback_page) (struct page *);
> >  	int (*launder_page) (struct page *);
> > +
> >  	int (*is_partially_uptodate) (struct page *, unsigned long,
> >  					unsigned long);
> >  	void (*is_dirty_writeback) (struct page *, bool *, bool *);
> > diff --git a/fs/proc/page.c b/fs/proc/page.c
> > index 712f1b9992cc..e2066e73a9b8 100644
> > --- a/fs/proc/page.c
> > +++ b/fs/proc/page.c
> > @@ -157,6 +157,9 @@ u64 stable_page_flags(struct page *page)
> >  	if (page_is_idle(page))
> >  		u |= 1 << KPF_IDLE;
> >  
> > +	if (PageMovable(page))
> > +		u |= 1 << KPF_MOVABLE;
> > +
> >  	u |= kpf_copy_bit(k, KPF_LOCKED,	PG_locked);
> >  
> >  	u |= kpf_copy_bit(k, KPF_SLAB,		PG_slab);
> > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > index 14a97194b34b..b7ef2e41fa4a 100644
> > --- a/include/linux/fs.h
> > +++ b/include/linux/fs.h
> > @@ -401,6 +401,8 @@ struct address_space_operations {
> >  	 */
> >  	int (*migratepage) (struct address_space *,
> >  			struct page *, struct page *, enum migrate_mode);
> > +	bool (*isolate_page)(struct page *, isolate_mode_t);
> > +	void (*putback_page)(struct page *);
> >  	int (*launder_page) (struct page *);
> >  	int (*is_partially_uptodate) (struct page *, unsigned long,
> >  					unsigned long);
> > diff --git a/include/linux/migrate.h b/include/linux/migrate.h
> > index 9b50325e4ddf..404fbfefeb33 100644
> > --- a/include/linux/migrate.h
> > +++ b/include/linux/migrate.h
> > @@ -37,6 +37,8 @@ extern int migrate_page(struct address_space *,
> >  			struct page *, struct page *, enum migrate_mode);
> >  extern int migrate_pages(struct list_head *l, new_page_t new, free_page_t free,
> >  		unsigned long private, enum migrate_mode mode, int reason);
> > +extern bool isolate_movable_page(struct page *page, isolate_mode_t mode);
> > +extern void putback_movable_page(struct page *page);
> >  
> >  extern int migrate_prep(void);
> >  extern int migrate_prep_local(void);
> > diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> > index f4ed4f1b0c77..3885064641c4 100644
> > --- a/include/linux/page-flags.h
> > +++ b/include/linux/page-flags.h
> > @@ -129,6 +129,10 @@ enum pageflags {
> >  
> >  	/* Compound pages. Stored in first tail page's flags */
> >  	PG_double_map = PG_private_2,
> > +
> > +	/* non-lru movable pages */
> > +	PG_movable = PG_reclaim,
> > +	PG_isolated = PG_owner_priv_1,
> >  };
> >  
> >  #ifndef __GENERATING_BOUNDS_H
> > @@ -614,6 +618,31 @@ static inline void __ClearPageBalloon(struct page *page)
> >  	atomic_set(&page->_mapcount, -1);
> >  }
> >  
> > +#define PAGE_MOVABLE_MAPCOUNT_VALUE (-255)
> > +
> > +static inline int PageMovable(struct page *page)
> > +{
> > +	return ((test_bit(PG_movable, &(page)->flags) &&
> > +		atomic_read(&page->_mapcount) == PAGE_MOVABLE_MAPCOUNT_VALUE)
> > +		|| PageBalloon(page));
> > +}
> > +
> > +/*
> > + * Caller should hold a PG_lock */
> > +static inline void __SetPageMovable(struct page *page)
> > +{
> > +	__set_bit(PG_movable, &page->flags);
> > +	atomic_set(&page->_mapcount, PAGE_MOVABLE_MAPCOUNT_VALUE);
> > +}
> 
> I think there is no big benefit to use non-atomic version here.
> PageMovable() is speculatively checked without holding a PG_lock
> so some cpu can miss this flag set if we use non-atomic version.

I wanted to show that double underscore is non-atomic so caller
should take care of the lock(i.e., PG_lock).
If we use atomic version, what kinds of benefit do we have?
Without holding PG_lock, atomic version could be raced, too.

> 
> > +
> > +static inline void __ClearPageMovable(struct page *page)
> > +{
> > +	atomic_set(&page->_mapcount, -1);
> > +	__clear_bit(PG_movable, &(page)->flags);
> > +}
> > +
> > +PAGEFLAG(Isolated, isolated, PF_ANY);
> > +
> >  /*
> >   * If network-based swap is enabled, sl*b must keep track of whether pages
> >   * were allocated from pfmemalloc reserves.
> > diff --git a/include/uapi/linux/kernel-page-flags.h b/include/uapi/linux/kernel-page-flags.h
> > index 5da5f8751ce7..a184fd2434fa 100644
> > --- a/include/uapi/linux/kernel-page-flags.h
> > +++ b/include/uapi/linux/kernel-page-flags.h
> > @@ -34,6 +34,7 @@
> >  #define KPF_BALLOON		23
> >  #define KPF_ZERO_PAGE		24
> >  #define KPF_IDLE		25
> > +#define KPF_MOVABLE		26
> >  
> >  
> >  #endif /* _UAPILINUX_KERNEL_PAGE_FLAGS_H */
> > diff --git a/mm/compaction.c b/mm/compaction.c
> > index ccf97b02b85f..7557aedddaee 100644
> > --- a/mm/compaction.c
> > +++ b/mm/compaction.c
> > @@ -703,7 +703,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
> >  
> >  		/*
> >  		 * Check may be lockless but that's ok as we recheck later.
> > -		 * It's possible to migrate LRU pages and balloon pages
> > +		 * It's possible to migrate LRU and movable kernel pages.
> >  		 * Skip any other type of page
> >  		 */
> >  		is_lru = PageLRU(page);
> > @@ -714,6 +714,18 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
> >  					goto isolate_success;
> >  				}
> >  			}
> > +
> > +			if (unlikely(PageMovable(page)) &&
> > +					!PageIsolated(page)) {
> > +				if (locked) {
> > +					spin_unlock_irqrestore(&zone->lru_lock,
> > +									flags);
> > +					locked = false;
> > +				}
> > +
> > +				if (isolate_movable_page(page, isolate_mode))
> > +					goto isolate_success;
> > +			}
> >  		}
> >  
> >  		/*
> > diff --git a/mm/migrate.c b/mm/migrate.c
> > index b65c84267ce0..fc2842a15807 100644
> > --- a/mm/migrate.c
> > +++ b/mm/migrate.c
> > @@ -73,6 +73,75 @@ int migrate_prep_local(void)
> >  	return 0;
> >  }
> >  
> > +bool isolate_movable_page(struct page *page, isolate_mode_t mode)
> > +{
> > +	bool ret = false;
> > +
> > +	/*
> > +	 * Avoid burning cycles with pages that are yet under __free_pages(),
> > +	 * or just got freed under us.
> > +	 *
> > +	 * In case we 'win' a race for a movable page being freed under us and
> > +	 * raise its refcount preventing __free_pages() from doing its job
> > +	 * the put_page() at the end of this block will take care of
> > +	 * release this page, thus avoiding a nasty leakage.
> > +	 */
> > +	if (unlikely(!get_page_unless_zero(page)))
> > +		goto out;
> 
> After getting the ref counter, we need to re-check PageMovable()
> to ensure that we indeed handle PageMovable() type page. Without it,
> the page we handle can be freed and re-allocated to someone else
> that isn't related to PageMovable() before grabbing the page. Trying
> trylock_page() in this case could cause a problem.

I don't get it. Why do you think trylock_page could cause a problem?
Could you elaborate it more?

> 
> > +	/*
> > +	 * As movable pages are not isolated from LRU lists, concurrent
> > +	 * compaction threads can race against page migration functions
> > +	 * as well as race against the releasing a page.
> > +	 *
> > +	 * In order to avoid having an already isolated movable page
> > +	 * being (wrongly) re-isolated while it is under migration,
> > +	 * or to avoid attempting to isolate pages being released,
> > +	 * lets be sure we have the page lock
> > +	 * before proceeding with the movable page isolation steps.
> > +	 */
> > +	if (unlikely(!trylock_page(page)))
> > +		goto out_putpage;
> > +
> > +	if (!PageMovable(page) || PageIsolated(page))
> > +		goto out_no_isolated;
> > +
> > +	ret = page->mapping->a_ops->isolate_page(page, mode);
> > +	if (!ret)
> > +		goto out_no_isolated;
> > +
> > +	WARN_ON_ONCE(!PageIsolated(page));
> > +	unlock_page(page);
> > +	return ret;
> > +
> > +out_no_isolated:
> > +	unlock_page(page);
> > +out_putpage:
> > +	put_page(page);
> > +out:
> > +	return ret;
> > +}
> > +
> > +void putback_movable_page(struct page *page)
> > +{
> > +	struct address_space *mapping;
> > +
> > +	/*
> > +	 * 'lock_page()' stabilizes the page and prevents races against
> > +	 * concurrent isolation threads attempting to re-isolate it.
> > +	 */
> > +	lock_page(page);
> > +	mapping = page_mapping(page);
> > +	if (mapping) {
> > +		mapping->a_ops->putback_page(page);
> > +		WARN_ON_ONCE(PageIsolated(page));
> > +	}
> > +	unlock_page(page);
> > +	/* drop the extra ref count taken for movable page isolation */
> > +	put_page(page);
> > +}
> 
> This is complicated part for me. mapping can disappear? In this case,
> who clear PageIsolated()?

Page's owner, for exmaple, zsmalloc, virtio-balloon.
They can free page whenever they want once it holds a PG_lock.
They should clear mapping and PG_movable with PG_lock.

> 
> > +
> > +
> >  /*
> >   * Put previously isolated pages back onto the appropriate lists
> >   * from where they were once taken off for compaction/migration.
> > @@ -96,6 +165,8 @@ void putback_movable_pages(struct list_head *l)
> >  				page_is_file_cache(page));
> >  		if (unlikely(isolated_balloon_page(page)))
> >  			balloon_page_putback(page);
> > +		else if (unlikely(PageIsolated(page)))
> > +			putback_movable_page(page);
> >  		else
> >  			putback_lru_page(page);
> >  	}
> 
> I think that this will not work. You uses PG_owner_priv_1 as
> PG_isolated and it is possible that some lru pages has this flag.
> I guess you need to add PageMovable() check but it seems that mapping
> and this flag can be cleared by others.

Hmm, PageMovable check may work because If PageMovable check fails,
it means page's owner free the page so we can simple put the page to
release refcount in here.
I will check it.

> 
> > @@ -592,7 +663,7 @@ void migrate_page_copy(struct page *newpage, struct page *page)
> >   ***********************************************************/
> >  
> >  /*
> > - * Common logic to directly migrate a single page suitable for
> > + * Common logic to directly migrate a single LRU page suitable for
> >   * pages that do not use PagePrivate/PagePrivate2.
> >   *
> >   * Pages are locked upon entry and exit.
> > @@ -755,24 +826,53 @@ static int move_to_new_page(struct page *newpage, struct page *page,
> >  				enum migrate_mode mode)
> >  {
> >  	struct address_space *mapping;
> > -	int rc;
> > +	int rc = -EAGAIN;
> > +	bool isolated_lru_page;
> >  
> >  	VM_BUG_ON_PAGE(!PageLocked(page), page);
> >  	VM_BUG_ON_PAGE(!PageLocked(newpage), newpage);
> >  
> >  	mapping = page_mapping(page);
> > -	if (!mapping)
> > -		rc = migrate_page(mapping, newpage, page, mode);
> > -	else if (mapping->a_ops->migratepage)
> > +	/*
> > +	 * In case of non-lru page, it could be released after
> > +	 * isolation step. In that case, we shouldn't try
> > +	 * fallback migration which was designed for LRU pages.
> 
> So, page we try to migrate can be released during migration.
> In this case, who does clear mapping and flag? And, without mapping,
> how we can clear PageIsolated()?

As I wrote down in description, it's role of user.

> 
> > +	 * To identify such pages, we cannot use PageMovable
> > +	 * because owner of the page can reset it. So intead,
> > +	 * use PG_isolated bit.
> > +	 */
> > +	isolated_lru_page = !PageIsolated(page);
> 
> Ditto. PageIsolated() isn't sufficient to distinguish non-lru page.

Okay, I will see it.

> 
> My comment mainly points out rules about when/who clear mapping
> and flag. Maybe, you need to answer just one of them. :)

I agree document is really lack of information at the moment.
I will put more words.

Thanks for the review, Joonsoo!

^ permalink raw reply

* [PATCH v2] drm/virtio: send vblank event after crtc updates
From: Gustavo Padovan @ 2016-03-22 20:17 UTC (permalink / raw)
  To: Gerd Hoffmann
  Cc: David Airlie, Gustavo Padovan, open list,
	open list:VIRTIO GPU DRIVER, open list:VIRTIO GPU DRIVER
In-Reply-To: <1458588187-26002-2-git-send-email-gustavo@padovan.org>

From: Gustavo Padovan <gustavo.padovan@collabora.co.uk>

virtio_gpu was failing to send vblank events when using the atomic IOCTL
with the DRM_MODE_PAGE_FLIP_EVENT flag set. This patch fixes each and
enables atomic pageflips updates.

Signed-off-by: Gustavo Padovan <gustavo.padovan@collabora.co.uk>
---
 drivers/gpu/drm/virtio/virtgpu_display.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/drivers/gpu/drm/virtio/virtgpu_display.c b/drivers/gpu/drm/virtio/virtgpu_display.c
index b70bb8b..4f372e0 100644
--- a/drivers/gpu/drm/virtio/virtgpu_display.c
+++ b/drivers/gpu/drm/virtio/virtgpu_display.c
@@ -274,12 +274,24 @@ static int virtio_gpu_crtc_atomic_check(struct drm_crtc *crtc,
 	return 0;
 }
 
+static void virtio_gpu_crtc_atomic_flush(struct drm_crtc *crtc,
+					 struct drm_crtc_state *old_state)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&crtc->dev->event_lock, flags);
+	if (crtc->state->event)
+		drm_crtc_send_vblank_event(crtc, crtc->state->event);
+	spin_unlock_irqrestore(&crtc->dev->event_lock, flags);
+}
+
 static const struct drm_crtc_helper_funcs virtio_gpu_crtc_helper_funcs = {
 	.enable        = virtio_gpu_crtc_enable,
 	.disable       = virtio_gpu_crtc_disable,
 	.mode_fixup    = virtio_gpu_crtc_mode_fixup,
 	.mode_set_nofb = virtio_gpu_crtc_mode_set_nofb,
 	.atomic_check  = virtio_gpu_crtc_atomic_check,
+	.atomic_flush  = virtio_gpu_crtc_atomic_flush,
 };
 
 static void virtio_gpu_enc_mode_set(struct drm_encoder *encoder,
-- 
2.5.0

^ permalink raw reply related

* Re: [PATCH v2 13/18] mm/compaction: support non-lru movable page migration
From: Joonsoo Kim @ 2016-03-23  5:05 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Rik van Riel, YiPing Xu, aquini, rknize, Sergey Senozhatsky,
	Chan Gyun Jeong, dri-devel, Hugh Dickins, linux-kernel, Al Viro,
	virtualization, bfields, linux-mm, Gioh Kim, Gioh Kim, koct9i,
	Sangseok Lee, Andrew Morton, jlayton, Vlastimil Babka, Mel Gorman
In-Reply-To: <20160322145545.GB3221@bbox>

On Tue, Mar 22, 2016 at 11:55:45PM +0900, Minchan Kim wrote:
> On Tue, Mar 22, 2016 at 02:50:37PM +0900, Joonsoo Kim wrote:
> > On Mon, Mar 21, 2016 at 03:31:02PM +0900, Minchan Kim wrote:
> > > We have allowed migration for only LRU pages until now and it was
> > > enough to make high-order pages. But recently, embedded system(e.g.,
> > > webOS, android) uses lots of non-movable pages(e.g., zram, GPU memory)
> > > so we have seen several reports about troubles of small high-order
> > > allocation. For fixing the problem, there were several efforts
> > > (e,g,. enhance compaction algorithm, SLUB fallback to 0-order page,
> > > reserved memory, vmalloc and so on) but if there are lots of
> > > non-movable pages in system, their solutions are void in the long run.
> > > 
> > > So, this patch is to support facility to change non-movable pages
> > > with movable. For the feature, this patch introduces functions related
> > > to migration to address_space_operations as well as some page flags.
> > > 
> > > Basically, this patch supports two page-flags and two functions related
> > > to page migration. The flag and page->mapping stability are protected
> > > by PG_lock.
> > > 
> > > 	PG_movable
> > > 	PG_isolated
> > > 
> > > 	bool (*isolate_page) (struct page *, isolate_mode_t);
> > > 	void (*putback_page) (struct page *);
> > > 
> > > Duty of subsystem want to make their pages as migratable are
> > > as follows:
> > > 
> > > 1. It should register address_space to page->mapping then mark
> > > the page as PG_movable via __SetPageMovable.
> > > 
> > > 2. It should mark the page as PG_isolated via SetPageIsolated
> > > if isolation is sucessful and return true.
> > > 
> > > 3. If migration is successful, it should clear PG_isolated and
> > > PG_movable of the page for free preparation then release the
> > > reference of the page to free.
> > > 
> > > 4. If migration fails, putback function of subsystem should
> > > clear PG_isolated via ClearPageIsolated.
> > 
> > I think that this feature needs a separate document to describe
> > requirement of each step in more detail. For example, #1 can be
> > possible without holding a lock? I'm not sure because you lock
> > the page when implementing zsmalloc page migration in 15th patch.
> 
> Yes, we needs PG_lock because install page->mapping and PG_movable
> should be atomic and PG_lock protects it.
> 
> Better interface might be
> 
> void __SetPageMovable(struct page *page, sruct address_space *mapping);
> 
> > 
> > #3 also need more explanation. Before release, we need to
> > unregister address_space. I guess that it needs to be done
> > in migratepage() but there is no explanation.
> 
> Okay, we can unregister address_space in __ClearPageMovable.
> I will change it.
> 
> > 
> > > 
> > > Cc: Vlastimil Babka <vbabka@suse.cz>
> > > Cc: Mel Gorman <mgorman@suse.de>
> > > Cc: Hugh Dickins <hughd@google.com>
> > > Cc: dri-devel@lists.freedesktop.org
> > > Cc: virtualization@lists.linux-foundation.org
> > > Signed-off-by: Gioh Kim <gurugio@hanmail.net>
> > > Signed-off-by: Minchan Kim <minchan@kernel.org>
> > > ---
> > >  Documentation/filesystems/Locking      |   4 +
> > >  Documentation/filesystems/vfs.txt      |   5 ++
> > >  fs/proc/page.c                         |   3 +
> > >  include/linux/fs.h                     |   2 +
> > >  include/linux/migrate.h                |   2 +
> > >  include/linux/page-flags.h             |  29 ++++++++
> > >  include/uapi/linux/kernel-page-flags.h |   1 +
> > >  mm/compaction.c                        |  14 +++-
> > >  mm/migrate.c                           | 132 +++++++++++++++++++++++++++++----
> > >  9 files changed, 177 insertions(+), 15 deletions(-)
> > > 
> > > diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking
> > > index 619af9bfdcb3..0bb79560abb3 100644
> > > --- a/Documentation/filesystems/Locking
> > > +++ b/Documentation/filesystems/Locking
> > > @@ -195,7 +195,9 @@ unlocks and drops the reference.
> > >  	int (*releasepage) (struct page *, int);
> > >  	void (*freepage)(struct page *);
> > >  	int (*direct_IO)(struct kiocb *, struct iov_iter *iter, loff_t offset);
> > > +	bool (*isolate_page) (struct page *, isolate_mode_t);
> > >  	int (*migratepage)(struct address_space *, struct page *, struct page *);
> > > +	void (*putback_page) (struct page *);
> > >  	int (*launder_page)(struct page *);
> > >  	int (*is_partially_uptodate)(struct page *, unsigned long, unsigned long);
> > >  	int (*error_remove_page)(struct address_space *, struct page *);
> > > @@ -219,7 +221,9 @@ invalidatepage:		yes
> > >  releasepage:		yes
> > >  freepage:		yes
> > >  direct_IO:
> > > +isolate_page:		yes
> > >  migratepage:		yes (both)
> > > +putback_page:		yes
> > >  launder_page:		yes
> > >  is_partially_uptodate:	yes
> > >  error_remove_page:	yes
> > > diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt
> > > index b02a7d598258..4c1b6c3b4bc8 100644
> > > --- a/Documentation/filesystems/vfs.txt
> > > +++ b/Documentation/filesystems/vfs.txt
> > > @@ -592,9 +592,14 @@ struct address_space_operations {
> > >  	int (*releasepage) (struct page *, int);
> > >  	void (*freepage)(struct page *);
> > >  	ssize_t (*direct_IO)(struct kiocb *, struct iov_iter *iter, loff_t offset);
> > > +	/* isolate a page for migration */
> > > +	bool (*isolate_page) (struct page *, isolate_mode_t);
> > >  	/* migrate the contents of a page to the specified target */
> > >  	int (*migratepage) (struct page *, struct page *);
> > > +	/* put the page back to right list */
> > > +	void (*putback_page) (struct page *);
> > >  	int (*launder_page) (struct page *);
> > > +
> > >  	int (*is_partially_uptodate) (struct page *, unsigned long,
> > >  					unsigned long);
> > >  	void (*is_dirty_writeback) (struct page *, bool *, bool *);
> > > diff --git a/fs/proc/page.c b/fs/proc/page.c
> > > index 712f1b9992cc..e2066e73a9b8 100644
> > > --- a/fs/proc/page.c
> > > +++ b/fs/proc/page.c
> > > @@ -157,6 +157,9 @@ u64 stable_page_flags(struct page *page)
> > >  	if (page_is_idle(page))
> > >  		u |= 1 << KPF_IDLE;
> > >  
> > > +	if (PageMovable(page))
> > > +		u |= 1 << KPF_MOVABLE;
> > > +
> > >  	u |= kpf_copy_bit(k, KPF_LOCKED,	PG_locked);
> > >  
> > >  	u |= kpf_copy_bit(k, KPF_SLAB,		PG_slab);
> > > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > > index 14a97194b34b..b7ef2e41fa4a 100644
> > > --- a/include/linux/fs.h
> > > +++ b/include/linux/fs.h
> > > @@ -401,6 +401,8 @@ struct address_space_operations {
> > >  	 */
> > >  	int (*migratepage) (struct address_space *,
> > >  			struct page *, struct page *, enum migrate_mode);
> > > +	bool (*isolate_page)(struct page *, isolate_mode_t);
> > > +	void (*putback_page)(struct page *);
> > >  	int (*launder_page) (struct page *);
> > >  	int (*is_partially_uptodate) (struct page *, unsigned long,
> > >  					unsigned long);
> > > diff --git a/include/linux/migrate.h b/include/linux/migrate.h
> > > index 9b50325e4ddf..404fbfefeb33 100644
> > > --- a/include/linux/migrate.h
> > > +++ b/include/linux/migrate.h
> > > @@ -37,6 +37,8 @@ extern int migrate_page(struct address_space *,
> > >  			struct page *, struct page *, enum migrate_mode);
> > >  extern int migrate_pages(struct list_head *l, new_page_t new, free_page_t free,
> > >  		unsigned long private, enum migrate_mode mode, int reason);
> > > +extern bool isolate_movable_page(struct page *page, isolate_mode_t mode);
> > > +extern void putback_movable_page(struct page *page);
> > >  
> > >  extern int migrate_prep(void);
> > >  extern int migrate_prep_local(void);
> > > diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> > > index f4ed4f1b0c77..3885064641c4 100644
> > > --- a/include/linux/page-flags.h
> > > +++ b/include/linux/page-flags.h
> > > @@ -129,6 +129,10 @@ enum pageflags {
> > >  
> > >  	/* Compound pages. Stored in first tail page's flags */
> > >  	PG_double_map = PG_private_2,
> > > +
> > > +	/* non-lru movable pages */
> > > +	PG_movable = PG_reclaim,
> > > +	PG_isolated = PG_owner_priv_1,
> > >  };
> > >  
> > >  #ifndef __GENERATING_BOUNDS_H
> > > @@ -614,6 +618,31 @@ static inline void __ClearPageBalloon(struct page *page)
> > >  	atomic_set(&page->_mapcount, -1);
> > >  }
> > >  
> > > +#define PAGE_MOVABLE_MAPCOUNT_VALUE (-255)
> > > +
> > > +static inline int PageMovable(struct page *page)
> > > +{
> > > +	return ((test_bit(PG_movable, &(page)->flags) &&
> > > +		atomic_read(&page->_mapcount) == PAGE_MOVABLE_MAPCOUNT_VALUE)
> > > +		|| PageBalloon(page));
> > > +}
> > > +
> > > +/*
> > > + * Caller should hold a PG_lock */
> > > +static inline void __SetPageMovable(struct page *page)
> > > +{
> > > +	__set_bit(PG_movable, &page->flags);
> > > +	atomic_set(&page->_mapcount, PAGE_MOVABLE_MAPCOUNT_VALUE);
> > > +}
> > 
> > I think there is no big benefit to use non-atomic version here.
> > PageMovable() is speculatively checked without holding a PG_lock
> > so some cpu can miss this flag set if we use non-atomic version.
> 
> I wanted to show that double underscore is non-atomic so caller
> should take care of the lock(i.e., PG_lock).
> If we use atomic version, what kinds of benefit do we have?
> Without holding PG_lock, atomic version could be raced, too.

My suggestion is holding PG_lock + atomic set. Compaction first
checks PageMovable() without PG_lock so it can miss PageMovable() if
non-atomic version is used.

> 
> > 
> > > +
> > > +static inline void __ClearPageMovable(struct page *page)
> > > +{
> > > +	atomic_set(&page->_mapcount, -1);
> > > +	__clear_bit(PG_movable, &(page)->flags);
> > > +}
> > > +
> > > +PAGEFLAG(Isolated, isolated, PF_ANY);
> > > +
> > >  /*
> > >   * If network-based swap is enabled, sl*b must keep track of whether pages
> > >   * were allocated from pfmemalloc reserves.
> > > diff --git a/include/uapi/linux/kernel-page-flags.h b/include/uapi/linux/kernel-page-flags.h
> > > index 5da5f8751ce7..a184fd2434fa 100644
> > > --- a/include/uapi/linux/kernel-page-flags.h
> > > +++ b/include/uapi/linux/kernel-page-flags.h
> > > @@ -34,6 +34,7 @@
> > >  #define KPF_BALLOON		23
> > >  #define KPF_ZERO_PAGE		24
> > >  #define KPF_IDLE		25
> > > +#define KPF_MOVABLE		26
> > >  
> > >  
> > >  #endif /* _UAPILINUX_KERNEL_PAGE_FLAGS_H */
> > > diff --git a/mm/compaction.c b/mm/compaction.c
> > > index ccf97b02b85f..7557aedddaee 100644
> > > --- a/mm/compaction.c
> > > +++ b/mm/compaction.c
> > > @@ -703,7 +703,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
> > >  
> > >  		/*
> > >  		 * Check may be lockless but that's ok as we recheck later.
> > > -		 * It's possible to migrate LRU pages and balloon pages
> > > +		 * It's possible to migrate LRU and movable kernel pages.
> > >  		 * Skip any other type of page
> > >  		 */
> > >  		is_lru = PageLRU(page);
> > > @@ -714,6 +714,18 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
> > >  					goto isolate_success;
> > >  				}
> > >  			}
> > > +
> > > +			if (unlikely(PageMovable(page)) &&
> > > +					!PageIsolated(page)) {
> > > +				if (locked) {
> > > +					spin_unlock_irqrestore(&zone->lru_lock,
> > > +									flags);
> > > +					locked = false;
> > > +				}
> > > +
> > > +				if (isolate_movable_page(page, isolate_mode))
> > > +					goto isolate_success;
> > > +			}
> > >  		}
> > >  
> > >  		/*
> > > diff --git a/mm/migrate.c b/mm/migrate.c
> > > index b65c84267ce0..fc2842a15807 100644
> > > --- a/mm/migrate.c
> > > +++ b/mm/migrate.c
> > > @@ -73,6 +73,75 @@ int migrate_prep_local(void)
> > >  	return 0;
> > >  }
> > >  
> > > +bool isolate_movable_page(struct page *page, isolate_mode_t mode)
> > > +{
> > > +	bool ret = false;
> > > +
> > > +	/*
> > > +	 * Avoid burning cycles with pages that are yet under __free_pages(),
> > > +	 * or just got freed under us.
> > > +	 *
> > > +	 * In case we 'win' a race for a movable page being freed under us and
> > > +	 * raise its refcount preventing __free_pages() from doing its job
> > > +	 * the put_page() at the end of this block will take care of
> > > +	 * release this page, thus avoiding a nasty leakage.
> > > +	 */
> > > +	if (unlikely(!get_page_unless_zero(page)))
> > > +		goto out;
> > 
> > After getting the ref counter, we need to re-check PageMovable()
> > to ensure that we indeed handle PageMovable() type page. Without it,
> > the page we handle can be freed and re-allocated to someone else
> > that isn't related to PageMovable() before grabbing the page. Trying
> > trylock_page() in this case could cause a problem.
> 
> I don't get it. Why do you think trylock_page could cause a problem?
> Could you elaborate it more?

Okay. Consider following sequence.

CPU-A                               CPU-B
check PageMovable() in compacton
...                            free the page
...                            allocate the page for other usecase
...                            (maybe for file cache or slub)
get unless 0 in isolate_movable_page()
trylock (success)
                               (try) lock! failed!

In this case, someone can see failure even if they are owner of the
page. IIUC, this also can happen in zsmalloc. See init_zspage() in
15th patch. It assume that allocated page can be locked
unconditionally.

> > 
> > > +	/*
> > > +	 * As movable pages are not isolated from LRU lists, concurrent
> > > +	 * compaction threads can race against page migration functions
> > > +	 * as well as race against the releasing a page.
> > > +	 *
> > > +	 * In order to avoid having an already isolated movable page
> > > +	 * being (wrongly) re-isolated while it is under migration,
> > > +	 * or to avoid attempting to isolate pages being released,
> > > +	 * lets be sure we have the page lock
> > > +	 * before proceeding with the movable page isolation steps.
> > > +	 */
> > > +	if (unlikely(!trylock_page(page)))
> > > +		goto out_putpage;
> > > +
> > > +	if (!PageMovable(page) || PageIsolated(page))
> > > +		goto out_no_isolated;
> > > +
> > > +	ret = page->mapping->a_ops->isolate_page(page, mode);
> > > +	if (!ret)
> > > +		goto out_no_isolated;
> > > +
> > > +	WARN_ON_ONCE(!PageIsolated(page));
> > > +	unlock_page(page);
> > > +	return ret;
> > > +
> > > +out_no_isolated:
> > > +	unlock_page(page);
> > > +out_putpage:
> > > +	put_page(page);
> > > +out:
> > > +	return ret;
> > > +}
> > > +
> > > +void putback_movable_page(struct page *page)
> > > +{
> > > +	struct address_space *mapping;
> > > +
> > > +	/*
> > > +	 * 'lock_page()' stabilizes the page and prevents races against
> > > +	 * concurrent isolation threads attempting to re-isolate it.
> > > +	 */
> > > +	lock_page(page);
> > > +	mapping = page_mapping(page);
> > > +	if (mapping) {
> > > +		mapping->a_ops->putback_page(page);
> > > +		WARN_ON_ONCE(PageIsolated(page));
> > > +	}
> > > +	unlock_page(page);
> > > +	/* drop the extra ref count taken for movable page isolation */
> > > +	put_page(page);
> > > +}
> > 
> > This is complicated part for me. mapping can disappear? In this case,
> > who clear PageIsolated()?
> 
> Page's owner, for exmaple, zsmalloc, virtio-balloon.
> They can free page whenever they want once it holds a PG_lock.
> They should clear mapping and PG_movable with PG_lock.
> 
> > 
> > > +
> > > +
> > >  /*
> > >   * Put previously isolated pages back onto the appropriate lists
> > >   * from where they were once taken off for compaction/migration.
> > > @@ -96,6 +165,8 @@ void putback_movable_pages(struct list_head *l)
> > >  				page_is_file_cache(page));
> > >  		if (unlikely(isolated_balloon_page(page)))
> > >  			balloon_page_putback(page);
> > > +		else if (unlikely(PageIsolated(page)))
> > > +			putback_movable_page(page);
> > >  		else
> > >  			putback_lru_page(page);
> > >  	}
> > 
> > I think that this will not work. You uses PG_owner_priv_1 as
> > PG_isolated and it is possible that some lru pages has this flag.
> > I guess you need to add PageMovable() check but it seems that mapping
> > and this flag can be cleared by others.
> 
> Hmm, PageMovable check may work because If PageMovable check fails,
> it means page's owner free the page so we can simple put the page to
> release refcount in here.
> I will check it.

Hmmm... But, in failure case, is it safe to call putback_lru_page() for them?
And, PageIsolated() would be left. Is it okay? It's not symmetric that
isolated page can be freed by decreasing ref count without calling
putback function. This should be clarified and documented.

Thanks.

^ permalink raw reply

* RE: Re: [PATCH v2 13/18] mm/compaction: support non-lru movable pagemigration
From: Gioh Kim @ 2016-03-23 20:26 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Rik van Riel, Sergey Senozhatsky, YiPing Xu, aquini, rknize,
	linux-mm, Chan Gyun Jeong, dri-devel, Hugh Dickins, linux-kernel,
	Al Viro, virtualization, bfields, Minchan Kim, Gioh Kim, koct9i,
	Sangseok Lee, Andrew Morton, jlayton, Vlastimil Babka, Mel Gorman

[-- Attachment #1: Type: text/html, Size: 2502 bytes --]

[-- Attachment #2: Type: text/plain, Size: 183 bytes --]

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* Re: [PATCH v2 13/18] mm/compaction: support non-lru movable page migration
From: Minchan Kim @ 2016-03-24  5:00 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Rik van Riel, YiPing Xu, aquini, rknize, Sergey Senozhatsky,
	Chan Gyun Jeong, dri-devel, Hugh Dickins, linux-kernel, Al Viro,
	virtualization, bfields, linux-mm, Gioh Kim, Gioh Kim, koct9i,
	Sangseok Lee, Andrew Morton, jlayton, Vlastimil Babka, Mel Gorman
In-Reply-To: <20160323050511.GD4624@js1304-P5Q-DELUXE>

On Wed, Mar 23, 2016 at 02:05:11PM +0900, Joonsoo Kim wrote:
> On Tue, Mar 22, 2016 at 11:55:45PM +0900, Minchan Kim wrote:
> > On Tue, Mar 22, 2016 at 02:50:37PM +0900, Joonsoo Kim wrote:
> > > On Mon, Mar 21, 2016 at 03:31:02PM +0900, Minchan Kim wrote:
> > > > We have allowed migration for only LRU pages until now and it was
> > > > enough to make high-order pages. But recently, embedded system(e.g.,
> > > > webOS, android) uses lots of non-movable pages(e.g., zram, GPU memory)
> > > > so we have seen several reports about troubles of small high-order
> > > > allocation. For fixing the problem, there were several efforts
> > > > (e,g,. enhance compaction algorithm, SLUB fallback to 0-order page,
> > > > reserved memory, vmalloc and so on) but if there are lots of
> > > > non-movable pages in system, their solutions are void in the long run.
> > > > 
> > > > So, this patch is to support facility to change non-movable pages
> > > > with movable. For the feature, this patch introduces functions related
> > > > to migration to address_space_operations as well as some page flags.
> > > > 
> > > > Basically, this patch supports two page-flags and two functions related
> > > > to page migration. The flag and page->mapping stability are protected
> > > > by PG_lock.
> > > > 
> > > > 	PG_movable
> > > > 	PG_isolated
> > > > 
> > > > 	bool (*isolate_page) (struct page *, isolate_mode_t);
> > > > 	void (*putback_page) (struct page *);
> > > > 
> > > > Duty of subsystem want to make their pages as migratable are
> > > > as follows:
> > > > 
> > > > 1. It should register address_space to page->mapping then mark
> > > > the page as PG_movable via __SetPageMovable.
> > > > 
> > > > 2. It should mark the page as PG_isolated via SetPageIsolated
> > > > if isolation is sucessful and return true.
> > > > 
> > > > 3. If migration is successful, it should clear PG_isolated and
> > > > PG_movable of the page for free preparation then release the
> > > > reference of the page to free.
> > > > 
> > > > 4. If migration fails, putback function of subsystem should
> > > > clear PG_isolated via ClearPageIsolated.
> > > 
> > > I think that this feature needs a separate document to describe
> > > requirement of each step in more detail. For example, #1 can be
> > > possible without holding a lock? I'm not sure because you lock
> > > the page when implementing zsmalloc page migration in 15th patch.
> > 
> > Yes, we needs PG_lock because install page->mapping and PG_movable
> > should be atomic and PG_lock protects it.
> > 
> > Better interface might be
> > 
> > void __SetPageMovable(struct page *page, sruct address_space *mapping);
> > 
> > > 
> > > #3 also need more explanation. Before release, we need to
> > > unregister address_space. I guess that it needs to be done
> > > in migratepage() but there is no explanation.
> > 
> > Okay, we can unregister address_space in __ClearPageMovable.
> > I will change it.
> > 
> > > 
> > > > 
> > > > Cc: Vlastimil Babka <vbabka@suse.cz>
> > > > Cc: Mel Gorman <mgorman@suse.de>
> > > > Cc: Hugh Dickins <hughd@google.com>
> > > > Cc: dri-devel@lists.freedesktop.org
> > > > Cc: virtualization@lists.linux-foundation.org
> > > > Signed-off-by: Gioh Kim <gurugio@hanmail.net>
> > > > Signed-off-by: Minchan Kim <minchan@kernel.org>
> > > > ---
> > > >  Documentation/filesystems/Locking      |   4 +
> > > >  Documentation/filesystems/vfs.txt      |   5 ++
> > > >  fs/proc/page.c                         |   3 +
> > > >  include/linux/fs.h                     |   2 +
> > > >  include/linux/migrate.h                |   2 +
> > > >  include/linux/page-flags.h             |  29 ++++++++
> > > >  include/uapi/linux/kernel-page-flags.h |   1 +
> > > >  mm/compaction.c                        |  14 +++-
> > > >  mm/migrate.c                           | 132 +++++++++++++++++++++++++++++----
> > > >  9 files changed, 177 insertions(+), 15 deletions(-)
> > > > 
> > > > diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking
> > > > index 619af9bfdcb3..0bb79560abb3 100644
> > > > --- a/Documentation/filesystems/Locking
> > > > +++ b/Documentation/filesystems/Locking
> > > > @@ -195,7 +195,9 @@ unlocks and drops the reference.
> > > >  	int (*releasepage) (struct page *, int);
> > > >  	void (*freepage)(struct page *);
> > > >  	int (*direct_IO)(struct kiocb *, struct iov_iter *iter, loff_t offset);
> > > > +	bool (*isolate_page) (struct page *, isolate_mode_t);
> > > >  	int (*migratepage)(struct address_space *, struct page *, struct page *);
> > > > +	void (*putback_page) (struct page *);
> > > >  	int (*launder_page)(struct page *);
> > > >  	int (*is_partially_uptodate)(struct page *, unsigned long, unsigned long);
> > > >  	int (*error_remove_page)(struct address_space *, struct page *);
> > > > @@ -219,7 +221,9 @@ invalidatepage:		yes
> > > >  releasepage:		yes
> > > >  freepage:		yes
> > > >  direct_IO:
> > > > +isolate_page:		yes
> > > >  migratepage:		yes (both)
> > > > +putback_page:		yes
> > > >  launder_page:		yes
> > > >  is_partially_uptodate:	yes
> > > >  error_remove_page:	yes
> > > > diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt
> > > > index b02a7d598258..4c1b6c3b4bc8 100644
> > > > --- a/Documentation/filesystems/vfs.txt
> > > > +++ b/Documentation/filesystems/vfs.txt
> > > > @@ -592,9 +592,14 @@ struct address_space_operations {
> > > >  	int (*releasepage) (struct page *, int);
> > > >  	void (*freepage)(struct page *);
> > > >  	ssize_t (*direct_IO)(struct kiocb *, struct iov_iter *iter, loff_t offset);
> > > > +	/* isolate a page for migration */
> > > > +	bool (*isolate_page) (struct page *, isolate_mode_t);
> > > >  	/* migrate the contents of a page to the specified target */
> > > >  	int (*migratepage) (struct page *, struct page *);
> > > > +	/* put the page back to right list */
> > > > +	void (*putback_page) (struct page *);
> > > >  	int (*launder_page) (struct page *);
> > > > +
> > > >  	int (*is_partially_uptodate) (struct page *, unsigned long,
> > > >  					unsigned long);
> > > >  	void (*is_dirty_writeback) (struct page *, bool *, bool *);
> > > > diff --git a/fs/proc/page.c b/fs/proc/page.c
> > > > index 712f1b9992cc..e2066e73a9b8 100644
> > > > --- a/fs/proc/page.c
> > > > +++ b/fs/proc/page.c
> > > > @@ -157,6 +157,9 @@ u64 stable_page_flags(struct page *page)
> > > >  	if (page_is_idle(page))
> > > >  		u |= 1 << KPF_IDLE;
> > > >  
> > > > +	if (PageMovable(page))
> > > > +		u |= 1 << KPF_MOVABLE;
> > > > +
> > > >  	u |= kpf_copy_bit(k, KPF_LOCKED,	PG_locked);
> > > >  
> > > >  	u |= kpf_copy_bit(k, KPF_SLAB,		PG_slab);
> > > > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > > > index 14a97194b34b..b7ef2e41fa4a 100644
> > > > --- a/include/linux/fs.h
> > > > +++ b/include/linux/fs.h
> > > > @@ -401,6 +401,8 @@ struct address_space_operations {
> > > >  	 */
> > > >  	int (*migratepage) (struct address_space *,
> > > >  			struct page *, struct page *, enum migrate_mode);
> > > > +	bool (*isolate_page)(struct page *, isolate_mode_t);
> > > > +	void (*putback_page)(struct page *);
> > > >  	int (*launder_page) (struct page *);
> > > >  	int (*is_partially_uptodate) (struct page *, unsigned long,
> > > >  					unsigned long);
> > > > diff --git a/include/linux/migrate.h b/include/linux/migrate.h
> > > > index 9b50325e4ddf..404fbfefeb33 100644
> > > > --- a/include/linux/migrate.h
> > > > +++ b/include/linux/migrate.h
> > > > @@ -37,6 +37,8 @@ extern int migrate_page(struct address_space *,
> > > >  			struct page *, struct page *, enum migrate_mode);
> > > >  extern int migrate_pages(struct list_head *l, new_page_t new, free_page_t free,
> > > >  		unsigned long private, enum migrate_mode mode, int reason);
> > > > +extern bool isolate_movable_page(struct page *page, isolate_mode_t mode);
> > > > +extern void putback_movable_page(struct page *page);
> > > >  
> > > >  extern int migrate_prep(void);
> > > >  extern int migrate_prep_local(void);
> > > > diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> > > > index f4ed4f1b0c77..3885064641c4 100644
> > > > --- a/include/linux/page-flags.h
> > > > +++ b/include/linux/page-flags.h
> > > > @@ -129,6 +129,10 @@ enum pageflags {
> > > >  
> > > >  	/* Compound pages. Stored in first tail page's flags */
> > > >  	PG_double_map = PG_private_2,
> > > > +
> > > > +	/* non-lru movable pages */
> > > > +	PG_movable = PG_reclaim,
> > > > +	PG_isolated = PG_owner_priv_1,
> > > >  };
> > > >  
> > > >  #ifndef __GENERATING_BOUNDS_H
> > > > @@ -614,6 +618,31 @@ static inline void __ClearPageBalloon(struct page *page)
> > > >  	atomic_set(&page->_mapcount, -1);
> > > >  }
> > > >  
> > > > +#define PAGE_MOVABLE_MAPCOUNT_VALUE (-255)
> > > > +
> > > > +static inline int PageMovable(struct page *page)
> > > > +{
> > > > +	return ((test_bit(PG_movable, &(page)->flags) &&
> > > > +		atomic_read(&page->_mapcount) == PAGE_MOVABLE_MAPCOUNT_VALUE)
> > > > +		|| PageBalloon(page));
> > > > +}
> > > > +
> > > > +/*
> > > > + * Caller should hold a PG_lock */
> > > > +static inline void __SetPageMovable(struct page *page)
> > > > +{
> > > > +	__set_bit(PG_movable, &page->flags);
> > > > +	atomic_set(&page->_mapcount, PAGE_MOVABLE_MAPCOUNT_VALUE);
> > > > +}
> > > 
> > > I think there is no big benefit to use non-atomic version here.
> > > PageMovable() is speculatively checked without holding a PG_lock
> > > so some cpu can miss this flag set if we use non-atomic version.
> > 
> > I wanted to show that double underscore is non-atomic so caller
> > should take care of the lock(i.e., PG_lock).
> > If we use atomic version, what kinds of benefit do we have?
> > Without holding PG_lock, atomic version could be raced, too.
> 
> My suggestion is holding PG_lock + atomic set. Compaction first
> checks PageMovable() without PG_lock so it can miss PageMovable() if
> non-atomic version is used.

I think it's really unlikely to race so the chance it can miss PageMovable
is really small so we don't need to add unncessary overhead caused by
atomic op in PG_movable setting side. As well, I want to use double
understcore to show the intention it's not atomic so user should take
care of the function.

> 
> > 
> > > 
> > > > +
> > > > +static inline void __ClearPageMovable(struct page *page)
> > > > +{
> > > > +	atomic_set(&page->_mapcount, -1);
> > > > +	__clear_bit(PG_movable, &(page)->flags);
> > > > +}
> > > > +
> > > > +PAGEFLAG(Isolated, isolated, PF_ANY);
> > > > +
> > > >  /*
> > > >   * If network-based swap is enabled, sl*b must keep track of whether pages
> > > >   * were allocated from pfmemalloc reserves.
> > > > diff --git a/include/uapi/linux/kernel-page-flags.h b/include/uapi/linux/kernel-page-flags.h
> > > > index 5da5f8751ce7..a184fd2434fa 100644
> > > > --- a/include/uapi/linux/kernel-page-flags.h
> > > > +++ b/include/uapi/linux/kernel-page-flags.h
> > > > @@ -34,6 +34,7 @@
> > > >  #define KPF_BALLOON		23
> > > >  #define KPF_ZERO_PAGE		24
> > > >  #define KPF_IDLE		25
> > > > +#define KPF_MOVABLE		26
> > > >  
> > > >  
> > > >  #endif /* _UAPILINUX_KERNEL_PAGE_FLAGS_H */
> > > > diff --git a/mm/compaction.c b/mm/compaction.c
> > > > index ccf97b02b85f..7557aedddaee 100644
> > > > --- a/mm/compaction.c
> > > > +++ b/mm/compaction.c
> > > > @@ -703,7 +703,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
> > > >  
> > > >  		/*
> > > >  		 * Check may be lockless but that's ok as we recheck later.
> > > > -		 * It's possible to migrate LRU pages and balloon pages
> > > > +		 * It's possible to migrate LRU and movable kernel pages.
> > > >  		 * Skip any other type of page
> > > >  		 */
> > > >  		is_lru = PageLRU(page);
> > > > @@ -714,6 +714,18 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
> > > >  					goto isolate_success;
> > > >  				}
> > > >  			}
> > > > +
> > > > +			if (unlikely(PageMovable(page)) &&
> > > > +					!PageIsolated(page)) {
> > > > +				if (locked) {
> > > > +					spin_unlock_irqrestore(&zone->lru_lock,
> > > > +									flags);
> > > > +					locked = false;
> > > > +				}
> > > > +
> > > > +				if (isolate_movable_page(page, isolate_mode))
> > > > +					goto isolate_success;
> > > > +			}
> > > >  		}
> > > >  
> > > >  		/*
> > > > diff --git a/mm/migrate.c b/mm/migrate.c
> > > > index b65c84267ce0..fc2842a15807 100644
> > > > --- a/mm/migrate.c
> > > > +++ b/mm/migrate.c
> > > > @@ -73,6 +73,75 @@ int migrate_prep_local(void)
> > > >  	return 0;
> > > >  }
> > > >  
> > > > +bool isolate_movable_page(struct page *page, isolate_mode_t mode)
> > > > +{
> > > > +	bool ret = false;
> > > > +
> > > > +	/*
> > > > +	 * Avoid burning cycles with pages that are yet under __free_pages(),
> > > > +	 * or just got freed under us.
> > > > +	 *
> > > > +	 * In case we 'win' a race for a movable page being freed under us and
> > > > +	 * raise its refcount preventing __free_pages() from doing its job
> > > > +	 * the put_page() at the end of this block will take care of
> > > > +	 * release this page, thus avoiding a nasty leakage.
> > > > +	 */
> > > > +	if (unlikely(!get_page_unless_zero(page)))
> > > > +		goto out;
> > > 
> > > After getting the ref counter, we need to re-check PageMovable()
> > > to ensure that we indeed handle PageMovable() type page. Without it,
> > > the page we handle can be freed and re-allocated to someone else
> > > that isn't related to PageMovable() before grabbing the page. Trying
> > > trylock_page() in this case could cause a problem.
> > 
> > I don't get it. Why do you think trylock_page could cause a problem?
> > Could you elaborate it more?
> 
> Okay. Consider following sequence.
> 
> CPU-A                               CPU-B
> check PageMovable() in compacton
> ...                            free the page
> ...                            allocate the page for other usecase
> ...                            (maybe for file cache or slub)
> get unless 0 in isolate_movable_page()
> trylock (success)
>                                (try) lock! failed!
> 
> In this case, someone can see failure even if they are owner of the
> page. IIUC, this also can happen in zsmalloc. See init_zspage() in
> 15th patch. It assume that allocated page can be locked
> unconditionally.

Oops, As you pointed out, other places already assumed page owner's
trylock will always succeed. :(
Thanks for the notice.
I will add PageMovable check right before trylock.

> 
> > > 
> > > > +	/*
> > > > +	 * As movable pages are not isolated from LRU lists, concurrent
> > > > +	 * compaction threads can race against page migration functions
> > > > +	 * as well as race against the releasing a page.
> > > > +	 *
> > > > +	 * In order to avoid having an already isolated movable page
> > > > +	 * being (wrongly) re-isolated while it is under migration,
> > > > +	 * or to avoid attempting to isolate pages being released,
> > > > +	 * lets be sure we have the page lock
> > > > +	 * before proceeding with the movable page isolation steps.
> > > > +	 */
> > > > +	if (unlikely(!trylock_page(page)))
> > > > +		goto out_putpage;
> > > > +
> > > > +	if (!PageMovable(page) || PageIsolated(page))
> > > > +		goto out_no_isolated;
> > > > +
> > > > +	ret = page->mapping->a_ops->isolate_page(page, mode);
> > > > +	if (!ret)
> > > > +		goto out_no_isolated;
> > > > +
> > > > +	WARN_ON_ONCE(!PageIsolated(page));
> > > > +	unlock_page(page);
> > > > +	return ret;
> > > > +
> > > > +out_no_isolated:
> > > > +	unlock_page(page);
> > > > +out_putpage:
> > > > +	put_page(page);
> > > > +out:
> > > > +	return ret;
> > > > +}
> > > > +
> > > > +void putback_movable_page(struct page *page)
> > > > +{
> > > > +	struct address_space *mapping;
> > > > +
> > > > +	/*
> > > > +	 * 'lock_page()' stabilizes the page and prevents races against
> > > > +	 * concurrent isolation threads attempting to re-isolate it.
> > > > +	 */
> > > > +	lock_page(page);
> > > > +	mapping = page_mapping(page);
> > > > +	if (mapping) {
> > > > +		mapping->a_ops->putback_page(page);
> > > > +		WARN_ON_ONCE(PageIsolated(page));
> > > > +	}
> > > > +	unlock_page(page);
> > > > +	/* drop the extra ref count taken for movable page isolation */
> > > > +	put_page(page);
> > > > +}
> > > 
> > > This is complicated part for me. mapping can disappear? In this case,
> > > who clear PageIsolated()?
> > 
> > Page's owner, for exmaple, zsmalloc, virtio-balloon.
> > They can free page whenever they want once it holds a PG_lock.
> > They should clear mapping and PG_movable with PG_lock.
> > 
> > > 
> > > > +
> > > > +
> > > >  /*
> > > >   * Put previously isolated pages back onto the appropriate lists
> > > >   * from where they were once taken off for compaction/migration.
> > > > @@ -96,6 +165,8 @@ void putback_movable_pages(struct list_head *l)
> > > >  				page_is_file_cache(page));
> > > >  		if (unlikely(isolated_balloon_page(page)))
> > > >  			balloon_page_putback(page);
> > > > +		else if (unlikely(PageIsolated(page)))
> > > > +			putback_movable_page(page);
> > > >  		else
> > > >  			putback_lru_page(page);
> > > >  	}
> > > 
> > > I think that this will not work. You uses PG_owner_priv_1 as
> > > PG_isolated and it is possible that some lru pages has this flag.
> > > I guess you need to add PageMovable() check but it seems that mapping
> > > and this flag can be cleared by others.
> > 
> > Hmm, PageMovable check may work because If PageMovable check fails,
> > it means page's owner free the page so we can simple put the page to
> > release refcount in here.
> > I will check it.
> 
> Hmmm... But, in failure case, is it safe to call putback_lru_page() for them?

At the moment, it's safe to work. It is just added to lrupvec to release
although it was not LRU page. Yes, it's error-pronce so someday it hurts
someone. I will think more about it.


> And, PageIsolated() would be left. Is it okay? It's not symmetric that

Page owner should clear PageIsolated once he freed the page.

> isolated page can be freed by decreasing ref count without calling
> putback function. This should be clarified and documented.

Yes, I will add it on document.

Thanks!

^ permalink raw reply

* Re: Re: [PATCH v2 13/18] mm/compaction: support non-lru movable pagemigration
From: Minchan Kim @ 2016-03-24  5:11 UTC (permalink / raw)
  To: Gioh Kim
  Cc: Rik van Riel, YiPing Xu, Andrew Morton, rknize,
	Sergey Senozhatsky, Chan Gyun Jeong, dri-devel, Hugh Dickins,
	linux-kernel, Al Viro, virtualization, bfields, linux-mm,
	Mel Gorman, Gioh Kim, koct9i, Sangseok Lee, Joonsoo Kim, jlayton,
	Vlastimil Babka, aquini
In-Reply-To: <20160324052650.HM.e0000000006t8Yn@gurugio.wwl1662.hanmail.net>

On Thu, Mar 24, 2016 at 05:26:50AM +0900, Gioh Kim wrote:
>    Hmmm... But, in failure case, is it safe to call putback_lru_page() for
>    them?
>    And, PageIsolated() would be left. Is it okay? It's not symmetric that
>    isolated page can be freed by decreasing ref count without calling
>    putback function. This should be clarified and documented.
> 
>    I agree Joonsoo's idea.
> 
>    Freeing isolated page out of putback() could be confused.

If we makes such rule, subsystem cannot free the isolated pages until VM calls
putback. I don't think it's a good idea. With it, every users should make own
deferred page freeing logic which might be more error-prone and obstacle for
using this interface.

I want to make client free his pages whenever he want if possible.

> 
>    Every detail cannot be documented. And more documents mean less elegant
>    code.
> 
>    Is it possible to free isolated page in putback()?
> 
>    In move_to_new_page(), can we call a_ops->migratepage like following?
> 
>    move_to_new_page()
> 
>    {
> 
>    mapping = page_mapping(page)
> 
>    if (!mapping)
> 
>        rc = migrate_page
> 
>    else if (mapping->a_ops->migratepage && IsolatePage(page))
> 
>       rc = mapping->a_ops->migratepage
> 

It's not a problem. The problem is that a page failed migration
so VM will putback the page. But, between fail of migration and
putback of isolated page, user can free the page. In this case,
putback operation would be not called and pass the page in
putback_lru_page.


>    else
> 
>        rc = fallback_migrate_page
> 
>    ...
> 
>       return rc
> 
>    }
> 
>    I'm sorry that I couldn't review in detail because I forgot many
>    details.

You're a human being, not Alphago. :)

Thanks for the review, Gioh!

> 
>    [1][Kk8NwEH1.I.q95.FfPs-qw00]
>    [@from=gurugio&rcpt=minchan%40kernel%2Eorg&msgid=%3C20160324052650%2EHM
>    %2Ee0000000006t8Yn%40gurugio%2Ewwl1662%2Ehanmail%2Enet%3E]
> 
> References
> 
>    1. mailto:gurugio@hanmail.net

^ permalink raw reply

* [RFC PATCH V2 0/2] basic device IOTLB support
From: Jason Wang @ 2016-03-25  2:34 UTC (permalink / raw)
  To: mst, kvm, virtualization, netdev, linux-kernel
  Cc: pbonzini, qemu-devel, peterx

This patch tries to implement an device IOTLB for vhost. This could be
used with for co-operation with userspace(qemu) implementation of
iommu for a secure DMA environment (DMAR) in guest.

The idea is simple. When vhost meets an IOTLB miss, it will request
the assistance of userspace to do the translation, this is done
through:

- Fill the translation request in a preset userspace address (This
  address is set through ioctl VHOST_SET_IOTLB_REQUEST_ENTRY).
- Notify userspace through eventfd (This eventfd was set through ioctl
  VHOST_SET_IOTLB_FD).

When userspace finishes the translation, it will update the vhost
IOTLB through VHOST_UPDATE_IOTLB ioctl. Userspace is also in charge of
snooping the IOTLB invalidation of IOMMU IOTLB and use
VHOST_UPDATE_IOTLB to invalidate the possible entry in vhost.

The codes were designed to be architecture independent. It should be
easily ported to any architecture.

Changes from V1:
- support any size/range of updating and invalidation through
  introducing the interval tree.
- convert from per device iotlb request to per virtqueue iotlb
  request, this solves the possible deadlock in V1.
- read/write permission check support.

Please review.

Thanks

Jason Wang (2):
  vhost: convert pre sorted vhost memory array to interval tree
  vhost: device IOTLB API

 drivers/vhost/net.c        |  14 +-
 drivers/vhost/vhost.c      | 427 +++++++++++++++++++++++++++++++++++----------
 drivers/vhost/vhost.h      |  42 ++++-
 fs/eventfd.c               |   3 +-
 include/uapi/linux/vhost.h |  35 ++++
 5 files changed, 419 insertions(+), 102 deletions(-)

-- 
2.5.0

^ permalink raw reply

* [RFC PATCH V2 1/2] vhost: convert pre sorted vhost memory array to interval tree
From: Jason Wang @ 2016-03-25  2:34 UTC (permalink / raw)
  To: mst, kvm, virtualization, netdev, linux-kernel
  Cc: pbonzini, qemu-devel, peterx
In-Reply-To: <1458873274-13961-1-git-send-email-jasowang@redhat.com>

Current pre-sorted memory region array has some limitations for future
device IOTLB conversion:

1) need extra work for adding and removing a single region, and it's
   expected to be slow because of sorting or memory re-allocation.
2) need extra work of removing a large range which may intersect
   several regions with different size.
3) need trick for a replacement policy like LRU

To overcome the above shortcomings, this patch convert it to interval
tree which can easily address the above issue with almost no extra
work.

The patch could be used for:

- Extend the current API and only let the userspace to send diffs of
  memory table.
- Simplify Device IOTLB implementation.

Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 drivers/vhost/net.c   |   8 +--
 drivers/vhost/vhost.c | 182 ++++++++++++++++++++++++++++----------------------
 drivers/vhost/vhost.h |  27 ++++++--
 3 files changed, 128 insertions(+), 89 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 9eda69e..481db96 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -968,20 +968,20 @@ static long vhost_net_reset_owner(struct vhost_net *n)
 	struct socket *tx_sock = NULL;
 	struct socket *rx_sock = NULL;
 	long err;
-	struct vhost_memory *memory;
+	struct vhost_umem *umem;
 
 	mutex_lock(&n->dev.mutex);
 	err = vhost_dev_check_owner(&n->dev);
 	if (err)
 		goto done;
-	memory = vhost_dev_reset_owner_prepare();
-	if (!memory) {
+	umem = vhost_dev_reset_owner_prepare();
+	if (!umem) {
 		err = -ENOMEM;
 		goto done;
 	}
 	vhost_net_stop(n, &tx_sock, &rx_sock);
 	vhost_net_flush(n);
-	vhost_dev_reset_owner(&n->dev, memory);
+	vhost_dev_reset_owner(&n->dev, umem);
 	vhost_net_vq_reset(n);
 done:
 	mutex_unlock(&n->dev.mutex);
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index ad2146a..32c35a9 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -27,6 +27,7 @@
 #include <linux/cgroup.h>
 #include <linux/module.h>
 #include <linux/sort.h>
+#include <linux/interval_tree_generic.h>
 
 #include "vhost.h"
 
@@ -42,6 +43,10 @@ enum {
 #define vhost_used_event(vq) ((__virtio16 __user *)&vq->avail->ring[vq->num])
 #define vhost_avail_event(vq) ((__virtio16 __user *)&vq->used->ring[vq->num])
 
+INTERVAL_TREE_DEFINE(struct vhost_umem_node,
+		     rb, __u64, __subtree_last,
+		     START, LAST, , vhost_umem_interval_tree);
+
 #ifdef CONFIG_VHOST_CROSS_ENDIAN_LEGACY
 static void vhost_vq_reset_user_be(struct vhost_virtqueue *vq)
 {
@@ -275,7 +280,7 @@ static void vhost_vq_reset(struct vhost_dev *dev,
 	vq->call_ctx = NULL;
 	vq->call = NULL;
 	vq->log_ctx = NULL;
-	vq->memory = NULL;
+	vq->umem = NULL;
 	vq->is_le = virtio_legacy_is_little_endian();
 	vhost_vq_reset_user_be(vq);
 }
@@ -381,7 +386,7 @@ void vhost_dev_init(struct vhost_dev *dev,
 	mutex_init(&dev->mutex);
 	dev->log_ctx = NULL;
 	dev->log_file = NULL;
-	dev->memory = NULL;
+	dev->umem = NULL;
 	dev->mm = NULL;
 	spin_lock_init(&dev->work_lock);
 	INIT_LIST_HEAD(&dev->work_list);
@@ -486,27 +491,36 @@ err_mm:
 }
 EXPORT_SYMBOL_GPL(vhost_dev_set_owner);
 
-struct vhost_memory *vhost_dev_reset_owner_prepare(void)
+static void *vhost_kvzalloc(unsigned long size)
+{
+	void *n = kzalloc(size, GFP_KERNEL | __GFP_NOWARN | __GFP_REPEAT);
+
+	if (!n)
+		n = vzalloc(size);
+	return n;
+}
+
+struct vhost_umem *vhost_dev_reset_owner_prepare(void)
 {
-	return kmalloc(offsetof(struct vhost_memory, regions), GFP_KERNEL);
+	return vhost_kvzalloc(sizeof(struct vhost_umem));
 }
 EXPORT_SYMBOL_GPL(vhost_dev_reset_owner_prepare);
 
 /* Caller should have device mutex */
-void vhost_dev_reset_owner(struct vhost_dev *dev, struct vhost_memory *memory)
+void vhost_dev_reset_owner(struct vhost_dev *dev, struct vhost_umem *umem)
 {
 	int i;
 
 	vhost_dev_cleanup(dev, true);
 
 	/* Restore memory to default empty mapping. */
-	memory->nregions = 0;
-	dev->memory = memory;
+	INIT_LIST_HEAD(&umem->umem_list);
+	dev->umem = umem;
 	/* We don't need VQ locks below since vhost_dev_cleanup makes sure
 	 * VQs aren't running.
 	 */
 	for (i = 0; i < dev->nvqs; ++i)
-		dev->vqs[i]->memory = memory;
+		dev->vqs[i]->umem = umem;
 }
 EXPORT_SYMBOL_GPL(vhost_dev_reset_owner);
 
@@ -523,6 +537,21 @@ void vhost_dev_stop(struct vhost_dev *dev)
 }
 EXPORT_SYMBOL_GPL(vhost_dev_stop);
 
+static void vhost_umem_clean(struct vhost_umem *umem)
+{
+	struct vhost_umem_node *node, *tmp;
+
+	if (!umem)
+		return;
+
+	list_for_each_entry_safe(node, tmp, &umem->umem_list, link) {
+		vhost_umem_interval_tree_remove(node, &umem->umem_tree);
+		list_del(&node->link);
+		kvfree(node);
+	}
+	kvfree(umem);
+}
+
 /* Caller should have device mutex if and only if locked is set */
 void vhost_dev_cleanup(struct vhost_dev *dev, bool locked)
 {
@@ -549,8 +578,8 @@ void vhost_dev_cleanup(struct vhost_dev *dev, bool locked)
 		fput(dev->log_file);
 	dev->log_file = NULL;
 	/* No one will access memory at this point */
-	kvfree(dev->memory);
-	dev->memory = NULL;
+	vhost_umem_clean(dev->umem);
+	dev->umem = NULL;
 	WARN_ON(!list_empty(&dev->work_list));
 	if (dev->worker) {
 		kthread_stop(dev->worker);
@@ -576,25 +605,25 @@ static int log_access_ok(void __user *log_base, u64 addr, unsigned long sz)
 }
 
 /* Caller should have vq mutex and device mutex. */
-static int vq_memory_access_ok(void __user *log_base, struct vhost_memory *mem,
+static int vq_memory_access_ok(void __user *log_base, struct vhost_umem *umem,
 			       int log_all)
 {
-	int i;
+	struct vhost_umem_node *node;
 
-	if (!mem)
+	if (!umem)
 		return 0;
 
-	for (i = 0; i < mem->nregions; ++i) {
-		struct vhost_memory_region *m = mem->regions + i;
-		unsigned long a = m->userspace_addr;
-		if (m->memory_size > ULONG_MAX)
+	list_for_each_entry(node, &umem->umem_list, link) {
+		unsigned long a = node->userspace_addr;
+
+		if (node->size > ULONG_MAX)
 			return 0;
 		else if (!access_ok(VERIFY_WRITE, (void __user *)a,
-				    m->memory_size))
+				    node->size))
 			return 0;
 		else if (log_all && !log_access_ok(log_base,
-						   m->guest_phys_addr,
-						   m->memory_size))
+						   node->start,
+						   node->size))
 			return 0;
 	}
 	return 1;
@@ -602,7 +631,7 @@ static int vq_memory_access_ok(void __user *log_base, struct vhost_memory *mem,
 
 /* Can we switch to this memory table? */
 /* Caller should have device mutex but not vq mutex */
-static int memory_access_ok(struct vhost_dev *d, struct vhost_memory *mem,
+static int memory_access_ok(struct vhost_dev *d, struct vhost_umem *umem,
 			    int log_all)
 {
 	int i;
@@ -615,7 +644,8 @@ static int memory_access_ok(struct vhost_dev *d, struct vhost_memory *mem,
 		log = log_all || vhost_has_feature(d->vqs[i], VHOST_F_LOG_ALL);
 		/* If ring is inactive, will check when it's enabled. */
 		if (d->vqs[i]->private_data)
-			ok = vq_memory_access_ok(d->vqs[i]->log_base, mem, log);
+			ok = vq_memory_access_ok(d->vqs[i]->log_base,
+						 umem, log);
 		else
 			ok = 1;
 		mutex_unlock(&d->vqs[i]->mutex);
@@ -642,7 +672,7 @@ static int vq_access_ok(struct vhost_virtqueue *vq, unsigned int num,
 /* Caller should have device mutex but not vq mutex */
 int vhost_log_access_ok(struct vhost_dev *dev)
 {
-	return memory_access_ok(dev, dev->memory, 1);
+	return memory_access_ok(dev, dev->umem, 1);
 }
 EXPORT_SYMBOL_GPL(vhost_log_access_ok);
 
@@ -653,7 +683,7 @@ static int vq_log_access_ok(struct vhost_virtqueue *vq,
 {
 	size_t s = vhost_has_feature(vq, VIRTIO_RING_F_EVENT_IDX) ? 2 : 0;
 
-	return vq_memory_access_ok(log_base, vq->memory,
+	return vq_memory_access_ok(log_base, vq->umem,
 				   vhost_has_feature(vq, VHOST_F_LOG_ALL)) &&
 		(!vq->log_used || log_access_ok(log_base, vq->log_addr,
 					sizeof *vq->used +
@@ -669,28 +699,12 @@ int vhost_vq_access_ok(struct vhost_virtqueue *vq)
 }
 EXPORT_SYMBOL_GPL(vhost_vq_access_ok);
 
-static int vhost_memory_reg_sort_cmp(const void *p1, const void *p2)
-{
-	const struct vhost_memory_region *r1 = p1, *r2 = p2;
-	if (r1->guest_phys_addr < r2->guest_phys_addr)
-		return 1;
-	if (r1->guest_phys_addr > r2->guest_phys_addr)
-		return -1;
-	return 0;
-}
-
-static void *vhost_kvzalloc(unsigned long size)
-{
-	void *n = kzalloc(size, GFP_KERNEL | __GFP_NOWARN | __GFP_REPEAT);
-
-	if (!n)
-		n = vzalloc(size);
-	return n;
-}
-
 static long vhost_set_memory(struct vhost_dev *d, struct vhost_memory __user *m)
 {
-	struct vhost_memory mem, *newmem, *oldmem;
+	struct vhost_memory mem, *newmem;
+	struct vhost_memory_region *region;
+	struct vhost_umem_node *node;
+	struct vhost_umem *newumem, *oldumem;
 	unsigned long size = offsetof(struct vhost_memory, regions);
 	int i;
 
@@ -710,24 +724,52 @@ static long vhost_set_memory(struct vhost_dev *d, struct vhost_memory __user *m)
 		kvfree(newmem);
 		return -EFAULT;
 	}
-	sort(newmem->regions, newmem->nregions, sizeof(*newmem->regions),
-		vhost_memory_reg_sort_cmp, NULL);
 
-	if (!memory_access_ok(d, newmem, 0)) {
+	newumem = vhost_kvzalloc(sizeof(*newumem));
+	if (!newumem) {
 		kvfree(newmem);
-		return -EFAULT;
+		return -ENOMEM;
+	}
+
+	newumem->umem_tree = RB_ROOT;
+	INIT_LIST_HEAD(&newumem->umem_list);
+
+	for (region = newmem->regions;
+	     region < newmem->regions + mem.nregions;
+	     region++) {
+		node = vhost_kvzalloc(sizeof(*node));
+		if (!node)
+			goto err;
+		node->start = region->guest_phys_addr;
+		node->size = region->memory_size;
+		node->last = node->start + node->size - 1;
+		node->userspace_addr = region->userspace_addr;
+		INIT_LIST_HEAD(&node->link);
+		list_add_tail(&node->link, &newumem->umem_list);
+		vhost_umem_interval_tree_insert(node, &newumem->umem_tree);
 	}
-	oldmem = d->memory;
-	d->memory = newmem;
+
+	if (!memory_access_ok(d, newumem, 0))
+		goto err;
+
+	oldumem = d->umem;
+	d->umem = newumem;
 
 	/* All memory accesses are done under some VQ mutex. */
 	for (i = 0; i < d->nvqs; ++i) {
 		mutex_lock(&d->vqs[i]->mutex);
-		d->vqs[i]->memory = newmem;
+		d->vqs[i]->umem = newumem;
 		mutex_unlock(&d->vqs[i]->mutex);
 	}
-	kvfree(oldmem);
+
+	kvfree(newmem);
+	vhost_umem_clean(oldumem);
 	return 0;
+
+err:
+	vhost_umem_clean(newumem);
+	kvfree(newmem);
+	return -EFAULT;
 }
 
 long vhost_vring_ioctl(struct vhost_dev *d, int ioctl, void __user *argp)
@@ -1017,28 +1059,6 @@ done:
 }
 EXPORT_SYMBOL_GPL(vhost_dev_ioctl);
 
-static const struct vhost_memory_region *find_region(struct vhost_memory *mem,
-						     __u64 addr, __u32 len)
-{
-	const struct vhost_memory_region *reg;
-	int start = 0, end = mem->nregions;
-
-	while (start < end) {
-		int slot = start + (end - start) / 2;
-		reg = mem->regions + slot;
-		if (addr >= reg->guest_phys_addr)
-			end = slot;
-		else
-			start = slot + 1;
-	}
-
-	reg = mem->regions + start;
-	if (addr >= reg->guest_phys_addr &&
-		reg->guest_phys_addr + reg->memory_size > addr)
-		return reg;
-	return NULL;
-}
-
 /* TODO: This is really inefficient.  We need something like get_user()
  * (instruction directly accesses the data, with an exception table entry
  * returning -EFAULT). See Documentation/x86/exception-tables.txt.
@@ -1180,29 +1200,29 @@ EXPORT_SYMBOL_GPL(vhost_init_used);
 static int translate_desc(struct vhost_virtqueue *vq, u64 addr, u32 len,
 			  struct iovec iov[], int iov_size)
 {
-	const struct vhost_memory_region *reg;
-	struct vhost_memory *mem;
+	const struct vhost_umem_node *node;
+	struct vhost_umem *umem = vq->umem;
 	struct iovec *_iov;
 	u64 s = 0;
 	int ret = 0;
 
-	mem = vq->memory;
 	while ((u64)len > s) {
 		u64 size;
 		if (unlikely(ret >= iov_size)) {
 			ret = -ENOBUFS;
 			break;
 		}
-		reg = find_region(mem, addr, len);
-		if (unlikely(!reg)) {
+		node = vhost_umem_interval_tree_iter_first(&umem->umem_tree,
+							addr, addr + len - 1);
+		if (node == NULL || node->start > addr) {
 			ret = -EFAULT;
 			break;
 		}
 		_iov = iov + ret;
-		size = reg->memory_size - addr + reg->guest_phys_addr;
+		size = node->size - addr + node->start;
 		_iov->iov_len = min((u64)len - s, size);
 		_iov->iov_base = (void __user *)(unsigned long)
-			(reg->userspace_addr + addr - reg->guest_phys_addr);
+			(node->userspace_addr + addr - node->start);
 		s += size;
 		addr += size;
 		++ret;
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index d3f7674..5d64393 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -52,6 +52,25 @@ struct vhost_log {
 	u64 len;
 };
 
+#define START(node) ((node)->start)
+#define LAST(node) ((node)->last)
+
+struct vhost_umem_node {
+	struct rb_node rb;
+	struct list_head link;
+	__u64 start;
+	__u64 last;
+	__u64 size;
+	__u64 userspace_addr;
+	__u64 flags_padding;
+	__u64 __subtree_last;
+};
+
+struct vhost_umem {
+	struct rb_root umem_tree;
+	struct list_head umem_list;
+};
+
 /* The virtqueue structure describes a queue attached to a device. */
 struct vhost_virtqueue {
 	struct vhost_dev *dev;
@@ -100,7 +119,7 @@ struct vhost_virtqueue {
 	struct iovec *indirect;
 	struct vring_used_elem *heads;
 	/* Protected by virtqueue mutex. */
-	struct vhost_memory *memory;
+	struct vhost_umem *umem;
 	void *private_data;
 	u64 acked_features;
 	/* Log write descriptors */
@@ -117,7 +136,6 @@ struct vhost_virtqueue {
 };
 
 struct vhost_dev {
-	struct vhost_memory *memory;
 	struct mm_struct *mm;
 	struct mutex mutex;
 	struct vhost_virtqueue **vqs;
@@ -127,14 +145,15 @@ struct vhost_dev {
 	spinlock_t work_lock;
 	struct list_head work_list;
 	struct task_struct *worker;
+	struct vhost_umem *umem;
 };
 
 void vhost_dev_init(struct vhost_dev *, struct vhost_virtqueue **vqs, int nvqs);
 long vhost_dev_set_owner(struct vhost_dev *dev);
 bool vhost_dev_has_owner(struct vhost_dev *dev);
 long vhost_dev_check_owner(struct vhost_dev *);
-struct vhost_memory *vhost_dev_reset_owner_prepare(void);
-void vhost_dev_reset_owner(struct vhost_dev *, struct vhost_memory *);
+struct vhost_umem *vhost_dev_reset_owner_prepare(void);
+void vhost_dev_reset_owner(struct vhost_dev *, struct vhost_umem *);
 void vhost_dev_cleanup(struct vhost_dev *, bool locked);
 void vhost_dev_stop(struct vhost_dev *);
 long vhost_dev_ioctl(struct vhost_dev *, unsigned int ioctl, void __user *argp);
-- 
2.5.0

^ permalink raw reply related

* [RFC PATCH V2 2/2] vhost: device IOTLB API
From: Jason Wang @ 2016-03-25  2:34 UTC (permalink / raw)
  To: mst, kvm, virtualization, netdev, linux-kernel
  Cc: pbonzini, qemu-devel, peterx
In-Reply-To: <1458873274-13961-1-git-send-email-jasowang@redhat.com>

This patch tries to implement an device IOTLB for vhost. This could be
used with for co-operation with userspace(qemu) implementation of DMA
remapping.

The idea is simple. When vhost meets an IOTLB miss, it will request
the assistance of userspace to do the translation, this is done
through:

- Fill the translation request in a preset userspace address (This
  address is set through ioctl VHOST_SET_IOTLB_REQUEST_ENTRY).
- Notify userspace through eventfd (This eventfd was set through ioctl
  VHOST_SET_IOTLB_FD).
- device IOTLB were started and stopped through VHOST_RUN_IOTLB ioctl

When userspace finishes the translation, it will update the vhost
IOTLB through VHOST_UPDATE_IOTLB ioctl. Userspace is also in charge of
snooping the IOTLB invalidation of IOMMU IOTLB and use
VHOST_UPDATE_IOTLB to invalidate the possible entry in vhost.

Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 drivers/vhost/net.c        |   6 +-
 drivers/vhost/vhost.c      | 301 +++++++++++++++++++++++++++++++++++++++------
 drivers/vhost/vhost.h      |  17 ++-
 fs/eventfd.c               |   3 +-
 include/uapi/linux/vhost.h |  35 ++++++
 5 files changed, 320 insertions(+), 42 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 481db96..7cbdeed 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -334,7 +334,7 @@ static void handle_tx(struct vhost_net *net)
 		head = vhost_get_vq_desc(vq, vq->iov,
 					 ARRAY_SIZE(vq->iov),
 					 &out, &in,
-					 NULL, NULL);
+					 NULL, NULL, VHOST_ACCESS_RO);
 		/* On error, stop handling until the next kick. */
 		if (unlikely(head < 0))
 			break;
@@ -470,7 +470,7 @@ static int get_rx_bufs(struct vhost_virtqueue *vq,
 		}
 		r = vhost_get_vq_desc(vq, vq->iov + seg,
 				      ARRAY_SIZE(vq->iov) - seg, &out,
-				      &in, log, log_num);
+				      &in, log, log_num, VHOST_ACCESS_WO);
 		if (unlikely(r < 0))
 			goto err;
 
@@ -1083,7 +1083,7 @@ static long vhost_net_ioctl(struct file *f, unsigned int ioctl,
 		r = vhost_dev_ioctl(&n->dev, ioctl, argp);
 		if (r == -ENOIOCTLCMD)
 			r = vhost_vring_ioctl(&n->dev, ioctl, argp);
-		else
+		else if (ioctl != VHOST_UPDATE_IOTLB)
 			vhost_net_flush(n);
 		mutex_unlock(&n->dev.mutex);
 		return r;
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 32c35a9..1dd43e8 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -280,6 +280,10 @@ static void vhost_vq_reset(struct vhost_dev *dev,
 	vq->call_ctx = NULL;
 	vq->call = NULL;
 	vq->log_ctx = NULL;
+	vq->iotlb_call = NULL;
+	vq->iotlb_call_ctx = NULL;
+	vq->iotlb_request = NULL;
+	vq->pending_request.flags.type = VHOST_IOTLB_INVALIDATE;
 	vq->umem = NULL;
 	vq->is_le = virtio_legacy_is_little_endian();
 	vhost_vq_reset_user_be(vq);
@@ -387,8 +391,10 @@ void vhost_dev_init(struct vhost_dev *dev,
 	dev->log_ctx = NULL;
 	dev->log_file = NULL;
 	dev->umem = NULL;
+	dev->iotlb = NULL;
 	dev->mm = NULL;
 	spin_lock_init(&dev->work_lock);
+	spin_lock_init(&dev->iotlb_lock);
 	INIT_LIST_HEAD(&dev->work_list);
 	dev->worker = NULL;
 
@@ -537,6 +543,15 @@ void vhost_dev_stop(struct vhost_dev *dev)
 }
 EXPORT_SYMBOL_GPL(vhost_dev_stop);
 
+static void vhost_umem_free(struct vhost_umem *umem,
+			    struct vhost_umem_node *node)
+{
+	vhost_umem_interval_tree_remove(node, &umem->umem_tree);
+	list_del(&node->link);
+	kfree(node);
+	umem->numem--;
+}
+
 static void vhost_umem_clean(struct vhost_umem *umem)
 {
 	struct vhost_umem_node *node, *tmp;
@@ -544,11 +559,9 @@ static void vhost_umem_clean(struct vhost_umem *umem)
 	if (!umem)
 		return;
 
-	list_for_each_entry_safe(node, tmp, &umem->umem_list, link) {
-		vhost_umem_interval_tree_remove(node, &umem->umem_tree);
-		list_del(&node->link);
-		kvfree(node);
-	}
+	list_for_each_entry_safe(node, tmp, &umem->umem_list, link)
+		vhost_umem_free(umem, node);
+
 	kvfree(umem);
 }
 
@@ -580,6 +593,8 @@ void vhost_dev_cleanup(struct vhost_dev *dev, bool locked)
 	/* No one will access memory at this point */
 	vhost_umem_clean(dev->umem);
 	dev->umem = NULL;
+	vhost_umem_clean(dev->iotlb);
+	dev->iotlb = NULL;
 	WARN_ON(!list_empty(&dev->work_list));
 	if (dev->worker) {
 		kthread_stop(dev->worker);
@@ -699,11 +714,61 @@ int vhost_vq_access_ok(struct vhost_virtqueue *vq)
 }
 EXPORT_SYMBOL_GPL(vhost_vq_access_ok);
 
+static int vhost_new_umem_range(struct vhost_umem *umem,
+				u64 start, u64 size, u64 end,
+				u64 userspace_addr, int perm)
+{
+	struct vhost_umem_node *tmp, *node = kmalloc(sizeof(*node), GFP_ATOMIC);
+
+	if (!node)
+		return -ENOMEM;
+
+	if (umem->numem == VHOST_IOTLB_SIZE) {
+		tmp = list_last_entry(&umem->umem_list, typeof(*tmp), link);
+		vhost_umem_free(umem, tmp);
+	}
+
+	node->start = start;
+	node->size = size;
+	node->last = end;
+	node->userspace_addr = userspace_addr;
+	node->perm = perm;
+	INIT_LIST_HEAD(&node->link);
+	list_add_tail(&node->link, &umem->umem_list);
+	vhost_umem_interval_tree_insert(node, &umem->umem_tree);
+	umem->numem++;
+
+	return 0;
+}
+
+static void vhost_del_umem_range(struct vhost_umem *umem,
+				 u64 start, u64 end)
+{
+	struct vhost_umem_node *node;
+
+	while ((node = vhost_umem_interval_tree_iter_first(&umem->umem_tree,
+							   start, end)))
+		vhost_umem_free(umem, node);
+}
+
+static struct vhost_umem *vhost_umem_alloc(void)
+{
+	struct vhost_umem *umem = vhost_kvzalloc(sizeof(*umem));
+
+	if (!umem)
+		return NULL;
+
+	umem->umem_tree = RB_ROOT;
+	umem->numem = 0;
+	INIT_LIST_HEAD(&umem->umem_list);
+
+	return umem;
+}
+
 static long vhost_set_memory(struct vhost_dev *d, struct vhost_memory __user *m)
 {
 	struct vhost_memory mem, *newmem;
 	struct vhost_memory_region *region;
-	struct vhost_umem_node *node;
 	struct vhost_umem *newumem, *oldumem;
 	unsigned long size = offsetof(struct vhost_memory, regions);
 	int i;
@@ -725,28 +790,23 @@ static long vhost_set_memory(struct vhost_dev *d, struct vhost_memory __user *m)
 		return -EFAULT;
 	}
 
-	newumem = vhost_kvzalloc(sizeof(*newumem));
+	newumem = vhost_umem_alloc();
 	if (!newumem) {
 		kvfree(newmem);
 		return -ENOMEM;
 	}
 
-	newumem->umem_tree = RB_ROOT;
-	INIT_LIST_HEAD(&newumem->umem_list);
-
 	for (region = newmem->regions;
 	     region < newmem->regions + mem.nregions;
 	     region++) {
-		node = vhost_kvzalloc(sizeof(*node));
-		if (!node)
+		if (vhost_new_umem_range(newumem,
+					 region->guest_phys_addr,
+					 region->memory_size,
+					 region->guest_phys_addr +
+					 region->memory_size - 1,
+					 region->userspace_addr,
+				         VHOST_ACCESS_RW))
 			goto err;
-		node->start = region->guest_phys_addr;
-		node->size = region->memory_size;
-		node->last = node->start + node->size - 1;
-		node->userspace_addr = region->userspace_addr;
-		INIT_LIST_HEAD(&node->link);
-		list_add_tail(&node->link, &newumem->umem_list);
-		vhost_umem_interval_tree_insert(node, &newumem->umem_tree);
 	}
 
 	if (!memory_access_ok(d, newumem, 0))
@@ -782,6 +842,7 @@ long vhost_vring_ioctl(struct vhost_dev *d, int ioctl, void __user *argp)
 	struct vhost_vring_state s;
 	struct vhost_vring_file f;
 	struct vhost_vring_addr a;
+	struct vhost_vring_iotlb_entry e;
 	u32 idx;
 	long r;
 
@@ -910,6 +971,35 @@ long vhost_vring_ioctl(struct vhost_dev *d, int ioctl, void __user *argp)
 		} else
 			filep = eventfp;
 		break;
+	case VHOST_SET_VRING_IOTLB_REQUEST:
+		r = -EFAULT;
+		if (copy_from_user(&e, argp, sizeof e))
+			break;
+		if (!access_ok(VERIFY_WRITE, e.userspace_addr,
+				sizeof(*vq->iotlb_request)))
+			break;
+		r = 0;
+		vq->iotlb_request = (struct vhost_iotlb_entry __user *)e.userspace_addr;
+		break;
+	case VHOST_SET_VRING_IOTLB_CALL:
+		if (copy_from_user(&f, argp, sizeof f)) {
+			r = -EFAULT;
+			break;
+		}
+		eventfp = f.fd == -1 ? NULL : eventfd_fget(f.fd);
+		if (IS_ERR(eventfp)) {
+			r = PTR_ERR(eventfp);
+			break;
+		}
+		if (eventfp != vq->iotlb_call) {
+			filep = vq->iotlb_call;
+			ctx = vq->iotlb_call_ctx;
+			vq->iotlb_call = eventfp;
+			vq->iotlb_call_ctx = eventfp ?
+				eventfd_ctx_fileget(eventfp) : NULL;
+		} else
+			filep = eventfp;
+		break;
 	case VHOST_SET_VRING_CALL:
 		if (copy_from_user(&f, argp, sizeof f)) {
 			r = -EFAULT;
@@ -977,11 +1067,55 @@ long vhost_vring_ioctl(struct vhost_dev *d, int ioctl, void __user *argp)
 }
 EXPORT_SYMBOL_GPL(vhost_vring_ioctl);
 
+static int vhost_init_device_iotlb(struct vhost_dev *d, bool enabled)
+{
+	struct vhost_umem *niotlb, *oiotlb;
+
+	if (enabled) {
+		niotlb = vhost_umem_alloc();
+		if (!niotlb)
+			return -ENOMEM;
+	} else
+		niotlb = NULL;
+
+	spin_lock(&d->iotlb_lock);
+	oiotlb = d->iotlb;
+	d->iotlb = niotlb;
+	spin_unlock(&d->iotlb_lock);
+
+	vhost_umem_clean(oiotlb);
+
+	return 0;
+}
+
+static void vhost_complete_iotlb_update(struct vhost_dev *d,
+					struct vhost_iotlb_entry *entry)
+{
+	struct vhost_iotlb_entry *req;
+	struct vhost_virtqueue *vq;
+	int i;
+
+
+	for (i = 0; i < d->nvqs; i++) {
+		vq = d->vqs[i];
+		mutex_lock(&vq->mutex);
+		req = &vq->pending_request;
+		if (entry->iova <= req->iova &&
+		    entry->iova + entry->size - 1 > req->iova &&
+		    req->flags.type == VHOST_IOTLB_MISS) {
+			*req = *entry;
+			vhost_poll_queue(&vq->poll);
+		}
+		mutex_unlock(&vq->mutex);
+	}
+}
+
 /* Caller must have device mutex */
 long vhost_dev_ioctl(struct vhost_dev *d, unsigned int ioctl, void __user *argp)
 {
 	struct file *eventfp, *filep = NULL;
 	struct eventfd_ctx *ctx = NULL;
+	struct vhost_iotlb_entry entry;
 	u64 p;
 	long r;
 	int i, fd;
@@ -1050,6 +1184,52 @@ long vhost_dev_ioctl(struct vhost_dev *d, unsigned int ioctl, void __user *argp)
 		if (filep)
 			fput(filep);
 		break;
+	case VHOST_RUN_IOTLB:
+		/* FIXME: enable and disabled */
+		vhost_init_device_iotlb(d, true);
+		break;
+	case VHOST_UPDATE_IOTLB:
+		r = copy_from_user(&entry, argp, sizeof(entry));
+		if (r < 0) {
+			r = -EFAULT;
+			goto done;
+		}
+
+		spin_lock(&d->iotlb_lock);
+		if (!d->iotlb) {
+			spin_unlock(&d->iotlb_lock);
+			r = -EFAULT;
+			goto done;
+		}
+		switch (entry.flags.type) {
+		case VHOST_IOTLB_UPDATE:
+			if (entry.flags.valid != VHOST_IOTLB_VALID) {
+				break;
+			}
+			if (vhost_new_umem_range(d->iotlb,
+						 entry.iova,
+						 entry.size,
+						 entry.iova + entry.size - 1,
+                                                 entry.userspace_addr,
+                                                 entry.flags.perm)) {
+				r = -ENOMEM;
+				break;
+			}
+			break;
+		case VHOST_IOTLB_INVALIDATE:
+			vhost_del_umem_range(d->iotlb,
+					     entry.iova,
+					     entry.iova + entry.size - 1);
+			break;
+		default:
+			r = -EINVAL;
+		}
+		spin_unlock(&d->iotlb_lock);
+
+		if (!r && entry.flags.type != VHOST_IOTLB_INVALIDATE)
+			vhost_complete_iotlb_update(d, &entry);
+
+		break;
 	default:
 		r = -ENOIOCTLCMD;
 		break;
@@ -1197,27 +1377,69 @@ int vhost_init_used(struct vhost_virtqueue *vq)
 }
 EXPORT_SYMBOL_GPL(vhost_init_used);
 
+static int vhost_iotlb_miss(struct vhost_virtqueue *vq, u64 iova)
+{
+	struct vhost_iotlb_entry *pending = &vq->pending_request;
+	int ret;
+
+	if (!vq->iotlb_call_ctx)
+		return -EOPNOTSUPP;
+
+	if (!vq->iotlb_request)
+		return -EOPNOTSUPP;
+
+	if (pending->flags.type == VHOST_IOTLB_MISS) {
+		return -EEXIST;
+	}
+
+	pending->iova = iova;
+	pending->flags.type = VHOST_IOTLB_MISS;
+
+	ret = __copy_to_user(vq->iotlb_request, pending,
+			     sizeof(struct vhost_iotlb_entry));
+	if (ret) {
+		goto err;
+	}
+
+	if (vq->iotlb_call_ctx)
+		eventfd_signal(vq->iotlb_call_ctx, 1);
+err:
+	return ret;
+}
+
 static int translate_desc(struct vhost_virtqueue *vq, u64 addr, u32 len,
-			  struct iovec iov[], int iov_size)
+			  struct iovec iov[], int iov_size, int access)
 {
 	const struct vhost_umem_node *node;
-	struct vhost_umem *umem = vq->umem;
+	struct vhost_dev *dev = vq->dev;
+	struct vhost_umem *umem = dev->iotlb ? dev->iotlb : dev->umem;
 	struct iovec *_iov;
 	u64 s = 0;
 	int ret = 0;
 
+	spin_lock(&dev->iotlb_lock);
+
 	while ((u64)len > s) {
 		u64 size;
 		if (unlikely(ret >= iov_size)) {
 			ret = -ENOBUFS;
 			break;
 		}
+
 		node = vhost_umem_interval_tree_iter_first(&umem->umem_tree,
 							addr, addr + len - 1);
 		if (node == NULL || node->start > addr) {
-			ret = -EFAULT;
+			if (umem != dev->iotlb) {
+				ret = -EFAULT;
+				break;
+			}
+			ret = -EAGAIN;
+			break;
+		} else if (!(node->perm & access)) {
+			ret = -EPERM;
 			break;
 		}
+
 		_iov = iov + ret;
 		size = node->size - addr + node->start;
 		_iov->iov_len = min((u64)len - s, size);
@@ -1228,6 +1450,10 @@ static int translate_desc(struct vhost_virtqueue *vq, u64 addr, u32 len,
 		++ret;
 	}
 
+	spin_unlock(&dev->iotlb_lock);
+
+	if (ret == -EAGAIN)
+		vhost_iotlb_miss(vq, addr);
 	return ret;
 }
 
@@ -1256,7 +1482,7 @@ static int get_indirect(struct vhost_virtqueue *vq,
 			struct iovec iov[], unsigned int iov_size,
 			unsigned int *out_num, unsigned int *in_num,
 			struct vhost_log *log, unsigned int *log_num,
-			struct vring_desc *indirect)
+			struct vring_desc *indirect, int access)
 {
 	struct vring_desc desc;
 	unsigned int i = 0, count, found = 0;
@@ -1274,9 +1500,10 @@ static int get_indirect(struct vhost_virtqueue *vq,
 	}
 
 	ret = translate_desc(vq, vhost64_to_cpu(vq, indirect->addr), len, vq->indirect,
-			     UIO_MAXIOV);
+			     UIO_MAXIOV, access);
 	if (unlikely(ret < 0)) {
-		vq_err(vq, "Translation failure %d in indirect.\n", ret);
+		if (ret != -EAGAIN)
+			vq_err(vq, "Translation failure %d in indirect.\n", ret);
 		return ret;
 	}
 	iov_iter_init(&from, READ, vq->indirect, ret, len);
@@ -1316,10 +1543,11 @@ static int get_indirect(struct vhost_virtqueue *vq,
 
 		ret = translate_desc(vq, vhost64_to_cpu(vq, desc.addr),
 				     vhost32_to_cpu(vq, desc.len), iov + iov_count,
-				     iov_size - iov_count);
+				     iov_size - iov_count, access);
 		if (unlikely(ret < 0)) {
-			vq_err(vq, "Translation failure %d indirect idx %d\n",
-			       ret, i);
+			if (ret != -EAGAIN)
+				vq_err(vq, "Translation failure %d indirect idx %d\n",
+					ret, i);
 			return ret;
 		}
 		/* If this is an input descriptor, increment that count. */
@@ -1355,7 +1583,8 @@ static int get_indirect(struct vhost_virtqueue *vq,
 int vhost_get_vq_desc(struct vhost_virtqueue *vq,
 		      struct iovec iov[], unsigned int iov_size,
 		      unsigned int *out_num, unsigned int *in_num,
-		      struct vhost_log *log, unsigned int *log_num)
+		      struct vhost_log *log, unsigned int *log_num,
+		      int access)
 {
 	struct vring_desc desc;
 	unsigned int i, head, found = 0;
@@ -1433,10 +1662,11 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
 		if (desc.flags & cpu_to_vhost16(vq, VRING_DESC_F_INDIRECT)) {
 			ret = get_indirect(vq, iov, iov_size,
 					   out_num, in_num,
-					   log, log_num, &desc);
+					   log, log_num, &desc, access);
 			if (unlikely(ret < 0)) {
-				vq_err(vq, "Failure detected "
-				       "in indirect descriptor at idx %d\n", i);
+				if (ret != -EAGAIN)
+					vq_err(vq, "Failure detected "
+						"in indirect descriptor at idx %d\n", i);
 				return ret;
 			}
 			continue;
@@ -1444,10 +1674,11 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
 
 		ret = translate_desc(vq, vhost64_to_cpu(vq, desc.addr),
 				     vhost32_to_cpu(vq, desc.len), iov + iov_count,
-				     iov_size - iov_count);
+				     iov_size - iov_count, access);
 		if (unlikely(ret < 0)) {
-			vq_err(vq, "Translation failure %d descriptor idx %d\n",
-			       ret, i);
+			if (ret != -EAGAIN)
+				vq_err(vq, "Translation failure %d descriptor idx %d\n",
+					ret, i);
 			return ret;
 		}
 		if (desc.flags & cpu_to_vhost16(vq, VRING_DESC_F_WRITE)) {
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index 5d64393..4365104 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -62,13 +62,15 @@ struct vhost_umem_node {
 	__u64 last;
 	__u64 size;
 	__u64 userspace_addr;
-	__u64 flags_padding;
+	__u32 perm;
+	__u32 flags_padding;
 	__u64 __subtree_last;
 };
 
 struct vhost_umem {
 	struct rb_root umem_tree;
 	struct list_head umem_list;
+	int numem;
 };
 
 /* The virtqueue structure describes a queue attached to a device. */
@@ -84,9 +86,13 @@ struct vhost_virtqueue {
 	struct file *kick;
 	struct file *call;
 	struct file *error;
+	struct file *iotlb_call;
 	struct eventfd_ctx *call_ctx;
 	struct eventfd_ctx *error_ctx;
 	struct eventfd_ctx *log_ctx;
+	struct eventfd_ctx *iotlb_call_ctx;
+	struct vhost_iotlb_entry __user *iotlb_request;
+	struct vhost_iotlb_entry pending_request;
 
 	struct vhost_poll poll;
 
@@ -135,6 +141,8 @@ struct vhost_virtqueue {
 #endif
 };
 
+#define VHOST_IOTLB_SIZE 2048
+
 struct vhost_dev {
 	struct mm_struct *mm;
 	struct mutex mutex;
@@ -146,6 +154,8 @@ struct vhost_dev {
 	struct list_head work_list;
 	struct task_struct *worker;
 	struct vhost_umem *umem;
+	struct vhost_umem *iotlb;
+	spinlock_t iotlb_lock;
 };
 
 void vhost_dev_init(struct vhost_dev *, struct vhost_virtqueue **vqs, int nvqs);
@@ -164,7 +174,8 @@ int vhost_log_access_ok(struct vhost_dev *);
 int vhost_get_vq_desc(struct vhost_virtqueue *,
 		      struct iovec iov[], unsigned int iov_count,
 		      unsigned int *out_num, unsigned int *in_num,
-		      struct vhost_log *log, unsigned int *log_num);
+		      struct vhost_log *log, unsigned int *log_num,
+		      int access);
 void vhost_discard_vq_desc(struct vhost_virtqueue *, int n);
 
 int vhost_init_used(struct vhost_virtqueue *);
@@ -183,7 +194,7 @@ int vhost_log_write(struct vhost_virtqueue *vq, struct vhost_log *log,
 		    unsigned int log_num, u64 len);
 
 #define vq_err(vq, fmt, ...) do {                                  \
-		pr_debug(pr_fmt(fmt), ##__VA_ARGS__);       \
+		printk(pr_fmt(fmt), ##__VA_ARGS__);       \
 		if ((vq)->error_ctx)                               \
 				eventfd_signal((vq)->error_ctx, 1);\
 	} while (0)
diff --git a/fs/eventfd.c b/fs/eventfd.c
index 8d0c0df..5c0a22f 100644
--- a/fs/eventfd.c
+++ b/fs/eventfd.c
@@ -59,8 +59,9 @@ __u64 eventfd_signal(struct eventfd_ctx *ctx, __u64 n)
 	if (ULLONG_MAX - ctx->count < n)
 		n = ULLONG_MAX - ctx->count;
 	ctx->count += n;
-	if (waitqueue_active(&ctx->wqh))
+	if (waitqueue_active(&ctx->wqh)) {
 		wake_up_locked_poll(&ctx->wqh, POLLIN);
+	}
 	spin_unlock_irqrestore(&ctx->wqh.lock, flags);
 
 	return n;
diff --git a/include/uapi/linux/vhost.h b/include/uapi/linux/vhost.h
index ab373191..5c35ab4 100644
--- a/include/uapi/linux/vhost.h
+++ b/include/uapi/linux/vhost.h
@@ -47,6 +47,32 @@ struct vhost_vring_addr {
 	__u64 log_guest_addr;
 };
 
+struct vhost_iotlb_entry {
+	__u64 iova;
+	__u64 size;
+	__u64 userspace_addr;
+	struct {
+#define VHOST_ACCESS_RO      0x1
+#define VHOST_ACCESS_WO      0x2
+#define VHOST_ACCESS_RW      0x3
+		__u8  perm;
+#define VHOST_IOTLB_MISS           1
+#define VHOST_IOTLB_UPDATE         2
+#define VHOST_IOTLB_INVALIDATE     3
+		__u8  type;
+#define VHOST_IOTLB_INVALID        0x1
+#define VHOST_IOTLB_VALID          0x2
+		__u8  valid;
+		__u8  u8_padding;
+		__u32 padding;
+	} flags;
+};
+
+struct vhost_vring_iotlb_entry {
+	unsigned int index;
+	__u64 userspace_addr;
+};
+
 struct vhost_memory_region {
 	__u64 guest_phys_addr;
 	__u64 memory_size; /* bytes */
@@ -127,6 +153,15 @@ struct vhost_memory {
 /* Set eventfd to signal an error */
 #define VHOST_SET_VRING_ERR _IOW(VHOST_VIRTIO, 0x22, struct vhost_vring_file)
 
+/* IOTLB */
+/* Specify an eventfd file descriptor to signle on IOTLB miss */
+#define VHOST_SET_VRING_IOTLB_CALL _IOW(VHOST_VIRTIO, 0x23, struct      \
+                                        vhost_vring_file)
+#define VHOST_SET_VRING_IOTLB_REQUEST _IOW(VHOST_VIRTIO, 0x25, struct   \
+                                           vhost_vring_iotlb_entry)
+#define VHOST_UPDATE_IOTLB _IOW(VHOST_VIRTIO, 0x24, struct vhost_iotlb_entry)
+#define VHOST_RUN_IOTLB _IOW(VHOST_VIRTIO, 0x26, int)
+
 /* VHOST_NET specific defines */
 
 /* Attach virtio net ring to a raw socket, or tap device.
-- 
2.5.0

^ permalink raw reply related

* Re: [PATCH v2 00/18] Support non-lru page migration
From: Minchan Kim @ 2016-03-28  5:08 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, YiPing Xu, aquini, rknize, Sergey Senozhatsky,
	Chan Gyun Jeong, Hugh Dickins, linux-kernel, Al Viro,
	virtualization, bfields, linux-mm, Gioh Kim, koct9i, Sangseok Lee,
	jlayton, Joonsoo Kim, Vlastimil Babka, Mel Gorman
In-Reply-To: <1458541867-27380-1-git-send-email-minchan@kernel.org>

Hello Andrew,

On Mon, Mar 21, 2016 at 03:30:49PM +0900, Minchan Kim wrote:
> Recently, I got many reports about perfermance degradation
> in embedded system(Android mobile phone, webOS TV and so on)
> and failed to fork easily.
> 
> The problem was fragmentation caused by zram and GPU driver
> pages. Their pages cannot be migrated so compaction cannot
> work well, either so reclaimer ends up shrinking all of working
> set pages. It made system very slow and even to fail to fork
> easily.
> 
> Other pain point is that they cannot work with CMA.
> Most of CMA memory space could be idle(ie, it could be used
> for movable pages unless driver is using) but if driver(i.e.,
> zram) cannot migrate his page, that memory space could be
> wasted. In our product which has big CMA memory, it reclaims
> zones too exccessively although there are lots of free space
> in CMA so system was very slow easily.
> 
> To solve these problem, this patch try to add facility to
> migrate non-lru pages via introducing new friend functions
> of migratepage in address_space_operation and new page flags.
> 
> 	(isolate_page, putback_page)
> 	(PG_movable, PG_isolated)
> 
> For details, please read description in
> "mm/compaction: support non-lru movable page migration".
> 
> Originally, Gioh Kim tried to support this feature but he moved
> so I took over the work. But I took many code from his work and
> changed a little bit.
> Thanks, Gioh!
> 
> And I should mention Konstantin Khlebnikov. He really heped Gioh
> at that time so he should deserve to have many credit, too.
> Thanks, Konstantin!
> 
> This patchset consists of five parts
> 
> 1. clean up migration
>   mm: use put_page to free page instead of putback_lru_page
> 
> 2. zsmalloc clean-up for preparing page migration
>   zsmalloc: use first_page rather than page
>   zsmalloc: clean up many BUG_ON
>   zsmalloc: reordering function parameter
>   zsmalloc: remove unused pool param in obj_free
>   zsmalloc: keep max_object in size_class
>   zsmalloc: squeeze inuse into page->mapping
>   zsmalloc: squeeze freelist into page->mapping
>   zsmalloc: move struct zs_meta from mapping to freelist
>   zsmalloc: factor page chain functionality out
>   zsmalloc: separate free_zspage from putback_zspage
>   zsmalloc: zs_compact refactoring

In this series, [2-5] are clean up regardless of goal of the patchset
so it could be merged independently.
I want to reduce patchset size in next post.
If anyone are not against, could you merge cleanup patchset?

   zsmalloc: use first_page rather than page
   zsmalloc: clean up many BUG_ON
   zsmalloc: reordering function parameter
   zsmalloc: remove unused pool param in obj_free

Thanks.

^ permalink raw reply

* [PATCH 02/10] x86/cpufeature: Kill cpu_has_hypervisor
From: Borislav Petkov @ 2016-03-29 15:41 UTC (permalink / raw)
  To: X86 ML; +Cc: sparmaintainer, LKML, virtualization
In-Reply-To: <1459266123-21878-1-git-send-email-bp@alien8.de>

From: Borislav Petkov <bp@suse.de>

Use boot_cpu_has() instead.

Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: virtualization@lists.linux-foundation.org
Cc: sparmaintainer@unisys.com
---
 arch/x86/events/intel/cstate.c                 | 2 +-
 arch/x86/events/intel/uncore.c                 | 2 +-
 arch/x86/include/asm/cpufeature.h              | 1 -
 arch/x86/kernel/cpu/vmware.c                   | 2 +-
 arch/x86/kernel/kvm.c                          | 2 +-
 drivers/staging/unisys/visorbus/visorchipset.c | 2 +-
 6 files changed, 5 insertions(+), 6 deletions(-)

diff --git a/arch/x86/events/intel/cstate.c b/arch/x86/events/intel/cstate.c
index 7946c4231169..d5045c8e2e63 100644
--- a/arch/x86/events/intel/cstate.c
+++ b/arch/x86/events/intel/cstate.c
@@ -677,7 +677,7 @@ static int __init cstate_pmu_init(void)
 {
 	int err;
 
-	if (cpu_has_hypervisor)
+	if (boot_cpu_has(X86_FEATURE_HYPERVISOR))
 		return -ENODEV;
 
 	err = cstate_init();
diff --git a/arch/x86/events/intel/uncore.c b/arch/x86/events/intel/uncore.c
index 7012d18bb293..3f6d8b5672d5 100644
--- a/arch/x86/events/intel/uncore.c
+++ b/arch/x86/events/intel/uncore.c
@@ -1383,7 +1383,7 @@ static int __init intel_uncore_init(void)
 	if (boot_cpu_data.x86_vendor != X86_VENDOR_INTEL)
 		return -ENODEV;
 
-	if (cpu_has_hypervisor)
+	if (boot_cpu_has(X86_FEATURE_HYPERVISOR))
 		return -ENODEV;
 
 	max_packages = topology_max_packages();
diff --git a/arch/x86/include/asm/cpufeature.h b/arch/x86/include/asm/cpufeature.h
index fee7a6efcd2d..3aea54ecabfd 100644
--- a/arch/x86/include/asm/cpufeature.h
+++ b/arch/x86/include/asm/cpufeature.h
@@ -136,7 +136,6 @@ extern const char * const x86_bug_flags[NBUGINTS*32];
 #define cpu_has_xsave		boot_cpu_has(X86_FEATURE_XSAVE)
 #define cpu_has_xsaves		boot_cpu_has(X86_FEATURE_XSAVES)
 #define cpu_has_osxsave		boot_cpu_has(X86_FEATURE_OSXSAVE)
-#define cpu_has_hypervisor	boot_cpu_has(X86_FEATURE_HYPERVISOR)
 /*
  * Do not add any more of those clumsy macros - use static_cpu_has() for
  * fast paths and boot_cpu_has() otherwise!
diff --git a/arch/x86/kernel/cpu/vmware.c b/arch/x86/kernel/cpu/vmware.c
index 364e58346897..8cac429b6a1d 100644
--- a/arch/x86/kernel/cpu/vmware.c
+++ b/arch/x86/kernel/cpu/vmware.c
@@ -94,7 +94,7 @@ static void __init vmware_platform_setup(void)
  */
 static uint32_t __init vmware_platform(void)
 {
-	if (cpu_has_hypervisor) {
+	if (boot_cpu_has(X86_FEATURE_HYPERVISOR)) {
 		unsigned int eax;
 		unsigned int hyper_vendor_id[3];
 
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 807950860fb7..dc1207e2f193 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -522,7 +522,7 @@ static noinline uint32_t __kvm_cpuid_base(void)
 	if (boot_cpu_data.cpuid_level < 0)
 		return 0;	/* So we don't blow up on old processors */
 
-	if (cpu_has_hypervisor)
+	if (boot_cpu_has(X86_FEATURE_HYPERVISOR))
 		return hypervisor_cpuid_base("KVMKVMKVM\0\0\0", 0);
 
 	return 0;
diff --git a/drivers/staging/unisys/visorbus/visorchipset.c b/drivers/staging/unisys/visorbus/visorchipset.c
index 5fbda7b218c7..9cf4f8463c4e 100644
--- a/drivers/staging/unisys/visorbus/visorchipset.c
+++ b/drivers/staging/unisys/visorbus/visorchipset.c
@@ -2425,7 +2425,7 @@ static __init uint32_t visorutil_spar_detect(void)
 {
 	unsigned int eax, ebx, ecx, edx;
 
-	if (cpu_has_hypervisor) {
+	if (boot_cpu_has(X86_FEATURE_HYPERVISOR)) {
 		/* check the ID */
 		cpuid(UNISYS_SPAR_LEAF_ID, &eax, &ebx, &ecx, &edx);
 		return  (ebx == UNISYS_SPAR_ID_EBX) &&
-- 
2.7.3

^ permalink raw reply related

* RE: [PATCH 02/10] x86/cpufeature: Kill cpu_has_hypervisor
From: Kershner, David A @ 2016-03-29 18:03 UTC (permalink / raw)
  To: Borislav Petkov, X86 ML
  Cc: *S-Par-Maintainer, LKML,
	virtualization@lists.linux-foundation.org
In-Reply-To: <1459266123-21878-3-git-send-email-bp@alien8.de>


> -----Original Message-----
> From: Borislav Petkov [mailto:bp@alien8.de]
> Sent: Tuesday, March 29, 2016 11:42 AM
> To: X86 ML <x86@kernel.org>
> Cc: LKML <linux-kernel@vger.kernel.org>; virtualization@lists.linux-
> foundation.org; *S-Par-Maintainer <SParMaintainer@unisys.com>
> Subject: [PATCH 02/10] x86/cpufeature: Kill cpu_has_hypervisor
> 
> From: Borislav Petkov <bp@suse.de>
> 
> Use boot_cpu_has() instead.
> 
> Signed-off-by: Borislav Petkov <bp@suse.de>

Tested-by: David Kershner <david.kershner@unisys.com>

> Cc: virtualization@lists.linux-foundation.org
> Cc: sparmaintainer@unisys.com
> ---
>  arch/x86/events/intel/cstate.c                 | 2 +-
>  arch/x86/events/intel/uncore.c                 | 2 +-
>  arch/x86/include/asm/cpufeature.h              | 1 -
>  arch/x86/kernel/cpu/vmware.c                   | 2 +-
>  arch/x86/kernel/kvm.c                          | 2 +-
>  drivers/staging/unisys/visorbus/visorchipset.c | 2 +-
>  6 files changed, 5 insertions(+), 6 deletions(-)
> 
> diff --git a/arch/x86/events/intel/cstate.c b/arch/x86/events/intel/cstate.c
> index 7946c4231169..d5045c8e2e63 100644
> --- a/arch/x86/events/intel/cstate.c
> +++ b/arch/x86/events/intel/cstate.c
> @@ -677,7 +677,7 @@ static int __init cstate_pmu_init(void)
>  {
>  	int err;
> 
> -	if (cpu_has_hypervisor)
> +	if (boot_cpu_has(X86_FEATURE_HYPERVISOR))
>  		return -ENODEV;
> 
>  	err = cstate_init();
> diff --git a/arch/x86/events/intel/uncore.c b/arch/x86/events/intel/uncore.c
> index 7012d18bb293..3f6d8b5672d5 100644
> --- a/arch/x86/events/intel/uncore.c
> +++ b/arch/x86/events/intel/uncore.c
> @@ -1383,7 +1383,7 @@ static int __init intel_uncore_init(void)
>  	if (boot_cpu_data.x86_vendor != X86_VENDOR_INTEL)
>  		return -ENODEV;
> 
> -	if (cpu_has_hypervisor)
> +	if (boot_cpu_has(X86_FEATURE_HYPERVISOR))
>  		return -ENODEV;
> 
>  	max_packages = topology_max_packages();
> diff --git a/arch/x86/include/asm/cpufeature.h
> b/arch/x86/include/asm/cpufeature.h
> index fee7a6efcd2d..3aea54ecabfd 100644
> --- a/arch/x86/include/asm/cpufeature.h
> +++ b/arch/x86/include/asm/cpufeature.h
> @@ -136,7 +136,6 @@ extern const char * const
> x86_bug_flags[NBUGINTS*32];
>  #define cpu_has_xsave
> 	boot_cpu_has(X86_FEATURE_XSAVE)
>  #define cpu_has_xsaves
> 	boot_cpu_has(X86_FEATURE_XSAVES)
>  #define cpu_has_osxsave
> 	boot_cpu_has(X86_FEATURE_OSXSAVE)
> -#define cpu_has_hypervisor	boot_cpu_has(X86_FEATURE_HYPERVISOR)
>  /*
>   * Do not add any more of those clumsy macros - use static_cpu_has() for
>   * fast paths and boot_cpu_has() otherwise!
> diff --git a/arch/x86/kernel/cpu/vmware.c b/arch/x86/kernel/cpu/vmware.c
> index 364e58346897..8cac429b6a1d 100644
> --- a/arch/x86/kernel/cpu/vmware.c
> +++ b/arch/x86/kernel/cpu/vmware.c
> @@ -94,7 +94,7 @@ static void __init vmware_platform_setup(void)
>   */
>  static uint32_t __init vmware_platform(void)
>  {
> -	if (cpu_has_hypervisor) {
> +	if (boot_cpu_has(X86_FEATURE_HYPERVISOR)) {
>  		unsigned int eax;
>  		unsigned int hyper_vendor_id[3];
> 
> diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
> index 807950860fb7..dc1207e2f193 100644
> --- a/arch/x86/kernel/kvm.c
> +++ b/arch/x86/kernel/kvm.c
> @@ -522,7 +522,7 @@ static noinline uint32_t __kvm_cpuid_base(void)
>  	if (boot_cpu_data.cpuid_level < 0)
>  		return 0;	/* So we don't blow up on old processors */
> 
> -	if (cpu_has_hypervisor)
> +	if (boot_cpu_has(X86_FEATURE_HYPERVISOR))
>  		return hypervisor_cpuid_base("KVMKVMKVM\0\0\0", 0);
> 
>  	return 0;
> diff --git a/drivers/staging/unisys/visorbus/visorchipset.c
> b/drivers/staging/unisys/visorbus/visorchipset.c
> index 5fbda7b218c7..9cf4f8463c4e 100644
> --- a/drivers/staging/unisys/visorbus/visorchipset.c
> +++ b/drivers/staging/unisys/visorbus/visorchipset.c
> @@ -2425,7 +2425,7 @@ static __init uint32_t visorutil_spar_detect(void)
>  {
>  	unsigned int eax, ebx, ecx, edx;
> 
> -	if (cpu_has_hypervisor) {
> +	if (boot_cpu_has(X86_FEATURE_HYPERVISOR)) {
>  		/* check the ID */
>  		cpuid(UNISYS_SPAR_LEAF_ID, &eax, &ebx, &ecx, &edx);
>  		return  (ebx == UNISYS_SPAR_ID_EBX) &&
> --
> 2.7.3

^ permalink raw reply

* [PATCH v3 00/16] Support non-lru page migration
From: Minchan Kim @ 2016-03-30  7:11 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, YiPing Xu, aquini, rknize, Sergey Senozhatsky,
	Chan Gyun Jeong, Minchan Kim, Hugh Dickins, linux-kernel, Al Viro,
	virtualization, bfields, linux-mm, Gioh Kim, koct9i, Sangseok Lee,
	jlayton, Joonsoo Kim, Vlastimil Babka, Mel Gorman

Recently, I got many reports about perfermance degradation
in embedded system(Android mobile phone, webOS TV and so on)
and failed to fork easily.

The problem was fragmentation caused by zram and GPU driver
pages. Their pages cannot be migrated so compaction cannot
work well, either so reclaimer ends up shrinking all of working
set pages. It made system very slow and even to fail to fork
easily.

Other pain point is that they cannot work with CMA.
Most of CMA memory space could be idle(ie, it could be used
for movable pages unless driver is using) but if driver(i.e.,
zram) cannot migrate his page, that memory space could be
wasted. In our product which has big CMA memory, it reclaims
zones too exccessively although there are lots of free space
in CMA so system was very slow easily.

To solve these problem, this patch try to add facility to
migrate non-lru pages via introducing new friend functions
of migratepage in address_space_operation and new page flags.

	(isolate_page, putback_page)
	(PG_movable, PG_isolated)

For details, please read description in
"mm/compaction: support non-lru movable page migration".

Originally, Gioh Kim tried to support this feature but he moved
so I took over the work. But I took many code from his work and
changed a little bit.
Thanks, Gioh!

And I should mention Konstantin Khlebnikov. He really heped Gioh
at that time so he should deserve to have many credit, too.
Thanks, Konstantin!

This patchset consists of five parts.

1. clean up migration
  mm: use put_page to free page instead of putback_lru_page

2. add non-lru page migration feature
  mm/compaction: support non-lru movable page migration
  mm: add non-lru movable page support document

3. rework KVM memory-ballooning
  mm/balloon: use general movable page feature into balloon

4. zsmalloc clean-up for preparing page migration
  zsmalloc: keep max_object in size_class
  zsmalloc: squeeze inuse into page->mapping
  zsmalloc: remove page_mapcount_reset
  zsmalloc: squeeze freelist into page->mapping
  zsmalloc: move struct zs_meta from mapping to freelist
  zsmalloc: factor page chain functionality out
  zsmalloc: separate free_zspage from putback_zspage
  zsmalloc: zs_compact refactoring

5. add zsmalloc page migration
  zsmalloc: migrate head page of zspage
  zsmalloc: use single linked list for page chain
  zsmalloc: migrate tail pages in zspage
  zram: use __GFP_MOVABLE for memory allocation

* From v2
  * rebase on mmotm-2016-03-29-15-54-16
  * check PageMovable before lock_page - Joonsoo
  * check PageMovable before PageIsolated checking - Joonsoo
  * add more description about rule

* From v1
  * rebase on v4.5-mmotm-2016-03-17-15-04
  * reordering patches to merge clean-up patches first
  * add Acked-by/Reviewed-by from Vlastimil and Sergey
  * use each own mount model instead of reusing anon_inode_fs - Al Viro
  * small changes - YiPing, Gioh


Minchan Kim (16):
  mm: use put_page to free page instead of putback_lru_page
  mm/compaction: support non-lru movable page migration
  mm: add non-lru movable page support document
  mm/balloon: use general movable page feature into balloon
  zsmalloc: keep max_object in size_class
  zsmalloc: squeeze inuse into page->mapping
  zsmalloc: remove page_mapcount_reset
  zsmalloc: squeeze freelist into page->mapping
  zsmalloc: move struct zs_meta from mapping to freelist
  zsmalloc: factor page chain functionality out
  zsmalloc: separate free_zspage from putback_zspage
  zsmalloc: zs_compact refactoring
  zsmalloc: migrate head page of zspage
  zsmalloc: use single linked list for page chain
  zsmalloc: migrate tail pages in zspage
  zram: use __GFP_MOVABLE for memory allocation

 Documentation/filesystems/Locking      |    4 +
 Documentation/filesystems/vfs.txt      |   16 +-
 Documentation/vm/page_migration        |   69 +-
 drivers/block/zram/zram_drv.c          |    3 +-
 drivers/virtio/virtio_balloon.c        |   53 +-
 fs/proc/page.c                         |    3 +
 include/linux/balloon_compaction.h     |   49 +-
 include/linux/fs.h                     |    2 +
 include/linux/migrate.h                |    2 +
 include/linux/page-flags.h             |   47 +-
 include/uapi/linux/kernel-page-flags.h |    1 +
 include/uapi/linux/magic.h             |    2 +
 mm/balloon_compaction.c                |  101 +--
 mm/compaction.c                        |   15 +-
 mm/migrate.c                           |  238 ++++--
 mm/vmscan.c                            |    2 +-
 mm/zsmalloc.c                          | 1253 ++++++++++++++++++++++++--------
 17 files changed, 1368 insertions(+), 492 deletions(-)

-- 
1.9.1

^ permalink raw reply

* [PATCH v3 01/16] mm: use put_page to free page instead of putback_lru_page
From: Minchan Kim @ 2016-03-30  7:12 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, YiPing Xu, aquini, rknize, Sergey Senozhatsky,
	Chan Gyun Jeong, Minchan Kim, Hugh Dickins, linux-kernel, Al Viro,
	virtualization, bfields, linux-mm, Gioh Kim, koct9i, Sangseok Lee,
	jlayton, Naoya Horiguchi, Joonsoo Kim, Vlastimil Babka,
	Mel Gorman
In-Reply-To: <1459321935-3655-1-git-send-email-minchan@kernel.org>

Procedure of page migration is as follows:

First of all, it should isolate a page from LRU and try to
migrate the page. If it is successful, it releases the page
for freeing. Otherwise, it should put the page back to LRU
list.

For LRU pages, we have used putback_lru_page for both freeing
and putback to LRU list. It's okay because put_page is aware of
LRU list so if it releases last refcount of the page, it removes
the page from LRU list. However, It makes unnecessary operations
(e.g., lru_cache_add, pagevec and flags operations. It would be
not significant but no worth to do) and harder to support new
non-lru page migration because put_page isn't aware of non-lru
page's data structure.

To solve the problem, we can add new hook in put_page with
PageMovable flags check but it can increase overhead in
hot path and needs new locking scheme to stabilize the flag check
with put_page.

So, this patch cleans it up to divide two semantic(ie, put and putback).
If migration is successful, use put_page instead of putback_lru_page and
use putback_lru_page only on failure. That makes code more readable
and doesn't add overhead in put_page.

Comment from Vlastimil
"Yeah, and compaction (perhaps also other migration users) has to drain
the lru pvec... Getting rid of this stuff is worth even by itself."

Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 mm/migrate.c | 50 +++++++++++++++++++++++++++++++-------------------
 1 file changed, 31 insertions(+), 19 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index 6c822a7b27e0..53529c805752 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -913,6 +913,14 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
 		put_anon_vma(anon_vma);
 	unlock_page(page);
 out:
+	/* If migration is successful, move newpage to right list */
+	if (rc == MIGRATEPAGE_SUCCESS) {
+		if (unlikely(__is_movable_balloon_page(newpage)))
+			put_page(newpage);
+		else
+			putback_lru_page(newpage);
+	}
+
 	return rc;
 }
 
@@ -946,6 +954,12 @@ static ICE_noinline int unmap_and_move(new_page_t get_new_page,
 
 	if (page_count(page) == 1) {
 		/* page was freed from under us. So we are done. */
+		ClearPageActive(page);
+		ClearPageUnevictable(page);
+		if (put_new_page)
+			put_new_page(newpage, private);
+		else
+			put_page(newpage);
 		goto out;
 	}
 
@@ -958,10 +972,8 @@ static ICE_noinline int unmap_and_move(new_page_t get_new_page,
 	}
 
 	rc = __unmap_and_move(page, newpage, force, mode);
-	if (rc == MIGRATEPAGE_SUCCESS) {
-		put_new_page = NULL;
+	if (rc == MIGRATEPAGE_SUCCESS)
 		set_page_owner_migrate_reason(newpage, reason);
-	}
 
 out:
 	if (rc != -EAGAIN) {
@@ -974,28 +986,28 @@ static ICE_noinline int unmap_and_move(new_page_t get_new_page,
 		list_del(&page->lru);
 		dec_zone_page_state(page, NR_ISOLATED_ANON +
 				page_is_file_cache(page));
-		/* Soft-offlined page shouldn't go through lru cache list */
+	}
+
+	/*
+	 * If migration is successful, drop the reference grabbed during
+	 * isolation. Otherwise, restore the page to LRU list unless we
+	 * want to retry.
+	 */
+	if (rc == MIGRATEPAGE_SUCCESS) {
+		put_page(page);
 		if (reason == MR_MEMORY_FAILURE) {
-			put_page(page);
 			if (!test_set_page_hwpoison(page))
 				num_poisoned_pages_inc();
-		} else
+		}
+	} else {
+		if (rc != -EAGAIN)
 			putback_lru_page(page);
+		if (put_new_page)
+			put_new_page(newpage, private);
+		else
+			put_page(newpage);
 	}
 
-	/*
-	 * If migration was not successful and there's a freeing callback, use
-	 * it.  Otherwise, putback_lru_page() will drop the reference grabbed
-	 * during isolation.
-	 */
-	if (put_new_page)
-		put_new_page(newpage, private);
-	else if (unlikely(__is_movable_balloon_page(newpage))) {
-		/* drop our reference, page already in the balloon */
-		put_page(newpage);
-	} else
-		putback_lru_page(newpage);
-
 	if (result) {
 		if (rc)
 			*result = rc;
-- 
1.9.1

^ permalink raw reply related

* [PATCH v3 02/16] mm/compaction: support non-lru movable page migration
From: Minchan Kim @ 2016-03-30  7:12 UTC (permalink / raw)
  To: Andrew Morton
  Cc: aquini, rknize, dri-devel, virtualization, bfields, linux-mm,
	Sangseok Lee, jlayton, koct9i, YiPing Xu, Minchan Kim,
	Hugh Dickins, Gioh Kim, Mel Gorman, Rik van Riel, Vlastimil Babka,
	Al Viro, Chan Gyun Jeong, linux-kernel, Sergey Senozhatsky,
	Gioh Kim, Joonsoo Kim
In-Reply-To: <1459321935-3655-1-git-send-email-minchan@kernel.org>

We have allowed migration for only LRU pages until now and it was
enough to make high-order pages. But recently, embedded system(e.g.,
webOS, android) uses lots of non-movable pages(e.g., zram, GPU memory)
so we have seen several reports about troubles of small high-order
allocation. For fixing the problem, there were several efforts
(e,g,. enhance compaction algorithm, SLUB fallback to 0-order page,
reserved memory, vmalloc and so on) but if there are lots of
non-movable pages in system, their solutions are void in the long run.

So, this patch is to support facility to change non-movable pages
with movable. For the feature, this patch introduces functions related
to migration to address_space_operations as well as some page flags.

Basically, this patch supports two page-flags and two functions related
to page migration. The flag and page->mapping stability are protected
by PG_lock.

	PG_movable
	PG_isolated

	bool (*isolate_page) (struct page *, isolate_mode_t);
	void (*putback_page) (struct page *);

Duty of subsystem want to make their pages as migratable are
as follows:

1. It should register address_space to page->mapping then mark
the page as PG_movable via __SetPageMovable.

2. It should mark the page as PG_isolated via SetPageIsolated
if isolation is sucessful and return true.

3. If migration is successful, it should clear PG_isolated and
PG_movable of the page for free preparation then release the
reference of the page to free.

4. If migration fails, putback function of subsystem should
clear PG_isolated via ClearPageIsolated.

5. If a subsystem want to release isolated page, it should
clear PG_isolated but not PG_movable. Instead, VM will do it.

Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: dri-devel@lists.freedesktop.org
Cc: virtualization@lists.linux-foundation.org
Signed-off-by: Gioh Kim <gurugio@hanmail.net>
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 Documentation/filesystems/Locking      |   4 +
 Documentation/filesystems/vfs.txt      |   5 +
 fs/proc/page.c                         |   3 +
 include/linux/fs.h                     |   2 +
 include/linux/migrate.h                |   2 +
 include/linux/page-flags.h             |  31 ++++++
 include/uapi/linux/kernel-page-flags.h |   1 +
 mm/compaction.c                        |  14 ++-
 mm/migrate.c                           | 174 +++++++++++++++++++++++++++++----
 9 files changed, 217 insertions(+), 19 deletions(-)

diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking
index 619af9bfdcb3..0bb79560abb3 100644
--- a/Documentation/filesystems/Locking
+++ b/Documentation/filesystems/Locking
@@ -195,7 +195,9 @@ unlocks and drops the reference.
 	int (*releasepage) (struct page *, int);
 	void (*freepage)(struct page *);
 	int (*direct_IO)(struct kiocb *, struct iov_iter *iter, loff_t offset);
+	bool (*isolate_page) (struct page *, isolate_mode_t);
 	int (*migratepage)(struct address_space *, struct page *, struct page *);
+	void (*putback_page) (struct page *);
 	int (*launder_page)(struct page *);
 	int (*is_partially_uptodate)(struct page *, unsigned long, unsigned long);
 	int (*error_remove_page)(struct address_space *, struct page *);
@@ -219,7 +221,9 @@ invalidatepage:		yes
 releasepage:		yes
 freepage:		yes
 direct_IO:
+isolate_page:		yes
 migratepage:		yes (both)
+putback_page:		yes
 launder_page:		yes
 is_partially_uptodate:	yes
 error_remove_page:	yes
diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt
index b02a7d598258..4c1b6c3b4bc8 100644
--- a/Documentation/filesystems/vfs.txt
+++ b/Documentation/filesystems/vfs.txt
@@ -592,9 +592,14 @@ struct address_space_operations {
 	int (*releasepage) (struct page *, int);
 	void (*freepage)(struct page *);
 	ssize_t (*direct_IO)(struct kiocb *, struct iov_iter *iter, loff_t offset);
+	/* isolate a page for migration */
+	bool (*isolate_page) (struct page *, isolate_mode_t);
 	/* migrate the contents of a page to the specified target */
 	int (*migratepage) (struct page *, struct page *);
+	/* put the page back to right list */
+	void (*putback_page) (struct page *);
 	int (*launder_page) (struct page *);
+
 	int (*is_partially_uptodate) (struct page *, unsigned long,
 					unsigned long);
 	void (*is_dirty_writeback) (struct page *, bool *, bool *);
diff --git a/fs/proc/page.c b/fs/proc/page.c
index 3ecd445e830d..ce3d08a4ad8d 100644
--- a/fs/proc/page.c
+++ b/fs/proc/page.c
@@ -157,6 +157,9 @@ u64 stable_page_flags(struct page *page)
 	if (page_is_idle(page))
 		u |= 1 << KPF_IDLE;
 
+	if (PageMovable(page))
+		u |= 1 << KPF_MOVABLE;
+
 	u |= kpf_copy_bit(k, KPF_LOCKED,	PG_locked);
 
 	u |= kpf_copy_bit(k, KPF_SLAB,		PG_slab);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index da9e67d937e5..36f2d610e7a8 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -401,6 +401,8 @@ struct address_space_operations {
 	 */
 	int (*migratepage) (struct address_space *,
 			struct page *, struct page *, enum migrate_mode);
+	bool (*isolate_page)(struct page *, isolate_mode_t);
+	void (*putback_page)(struct page *);
 	int (*launder_page) (struct page *);
 	int (*is_partially_uptodate) (struct page *, unsigned long,
 					unsigned long);
diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 9b50325e4ddf..404fbfefeb33 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -37,6 +37,8 @@ extern int migrate_page(struct address_space *,
 			struct page *, struct page *, enum migrate_mode);
 extern int migrate_pages(struct list_head *l, new_page_t new, free_page_t free,
 		unsigned long private, enum migrate_mode mode, int reason);
+extern bool isolate_movable_page(struct page *page, isolate_mode_t mode);
+extern void putback_movable_page(struct page *page);
 
 extern int migrate_prep(void);
 extern int migrate_prep_local(void);
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index f4ed4f1b0c77..77ebf8fdbc6e 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -129,6 +129,10 @@ enum pageflags {
 
 	/* Compound pages. Stored in first tail page's flags */
 	PG_double_map = PG_private_2,
+
+	/* non-lru movable pages */
+	PG_movable = PG_reclaim,
+	PG_isolated = PG_owner_priv_1,
 };
 
 #ifndef __GENERATING_BOUNDS_H
@@ -614,6 +618,33 @@ static inline void __ClearPageBalloon(struct page *page)
 	atomic_set(&page->_mapcount, -1);
 }
 
+#define PAGE_MOVABLE_MAPCOUNT_VALUE (-255)
+
+static inline int PageMovable(struct page *page)
+{
+	return ((test_bit(PG_movable, &(page)->flags) &&
+		atomic_read(&page->_mapcount) == PAGE_MOVABLE_MAPCOUNT_VALUE)
+		|| PageBalloon(page));
+}
+
+/* Caller should hold a PG_lock */
+static inline void __SetPageMovable(struct page *page,
+				struct address_space *mapping)
+{
+	page->mapping = mapping;
+	__set_bit(PG_movable, &page->flags);
+	atomic_set(&page->_mapcount, PAGE_MOVABLE_MAPCOUNT_VALUE);
+}
+
+static inline void __ClearPageMovable(struct page *page)
+{
+	atomic_set(&page->_mapcount, -1);
+	__clear_bit(PG_movable, &(page)->flags);
+	page->mapping = NULL;
+}
+
+PAGEFLAG(Isolated, isolated, PF_ANY);
+
 /*
  * If network-based swap is enabled, sl*b must keep track of whether pages
  * were allocated from pfmemalloc reserves.
diff --git a/include/uapi/linux/kernel-page-flags.h b/include/uapi/linux/kernel-page-flags.h
index 5da5f8751ce7..a184fd2434fa 100644
--- a/include/uapi/linux/kernel-page-flags.h
+++ b/include/uapi/linux/kernel-page-flags.h
@@ -34,6 +34,7 @@
 #define KPF_BALLOON		23
 #define KPF_ZERO_PAGE		24
 #define KPF_IDLE		25
+#define KPF_MOVABLE		26
 
 
 #endif /* _UAPILINUX_KERNEL_PAGE_FLAGS_H */
diff --git a/mm/compaction.c b/mm/compaction.c
index ccf97b02b85f..7557aedddaee 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -703,7 +703,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
 
 		/*
 		 * Check may be lockless but that's ok as we recheck later.
-		 * It's possible to migrate LRU pages and balloon pages
+		 * It's possible to migrate LRU and movable kernel pages.
 		 * Skip any other type of page
 		 */
 		is_lru = PageLRU(page);
@@ -714,6 +714,18 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
 					goto isolate_success;
 				}
 			}
+
+			if (unlikely(PageMovable(page)) &&
+					!PageIsolated(page)) {
+				if (locked) {
+					spin_unlock_irqrestore(&zone->lru_lock,
+									flags);
+					locked = false;
+				}
+
+				if (isolate_movable_page(page, isolate_mode))
+					goto isolate_success;
+			}
 		}
 
 		/*
diff --git a/mm/migrate.c b/mm/migrate.c
index 53529c805752..b56bf2b3fe8c 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -73,6 +73,85 @@ int migrate_prep_local(void)
 	return 0;
 }
 
+bool isolate_movable_page(struct page *page, isolate_mode_t mode)
+{
+	bool ret = false;
+
+	/*
+	 * Avoid burning cycles with pages that are yet under __free_pages(),
+	 * or just got freed under us.
+	 *
+	 * In case we 'win' a race for a movable page being freed under us and
+	 * raise its refcount preventing __free_pages() from doing its job
+	 * the put_page() at the end of this block will take care of
+	 * release this page, thus avoiding a nasty leakage.
+	 */
+	if (unlikely(!get_page_unless_zero(page)))
+		goto out;
+
+	/*
+	 * Check PG_movable before holding a PG_lock because page's owner
+	 * assumes anybody doesn't touch PG_lock of newly allocated page.
+	 */
+	if (unlikely(!PageMovable(page)))
+		goto out_putpage;
+	/*
+	 * As movable pages are not isolated from LRU lists, concurrent
+	 * compaction threads can race against page migration functions
+	 * as well as race against the releasing a page.
+	 *
+	 * In order to avoid having an already isolated movable page
+	 * being (wrongly) re-isolated while it is under migration,
+	 * or to avoid attempting to isolate pages being released,
+	 * lets be sure we have the page lock
+	 * before proceeding with the movable page isolation steps.
+	 */
+	if (unlikely(!trylock_page(page)))
+		goto out_putpage;
+
+	if (!PageMovable(page) || PageIsolated(page))
+		goto out_no_isolated;
+
+	ret = page->mapping->a_ops->isolate_page(page, mode);
+	if (!ret)
+		goto out_no_isolated;
+
+	WARN_ON_ONCE(!PageIsolated(page));
+	unlock_page(page);
+	return ret;
+
+out_no_isolated:
+	unlock_page(page);
+out_putpage:
+	put_page(page);
+out:
+	return ret;
+}
+
+/* It should be called on page which is PG_movable */
+void putback_movable_page(struct page *page)
+{
+	/*
+	 * 'lock_page()' stabilizes the page and prevents races against
+	 * concurrent isolation threads attempting to re-isolate it.
+	 */
+	VM_BUG_ON_PAGE(!PageMovable(page), page);
+
+	lock_page(page);
+	if (PageIsolated(page)) {
+		struct address_space *mapping;
+
+		mapping = page_mapping(page);
+		mapping->a_ops->putback_page(page);
+		WARN_ON_ONCE(PageIsolated(page));
+	} else {
+		__ClearPageMovable(page);
+	}
+	unlock_page(page);
+	/* drop the extra ref count taken for movable page isolation */
+	put_page(page);
+}
+
 /*
  * Put previously isolated pages back onto the appropriate lists
  * from where they were once taken off for compaction/migration.
@@ -94,10 +173,18 @@ void putback_movable_pages(struct list_head *l)
 		list_del(&page->lru);
 		dec_zone_page_state(page, NR_ISOLATED_ANON +
 				page_is_file_cache(page));
-		if (unlikely(isolated_balloon_page(page)))
+		if (unlikely(isolated_balloon_page(page))) {
 			balloon_page_putback(page);
-		else
+		} else if (unlikely(PageMovable(page))) {
+			if (PageIsolated(page)) {
+				putback_movable_page(page);
+			} else {
+				__ClearPageMovable(page);
+				put_page(page);
+			}
+		} else {
 			putback_lru_page(page);
+		}
 	}
 }
 
@@ -592,7 +679,7 @@ void migrate_page_copy(struct page *newpage, struct page *page)
  ***********************************************************/
 
 /*
- * Common logic to directly migrate a single page suitable for
+ * Common logic to directly migrate a single LRU page suitable for
  * pages that do not use PagePrivate/PagePrivate2.
  *
  * Pages are locked upon entry and exit.
@@ -755,24 +842,54 @@ static int move_to_new_page(struct page *newpage, struct page *page,
 				enum migrate_mode mode)
 {
 	struct address_space *mapping;
-	int rc;
+	int rc = -EAGAIN;
+	bool lru_movable = true;
 
 	VM_BUG_ON_PAGE(!PageLocked(page), page);
 	VM_BUG_ON_PAGE(!PageLocked(newpage), newpage);
 
 	mapping = page_mapping(page);
-	if (!mapping)
-		rc = migrate_page(mapping, newpage, page, mode);
-	else if (mapping->a_ops->migratepage)
-		/*
-		 * Most pages have a mapping and most filesystems provide a
-		 * migratepage callback. Anonymous pages are part of swap
-		 * space which also has its own migratepage callback. This
-		 * is the most common path for page migration.
-		 */
-		rc = mapping->a_ops->migratepage(mapping, newpage, page, mode);
-	else
-		rc = fallback_migrate_page(mapping, newpage, page, mode);
+	/*
+	 * In case of non-lru page, it could be released after
+	 * isolation step. In that case, we shouldn't try
+	 * fallback migration which was designed for LRU pages.
+	 *
+	 * The rule for such case is that subsystem should clear
+	 * PG_isolated but remains PG_movable so VM should catch
+	 * it and clear PG_movable for it.
+	 */
+	if (unlikely(PageMovable(page))) {
+		lru_movable = false;
+		VM_BUG_ON_PAGE(!mapping, page);
+		if (!PageIsolated(page)) {
+			rc = MIGRATEPAGE_SUCCESS;
+			__ClearPageMovable(page);
+			goto out;
+		}
+	}
+
+	if (likely(lru_movable)) {
+		if (!mapping)
+			rc = migrate_page(mapping, newpage, page, mode);
+		else if (mapping->a_ops->migratepage)
+			/*
+			 * Most pages have a mapping and most filesystems
+			 * provide a migratepage callback. Anonymous pages
+			 * are part of swap space which also has its own
+			 * migratepage callback. This is the most common path
+			 * for page migration.
+			 */
+			rc = mapping->a_ops->migratepage(mapping, newpage,
+							page, mode);
+		else
+			rc = fallback_migrate_page(mapping, newpage,
+							page, mode);
+	} else {
+		rc = mapping->a_ops->migratepage(mapping, newpage,
+						page, mode);
+		WARN_ON_ONCE(rc == MIGRATEPAGE_SUCCESS &&
+			PageIsolated(page));
+	}
 
 	/*
 	 * When successful, old pagecache page->mapping must be cleared before
@@ -782,6 +899,7 @@ static int move_to_new_page(struct page *newpage, struct page *page,
 		if (!PageAnon(page))
 			page->mapping = NULL;
 	}
+out:
 	return rc;
 }
 
@@ -960,6 +1078,8 @@ static ICE_noinline int unmap_and_move(new_page_t get_new_page,
 			put_new_page(newpage, private);
 		else
 			put_page(newpage);
+		if (PageMovable(page))
+			__ClearPageMovable(page);
 		goto out;
 	}
 
@@ -1000,8 +1120,26 @@ static ICE_noinline int unmap_and_move(new_page_t get_new_page,
 				num_poisoned_pages_inc();
 		}
 	} else {
-		if (rc != -EAGAIN)
-			putback_lru_page(page);
+		if (rc != -EAGAIN) {
+			/*
+			 * subsystem couldn't remove PG_movable since page is
+			 * isolated so PageMovable check is not racy in here.
+			 * But PageIsolated check can be racy but it's okay
+			 * because putback_movable_page checks it under PG_lock
+			 * again.
+			 */
+			if (unlikely(PageMovable(page))) {
+				if (PageIsolated(page))
+					putback_movable_page(page);
+				else {
+					__ClearPageMovable(page);
+					put_page(page);
+				}
+			} else {
+				putback_lru_page(page);
+			}
+		}
+
 		if (put_new_page)
 			put_new_page(newpage, private);
 		else
-- 
1.9.1

^ permalink raw reply related

* [PATCH v3 03/16] mm: add non-lru movable page support document
From: Minchan Kim @ 2016-03-30  7:12 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jonathan Corbet, Rik van Riel, YiPing Xu, aquini, rknize,
	Sergey Senozhatsky, Chan Gyun Jeong, Minchan Kim, Hugh Dickins,
	linux-kernel, Al Viro, virtualization, bfields, linux-mm,
	Gioh Kim, koct9i, Sangseok Lee, jlayton, Joonsoo Kim,
	Vlastimil Babka, Mel Gorman
In-Reply-To: <1459321935-3655-1-git-send-email-minchan@kernel.org>

This patch describes what a subsystem should do for non-lru movable
page supporting.

Cc: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 Documentation/filesystems/vfs.txt | 11 ++++++-
 Documentation/vm/page_migration   | 69 ++++++++++++++++++++++++++++++++++++++-
 2 files changed, 78 insertions(+), 2 deletions(-)

diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt
index 4c1b6c3b4bc8..d63142f8ed7b 100644
--- a/Documentation/filesystems/vfs.txt
+++ b/Documentation/filesystems/vfs.txt
@@ -752,12 +752,21 @@ struct address_space_operations {
         and transfer data directly between the storage and the
         application's address space.
 
+  isolate_page: Called by the VM when isolating a movable non-lru page.
+	If page is successfully isolated, we should mark the page as
+	PG_isolated via __SetPageIsolated.
+
   migrate_page:  This is used to compact the physical memory usage.
         If the VM wants to relocate a page (maybe off a memory card
         that is signalling imminent failure) it will pass a new page
 	and an old page to this function.  migrate_page should
 	transfer any private data across and update any references
-        that it has to the page.
+	that it has to the page. If migrated page is non-lru page,
+	we should clear PG_isolated and PG_movable via __ClearPageIsolated
+	and __ClearPageMovable.
+
+  putback_page: Called by the VM when isolated page's migration fails.
+	We should clear PG_isolated marked in isolated_page function.
 
   launder_page: Called before freeing a page - it writes back the dirty page. To
   	prevent redirtying the page, it is kept locked during the whole
diff --git a/Documentation/vm/page_migration b/Documentation/vm/page_migration
index fea5c0864170..c4e7551a414e 100644
--- a/Documentation/vm/page_migration
+++ b/Documentation/vm/page_migration
@@ -142,5 +142,72 @@ is increased so that the page cannot be freed while page migration occurs.
 20. The new page is moved to the LRU and can be scanned by the swapper
     etc again.
 
-Christoph Lameter, May 8, 2006.
+C. Non-LRU Page migration
+-------------------------
+
+Although original migration aimed for reducing the latency of memory access
+for NUMA, compaction who want to create high-order page is also main customer.
+
+Ppage migration's disadvantage is that it was designed to migrate only
+*LRU* pages. However, there are potential non-lru movable pages which can be
+migrated in system, for example, zsmalloc, virtio-balloon pages.
+For virtio-balloon pages, some parts of migration code path was hooked up
+and added virtio-balloon specific functions to intercept logi.
+It's too specific to one subsystem so other subsystem who want to make
+their pages movable should add own specific hooks in migration path.
+
+To solve such problem, VM supports non-LRU page migration which provides
+generic functions for non-LRU movable pages without needing subsystem
+specific hook in mm/{migrate|compact}.c.
+
+If a subsystem want to make own pages movable, it should mark pages as
+PG_movable via __SetPageMovable. __SetPageMovable needs address_space for
+argument for register functions which will be called by VM.
+
+Three functions in address_space_operation related to non-lru movable page:
+
+	bool (*isolate_page) (struct page *, isolate_mode_t);
+	int (*migratepage) (struct address_space *,
+		struct page *, struct page *, enum migrate_mode);
+	void (*putback_page)(struct page *);
+
+1. Isolation
+
+What VM expected on isolate_page of subsystem is to set PG_isolated flags
+of the page if it was successful. With that, concurrent isolation among
+CPUs skips the isolated page by other CPU earlier. VM calls isolate_page
+under PG_lock of page. If a subsystem cannot isolate the page, it should
+return false.
 
+2. Migration
+
+After successful isolation, VM calls migratepage. The migratepage's goal is
+to move content of the old page to new page and set up struct page fields
+of new page. If migration is successful, subsystem should release old page's
+refcount to free. Keep in mind that subsystem should clear PG_movable and
+PG_isolated before releasing the refcount.  If everything are done, user
+should return MIGRATEPAGE_SUCCESS. If subsystem cannot migrate the page
+at the moment, migratepage can return -EAGAIN. On -EAGAIN, VM will retry page
+migration because VM interprets -EAGAIN as "temporal migration failure".
+
+3. Putback
+
+If migration was unsuccessful, VM calls putback_page. The subsystem should
+insert isolated page to own data structure again if it has. And subsystem
+should clear PG_isolated which was marked in isolation step.
+
+Note about releasing page:
+
+Subsystem can release pages whenever it want but if it releses the page
+which is already isolated, it should clear PG_isolated but doesn't touch
+PG_movable under PG_lock. Instead of it, VM will clear PG_movable after
+his job done. Otherweise, subsystem should clear both page flags before
+releasing the page.
+
+Note about PG_isolated:
+
+PG_isolated check on a page is valid only if the page's flag is already
+set to PG_movable.
+
+Christoph Lameter, May 8, 2006.
+Minchan Kim, Mar 28, 2016.
-- 
1.9.1

^ permalink raw reply related

* [PATCH v3 04/16] mm/balloon: use general movable page feature into balloon
From: Minchan Kim @ 2016-03-30  7:12 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, YiPing Xu, aquini, rknize, Sergey Senozhatsky,
	Chan Gyun Jeong, Minchan Kim, Hugh Dickins, linux-kernel, Al Viro,
	virtualization, bfields, linux-mm, Gioh Kim, Gioh Kim, koct9i,
	Sangseok Lee, jlayton, Joonsoo Kim, Vlastimil Babka, Mel Gorman
In-Reply-To: <1459321935-3655-1-git-send-email-minchan@kernel.org>

Now, VM has a feature to migrate non-lru movable pages so
balloon doesn't need custom migration hooks in migrate.c
and compact.c. Instead, this patch implements page->mapping
->{isolate|migrate|putback} functions.

With that, we could remove hooks for ballooning in general
migration functions and make balloon compaction simple.

Cc: virtualization@lists.linux-foundation.org
Cc: Rafael Aquini <aquini@redhat.com>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Signed-off-by: Gioh Kim <gurugio@hanmail.net>
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 drivers/virtio/virtio_balloon.c    |  53 ++++++++++++++++---
 include/linux/balloon_compaction.h |  49 ++++--------------
 include/linux/page-flags.h         |  56 +++++++++++---------
 include/uapi/linux/magic.h         |   1 +
 mm/balloon_compaction.c            | 101 ++++++++-----------------------------
 mm/compaction.c                    |   7 ---
 mm/migrate.c                       |  22 ++------
 mm/vmscan.c                        |   2 +-
 8 files changed, 119 insertions(+), 172 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 7b6d74f0c72f..0c16192d2684 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -30,6 +30,7 @@
 #include <linux/oom.h>
 #include <linux/wait.h>
 #include <linux/mm.h>
+#include <linux/mount.h>
 
 /*
  * Balloon device works in 4K page units.  So each page is pointed to by
@@ -45,6 +46,10 @@ static int oom_pages = OOM_VBALLOON_DEFAULT_PAGES;
 module_param(oom_pages, int, S_IRUSR | S_IWUSR);
 MODULE_PARM_DESC(oom_pages, "pages to free on OOM");
 
+#ifdef CONFIG_BALLOON_COMPACTION
+static struct vfsmount *balloon_mnt;
+#endif
+
 struct virtio_balloon {
 	struct virtio_device *vdev;
 	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq;
@@ -482,10 +487,29 @@ static int virtballoon_migratepage(struct balloon_dev_info *vb_dev_info,
 
 	mutex_unlock(&vb->balloon_lock);
 
+	ClearPageIsolated(page);
 	put_page(page); /* balloon reference */
 
 	return MIGRATEPAGE_SUCCESS;
 }
+
+static struct dentry *balloon_mount(struct file_system_type *fs_type,
+		int flags, const char *dev_name, void *data)
+{
+	static const struct dentry_operations ops = {
+		.d_dname = simple_dname,
+	};
+
+	return mount_pseudo(fs_type, "balloon-kvm:", NULL, &ops,
+				BALLOON_KVM_MAGIC);
+}
+
+static struct file_system_type balloon_fs = {
+	.name           = "balloon-kvm",
+	.mount          = balloon_mount,
+	.kill_sb        = kill_anon_super,
+};
+
 #endif /* CONFIG_BALLOON_COMPACTION */
 
 static int virtballoon_probe(struct virtio_device *vdev)
@@ -515,10 +539,6 @@ static int virtballoon_probe(struct virtio_device *vdev)
 	vb->vdev = vdev;
 
 	balloon_devinfo_init(&vb->vb_dev_info);
-#ifdef CONFIG_BALLOON_COMPACTION
-	vb->vb_dev_info.migratepage = virtballoon_migratepage;
-#endif
-
 	err = init_vqs(vb);
 	if (err)
 		goto out_free_vb;
@@ -527,13 +547,32 @@ static int virtballoon_probe(struct virtio_device *vdev)
 	vb->nb.priority = VIRTBALLOON_OOM_NOTIFY_PRIORITY;
 	err = register_oom_notifier(&vb->nb);
 	if (err < 0)
-		goto out_oom_notify;
+		goto out_del_vqs;
+
+#ifdef CONFIG_BALLOON_COMPACTION
+	balloon_mnt = kern_mount(&balloon_fs);
+	if (IS_ERR(balloon_mnt)) {
+		err = PTR_ERR(balloon_mnt);
+		unregister_oom_notifier(&vb->nb);
+		goto out_del_vqs;
+	}
 
+	vb->vb_dev_info.migratepage = virtballoon_migratepage;
+	vb->vb_dev_info.inode = alloc_anon_inode(balloon_mnt->mnt_sb);
+	if (IS_ERR(vb->vb_dev_info.inode)) {
+		err = PTR_ERR(vb->vb_dev_info.inode);
+		kern_unmount(balloon_mnt);
+		unregister_oom_notifier(&vb->nb);
+		vb->vb_dev_info.inode = NULL;
+		goto out_del_vqs;
+	}
+	vb->vb_dev_info.inode->i_mapping->a_ops = &balloon_aops;
+#endif
 	virtio_device_ready(vdev);
 
 	return 0;
 
-out_oom_notify:
+out_del_vqs:
 	vdev->config->del_vqs(vdev);
 out_free_vb:
 	kfree(vb);
@@ -567,6 +606,8 @@ static void virtballoon_remove(struct virtio_device *vdev)
 	cancel_work_sync(&vb->update_balloon_stats_work);
 
 	remove_common(vb);
+	if (vb->vb_dev_info.inode)
+		iput(vb->vb_dev_info.inode);
 	kfree(vb);
 }
 
diff --git a/include/linux/balloon_compaction.h b/include/linux/balloon_compaction.h
index 9b0a15d06a4f..4c693bf3abdf 100644
--- a/include/linux/balloon_compaction.h
+++ b/include/linux/balloon_compaction.h
@@ -48,6 +48,7 @@
 #include <linux/migrate.h>
 #include <linux/gfp.h>
 #include <linux/err.h>
+#include <linux/fs.h>
 
 /*
  * Balloon device information descriptor.
@@ -62,6 +63,7 @@ struct balloon_dev_info {
 	struct list_head pages;		/* Pages enqueued & handled to Host */
 	int (*migratepage)(struct balloon_dev_info *, struct page *newpage,
 			struct page *page, enum migrate_mode mode);
+	struct inode *inode;
 };
 
 extern struct page *balloon_page_enqueue(struct balloon_dev_info *b_dev_info);
@@ -73,45 +75,19 @@ static inline void balloon_devinfo_init(struct balloon_dev_info *balloon)
 	spin_lock_init(&balloon->pages_lock);
 	INIT_LIST_HEAD(&balloon->pages);
 	balloon->migratepage = NULL;
+	balloon->inode = NULL;
 }
 
 #ifdef CONFIG_BALLOON_COMPACTION
-extern bool balloon_page_isolate(struct page *page);
+extern const struct address_space_operations balloon_aops;
+extern bool balloon_page_isolate(struct page *page,
+				isolate_mode_t mode);
 extern void balloon_page_putback(struct page *page);
-extern int balloon_page_migrate(struct page *newpage,
+extern int balloon_page_migrate(struct address_space *mapping,
+				struct page *newpage,
 				struct page *page, enum migrate_mode mode);
 
 /*
- * __is_movable_balloon_page - helper to perform @page PageBalloon tests
- */
-static inline bool __is_movable_balloon_page(struct page *page)
-{
-	return PageBalloon(page);
-}
-
-/*
- * balloon_page_movable - test PageBalloon to identify balloon pages
- *			  and PagePrivate to check that the page is not
- *			  isolated and can be moved by compaction/migration.
- *
- * As we might return false positives in the case of a balloon page being just
- * released under us, this need to be re-tested later, under the page lock.
- */
-static inline bool balloon_page_movable(struct page *page)
-{
-	return PageBalloon(page) && PagePrivate(page);
-}
-
-/*
- * isolated_balloon_page - identify an isolated balloon page on private
- *			   compaction/migration page lists.
- */
-static inline bool isolated_balloon_page(struct page *page)
-{
-	return PageBalloon(page);
-}
-
-/*
  * balloon_page_insert - insert a page into the balloon's page list and make
  *			 the page->private assignment accordingly.
  * @balloon : pointer to balloon device
@@ -123,8 +99,7 @@ static inline bool isolated_balloon_page(struct page *page)
 static inline void balloon_page_insert(struct balloon_dev_info *balloon,
 				       struct page *page)
 {
-	__SetPageBalloon(page);
-	SetPagePrivate(page);
+	__SetPageBalloon(page, balloon->inode->i_mapping);
 	set_page_private(page, (unsigned long)balloon);
 	list_add(&page->lru, &balloon->pages);
 }
@@ -141,10 +116,8 @@ static inline void balloon_page_delete(struct page *page)
 {
 	__ClearPageBalloon(page);
 	set_page_private(page, 0);
-	if (PagePrivate(page)) {
-		ClearPagePrivate(page);
+	if (!PageIsolated(page))
 		list_del(&page->lru);
-	}
 }
 
 /*
@@ -166,7 +139,7 @@ static inline gfp_t balloon_mapping_gfp_mask(void)
 static inline void balloon_page_insert(struct balloon_dev_info *balloon,
 				       struct page *page)
 {
-	__SetPageBalloon(page);
+	__SetPageBalloon(page, NULL);
 	list_add(&page->lru, &balloon->pages);
 }
 
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 77ebf8fdbc6e..603c47752126 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -599,32 +599,13 @@ static inline void __ClearPageBuddy(struct page *page)
 
 extern bool is_free_buddy_page(struct page *page);
 
-#define PAGE_BALLOON_MAPCOUNT_VALUE (-256)
-
-static inline int PageBalloon(struct page *page)
-{
-	return atomic_read(&page->_mapcount) == PAGE_BALLOON_MAPCOUNT_VALUE;
-}
-
-static inline void __SetPageBalloon(struct page *page)
-{
-	VM_BUG_ON_PAGE(atomic_read(&page->_mapcount) != -1, page);
-	atomic_set(&page->_mapcount, PAGE_BALLOON_MAPCOUNT_VALUE);
-}
-
-static inline void __ClearPageBalloon(struct page *page)
-{
-	VM_BUG_ON_PAGE(!PageBalloon(page), page);
-	atomic_set(&page->_mapcount, -1);
-}
-
-#define PAGE_MOVABLE_MAPCOUNT_VALUE (-255)
+#define PAGE_MOVABLE_MAPCOUNT_VALUE (-256)
+#define PAGE_BALLOON_MAPCOUNT_VALUE PAGE_MOVABLE_MAPCOUNT_VALUE
 
 static inline int PageMovable(struct page *page)
 {
-	return ((test_bit(PG_movable, &(page)->flags) &&
-		atomic_read(&page->_mapcount) == PAGE_MOVABLE_MAPCOUNT_VALUE)
-		|| PageBalloon(page));
+	return (test_bit(PG_movable, &(page)->flags) &&
+		atomic_read(&page->_mapcount) == PAGE_MOVABLE_MAPCOUNT_VALUE);
 }
 
 /* Caller should hold a PG_lock */
@@ -645,6 +626,35 @@ static inline void __ClearPageMovable(struct page *page)
 
 PAGEFLAG(Isolated, isolated, PF_ANY);
 
+static inline int PageBalloon(struct page *page)
+{
+	return atomic_read(&page->_mapcount) == PAGE_BALLOON_MAPCOUNT_VALUE
+		&& PagePrivate2(page);
+}
+
+static inline void __SetPageBalloon(struct page *page,
+				struct address_space *mapping)
+{
+	VM_BUG_ON_PAGE(atomic_read(&page->_mapcount) != -1, page);
+#ifdef CONFIG_BALLOON_COMPACTION
+	__SetPageMovable(page, mapping);
+#else
+	atomic_set(&page->_mapcount, PAGE_BALLOON_MAPCOUNT_VALUE);
+#endif
+	SetPagePrivate2(page);
+}
+
+static inline void __ClearPageBalloon(struct page *page)
+{
+	VM_BUG_ON_PAGE(!PageBalloon(page), page);
+#ifdef CONFIG_BALLOON_COMPACTION
+	__ClearPageMovable(page);
+#else
+	atomic_set(&page->_mapcount, -1);
+#endif
+	ClearPagePrivate2(page);
+}
+
 /*
  * If network-based swap is enabled, sl*b must keep track of whether pages
  * were allocated from pfmemalloc reserves.
diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
index 0de181ad73d5..e1fbe72c39c0 100644
--- a/include/uapi/linux/magic.h
+++ b/include/uapi/linux/magic.h
@@ -78,5 +78,6 @@
 #define BTRFS_TEST_MAGIC	0x73727279
 #define NSFS_MAGIC		0x6e736673
 #define BPF_FS_MAGIC		0xcafe4a11
+#define BALLOON_KVM_MAGIC	0x13661366
 
 #endif /* __LINUX_MAGIC_H__ */
diff --git a/mm/balloon_compaction.c b/mm/balloon_compaction.c
index 57b3e9bd6bc5..1fbc7fb387bb 100644
--- a/mm/balloon_compaction.c
+++ b/mm/balloon_compaction.c
@@ -70,7 +70,7 @@ struct page *balloon_page_dequeue(struct balloon_dev_info *b_dev_info)
 		 */
 		if (trylock_page(page)) {
 #ifdef CONFIG_BALLOON_COMPACTION
-			if (!PagePrivate(page)) {
+			if (PageIsolated(page)) {
 				/* raced with isolation */
 				unlock_page(page);
 				continue;
@@ -106,110 +106,53 @@ EXPORT_SYMBOL_GPL(balloon_page_dequeue);
 
 #ifdef CONFIG_BALLOON_COMPACTION
 
-static inline void __isolate_balloon_page(struct page *page)
+/* __isolate_lru_page() counterpart for a ballooned page */
+bool balloon_page_isolate(struct page *page, isolate_mode_t mode)
 {
 	struct balloon_dev_info *b_dev_info = balloon_page_device(page);
 	unsigned long flags;
 
 	spin_lock_irqsave(&b_dev_info->pages_lock, flags);
-	ClearPagePrivate(page);
 	list_del(&page->lru);
 	b_dev_info->isolated_pages++;
 	spin_unlock_irqrestore(&b_dev_info->pages_lock, flags);
+	SetPageIsolated(page);
+
+	return true;
 }
 
-static inline void __putback_balloon_page(struct page *page)
+/* putback_lru_page() counterpart for a ballooned page */
+void balloon_page_putback(struct page *page)
 {
 	struct balloon_dev_info *b_dev_info = balloon_page_device(page);
 	unsigned long flags;
 
+	ClearPageIsolated(page);
 	spin_lock_irqsave(&b_dev_info->pages_lock, flags);
-	SetPagePrivate(page);
 	list_add(&page->lru, &b_dev_info->pages);
 	b_dev_info->isolated_pages--;
 	spin_unlock_irqrestore(&b_dev_info->pages_lock, flags);
 }
 
-/* __isolate_lru_page() counterpart for a ballooned page */
-bool balloon_page_isolate(struct page *page)
-{
-	/*
-	 * Avoid burning cycles with pages that are yet under __free_pages(),
-	 * or just got freed under us.
-	 *
-	 * In case we 'win' a race for a balloon page being freed under us and
-	 * raise its refcount preventing __free_pages() from doing its job
-	 * the put_page() at the end of this block will take care of
-	 * release this page, thus avoiding a nasty leakage.
-	 */
-	if (likely(get_page_unless_zero(page))) {
-		/*
-		 * As balloon pages are not isolated from LRU lists, concurrent
-		 * compaction threads can race against page migration functions
-		 * as well as race against the balloon driver releasing a page.
-		 *
-		 * In order to avoid having an already isolated balloon page
-		 * being (wrongly) re-isolated while it is under migration,
-		 * or to avoid attempting to isolate pages being released by
-		 * the balloon driver, lets be sure we have the page lock
-		 * before proceeding with the balloon page isolation steps.
-		 */
-		if (likely(trylock_page(page))) {
-			/*
-			 * A ballooned page, by default, has PagePrivate set.
-			 * Prevent concurrent compaction threads from isolating
-			 * an already isolated balloon page by clearing it.
-			 */
-			if (balloon_page_movable(page)) {
-				__isolate_balloon_page(page);
-				unlock_page(page);
-				return true;
-			}
-			unlock_page(page);
-		}
-		put_page(page);
-	}
-	return false;
-}
-
-/* putback_lru_page() counterpart for a ballooned page */
-void balloon_page_putback(struct page *page)
-{
-	/*
-	 * 'lock_page()' stabilizes the page and prevents races against
-	 * concurrent isolation threads attempting to re-isolate it.
-	 */
-	lock_page(page);
-
-	if (__is_movable_balloon_page(page)) {
-		__putback_balloon_page(page);
-		/* drop the extra ref count taken for page isolation */
-		put_page(page);
-	} else {
-		WARN_ON(1);
-		dump_page(page, "not movable balloon page");
-	}
-	unlock_page(page);
-}
-
 /* move_to_new_page() counterpart for a ballooned page */
-int balloon_page_migrate(struct page *newpage,
-			 struct page *page, enum migrate_mode mode)
+int balloon_page_migrate(struct address_space *mapping,
+		struct page *newpage, struct page *page,
+		enum migrate_mode mode)
 {
 	struct balloon_dev_info *balloon = balloon_page_device(page);
-	int rc = -EAGAIN;
 
 	VM_BUG_ON_PAGE(!PageLocked(page), page);
 	VM_BUG_ON_PAGE(!PageLocked(newpage), newpage);
+	VM_BUG_ON_PAGE(!PageMovable(page), page);
+	VM_BUG_ON_PAGE(!PageIsolated(page), page);
 
-	if (WARN_ON(!__is_movable_balloon_page(page))) {
-		dump_page(page, "not movable balloon page");
-		return rc;
-	}
-
-	if (balloon && balloon->migratepage)
-		rc = balloon->migratepage(balloon, newpage, page, mode);
-
-	return rc;
+	return balloon->migratepage(balloon, newpage, page, mode);
 }
+
+const struct address_space_operations balloon_aops = {
+	.migratepage = balloon_page_migrate,
+	.isolate_page = balloon_page_isolate,
+	.putback_page = balloon_page_putback,
+};
+EXPORT_SYMBOL_GPL(balloon_aops);
 #endif /* CONFIG_BALLOON_COMPACTION */
diff --git a/mm/compaction.c b/mm/compaction.c
index 7557aedddaee..e336c620fd7b 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -708,13 +708,6 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
 		 */
 		is_lru = PageLRU(page);
 		if (!is_lru) {
-			if (unlikely(balloon_page_movable(page))) {
-				if (balloon_page_isolate(page)) {
-					/* Successfully isolated */
-					goto isolate_success;
-				}
-			}
-
 			if (unlikely(PageMovable(page)) &&
 					!PageIsolated(page)) {
 				if (locked) {
diff --git a/mm/migrate.c b/mm/migrate.c
index b56bf2b3fe8c..028814625eea 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -157,8 +157,8 @@ void putback_movable_page(struct page *page)
  * from where they were once taken off for compaction/migration.
  *
  * This function shall be used whenever the isolated pageset has been
- * built from lru, balloon, hugetlbfs page. See isolate_migratepages_range()
- * and isolate_huge_page().
+ * built from lru, movable, hugetlbfs page.
+ * See isolate_migratepages_range() and isolate_huge_page().
  */
 void putback_movable_pages(struct list_head *l)
 {
@@ -173,9 +173,7 @@ void putback_movable_pages(struct list_head *l)
 		list_del(&page->lru);
 		dec_zone_page_state(page, NR_ISOLATED_ANON +
 				page_is_file_cache(page));
-		if (unlikely(isolated_balloon_page(page))) {
-			balloon_page_putback(page);
-		} else if (unlikely(PageMovable(page))) {
+		if (unlikely(PageMovable(page))) {
 			if (PageIsolated(page)) {
 				putback_movable_page(page);
 			} else {
@@ -977,18 +975,6 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
 	if (unlikely(!trylock_page(newpage)))
 		goto out_unlock;
 
-	if (unlikely(isolated_balloon_page(page))) {
-		/*
-		 * A ballooned page does not need any special attention from
-		 * physical to virtual reverse mapping procedures.
-		 * Skip any attempt to unmap PTEs or to remap swap cache,
-		 * in order to avoid burning cycles at rmap level, and perform
-		 * the page migration right away (proteced by page lock).
-		 */
-		rc = balloon_page_migrate(newpage, page, mode);
-		goto out_unlock_both;
-	}
-
 	/*
 	 * Corner case handling:
 	 * 1. When a new swap-cache page is read into, it is added to the LRU
@@ -1033,7 +1019,7 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
 out:
 	/* If migration is successful, move newpage to right list */
 	if (rc == MIGRATEPAGE_SUCCESS) {
-		if (unlikely(__is_movable_balloon_page(newpage)))
+		if (unlikely(PageMovable(newpage)))
 			put_page(newpage);
 		else
 			putback_lru_page(newpage);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index d82196244340..c7696a2e11c7 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1254,7 +1254,7 @@ unsigned long reclaim_clean_pages_from_list(struct zone *zone,
 
 	list_for_each_entry_safe(page, next, page_list, lru) {
 		if (page_is_file_cache(page) && !PageDirty(page) &&
-		    !isolated_balloon_page(page)) {
+		    !PageIsolated(page)) {
 			ClearPageActive(page);
 			list_move(&page->lru, &clean_pages);
 		}
-- 
1.9.1

^ permalink raw reply related

* [PATCH v3 05/16] zsmalloc: keep max_object in size_class
From: Minchan Kim @ 2016-03-30  7:12 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, YiPing Xu, aquini, rknize, Sergey Senozhatsky,
	Chan Gyun Jeong, Minchan Kim, Hugh Dickins, linux-kernel, Al Viro,
	virtualization, bfields, linux-mm, Gioh Kim, koct9i, Sangseok Lee,
	jlayton, Joonsoo Kim, Vlastimil Babka, Mel Gorman
In-Reply-To: <1459321935-3655-1-git-send-email-minchan@kernel.org>

Every zspage in a size_class has same number of max objects so
we could move it to a size_class.

Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 mm/zsmalloc.c | 32 +++++++++++++++-----------------
 1 file changed, 15 insertions(+), 17 deletions(-)

diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index a0890e9003e2..8649d0243e6c 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -32,8 +32,6 @@
  *	page->freelist: points to the first free object in zspage.
  *		Free objects are linked together using in-place
  *		metadata.
- *	page->objects: maximum number of objects we can store in this
- *		zspage (class->zspage_order * PAGE_SIZE / class->size)
  *	page->lru: links together first pages of various zspages.
  *		Basically forming list of zspages in a fullness group.
  *	page->mapping: class index and fullness group of the zspage
@@ -211,6 +209,7 @@ struct size_class {
 	 * of ZS_ALIGN.
 	 */
 	int size;
+	int objs_per_zspage;
 	unsigned int index;
 
 	struct zs_size_stat stats;
@@ -627,21 +626,22 @@ static inline void zs_pool_stat_destroy(struct zs_pool *pool)
  * the pool (not yet implemented). This function returns fullness
  * status of the given page.
  */
-static enum fullness_group get_fullness_group(struct page *first_page)
+static enum fullness_group get_fullness_group(struct size_class *class,
+						struct page *first_page)
 {
-	int inuse, max_objects;
+	int inuse, objs_per_zspage;
 	enum fullness_group fg;
 
 	VM_BUG_ON_PAGE(!is_first_page(first_page), first_page);
 
 	inuse = first_page->inuse;
-	max_objects = first_page->objects;
+	objs_per_zspage = class->objs_per_zspage;
 
 	if (inuse == 0)
 		fg = ZS_EMPTY;
-	else if (inuse == max_objects)
+	else if (inuse == objs_per_zspage)
 		fg = ZS_FULL;
-	else if (inuse <= 3 * max_objects / fullness_threshold_frac)
+	else if (inuse <= 3 * objs_per_zspage / fullness_threshold_frac)
 		fg = ZS_ALMOST_EMPTY;
 	else
 		fg = ZS_ALMOST_FULL;
@@ -728,7 +728,7 @@ static enum fullness_group fix_fullness_group(struct size_class *class,
 	enum fullness_group currfg, newfg;
 
 	get_zspage_mapping(first_page, &class_idx, &currfg);
-	newfg = get_fullness_group(first_page);
+	newfg = get_fullness_group(class, first_page);
 	if (newfg == currfg)
 		goto out;
 
@@ -1008,9 +1008,6 @@ static struct page *alloc_zspage(struct size_class *class, gfp_t flags)
 	init_zspage(class, first_page);
 
 	first_page->freelist = location_to_obj(first_page, 0);
-	/* Maximum number of objects we can store in this zspage */
-	first_page->objects = class->pages_per_zspage * PAGE_SIZE / class->size;
-
 	error = 0; /* Success */
 
 cleanup:
@@ -1238,11 +1235,11 @@ static bool can_merge(struct size_class *prev, int size, int pages_per_zspage)
 	return true;
 }
 
-static bool zspage_full(struct page *first_page)
+static bool zspage_full(struct size_class *class, struct page *first_page)
 {
 	VM_BUG_ON_PAGE(!is_first_page(first_page), first_page);
 
-	return first_page->inuse == first_page->objects;
+	return first_page->inuse == class->objs_per_zspage;
 }
 
 unsigned long zs_get_total_pages(struct zs_pool *pool)
@@ -1628,7 +1625,7 @@ static int migrate_zspage(struct zs_pool *pool, struct size_class *class,
 		}
 
 		/* Stop if there is no more space */
-		if (zspage_full(d_page)) {
+		if (zspage_full(class, d_page)) {
 			unpin_tag(handle);
 			ret = -ENOMEM;
 			break;
@@ -1687,7 +1684,7 @@ static enum fullness_group putback_zspage(struct zs_pool *pool,
 {
 	enum fullness_group fullness;
 
-	fullness = get_fullness_group(first_page);
+	fullness = get_fullness_group(class, first_page);
 	insert_zspage(class, fullness, first_page);
 	set_zspage_mapping(first_page, class->index, fullness);
 
@@ -1936,8 +1933,9 @@ struct zs_pool *zs_create_pool(const char *name, gfp_t flags)
 		class->size = size;
 		class->index = i;
 		class->pages_per_zspage = pages_per_zspage;
-		if (pages_per_zspage == 1 &&
-			get_maxobj_per_zspage(size, pages_per_zspage) == 1)
+		class->objs_per_zspage = class->pages_per_zspage *
+						PAGE_SIZE / class->size;
+		if (pages_per_zspage == 1 && class->objs_per_zspage == 1)
 			class->huge = true;
 		spin_lock_init(&class->lock);
 		pool->size_class[i] = class;
-- 
1.9.1

^ permalink raw reply related

* [PATCH v3 06/16] zsmalloc: squeeze inuse into page->mapping
From: Minchan Kim @ 2016-03-30  7:12 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, YiPing Xu, aquini, rknize, Sergey Senozhatsky,
	Chan Gyun Jeong, Minchan Kim, Hugh Dickins, linux-kernel, Al Viro,
	virtualization, bfields, linux-mm, Gioh Kim, koct9i, Sangseok Lee,
	jlayton, Joonsoo Kim, Vlastimil Babka, Mel Gorman
In-Reply-To: <1459321935-3655-1-git-send-email-minchan@kernel.org>

Currently, we store class:fullness into page->mapping.
The number of class we can support is 255 and fullness is 4 so
(8 + 2 = 10bit) is enough to represent them.
Meanwhile, the bits we need to store in-use objects in zspage
is that 11bit is enough.

For example, If we assume that 64K PAGE_SIZE, class_size 32
which is worst case, class->pages_per_zspage become 1 so
the number of objects in zspage is 2048 so 11bit is enough.
The next class is 32 + 256(i.e., ZS_SIZE_CLASS_DELTA).
With worst case that ZS_MAX_PAGES_PER_ZSPAGE, 64K * 4 /
(32 + 256) = 910 so 11bit is still enough.

So, we could squeeze inuse object count to page->mapping.

Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 mm/zsmalloc.c | 103 ++++++++++++++++++++++++++++++++++++++++------------------
 1 file changed, 71 insertions(+), 32 deletions(-)

diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index 8649d0243e6c..4dd72a803568 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -34,8 +34,7 @@
  *		metadata.
  *	page->lru: links together first pages of various zspages.
  *		Basically forming list of zspages in a fullness group.
- *	page->mapping: class index and fullness group of the zspage
- *	page->inuse: the number of objects that are used in this zspage
+ *	page->mapping: override by struct zs_meta
  *
  * Usage of struct page flags:
  *	PG_private: identifies the first component page
@@ -132,6 +131,13 @@
 /* each chunk includes extra space to keep handle */
 #define ZS_MAX_ALLOC_SIZE	PAGE_SIZE
 
+#define CLASS_BITS	8
+#define CLASS_MASK	((1 << CLASS_BITS) - 1)
+#define FULLNESS_BITS	2
+#define FULLNESS_MASK	((1 << FULLNESS_BITS) - 1)
+#define INUSE_BITS	11
+#define INUSE_MASK	((1 << INUSE_BITS) - 1)
+
 /*
  * On systems with 4K page size, this gives 255 size classes! There is a
  * trader-off here:
@@ -145,7 +151,7 @@
  *  ZS_MIN_ALLOC_SIZE and ZS_SIZE_CLASS_DELTA must be multiple of ZS_ALIGN
  *  (reason above)
  */
-#define ZS_SIZE_CLASS_DELTA	(PAGE_SIZE >> 8)
+#define ZS_SIZE_CLASS_DELTA	(PAGE_SIZE >> CLASS_BITS)
 
 /*
  * We do not maintain any list for completely empty or full pages
@@ -155,7 +161,7 @@ enum fullness_group {
 	ZS_ALMOST_EMPTY,
 	_ZS_NR_FULLNESS_GROUPS,
 
-	ZS_EMPTY,
+	ZS_EMPTY = _ZS_NR_FULLNESS_GROUPS,
 	ZS_FULL
 };
 
@@ -263,14 +269,11 @@ struct zs_pool {
 #endif
 };
 
-/*
- * A zspage's class index and fullness group
- * are encoded in its (first)page->mapping
- */
-#define CLASS_IDX_BITS	28
-#define FULLNESS_BITS	4
-#define CLASS_IDX_MASK	((1 << CLASS_IDX_BITS) - 1)
-#define FULLNESS_MASK	((1 << FULLNESS_BITS) - 1)
+struct zs_meta {
+	unsigned long class:CLASS_BITS;
+	unsigned long fullness:FULLNESS_BITS;
+	unsigned long inuse:INUSE_BITS;
+};
 
 struct mapping_area {
 #ifdef CONFIG_PGTABLE_MAPPING
@@ -412,28 +415,61 @@ static int is_last_page(struct page *page)
 	return PagePrivate2(page);
 }
 
+static int get_zspage_inuse(struct page *first_page)
+{
+	struct zs_meta *m;
+
+	VM_BUG_ON_PAGE(!is_first_page(first_page), first_page);
+
+	m = (struct zs_meta *)&first_page->mapping;
+
+	return m->inuse;
+}
+
+static void set_zspage_inuse(struct page *first_page, int val)
+{
+	struct zs_meta *m;
+
+	VM_BUG_ON_PAGE(!is_first_page(first_page), first_page);
+
+	m = (struct zs_meta *)&first_page->mapping;
+	m->inuse = val;
+}
+
+static void mod_zspage_inuse(struct page *first_page, int val)
+{
+	struct zs_meta *m;
+
+	VM_BUG_ON_PAGE(!is_first_page(first_page), first_page);
+
+	m = (struct zs_meta *)&first_page->mapping;
+	m->inuse += val;
+}
+
 static void get_zspage_mapping(struct page *first_page,
 				unsigned int *class_idx,
 				enum fullness_group *fullness)
 {
-	unsigned long m;
+	struct zs_meta *m;
+
 	VM_BUG_ON_PAGE(!is_first_page(first_page), first_page);
 
-	m = (unsigned long)first_page->mapping;
-	*fullness = m & FULLNESS_MASK;
-	*class_idx = (m >> FULLNESS_BITS) & CLASS_IDX_MASK;
+	m = (struct zs_meta *)&first_page->mapping;
+	*fullness = m->fullness;
+	*class_idx = m->class;
 }
 
 static void set_zspage_mapping(struct page *first_page,
 				unsigned int class_idx,
 				enum fullness_group fullness)
 {
-	unsigned long m;
+	struct zs_meta *m;
+
 	VM_BUG_ON_PAGE(!is_first_page(first_page), first_page);
 
-	m = ((class_idx & CLASS_IDX_MASK) << FULLNESS_BITS) |
-			(fullness & FULLNESS_MASK);
-	first_page->mapping = (struct address_space *)m;
+	m = (struct zs_meta *)&first_page->mapping;
+	m->fullness = fullness;
+	m->class = class_idx;
 }
 
 /*
@@ -632,9 +668,7 @@ static enum fullness_group get_fullness_group(struct size_class *class,
 	int inuse, objs_per_zspage;
 	enum fullness_group fg;
 
-	VM_BUG_ON_PAGE(!is_first_page(first_page), first_page);
-
-	inuse = first_page->inuse;
+	inuse = get_zspage_inuse(first_page);
 	objs_per_zspage = class->objs_per_zspage;
 
 	if (inuse == 0)
@@ -677,10 +711,10 @@ static void insert_zspage(struct size_class *class,
 
 	/*
 	 * We want to see more ZS_FULL pages and less almost
-	 * empty/full. Put pages with higher ->inuse first.
+	 * empty/full. Put pages with higher inuse first.
 	 */
 	list_add_tail(&first_page->lru, &(*head)->lru);
-	if (first_page->inuse >= (*head)->inuse)
+	if (get_zspage_inuse(first_page) >= get_zspage_inuse(*head))
 		*head = first_page;
 }
 
@@ -896,7 +930,7 @@ static void free_zspage(struct page *first_page)
 	struct page *nextp, *tmp, *head_extra;
 
 	VM_BUG_ON_PAGE(!is_first_page(first_page), first_page);
-	VM_BUG_ON_PAGE(first_page->inuse, first_page);
+	VM_BUG_ON_PAGE(get_zspage_inuse(first_page), first_page);
 
 	head_extra = (struct page *)page_private(first_page);
 
@@ -992,7 +1026,7 @@ static struct page *alloc_zspage(struct size_class *class, gfp_t flags)
 			SetPagePrivate(page);
 			set_page_private(page, 0);
 			first_page = page;
-			first_page->inuse = 0;
+			set_zspage_inuse(page, 0);
 		}
 		if (i == 1)
 			set_page_private(first_page, (unsigned long)page);
@@ -1237,9 +1271,7 @@ static bool can_merge(struct size_class *prev, int size, int pages_per_zspage)
 
 static bool zspage_full(struct size_class *class, struct page *first_page)
 {
-	VM_BUG_ON_PAGE(!is_first_page(first_page), first_page);
-
-	return first_page->inuse == class->objs_per_zspage;
+	return get_zspage_inuse(first_page) == class->objs_per_zspage;
 }
 
 unsigned long zs_get_total_pages(struct zs_pool *pool)
@@ -1372,7 +1404,7 @@ static unsigned long obj_malloc(struct size_class *class,
 		/* record handle in first_page->private */
 		set_page_private(first_page, handle);
 	kunmap_atomic(vaddr);
-	first_page->inuse++;
+	mod_zspage_inuse(first_page, 1);
 	zs_stat_inc(class, OBJ_USED, 1);
 
 	return obj;
@@ -1457,7 +1489,7 @@ static void obj_free(struct size_class *class, unsigned long obj)
 		set_page_private(first_page, 0);
 	kunmap_atomic(vaddr);
 	first_page->freelist = (void *)obj;
-	first_page->inuse--;
+	mod_zspage_inuse(first_page, -1);
 	zs_stat_dec(class, OBJ_USED, 1);
 }
 
@@ -2002,6 +2034,13 @@ static int __init zs_init(void)
 	if (ret)
 		goto notifier_fail;
 
+	/*
+	 * A zspage's class index, fullness group, inuse object count are
+	 * encoded in its (first)page->mapping so sizeof(struct zs_meta)
+	 * should be less than sizeof(page->mapping(i.e., unsigned long)).
+	 */
+	BUILD_BUG_ON(sizeof(struct zs_meta) > sizeof(unsigned long));
+
 	init_zs_size_classes();
 
 #ifdef CONFIG_ZPOOL
-- 
1.9.1

^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox