Linux virtualization list
 help / color / mirror / Atom feed
* [PATCH v4 02/12] mm: migrate: support non-lru movable page migration
From: Minchan Kim @ 2016-04-27  7:48 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Sergey Senozhatsky, Rafael Aquini, Minchan Kim,
	Jonathan Corbet, Hugh Dickins, linux-kernel, dri-devel,
	virtualization, John Einar Reitan, linux-mm, Gioh Kim, Mel Gorman,
	Joonsoo Kim, Vlastimil Babka
In-Reply-To: <1461743305-19970-1-git-send-email-minchan@kernel.org>

We have allowed migration for only LRU pages until now and it was
enough to make high-order pages. But recently, embedded system(e.g.,
webOS, android) uses lots of non-movable pages(e.g., zram, GPU memory)
so we have seen several reports about troubles of small high-order
allocation. For fixing the problem, there were several efforts
(e,g,. enhance compaction algorithm, SLUB fallback to 0-order page,
reserved memory, vmalloc and so on) but if there are lots of
non-movable pages in system, their solutions are void in the long run.

So, this patch is to support facility to change non-movable pages
with movable. For the feature, this patch introduces functions related
to migration to address_space_operations as well as some page flags.

If a driver want to make own pages movable, it should define three functions
which are function pointers of struct address_space_operations.

1. bool (*isolate_page) (struct page *page, isolate_mode_t mode);

What VM expects on isolate_page function of driver is to return *true*
if driver isolates page successfully. On returing true, VM marks the page
as PG_isolated so concurrent isolation in several CPUs skip the page
for isolation. If a driver cannot isolate the page, it should return *false*.

Once page is successfully isolated, VM uses page.lru fields so driver
shouldn't expect to preserve values in that fields.

2. int (*migratepage) (struct address_space *mapping,
		struct page *newpage, struct page *oldpage, enum migrate_mode);

After isolation, VM calls migratepage of driver with isolated page.
The function of migratepage is to move content of the old page to new page
and set up fields of struct page newpage. Keep in mind that you should
clear PG_movable of oldpage via __ClearPageMovable under page_lock if you
migrated the oldpage successfully and returns MIGRATEPAGE_SUCCESS.
If driver cannot migrate the page at the moment, driver can return -EAGAIN.
On -EAGAIN, VM will retry page migration in a short time because VM interprets
-EAGAIN as "temporal migration failure". On returning any error except -EAGAIN,
VM will give up the page migration without retrying in this time.

Driver shouldn't touch page.lru field VM using in the functions.

3. void (*putback_page)(struct page *);

If migration fails on isolated page, VM should return the isolated page
to the driver so VM calls driver's putback_page with migration failed page.
In this function, driver should put the isolated page back to the own data
structure.

4. non-lru movable page flags

There are two page flags for supporting non-lru movable page.

* PG_movable

Driver should use the below function to make page movable under page_lock.

	void __SetPageMovable(struct page *page, struct address_space *mapping)

It needs argument of address_space for registering migration family functions
which will be called by VM. Exactly speaking, PG_movable is not a real flag of
struct page. Rather than, VM reuses page->mapping's lower bits to represent it.

	#define PAGE_MAPPING_MOVABLE 0x2
	page->mapping = page->mapping | PAGE_MAPPING_MOVABLE;

so driver shouldn't access page->mapping directly. Instead, driver should
use page_mapping which mask off the low two bits of page->mapping so it can get
right struct address_space.

For testing of non-lru movable page, VM supports __PageMovable function.
However, it doesn't guarantee to identify non-lru movable page because
page->mapping field is unified with other variables in struct page.
As well, if driver releases the page after isolation by VM, page->mapping
doesn't have stable value although it has PAGE_MAPPING_MOVABLE
(Look at __ClearPageMovable). But __PageMovable is cheap to catch whether
page is LRU or non-lru movable once the page has been isolated. Because
LRU pages never can have PAGE_MAPPING_MOVABLE in page->mapping. It is also
good for just peeking to test non-lru movable pages before more expensive
checking with lock_page in pfn scanning to select victim.

For guaranteeing non-lru movable page, VM provides PageMovable function.
Unlike __PageMovable, PageMovable functions validates page->mapping and
mapping->a_ops->isolate_page under lock_page. The lock_page prevents sudden
destroying of page->mapping.

Driver using __SetPageMovable should clear the flag via __ClearMovablePage
under page_lock before the releasing the page.

* PG_isolated

To prevent concurrent isolation among several CPUs, VM marks isolated page
as PG_isolated under lock_page. So if a CPU encounters PG_isolated non-lru
movable page, it can skip it. Driver doesn't need to manipulate the flag
because VM will set/clear it automatically. Keep in mind that if driver
sees PG_isolated page, it means the page have been isolated by VM so it
shouldn't touch page.lru field.
PG_isolated is alias with PG_reclaim flag so driver shouldn't use the flag
for own purpose.

Cc: Rik van Riel <riel@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: Rafael Aquini <aquini@redhat.com>
Cc: virtualization@lists.linux-foundation.org
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: John Einar Reitan <john.reitan@foss.arm.com>
Cc: dri-devel@lists.freedesktop.org
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Gioh Kim <gi-oh.kim@profitbricks.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 Documentation/filesystems/Locking |   4 +
 Documentation/filesystems/vfs.txt |  11 ++
 Documentation/vm/page_migration   | 107 +++++++++++++++++-
 include/linux/fs.h                |   2 +
 include/linux/ksm.h               |   3 +-
 include/linux/migrate.h           |   5 +
 include/linux/mm.h                |   1 +
 include/linux/page-flags.h        |  23 +++-
 mm/compaction.c                   |  47 +++++---
 mm/ksm.c                          |   4 +-
 mm/migrate.c                      | 222 ++++++++++++++++++++++++++++++++++----
 mm/page_alloc.c                   |   4 +-
 mm/util.c                         |   6 +-
 13 files changed, 390 insertions(+), 49 deletions(-)

diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking
index 619af9bfdcb3..0bb79560abb3 100644
--- a/Documentation/filesystems/Locking
+++ b/Documentation/filesystems/Locking
@@ -195,7 +195,9 @@ unlocks and drops the reference.
 	int (*releasepage) (struct page *, int);
 	void (*freepage)(struct page *);
 	int (*direct_IO)(struct kiocb *, struct iov_iter *iter, loff_t offset);
+	bool (*isolate_page) (struct page *, isolate_mode_t);
 	int (*migratepage)(struct address_space *, struct page *, struct page *);
+	void (*putback_page) (struct page *);
 	int (*launder_page)(struct page *);
 	int (*is_partially_uptodate)(struct page *, unsigned long, unsigned long);
 	int (*error_remove_page)(struct address_space *, struct page *);
@@ -219,7 +221,9 @@ invalidatepage:		yes
 releasepage:		yes
 freepage:		yes
 direct_IO:
+isolate_page:		yes
 migratepage:		yes (both)
+putback_page:		yes
 launder_page:		yes
 is_partially_uptodate:	yes
 error_remove_page:	yes
diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt
index 4164bd6397a2..4fa8f54d94c8 100644
--- a/Documentation/filesystems/vfs.txt
+++ b/Documentation/filesystems/vfs.txt
@@ -592,9 +592,14 @@ struct address_space_operations {
 	int (*releasepage) (struct page *, int);
 	void (*freepage)(struct page *);
 	ssize_t (*direct_IO)(struct kiocb *, struct iov_iter *iter, loff_t offset);
+	/* isolate a page for migration */
+	bool (*isolate_page) (struct page *, isolate_mode_t);
 	/* migrate the contents of a page to the specified target */
 	int (*migratepage) (struct page *, struct page *);
+	/* put migration-failed page back to right list */
+	void (*putback_page) (struct page *);
 	int (*launder_page) (struct page *);
+
 	int (*is_partially_uptodate) (struct page *, unsigned long,
 					unsigned long);
 	void (*is_dirty_writeback) (struct page *, bool *, bool *);
@@ -747,6 +752,10 @@ struct address_space_operations {
         and transfer data directly between the storage and the
         application's address space.
 
+  isolate_page: Called by the VM when isolating a movable non-lru page.
+	If page is successfully isolated, VM marks the page as PG_isolated
+	via __SetPageIsolated.
+
   migrate_page:  This is used to compact the physical memory usage.
         If the VM wants to relocate a page (maybe off a memory card
         that is signalling imminent failure) it will pass a new page
@@ -754,6 +763,8 @@ struct address_space_operations {
 	transfer any private data across and update any references
         that it has to the page.
 
+  putback_page: Called by the VM when isolated page's migration fails.
+
   launder_page: Called before freeing a page - it writes back the dirty page. To
   	prevent redirtying the page, it is kept locked during the whole
 	operation.
diff --git a/Documentation/vm/page_migration b/Documentation/vm/page_migration
index fea5c0864170..b89d1de026df 100644
--- a/Documentation/vm/page_migration
+++ b/Documentation/vm/page_migration
@@ -142,5 +142,110 @@ is increased so that the page cannot be freed while page migration occurs.
 20. The new page is moved to the LRU and can be scanned by the swapper
     etc again.
 
-Christoph Lameter, May 8, 2006.
+C. Non-LRU page migration
+-------------------------
+
+Although original migration aimed for reducing the latency of memory access
+for NUMA, compaction who want to create high-order page is also main customer.
+
+Current problem of the implementation is that it is designed to migrate only
+*LRU* pages. However, there are potential non-lru pages which can be migrated
+in drivers, for example, zsmalloc, virtio-balloon pages.
+
+For virtio-balloon pages, some parts of migration code path have been hooked
+up and added virtio-balloon specific functions to intercept migration logics.
+It's too specific to a driver so other drivers who want to make their pages
+movable would have to add own specific hooks in migration path.
+
+To overclome the problem, VM supports non-LRU page migration which provides
+generic functions for non-LRU movable pages without driver specific hooks
+migration path.
+
+If a driver want to make own pages movable, it should define three functions
+which are function pointers of struct address_space_operations.
+
+1. bool (*isolate_page) (struct page *page, isolate_mode_t mode);
+
+What VM expects on isolate_page function of driver is to return *true*
+if driver isolates page successfully. On returing true, VM marks the page
+as PG_isolated so concurrent isolation in several CPUs skip the page
+for isolation. If a driver cannot isolate the page, it should return *false*.
+
+Once page is successfully isolated, VM uses page.lru fields so driver
+shouldn't expect to preserve values in that fields.
+
+2. int (*migratepage) (struct address_space *mapping,
+		struct page *newpage, struct page *oldpage, enum migrate_mode);
+
+After isolation, VM calls migratepage of driver with isolated page.
+The function of migratepage is to move content of the old page to new page
+and set up fields of struct page newpage. Keep in mind that you should
+clear PG_movable of oldpage via __ClearPageMovable under page_lock if you
+migrated the oldpage successfully and returns MIGRATEPAGE_SUCCESS.
+If driver cannot migrate the page at the moment, driver can return -EAGAIN.
+On -EAGAIN, VM will retry page migration in a short time because VM interprets
+-EAGAIN as "temporal migration failure". On returning any error except -EAGAIN,
+VM will give up the page migration without retrying in this time.
+
+Driver shouldn't touch page.lru field VM using in the functions.
+
+3. void (*putback_page)(struct page *);
+
+If migration fails on isolated page, VM should return the isolated page
+to the driver so VM calls driver's putback_page with migration failed page.
+In this function, driver should put the isolated page back to the own data
+structure.
 
+4. non-lru movable page flags
+
+There are two page flags for supporting non-lru movable page.
+
+* PG_movable
+
+Driver should use the below function to make page movable under page_lock.
+
+	void __SetPageMovable(struct page *page, struct address_space *mapping)
+
+It needs argument of address_space for registering migration family functions
+which will be called by VM. Exactly speaking, PG_movable is not a real flag of
+struct page. Rather than, VM reuses page->mapping's lower bits to represent it.
+
+	#define PAGE_MAPPING_MOVABLE 0x2
+	page->mapping = page->mapping | PAGE_MAPPING_MOVABLE;
+
+so driver shouldn't access page->mapping directly. Instead, driver should
+use page_mapping which mask off the low two bits of page->mapping so it can get
+right struct address_space.
+
+For testing of non-lru movable page, VM supports __PageMovable function.
+However, it doesn't guarantee to identify non-lru movable page because
+page->mapping field is unified with other variables in struct page.
+As well, if driver releases the page after isolation by VM, page->mapping
+doesn't have stable value although it has PAGE_MAPPING_MOVABLE
+(Look at __ClearPageMovable). But __PageMovable is cheap to catch whether
+page is LRU or non-lru movable once the page has been isolated. Because
+LRU pages never can have PAGE_MAPPING_MOVABLE in page->mapping. It is also
+good for just peeking to test non-lru movable pages before more expensive
+checking with lock_page in pfn scanning to select victim.
+
+For guaranteeing non-lru movable page, VM provides PageMovable function.
+Unlike __PageMovable, PageMovable functions validates page->mapping and
+mapping->a_ops->isolate_page under lock_page. The lock_page prevents sudden
+destroying of page->mapping.
+
+Driver using __SetPageMovable should clear the flag via __ClearMovablePage
+under page_lock before the releasing the page.
+
+* PG_isolated
+
+To prevent concurrent isolation among several CPUs, VM marks isolated page
+as PG_isolated under lock_page. So if a CPU encounters PG_isolated non-lru
+movable page, it can skip it. Driver doesn't need to manipulate the flag
+because VM will set/clear it automatically. Keep in mind that if driver
+sees PG_isolated page, it means the page have been isolated by VM so it
+shouldn't touch page.lru field.
+PG_isolated is alias with PG_reclaim flag so driver shouldn't use the flag
+for own purpose.
+
+Christoph Lameter, May 8, 2006.
+Minchan Kim, Mar 28, 2016.
diff --git a/include/linux/fs.h b/include/linux/fs.h
index e3c0b7e54d47..e520d365adf6 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -401,6 +401,8 @@ struct address_space_operations {
 	 */
 	int (*migratepage) (struct address_space *,
 			struct page *, struct page *, enum migrate_mode);
+	bool (*isolate_page)(struct page *, isolate_mode_t);
+	void (*putback_page)(struct page *);
 	int (*launder_page) (struct page *);
 	int (*is_partially_uptodate) (struct page *, unsigned long,
 					unsigned long);
diff --git a/include/linux/ksm.h b/include/linux/ksm.h
index 7ae216a39c9e..481c8c4627ca 100644
--- a/include/linux/ksm.h
+++ b/include/linux/ksm.h
@@ -43,8 +43,7 @@ static inline struct stable_node *page_stable_node(struct page *page)
 static inline void set_page_stable_node(struct page *page,
 					struct stable_node *stable_node)
 {
-	page->mapping = (void *)stable_node +
-				(PAGE_MAPPING_ANON | PAGE_MAPPING_KSM);
+	page->mapping = (void *)((unsigned long)stable_node | PAGE_MAPPING_KSM);
 }
 
 /*
diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index f087653baf94..9365858db978 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -34,11 +34,16 @@ extern char *migrate_reason_names[MR_TYPES];
 
 #ifdef CONFIG_MIGRATION
 
+extern int PageMovable(struct page *page);
+extern void __SetPageMovable(struct page *page, struct address_space *mapping);
+extern void __ClearPageMovable(struct page *page);
 extern void putback_movable_pages(struct list_head *l);
 extern int migrate_page(struct address_space *,
 			struct page *, struct page *, enum migrate_mode);
 extern int migrate_pages(struct list_head *l, new_page_t new, free_page_t free,
 		unsigned long private, enum migrate_mode mode, int reason);
+extern bool isolate_movable_page(struct page *page, isolate_mode_t mode);
+extern void putback_movable_page(struct page *page);
 
 extern int migrate_prep(void);
 extern int migrate_prep_local(void);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index ebc23193f556..d1095c8ea851 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1021,6 +1021,7 @@ static inline pgoff_t page_file_index(struct page *page)
 }
 
 bool page_mapped(struct page *page);
+struct address_space *page_mapping(struct page *page);
 
 /*
  * Return true only if the page has been allocated with
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 36e3a3fa41b2..079a90ceae75 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -132,6 +132,9 @@ enum pageflags {
 
 	/* Compound pages. Stored in first tail page's flags */
 	PG_double_map = PG_private_2,
+
+	/* non-lru isolated movable page */
+	PG_isolated = PG_reclaim,
 };
 
 #ifndef __GENERATING_BOUNDS_H
@@ -367,15 +370,17 @@ PAGEFLAG(Idle, idle, PF_ANY)
  * and then page->mapping points, not to an anon_vma, but to a private
  * structure which KSM associates with that merged page.  See ksm.h.
  *
- * PAGE_MAPPING_KSM without PAGE_MAPPING_ANON is currently never used.
+ * PAGE_MAPPING_KSM without PAGE_MAPPING_ANON is used for non-lru movable
+ * page and then page->mapping points a struct address_space.
  *
  * Please note that, confusingly, "page_mapping" refers to the inode
  * address_space which maps the page from disk; whereas "page_mapped"
  * refers to user virtual address space into which the page is mapped.
  */
-#define PAGE_MAPPING_ANON	1
-#define PAGE_MAPPING_KSM	2
-#define PAGE_MAPPING_FLAGS	(PAGE_MAPPING_ANON | PAGE_MAPPING_KSM)
+#define PAGE_MAPPING_ANON	0x1
+#define PAGE_MAPPING_MOVABLE	0x2
+#define PAGE_MAPPING_KSM	(PAGE_MAPPING_ANON | PAGE_MAPPING_MOVABLE)
+#define PAGE_MAPPING_FLAGS	(PAGE_MAPPING_ANON | PAGE_MAPPING_MOVABLE)
 
 static __always_inline int PageAnon(struct page *page)
 {
@@ -383,6 +388,12 @@ static __always_inline int PageAnon(struct page *page)
 	return ((unsigned long)page->mapping & PAGE_MAPPING_ANON) != 0;
 }
 
+static __always_inline int __PageMovable(struct page *page)
+{
+	return ((unsigned long)page->mapping & PAGE_MAPPING_FLAGS) ==
+				PAGE_MAPPING_MOVABLE;
+}
+
 #ifdef CONFIG_KSM
 /*
  * A KSM page is one of those write-protected "shared pages" or "merged pages"
@@ -394,7 +405,7 @@ static __always_inline int PageKsm(struct page *page)
 {
 	page = compound_head(page);
 	return ((unsigned long)page->mapping & PAGE_MAPPING_FLAGS) ==
-				(PAGE_MAPPING_ANON | PAGE_MAPPING_KSM);
+				PAGE_MAPPING_KSM;
 }
 #else
 TESTPAGEFLAG_FALSE(Ksm)
@@ -624,6 +635,8 @@ static inline void __ClearPageBalloon(struct page *page)
 	atomic_set(&page->_mapcount, -1);
 }
 
+__PAGEFLAG(Isolated, isolated, PF_ANY);
+
 /*
  * If network-based swap is enabled, sl*b must keep track of whether pages
  * were allocated from pfmemalloc reserves.
diff --git a/mm/compaction.c b/mm/compaction.c
index 2f31c3769a8e..01789ebc4b10 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -740,21 +740,6 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
 		}
 
 		/*
-		 * Check may be lockless but that's ok as we recheck later.
-		 * It's possible to migrate LRU pages and balloon pages
-		 * Skip any other type of page
-		 */
-		is_lru = PageLRU(page);
-		if (!is_lru) {
-			if (unlikely(balloon_page_movable(page))) {
-				if (balloon_page_isolate(page)) {
-					/* Successfully isolated */
-					goto isolate_success;
-				}
-			}
-		}
-
-		/*
 		 * Regardless of being on LRU, compound pages such as THP and
 		 * hugetlbfs are not to be compacted. We can potentially save
 		 * a lot of iterations if we skip them at once. The check is
@@ -770,8 +755,38 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
 			goto isolate_fail;
 		}
 
-		if (!is_lru)
+		/*
+		 * Check may be lockless but that's ok as we recheck later.
+		 * It's possible to migrate LRU and non-lru movable pages.
+		 * Skip any other type of page
+		 */
+		is_lru = PageLRU(page);
+		if (!is_lru) {
+			if (unlikely(balloon_page_movable(page))) {
+				if (balloon_page_isolate(page)) {
+					/* Successfully isolated */
+					goto isolate_success;
+				}
+			}
+
+			/*
+			 * __PageMovable can return false positive so we need
+			 * to verify it under page_lock.
+			 */
+			if (unlikely(__PageMovable(page)) &&
+					!PageIsolated(page)) {
+				if (locked) {
+					spin_unlock_irqrestore(&zone->lru_lock,
+									flags);
+					locked = false;
+				}
+
+				if (isolate_movable_page(page, isolate_mode))
+					goto isolate_success;
+			}
+
 			goto isolate_fail;
+		}
 
 		/*
 		 * Migration will fail if an anonymous page is pinned in memory,
diff --git a/mm/ksm.c b/mm/ksm.c
index 3dee82e3f59a..647c3a446d35 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -658,8 +658,8 @@ static struct page *get_ksm_page(struct stable_node *stable_node, bool lock_it)
 	void *expected_mapping;
 	unsigned long kpfn;
 
-	expected_mapping = (void *)stable_node +
-				(PAGE_MAPPING_ANON | PAGE_MAPPING_KSM);
+	expected_mapping = (void *)((unsigned long)stable_node |
+					PAGE_MAPPING_KSM);
 again:
 	kpfn = READ_ONCE(stable_node->kpfn);
 	page = pfn_to_page(kpfn);
diff --git a/mm/migrate.c b/mm/migrate.c
index 7880f30d1d3d..b2ababa21440 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -48,6 +48,38 @@
 
 #include "internal.h"
 
+int PageMovable(struct page *page)
+{
+	struct address_space *mapping;
+
+	WARN_ON(!PageLocked(page));
+	if (!__PageMovable(page))
+		goto out;
+
+	mapping = page_mapping(page);
+	if (mapping && mapping->a_ops && mapping->a_ops->isolate_page)
+		return 1;
+out:
+	return 0;
+}
+
+void __SetPageMovable(struct page *page, struct address_space *mapping)
+{
+	VM_BUG_ON_PAGE(!PageLocked(page), page);
+	VM_BUG_ON_PAGE((unsigned long)mapping & PAGE_MAPPING_MOVABLE, page);
+	page->mapping = (void *)((unsigned long)mapping | PAGE_MAPPING_MOVABLE);
+}
+
+void __ClearPageMovable(struct page *page)
+{
+	VM_BUG_ON_PAGE(!PageLocked(page), page);
+	VM_BUG_ON_PAGE(!PageMovable(page), page);
+	VM_BUG_ON_PAGE(!((unsigned long)page->mapping & PAGE_MAPPING_MOVABLE),
+				page);
+	page->mapping = (void *)((unsigned long)page->mapping &
+				PAGE_MAPPING_MOVABLE);
+}
+
 /*
  * migrate_prep() needs to be called before we start compiling a list of pages
  * to be migrated using isolate_lru_page(). If scheduling work on other CPUs is
@@ -74,6 +106,79 @@ int migrate_prep_local(void)
 	return 0;
 }
 
+bool isolate_movable_page(struct page *page, isolate_mode_t mode)
+{
+	struct address_space *mapping;
+
+	/*
+	 * Avoid burning cycles with pages that are yet under __free_pages(),
+	 * or just got freed under us.
+	 *
+	 * In case we 'win' a race for a movable page being freed under us and
+	 * raise its refcount preventing __free_pages() from doing its job
+	 * the put_page() at the end of this block will take care of
+	 * release this page, thus avoiding a nasty leakage.
+	 */
+	if (unlikely(!get_page_unless_zero(page)))
+		goto out;
+
+	/*
+	 * Check PageMovable before holding a PG_lock because page's owner
+	 * assumes anybody doesn't touch PG_lock of newly allocated page
+	 * so unconditionally grapping the lock ruins page's owner side.
+	 */
+	if (unlikely(!__PageMovable(page)))
+		goto out_putpage;
+	/*
+	 * As movable pages are not isolated from LRU lists, concurrent
+	 * compaction threads can race against page migration functions
+	 * as well as race against the releasing a page.
+	 *
+	 * In order to avoid having an already isolated movable page
+	 * being (wrongly) re-isolated while it is under migration,
+	 * or to avoid attempting to isolate pages being released,
+	 * lets be sure we have the page lock
+	 * before proceeding with the movable page isolation steps.
+	 */
+	if (unlikely(!trylock_page(page)))
+		goto out_putpage;
+
+	if (!PageMovable(page) || PageIsolated(page))
+		goto out_no_isolated;
+
+	mapping = page_mapping(page);
+	if (!mapping->a_ops->isolate_page(page, mode))
+		goto out_no_isolated;
+
+	/* Driver shouldn't use PG_isolated bit of page->flags */
+	WARN_ON_ONCE(PageIsolated(page));
+	__SetPageIsolated(page);
+	unlock_page(page);
+
+	return true;
+
+out_no_isolated:
+	unlock_page(page);
+out_putpage:
+	put_page(page);
+out:
+	return false;
+}
+
+/* It should be called on page which is PG_movable */
+void putback_movable_page(struct page *page)
+{
+	struct address_space *mapping;
+
+	VM_BUG_ON_PAGE(!PageLocked(page), page);
+	VM_BUG_ON_PAGE(!PageMovable(page), page);
+	VM_BUG_ON_PAGE(!PageIsolated(page), page);
+
+	mapping = page_mapping(page);
+	mapping->a_ops->putback_page(page);
+	__ClearPageIsolated(page);
+}
+
 /*
  * Put previously isolated pages back onto the appropriate lists
  * from where they were once taken off for compaction/migration.
@@ -95,10 +200,25 @@ void putback_movable_pages(struct list_head *l)
 		list_del(&page->lru);
 		dec_zone_page_state(page, NR_ISOLATED_ANON +
 				page_is_file_cache(page));
-		if (unlikely(isolated_balloon_page(page)))
+		if (unlikely(isolated_balloon_page(page))) {
 			balloon_page_putback(page);
-		else
+		/*
+		 * We isolated non-lru movable page so here we can use
+		 * __PageMovable because LRU page's mapping cannot have
+		 * PAGE_MAPPING_MOVABLE.
+		 */
+		} else if (unlikely(__PageMovable(page))) {
+			VM_BUG_ON_PAGE(!PageIsolated(page), page);
+			lock_page(page);
+			if (PageMovable(page))
+				putback_movable_page(page);
+			else
+				__ClearPageIsolated(page);
+			unlock_page(page);
+			put_page(page);
+		} else {
 			putback_lru_page(page);
+		}
 	}
 }
 
@@ -601,7 +721,7 @@ void migrate_page_copy(struct page *newpage, struct page *page)
  ***********************************************************/
 
 /*
- * Common logic to directly migrate a single page suitable for
+ * Common logic to directly migrate a single LRU page suitable for
  * pages that do not use PagePrivate/PagePrivate2.
  *
  * Pages are locked upon entry and exit.
@@ -764,33 +884,69 @@ static int move_to_new_page(struct page *newpage, struct page *page,
 				enum migrate_mode mode)
 {
 	struct address_space *mapping;
-	int rc;
+	int rc = -EAGAIN;
+	bool is_lru = !__PageMovable(page);
 
 	VM_BUG_ON_PAGE(!PageLocked(page), page);
 	VM_BUG_ON_PAGE(!PageLocked(newpage), newpage);
 
 	mapping = page_mapping(page);
-	if (!mapping)
-		rc = migrate_page(mapping, newpage, page, mode);
-	else if (mapping->a_ops->migratepage)
-		/*
-		 * Most pages have a mapping and most filesystems provide a
-		 * migratepage callback. Anonymous pages are part of swap
-		 * space which also has its own migratepage callback. This
-		 * is the most common path for page migration.
-		 */
-		rc = mapping->a_ops->migratepage(mapping, newpage, page, mode);
-	else
-		rc = fallback_migrate_page(mapping, newpage, page, mode);
+	/*
+	 * In case of non-lru page, it could be released after
+	 * isolation step. In that case, we shouldn't try
+	 * fallback migration which is designed for LRU pages.
+	 */
+	if (unlikely(!is_lru)) {
+		VM_BUG_ON_PAGE(!PageIsolated(page), page);
+		if (!PageMovable(page)) {
+			rc = MIGRATEPAGE_SUCCESS;
+			__ClearPageIsolated(page);
+			goto out;
+		}
+	}
+
+	if (likely(is_lru)) {
+		if (!mapping)
+			rc = migrate_page(mapping, newpage, page, mode);
+		else if (mapping->a_ops->migratepage)
+			/*
+			 * Most pages have a mapping and most filesystems
+			 * provide a migratepage callback. Anonymous pages
+			 * are part of swap space which also has its own
+			 * migratepage callback. This is the most common path
+			 * for page migration.
+			 */
+			rc = mapping->a_ops->migratepage(mapping, newpage,
+							page, mode);
+		else
+			rc = fallback_migrate_page(mapping, newpage,
+							page, mode);
+	} else {
+		rc = mapping->a_ops->migratepage(mapping, newpage,
+						page, mode);
+		WARN_ON_ONCE(rc == MIGRATEPAGE_SUCCESS &&
+			!PageIsolated(page));
+	}
 
 	/*
 	 * When successful, old pagecache page->mapping must be cleared before
 	 * page is freed; but stats require that PageAnon be left as PageAnon.
 	 */
 	if (rc == MIGRATEPAGE_SUCCESS) {
-		if (!PageAnon(page))
+		if (__PageMovable(page)) {
+			VM_BUG_ON_PAGE(!PageIsolated(page), page);
+
+			/*
+			 * We clear PG_movable under page_lock so any compactor
+			 * cannot try to migrate this page.
+			 */
+			__ClearPageIsolated(page);
+		}
+
+		if (!((unsigned long)page->mapping & PAGE_MAPPING_FLAGS))
 			page->mapping = NULL;
 	}
+out:
 	return rc;
 }
 
@@ -800,6 +956,7 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
 	int rc = -EAGAIN;
 	int page_was_mapped = 0;
 	struct anon_vma *anon_vma = NULL;
+	bool is_lru = !__PageMovable(page);
 
 	if (!trylock_page(page)) {
 		if (!force || mode == MIGRATE_ASYNC)
@@ -891,6 +1048,11 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
 		goto out_unlock_both;
 	}
 
+	if (unlikely(!is_lru)) {
+		rc = move_to_new_page(newpage, page, mode);
+		goto out_unlock_both;
+	}
+
 	/*
 	 * Corner case handling:
 	 * 1. When a new swap-cache page is read into, it is added to the LRU
@@ -940,7 +1102,8 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
 	 * list in here.
 	 */
 	if (rc == MIGRATEPAGE_SUCCESS) {
-		if (unlikely(__is_movable_balloon_page(newpage)))
+		if (unlikely(__is_movable_balloon_page(newpage) ||
+				__PageMovable(newpage)))
 			put_page(newpage);
 		else
 			putback_lru_page(newpage);
@@ -986,6 +1149,12 @@ static ICE_noinline int unmap_and_move(new_page_t get_new_page,
 		/* page was freed from under us. So we are done. */
 		ClearPageActive(page);
 		ClearPageUnevictable(page);
+		if (unlikely(__PageMovable(page))) {
+			lock_page(page);
+			if (!PageMovable(page))
+				__ClearPageIsolated(page);
+			unlock_page(page);
+		}
 		if (put_new_page)
 			put_new_page(newpage, private);
 		else
@@ -1035,8 +1204,21 @@ static ICE_noinline int unmap_and_move(new_page_t get_new_page,
 				num_poisoned_pages_inc();
 		}
 	} else {
-		if (rc != -EAGAIN)
-			putback_lru_page(page);
+		if (rc != -EAGAIN) {
+			if (likely(!__PageMovable(page))) {
+				putback_lru_page(page);
+				goto put_new;
+			}
+
+			lock_page(page);
+			if (PageMovable(page))
+				putback_movable_page(page);
+			else
+				__ClearPageIsolated(page);
+			unlock_page(page);
+			put_page(page);
+		}
+put_new:
 		if (put_new_page)
 			put_new_page(newpage, private);
 		else
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 226db261215a..4af78fa1e665 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1034,8 +1034,10 @@ static bool free_pages_prepare(struct page *page, unsigned int order)
 	kmemcheck_free_shadow(page, order);
 	kasan_poison_free_pages(page, order);
 
-	if (PageAnon(page))
+	/* If PageAnon or __PageMovable is true, clear page->mapping */
+	if ((unsigned long)page->mapping & PAGE_MAPPING_FLAGS)
 		page->mapping = NULL;
+
 	bad += free_pages_check(page);
 	for (i = 1; i < (1 << order); i++) {
 		if (compound)
diff --git a/mm/util.c b/mm/util.c
index 23025c8ad31f..732cf3c74646 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -398,10 +398,12 @@ struct address_space *page_mapping(struct page *page)
 	}
 
 	mapping = page->mapping;
-	if ((unsigned long)mapping & PAGE_MAPPING_FLAGS)
+	if ((unsigned long)mapping & PAGE_MAPPING_ANON)
 		return NULL;
-	return mapping;
+
+	return (void *)((unsigned long)mapping & ~PAGE_MAPPING_FLAGS);
 }
+EXPORT_SYMBOL(page_mapping);
 
 /* Slow path of page_mapcount() for compound pages */
 int __page_mapcount(struct page *page)
-- 
1.9.1

^ permalink raw reply related

* [PATCH v4 03/12] mm: balloon: use general non-lru movable page feature
From: Minchan Kim @ 2016-04-27  7:48 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rafael Aquini, Minchan Kim, linux-kernel, virtualization,
	linux-mm, Gioh Kim, Konstantin Khlebnikov
In-Reply-To: <1461743305-19970-1-git-send-email-minchan@kernel.org>

Now, VM has a feature to migrate non-lru movable pages so
balloon doesn't need custom migration hooks in migrate.c
and compaction.c. Instead, this patch implements
page->mapping->a_ops->{isolate|migrate|putback} functions.

With that, we could remove hooks for ballooning in general
migration functions and make balloon compaction simple.

Cc: virtualization@lists.linux-foundation.org
Cc: Rafael Aquini <aquini@redhat.com>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Signed-off-by: Gioh Kim <gi-oh.kim@profitbricks.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 drivers/virtio/virtio_balloon.c    | 52 +++++++++++++++++++--
 include/linux/balloon_compaction.h | 51 ++++++---------------
 include/uapi/linux/magic.h         |  1 +
 mm/balloon_compaction.c            | 94 +++++++-------------------------------
 mm/compaction.c                    |  7 ---
 mm/migrate.c                       | 19 +-------
 mm/vmscan.c                        |  2 +-
 7 files changed, 83 insertions(+), 143 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 7b6d74f0c72f..04fc63b4a735 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -30,6 +30,7 @@
 #include <linux/oom.h>
 #include <linux/wait.h>
 #include <linux/mm.h>
+#include <linux/mount.h>
 
 /*
  * Balloon device works in 4K page units.  So each page is pointed to by
@@ -45,6 +46,10 @@ static int oom_pages = OOM_VBALLOON_DEFAULT_PAGES;
 module_param(oom_pages, int, S_IRUSR | S_IWUSR);
 MODULE_PARM_DESC(oom_pages, "pages to free on OOM");
 
+#ifdef CONFIG_BALLOON_COMPACTION
+static struct vfsmount *balloon_mnt;
+#endif
+
 struct virtio_balloon {
 	struct virtio_device *vdev;
 	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq;
@@ -486,6 +491,24 @@ static int virtballoon_migratepage(struct balloon_dev_info *vb_dev_info,
 
 	return MIGRATEPAGE_SUCCESS;
 }
+
+static struct dentry *balloon_mount(struct file_system_type *fs_type,
+		int flags, const char *dev_name, void *data)
+{
+	static const struct dentry_operations ops = {
+		.d_dname = simple_dname,
+	};
+
+	return mount_pseudo(fs_type, "balloon-kvm:", NULL, &ops,
+				BALLOON_KVM_MAGIC);
+}
+
+static struct file_system_type balloon_fs = {
+	.name           = "balloon-kvm",
+	.mount          = balloon_mount,
+	.kill_sb        = kill_anon_super,
+};
+
 #endif /* CONFIG_BALLOON_COMPACTION */
 
 static int virtballoon_probe(struct virtio_device *vdev)
@@ -515,9 +538,6 @@ static int virtballoon_probe(struct virtio_device *vdev)
 	vb->vdev = vdev;
 
 	balloon_devinfo_init(&vb->vb_dev_info);
-#ifdef CONFIG_BALLOON_COMPACTION
-	vb->vb_dev_info.migratepage = virtballoon_migratepage;
-#endif
 
 	err = init_vqs(vb);
 	if (err)
@@ -527,13 +547,33 @@ static int virtballoon_probe(struct virtio_device *vdev)
 	vb->nb.priority = VIRTBALLOON_OOM_NOTIFY_PRIORITY;
 	err = register_oom_notifier(&vb->nb);
 	if (err < 0)
-		goto out_oom_notify;
+		goto out_del_vqs;
+
+#ifdef CONFIG_BALLOON_COMPACTION
+	balloon_mnt = kern_mount(&balloon_fs);
+	if (IS_ERR(balloon_mnt)) {
+		err = PTR_ERR(balloon_mnt);
+		unregister_oom_notifier(&vb->nb);
+		goto out_del_vqs;
+	}
+
+	vb->vb_dev_info.migratepage = virtballoon_migratepage;
+	vb->vb_dev_info.inode = alloc_anon_inode(balloon_mnt->mnt_sb);
+	if (IS_ERR(vb->vb_dev_info.inode)) {
+		err = PTR_ERR(vb->vb_dev_info.inode);
+		kern_unmount(balloon_mnt);
+		unregister_oom_notifier(&vb->nb);
+		vb->vb_dev_info.inode = NULL;
+		goto out_del_vqs;
+	}
+	vb->vb_dev_info.inode->i_mapping->a_ops = &balloon_aops;
+#endif
 
 	virtio_device_ready(vdev);
 
 	return 0;
 
-out_oom_notify:
+out_del_vqs:
 	vdev->config->del_vqs(vdev);
 out_free_vb:
 	kfree(vb);
@@ -567,6 +607,8 @@ static void virtballoon_remove(struct virtio_device *vdev)
 	cancel_work_sync(&vb->update_balloon_stats_work);
 
 	remove_common(vb);
+	if (vb->vb_dev_info.inode)
+		iput(vb->vb_dev_info.inode);
 	kfree(vb);
 }
 
diff --git a/include/linux/balloon_compaction.h b/include/linux/balloon_compaction.h
index 9b0a15d06a4f..79542b2698ec 100644
--- a/include/linux/balloon_compaction.h
+++ b/include/linux/balloon_compaction.h
@@ -48,6 +48,7 @@
 #include <linux/migrate.h>
 #include <linux/gfp.h>
 #include <linux/err.h>
+#include <linux/fs.h>
 
 /*
  * Balloon device information descriptor.
@@ -62,6 +63,7 @@ struct balloon_dev_info {
 	struct list_head pages;		/* Pages enqueued & handled to Host */
 	int (*migratepage)(struct balloon_dev_info *, struct page *newpage,
 			struct page *page, enum migrate_mode mode);
+	struct inode *inode;
 };
 
 extern struct page *balloon_page_enqueue(struct balloon_dev_info *b_dev_info);
@@ -73,45 +75,19 @@ static inline void balloon_devinfo_init(struct balloon_dev_info *balloon)
 	spin_lock_init(&balloon->pages_lock);
 	INIT_LIST_HEAD(&balloon->pages);
 	balloon->migratepage = NULL;
+	balloon->inode = NULL;
 }
 
 #ifdef CONFIG_BALLOON_COMPACTION
-extern bool balloon_page_isolate(struct page *page);
+extern const struct address_space_operations balloon_aops;
+extern bool balloon_page_isolate(struct page *page,
+				isolate_mode_t mode);
 extern void balloon_page_putback(struct page *page);
-extern int balloon_page_migrate(struct page *newpage,
+extern int balloon_page_migrate(struct address_space *mapping,
+				struct page *newpage,
 				struct page *page, enum migrate_mode mode);
 
 /*
- * __is_movable_balloon_page - helper to perform @page PageBalloon tests
- */
-static inline bool __is_movable_balloon_page(struct page *page)
-{
-	return PageBalloon(page);
-}
-
-/*
- * balloon_page_movable - test PageBalloon to identify balloon pages
- *			  and PagePrivate to check that the page is not
- *			  isolated and can be moved by compaction/migration.
- *
- * As we might return false positives in the case of a balloon page being just
- * released under us, this need to be re-tested later, under the page lock.
- */
-static inline bool balloon_page_movable(struct page *page)
-{
-	return PageBalloon(page) && PagePrivate(page);
-}
-
-/*
- * isolated_balloon_page - identify an isolated balloon page on private
- *			   compaction/migration page lists.
- */
-static inline bool isolated_balloon_page(struct page *page)
-{
-	return PageBalloon(page);
-}
-
-/*
  * balloon_page_insert - insert a page into the balloon's page list and make
  *			 the page->private assignment accordingly.
  * @balloon : pointer to balloon device
@@ -124,7 +100,7 @@ static inline void balloon_page_insert(struct balloon_dev_info *balloon,
 				       struct page *page)
 {
 	__SetPageBalloon(page);
-	SetPagePrivate(page);
+	__SetPageMovable(page, balloon->inode->i_mapping);
 	set_page_private(page, (unsigned long)balloon);
 	list_add(&page->lru, &balloon->pages);
 }
@@ -140,11 +116,14 @@ static inline void balloon_page_insert(struct balloon_dev_info *balloon,
 static inline void balloon_page_delete(struct page *page)
 {
 	__ClearPageBalloon(page);
+	__ClearPageMovable(page);
 	set_page_private(page, 0);
-	if (PagePrivate(page)) {
-		ClearPagePrivate(page);
+	/*
+	 * No touch page.lru field once @page has been isolated
+	 * because VM is using the field.
+	 */
+	if (!PageIsolated(page))
 		list_del(&page->lru);
-	}
 }
 
 /*
diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
index 0de181ad73d5..e1fbe72c39c0 100644
--- a/include/uapi/linux/magic.h
+++ b/include/uapi/linux/magic.h
@@ -78,5 +78,6 @@
 #define BTRFS_TEST_MAGIC	0x73727279
 #define NSFS_MAGIC		0x6e736673
 #define BPF_FS_MAGIC		0xcafe4a11
+#define BALLOON_KVM_MAGIC	0x13661366
 
 #endif /* __LINUX_MAGIC_H__ */
diff --git a/mm/balloon_compaction.c b/mm/balloon_compaction.c
index 57b3e9bd6bc5..da91df50ba31 100644
--- a/mm/balloon_compaction.c
+++ b/mm/balloon_compaction.c
@@ -70,7 +70,7 @@ struct page *balloon_page_dequeue(struct balloon_dev_info *b_dev_info)
 		 */
 		if (trylock_page(page)) {
 #ifdef CONFIG_BALLOON_COMPACTION
-			if (!PagePrivate(page)) {
+			if (PageIsolated(page)) {
 				/* raced with isolation */
 				unlock_page(page);
 				continue;
@@ -106,110 +106,50 @@ EXPORT_SYMBOL_GPL(balloon_page_dequeue);
 
 #ifdef CONFIG_BALLOON_COMPACTION
 
-static inline void __isolate_balloon_page(struct page *page)
+bool balloon_page_isolate(struct page *page, isolate_mode_t mode)
+
 {
 	struct balloon_dev_info *b_dev_info = balloon_page_device(page);
 	unsigned long flags;
 
 	spin_lock_irqsave(&b_dev_info->pages_lock, flags);
-	ClearPagePrivate(page);
 	list_del(&page->lru);
 	b_dev_info->isolated_pages++;
 	spin_unlock_irqrestore(&b_dev_info->pages_lock, flags);
+
+	return true;
 }
 
-static inline void __putback_balloon_page(struct page *page)
+void balloon_page_putback(struct page *page)
 {
 	struct balloon_dev_info *b_dev_info = balloon_page_device(page);
 	unsigned long flags;
 
 	spin_lock_irqsave(&b_dev_info->pages_lock, flags);
-	SetPagePrivate(page);
 	list_add(&page->lru, &b_dev_info->pages);
 	b_dev_info->isolated_pages--;
 	spin_unlock_irqrestore(&b_dev_info->pages_lock, flags);
 }
 
-/* __isolate_lru_page() counterpart for a ballooned page */
-bool balloon_page_isolate(struct page *page)
-{
-	/*
-	 * Avoid burning cycles with pages that are yet under __free_pages(),
-	 * or just got freed under us.
-	 *
-	 * In case we 'win' a race for a balloon page being freed under us and
-	 * raise its refcount preventing __free_pages() from doing its job
-	 * the put_page() at the end of this block will take care of
-	 * release this page, thus avoiding a nasty leakage.
-	 */
-	if (likely(get_page_unless_zero(page))) {
-		/*
-		 * As balloon pages are not isolated from LRU lists, concurrent
-		 * compaction threads can race against page migration functions
-		 * as well as race against the balloon driver releasing a page.
-		 *
-		 * In order to avoid having an already isolated balloon page
-		 * being (wrongly) re-isolated while it is under migration,
-		 * or to avoid attempting to isolate pages being released by
-		 * the balloon driver, lets be sure we have the page lock
-		 * before proceeding with the balloon page isolation steps.
-		 */
-		if (likely(trylock_page(page))) {
-			/*
-			 * A ballooned page, by default, has PagePrivate set.
-			 * Prevent concurrent compaction threads from isolating
-			 * an already isolated balloon page by clearing it.
-			 */
-			if (balloon_page_movable(page)) {
-				__isolate_balloon_page(page);
-				unlock_page(page);
-				return true;
-			}
-			unlock_page(page);
-		}
-		put_page(page);
-	}
-	return false;
-}
-
-/* putback_lru_page() counterpart for a ballooned page */
-void balloon_page_putback(struct page *page)
-{
-	/*
-	 * 'lock_page()' stabilizes the page and prevents races against
-	 * concurrent isolation threads attempting to re-isolate it.
-	 */
-	lock_page(page);
-
-	if (__is_movable_balloon_page(page)) {
-		__putback_balloon_page(page);
-		/* drop the extra ref count taken for page isolation */
-		put_page(page);
-	} else {
-		WARN_ON(1);
-		dump_page(page, "not movable balloon page");
-	}
-	unlock_page(page);
-}
 
 /* move_to_new_page() counterpart for a ballooned page */
-int balloon_page_migrate(struct page *newpage,
-			 struct page *page, enum migrate_mode mode)
+int balloon_page_migrate(struct address_space *mapping,
+		struct page *newpage, struct page *page,
+		enum migrate_mode mode)
 {
 	struct balloon_dev_info *balloon = balloon_page_device(page);
-	int rc = -EAGAIN;
 
 	VM_BUG_ON_PAGE(!PageLocked(page), page);
 	VM_BUG_ON_PAGE(!PageLocked(newpage), newpage);
 
-	if (WARN_ON(!__is_movable_balloon_page(page))) {
-		dump_page(page, "not movable balloon page");
-		return rc;
-	}
+	return balloon->migratepage(balloon, newpage, page, mode);
+}
 
-	if (balloon && balloon->migratepage)
-		rc = balloon->migratepage(balloon, newpage, page, mode);
+const struct address_space_operations balloon_aops = {
+	.migratepage = balloon_page_migrate,
+	.isolate_page = balloon_page_isolate,
+	.putback_page = balloon_page_putback,
+};
+EXPORT_SYMBOL_GPL(balloon_aops);
 
-	return rc;
-}
 #endif /* CONFIG_BALLOON_COMPACTION */
diff --git a/mm/compaction.c b/mm/compaction.c
index 01789ebc4b10..adb805b01119 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -762,13 +762,6 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
 		 */
 		is_lru = PageLRU(page);
 		if (!is_lru) {
-			if (unlikely(balloon_page_movable(page))) {
-				if (balloon_page_isolate(page)) {
-					/* Successfully isolated */
-					goto isolate_success;
-				}
-			}
-
 			/*
 			 * __PageMovable can return false positive so we need
 			 * to verify it under page_lock.
diff --git a/mm/migrate.c b/mm/migrate.c
index b2ababa21440..df258cf38070 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -200,14 +200,12 @@ void putback_movable_pages(struct list_head *l)
 		list_del(&page->lru);
 		dec_zone_page_state(page, NR_ISOLATED_ANON +
 				page_is_file_cache(page));
-		if (unlikely(isolated_balloon_page(page))) {
-			balloon_page_putback(page);
 		/*
 		 * We isolated non-lru movable page so here we can use
 		 * __PageMovable because LRU page's mapping cannot have
 		 * PAGE_MAPPING_MOVABLE.
 		 */
-		} else if (unlikely(__PageMovable(page))) {
+		if (unlikely(__PageMovable(page))) {
 			VM_BUG_ON_PAGE(!PageIsolated(page), page);
 			lock_page(page);
 			if (PageMovable(page))
@@ -1036,18 +1034,6 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
 	if (unlikely(!trylock_page(newpage)))
 		goto out_unlock;
 
-	if (unlikely(isolated_balloon_page(page))) {
-		/*
-		 * A ballooned page does not need any special attention from
-		 * physical to virtual reverse mapping procedures.
-		 * Skip any attempt to unmap PTEs or to remap swap cache,
-		 * in order to avoid burning cycles at rmap level, and perform
-		 * the page migration right away (proteced by page lock).
-		 */
-		rc = balloon_page_migrate(newpage, page, mode);
-		goto out_unlock_both;
-	}
-
 	if (unlikely(!is_lru)) {
 		rc = move_to_new_page(newpage, page, mode);
 		goto out_unlock_both;
@@ -1102,8 +1088,7 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
 	 * list in here.
 	 */
 	if (rc == MIGRATEPAGE_SUCCESS) {
-		if (unlikely(__is_movable_balloon_page(newpage) ||
-				__PageMovable(newpage)))
+		if (unlikely(__PageMovable(newpage)))
 			put_page(newpage);
 		else
 			putback_lru_page(newpage);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index fa36dd09eaa5..693ae62d236e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1257,7 +1257,7 @@ unsigned long reclaim_clean_pages_from_list(struct zone *zone,
 
 	list_for_each_entry_safe(page, next, page_list, lru) {
 		if (page_is_file_cache(page) && !PageDirty(page) &&
-		    !isolated_balloon_page(page)) {
+		    !__PageMovable(page)) {
 			ClearPageActive(page);
 			list_move(&page->lru, &clean_pages);
 		}
-- 
1.9.1

^ permalink raw reply related

* [PATCH] virtio: silence compiler warning
From: Jiri Benc @ 2016-04-27  9:33 UTC (permalink / raw)
  To: virtualization; +Cc: linux-kernel, Michael S. Tsirkin

gcc 4.8.5 emits the following false positive:

drivers/virtio/virtio_ring.c: In function 'vring_create_virtqueue':
drivers/virtio/virtio_ring.c:1032:5: warning: 'queue' may be used uninitialized in this function [-Wmaybe-uninitialized]

Silence it.

Signed-off-by: Jiri Benc <jbenc@redhat.com>
---
 drivers/virtio/virtio_ring.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index 5c802d47892c..ca6bfddaacad 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -1006,7 +1006,7 @@ struct virtqueue *vring_create_virtqueue(
 	const char *name)
 {
 	struct virtqueue *vq;
-	void *queue;
+	void *queue = NULL;
 	dma_addr_t dma_addr;
 	size_t queue_size_in_bytes;
 	struct vring vring;
-- 
1.8.3.1

^ permalink raw reply related

* Re: [PATCH] vhost_net: stop polling socket during rx processing
From: Michael S. Tsirkin @ 2016-04-27 11:28 UTC (permalink / raw)
  To: Jason Wang; +Cc: netdev, linux-kernel, kvm, virtualization
In-Reply-To: <1461656153-24074-1-git-send-email-jasowang@redhat.com>

On Tue, Apr 26, 2016 at 03:35:53AM -0400, Jason Wang wrote:
> We don't stop polling socket during rx processing, this will lead
> unnecessary wakeups from under layer net devices (E.g
> sock_def_readable() form tun). Rx will be slowed down in this
> way. This patch avoids this by stop polling socket during rx
> processing. A small drawback is that this introduces some overheads in
> light load case because of the extra start/stop polling, but single
> netperf TCP_RR does not notice any change. In a super heavy load case,
> e.g using pktgen to inject packet to guest, we get about ~17%
> improvement on pps:
> 
> before: ~1370000 pkt/s
> after:  ~1500000 pkt/s
> 
> Signed-off-by: Jason Wang <jasowang@redhat.com>

Acked-by: Michael S. Tsirkin <mst@redhat.com>

There is one other possible enhancement: we actually have the wait queue
lock taken in _wake_up, but we give it up only to take it again in the
handler.

It would be nicer to just remove the entry when we wake
the vhost thread. Re-add it if required.
I think that something like the below would give you the necessary API.
Pls feel free to use it if you are going to implement a patch on top
doing this - that's not a reason not to include this simple patch
though.

--->

wait: add API to drop a wait_queue_t entry from wake up handler

A wake up handler might want to remove its own wait queue entry to avoid
future wakeups.  In particular, vhost has such a need.  As wait queue
lock is already taken, all we need is an API to remove the entry without
wait_queue_head_t which isn't currently accessible to wake up handlers.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

---

diff --git a/include/linux/wait.h b/include/linux/wait.h
index 27d7a0a..9c6604b 100644
--- a/include/linux/wait.h
+++ b/include/linux/wait.h
@@ -191,11 +191,17 @@ __add_wait_queue_tail_exclusive(wait_queue_head_t *q, wait_queue_t *wait)
 }
 
 static inline void
-__remove_wait_queue(wait_queue_head_t *head, wait_queue_t *old)
+__remove_wait_queue_entry(wait_queue_t *old)
 {
 	list_del(&old->task_list);
 }
 
+static inline void
+__remove_wait_queue(wait_queue_head_t *head, wait_queue_t *old)
+{
+	__remove_wait_queue_entry(old);
+}
+
 typedef int wait_bit_action_f(struct wait_bit_key *, int mode);
 void __wake_up(wait_queue_head_t *q, unsigned int mode, int nr, void *key);
 void __wake_up_locked_key(wait_queue_head_t *q, unsigned int mode, void *key);

^ permalink raw reply related

* Re: [RFC PATCH V2 1/2] vhost: convert pre sorted vhost memory array to interval tree
From: Michael S. Tsirkin @ 2016-04-27 11:30 UTC (permalink / raw)
  To: Jason Wang
  Cc: kvm, qemu-devel, netdev, linux-kernel, peterx, virtualization,
	pbonzini
In-Reply-To: <1458873274-13961-2-git-send-email-jasowang@redhat.com>

On Fri, Mar 25, 2016 at 10:34:33AM +0800, Jason Wang wrote:
> Current pre-sorted memory region array has some limitations for future
> device IOTLB conversion:
> 
> 1) need extra work for adding and removing a single region, and it's
>    expected to be slow because of sorting or memory re-allocation.
> 2) need extra work of removing a large range which may intersect
>    several regions with different size.
> 3) need trick for a replacement policy like LRU
> 
> To overcome the above shortcomings, this patch convert it to interval
> tree which can easily address the above issue with almost no extra
> work.
> 
> The patch could be used for:
> 
> - Extend the current API and only let the userspace to send diffs of
>   memory table.
> - Simplify Device IOTLB implementation.

Does this affect performance at all?

> 
> Signed-off-by: Jason Wang <jasowang@redhat.com>
> ---
>  drivers/vhost/net.c   |   8 +--
>  drivers/vhost/vhost.c | 182 ++++++++++++++++++++++++++++----------------------
>  drivers/vhost/vhost.h |  27 ++++++--
>  3 files changed, 128 insertions(+), 89 deletions(-)
> 
> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> index 9eda69e..481db96 100644
> --- a/drivers/vhost/net.c
> +++ b/drivers/vhost/net.c
> @@ -968,20 +968,20 @@ static long vhost_net_reset_owner(struct vhost_net *n)
>  	struct socket *tx_sock = NULL;
>  	struct socket *rx_sock = NULL;
>  	long err;
> -	struct vhost_memory *memory;
> +	struct vhost_umem *umem;
>  
>  	mutex_lock(&n->dev.mutex);
>  	err = vhost_dev_check_owner(&n->dev);
>  	if (err)
>  		goto done;
> -	memory = vhost_dev_reset_owner_prepare();
> -	if (!memory) {
> +	umem = vhost_dev_reset_owner_prepare();
> +	if (!umem) {
>  		err = -ENOMEM;
>  		goto done;
>  	}
>  	vhost_net_stop(n, &tx_sock, &rx_sock);
>  	vhost_net_flush(n);
> -	vhost_dev_reset_owner(&n->dev, memory);
> +	vhost_dev_reset_owner(&n->dev, umem);
>  	vhost_net_vq_reset(n);
>  done:
>  	mutex_unlock(&n->dev.mutex);
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index ad2146a..32c35a9 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -27,6 +27,7 @@
>  #include <linux/cgroup.h>
>  #include <linux/module.h>
>  #include <linux/sort.h>
> +#include <linux/interval_tree_generic.h>
>  
>  #include "vhost.h"
>  
> @@ -42,6 +43,10 @@ enum {
>  #define vhost_used_event(vq) ((__virtio16 __user *)&vq->avail->ring[vq->num])
>  #define vhost_avail_event(vq) ((__virtio16 __user *)&vq->used->ring[vq->num])
>  
> +INTERVAL_TREE_DEFINE(struct vhost_umem_node,
> +		     rb, __u64, __subtree_last,
> +		     START, LAST, , vhost_umem_interval_tree);
> +
>  #ifdef CONFIG_VHOST_CROSS_ENDIAN_LEGACY
>  static void vhost_vq_reset_user_be(struct vhost_virtqueue *vq)
>  {
> @@ -275,7 +280,7 @@ static void vhost_vq_reset(struct vhost_dev *dev,
>  	vq->call_ctx = NULL;
>  	vq->call = NULL;
>  	vq->log_ctx = NULL;
> -	vq->memory = NULL;
> +	vq->umem = NULL;
>  	vq->is_le = virtio_legacy_is_little_endian();
>  	vhost_vq_reset_user_be(vq);
>  }
> @@ -381,7 +386,7 @@ void vhost_dev_init(struct vhost_dev *dev,
>  	mutex_init(&dev->mutex);
>  	dev->log_ctx = NULL;
>  	dev->log_file = NULL;
> -	dev->memory = NULL;
> +	dev->umem = NULL;
>  	dev->mm = NULL;
>  	spin_lock_init(&dev->work_lock);
>  	INIT_LIST_HEAD(&dev->work_list);
> @@ -486,27 +491,36 @@ err_mm:
>  }
>  EXPORT_SYMBOL_GPL(vhost_dev_set_owner);
>  
> -struct vhost_memory *vhost_dev_reset_owner_prepare(void)
> +static void *vhost_kvzalloc(unsigned long size)
> +{
> +	void *n = kzalloc(size, GFP_KERNEL | __GFP_NOWARN | __GFP_REPEAT);
> +
> +	if (!n)
> +		n = vzalloc(size);
> +	return n;
> +}
> +
> +struct vhost_umem *vhost_dev_reset_owner_prepare(void)
>  {
> -	return kmalloc(offsetof(struct vhost_memory, regions), GFP_KERNEL);
> +	return vhost_kvzalloc(sizeof(struct vhost_umem));
>  }
>  EXPORT_SYMBOL_GPL(vhost_dev_reset_owner_prepare);
>  
>  /* Caller should have device mutex */
> -void vhost_dev_reset_owner(struct vhost_dev *dev, struct vhost_memory *memory)
> +void vhost_dev_reset_owner(struct vhost_dev *dev, struct vhost_umem *umem)
>  {
>  	int i;
>  
>  	vhost_dev_cleanup(dev, true);
>  
>  	/* Restore memory to default empty mapping. */
> -	memory->nregions = 0;
> -	dev->memory = memory;
> +	INIT_LIST_HEAD(&umem->umem_list);
> +	dev->umem = umem;
>  	/* We don't need VQ locks below since vhost_dev_cleanup makes sure
>  	 * VQs aren't running.
>  	 */
>  	for (i = 0; i < dev->nvqs; ++i)
> -		dev->vqs[i]->memory = memory;
> +		dev->vqs[i]->umem = umem;
>  }
>  EXPORT_SYMBOL_GPL(vhost_dev_reset_owner);
>  
> @@ -523,6 +537,21 @@ void vhost_dev_stop(struct vhost_dev *dev)
>  }
>  EXPORT_SYMBOL_GPL(vhost_dev_stop);
>  
> +static void vhost_umem_clean(struct vhost_umem *umem)
> +{
> +	struct vhost_umem_node *node, *tmp;
> +
> +	if (!umem)
> +		return;
> +
> +	list_for_each_entry_safe(node, tmp, &umem->umem_list, link) {
> +		vhost_umem_interval_tree_remove(node, &umem->umem_tree);
> +		list_del(&node->link);
> +		kvfree(node);
> +	}
> +	kvfree(umem);
> +}
> +
>  /* Caller should have device mutex if and only if locked is set */
>  void vhost_dev_cleanup(struct vhost_dev *dev, bool locked)
>  {
> @@ -549,8 +578,8 @@ void vhost_dev_cleanup(struct vhost_dev *dev, bool locked)
>  		fput(dev->log_file);
>  	dev->log_file = NULL;
>  	/* No one will access memory at this point */
> -	kvfree(dev->memory);
> -	dev->memory = NULL;
> +	vhost_umem_clean(dev->umem);
> +	dev->umem = NULL;
>  	WARN_ON(!list_empty(&dev->work_list));
>  	if (dev->worker) {
>  		kthread_stop(dev->worker);
> @@ -576,25 +605,25 @@ static int log_access_ok(void __user *log_base, u64 addr, unsigned long sz)
>  }
>  
>  /* Caller should have vq mutex and device mutex. */
> -static int vq_memory_access_ok(void __user *log_base, struct vhost_memory *mem,
> +static int vq_memory_access_ok(void __user *log_base, struct vhost_umem *umem,
>  			       int log_all)
>  {
> -	int i;
> +	struct vhost_umem_node *node;
>  
> -	if (!mem)
> +	if (!umem)
>  		return 0;
>  
> -	for (i = 0; i < mem->nregions; ++i) {
> -		struct vhost_memory_region *m = mem->regions + i;
> -		unsigned long a = m->userspace_addr;
> -		if (m->memory_size > ULONG_MAX)
> +	list_for_each_entry(node, &umem->umem_list, link) {
> +		unsigned long a = node->userspace_addr;
> +
> +		if (node->size > ULONG_MAX)
>  			return 0;
>  		else if (!access_ok(VERIFY_WRITE, (void __user *)a,
> -				    m->memory_size))
> +				    node->size))
>  			return 0;
>  		else if (log_all && !log_access_ok(log_base,
> -						   m->guest_phys_addr,
> -						   m->memory_size))
> +						   node->start,
> +						   node->size))
>  			return 0;
>  	}
>  	return 1;
> @@ -602,7 +631,7 @@ static int vq_memory_access_ok(void __user *log_base, struct vhost_memory *mem,
>  
>  /* Can we switch to this memory table? */
>  /* Caller should have device mutex but not vq mutex */
> -static int memory_access_ok(struct vhost_dev *d, struct vhost_memory *mem,
> +static int memory_access_ok(struct vhost_dev *d, struct vhost_umem *umem,
>  			    int log_all)
>  {
>  	int i;
> @@ -615,7 +644,8 @@ static int memory_access_ok(struct vhost_dev *d, struct vhost_memory *mem,
>  		log = log_all || vhost_has_feature(d->vqs[i], VHOST_F_LOG_ALL);
>  		/* If ring is inactive, will check when it's enabled. */
>  		if (d->vqs[i]->private_data)
> -			ok = vq_memory_access_ok(d->vqs[i]->log_base, mem, log);
> +			ok = vq_memory_access_ok(d->vqs[i]->log_base,
> +						 umem, log);
>  		else
>  			ok = 1;
>  		mutex_unlock(&d->vqs[i]->mutex);
> @@ -642,7 +672,7 @@ static int vq_access_ok(struct vhost_virtqueue *vq, unsigned int num,
>  /* Caller should have device mutex but not vq mutex */
>  int vhost_log_access_ok(struct vhost_dev *dev)
>  {
> -	return memory_access_ok(dev, dev->memory, 1);
> +	return memory_access_ok(dev, dev->umem, 1);
>  }
>  EXPORT_SYMBOL_GPL(vhost_log_access_ok);
>  
> @@ -653,7 +683,7 @@ static int vq_log_access_ok(struct vhost_virtqueue *vq,
>  {
>  	size_t s = vhost_has_feature(vq, VIRTIO_RING_F_EVENT_IDX) ? 2 : 0;
>  
> -	return vq_memory_access_ok(log_base, vq->memory,
> +	return vq_memory_access_ok(log_base, vq->umem,
>  				   vhost_has_feature(vq, VHOST_F_LOG_ALL)) &&
>  		(!vq->log_used || log_access_ok(log_base, vq->log_addr,
>  					sizeof *vq->used +
> @@ -669,28 +699,12 @@ int vhost_vq_access_ok(struct vhost_virtqueue *vq)
>  }
>  EXPORT_SYMBOL_GPL(vhost_vq_access_ok);
>  
> -static int vhost_memory_reg_sort_cmp(const void *p1, const void *p2)
> -{
> -	const struct vhost_memory_region *r1 = p1, *r2 = p2;
> -	if (r1->guest_phys_addr < r2->guest_phys_addr)
> -		return 1;
> -	if (r1->guest_phys_addr > r2->guest_phys_addr)
> -		return -1;
> -	return 0;
> -}
> -
> -static void *vhost_kvzalloc(unsigned long size)
> -{
> -	void *n = kzalloc(size, GFP_KERNEL | __GFP_NOWARN | __GFP_REPEAT);
> -
> -	if (!n)
> -		n = vzalloc(size);
> -	return n;
> -}
> -
>  static long vhost_set_memory(struct vhost_dev *d, struct vhost_memory __user *m)
>  {
> -	struct vhost_memory mem, *newmem, *oldmem;
> +	struct vhost_memory mem, *newmem;
> +	struct vhost_memory_region *region;
> +	struct vhost_umem_node *node;
> +	struct vhost_umem *newumem, *oldumem;
>  	unsigned long size = offsetof(struct vhost_memory, regions);
>  	int i;
>  
> @@ -710,24 +724,52 @@ static long vhost_set_memory(struct vhost_dev *d, struct vhost_memory __user *m)
>  		kvfree(newmem);
>  		return -EFAULT;
>  	}
> -	sort(newmem->regions, newmem->nregions, sizeof(*newmem->regions),
> -		vhost_memory_reg_sort_cmp, NULL);
>  
> -	if (!memory_access_ok(d, newmem, 0)) {
> +	newumem = vhost_kvzalloc(sizeof(*newumem));
> +	if (!newumem) {
>  		kvfree(newmem);
> -		return -EFAULT;
> +		return -ENOMEM;
> +	}
> +
> +	newumem->umem_tree = RB_ROOT;
> +	INIT_LIST_HEAD(&newumem->umem_list);
> +
> +	for (region = newmem->regions;
> +	     region < newmem->regions + mem.nregions;
> +	     region++) {
> +		node = vhost_kvzalloc(sizeof(*node));
> +		if (!node)
> +			goto err;
> +		node->start = region->guest_phys_addr;
> +		node->size = region->memory_size;
> +		node->last = node->start + node->size - 1;
> +		node->userspace_addr = region->userspace_addr;
> +		INIT_LIST_HEAD(&node->link);
> +		list_add_tail(&node->link, &newumem->umem_list);
> +		vhost_umem_interval_tree_insert(node, &newumem->umem_tree);
>  	}
> -	oldmem = d->memory;
> -	d->memory = newmem;
> +
> +	if (!memory_access_ok(d, newumem, 0))
> +		goto err;
> +
> +	oldumem = d->umem;
> +	d->umem = newumem;
>  
>  	/* All memory accesses are done under some VQ mutex. */
>  	for (i = 0; i < d->nvqs; ++i) {
>  		mutex_lock(&d->vqs[i]->mutex);
> -		d->vqs[i]->memory = newmem;
> +		d->vqs[i]->umem = newumem;
>  		mutex_unlock(&d->vqs[i]->mutex);
>  	}
> -	kvfree(oldmem);
> +
> +	kvfree(newmem);
> +	vhost_umem_clean(oldumem);
>  	return 0;
> +
> +err:
> +	vhost_umem_clean(newumem);
> +	kvfree(newmem);
> +	return -EFAULT;
>  }
>  
>  long vhost_vring_ioctl(struct vhost_dev *d, int ioctl, void __user *argp)
> @@ -1017,28 +1059,6 @@ done:
>  }
>  EXPORT_SYMBOL_GPL(vhost_dev_ioctl);
>  
> -static const struct vhost_memory_region *find_region(struct vhost_memory *mem,
> -						     __u64 addr, __u32 len)
> -{
> -	const struct vhost_memory_region *reg;
> -	int start = 0, end = mem->nregions;
> -
> -	while (start < end) {
> -		int slot = start + (end - start) / 2;
> -		reg = mem->regions + slot;
> -		if (addr >= reg->guest_phys_addr)
> -			end = slot;
> -		else
> -			start = slot + 1;
> -	}
> -
> -	reg = mem->regions + start;
> -	if (addr >= reg->guest_phys_addr &&
> -		reg->guest_phys_addr + reg->memory_size > addr)
> -		return reg;
> -	return NULL;
> -}
> -
>  /* TODO: This is really inefficient.  We need something like get_user()
>   * (instruction directly accesses the data, with an exception table entry
>   * returning -EFAULT). See Documentation/x86/exception-tables.txt.
> @@ -1180,29 +1200,29 @@ EXPORT_SYMBOL_GPL(vhost_init_used);
>  static int translate_desc(struct vhost_virtqueue *vq, u64 addr, u32 len,
>  			  struct iovec iov[], int iov_size)
>  {
> -	const struct vhost_memory_region *reg;
> -	struct vhost_memory *mem;
> +	const struct vhost_umem_node *node;
> +	struct vhost_umem *umem = vq->umem;
>  	struct iovec *_iov;
>  	u64 s = 0;
>  	int ret = 0;
>  
> -	mem = vq->memory;
>  	while ((u64)len > s) {
>  		u64 size;
>  		if (unlikely(ret >= iov_size)) {
>  			ret = -ENOBUFS;
>  			break;
>  		}
> -		reg = find_region(mem, addr, len);
> -		if (unlikely(!reg)) {
> +		node = vhost_umem_interval_tree_iter_first(&umem->umem_tree,
> +							addr, addr + len - 1);
> +		if (node == NULL || node->start > addr) {
>  			ret = -EFAULT;
>  			break;
>  		}
>  		_iov = iov + ret;
> -		size = reg->memory_size - addr + reg->guest_phys_addr;
> +		size = node->size - addr + node->start;
>  		_iov->iov_len = min((u64)len - s, size);
>  		_iov->iov_base = (void __user *)(unsigned long)
> -			(reg->userspace_addr + addr - reg->guest_phys_addr);
> +			(node->userspace_addr + addr - node->start);
>  		s += size;
>  		addr += size;
>  		++ret;
> diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
> index d3f7674..5d64393 100644
> --- a/drivers/vhost/vhost.h
> +++ b/drivers/vhost/vhost.h
> @@ -52,6 +52,25 @@ struct vhost_log {
>  	u64 len;
>  };
>  
> +#define START(node) ((node)->start)
> +#define LAST(node) ((node)->last)
> +
> +struct vhost_umem_node {
> +	struct rb_node rb;
> +	struct list_head link;
> +	__u64 start;
> +	__u64 last;
> +	__u64 size;
> +	__u64 userspace_addr;
> +	__u64 flags_padding;
> +	__u64 __subtree_last;
> +};
> +
> +struct vhost_umem {
> +	struct rb_root umem_tree;
> +	struct list_head umem_list;
> +};
> +
>  /* The virtqueue structure describes a queue attached to a device. */
>  struct vhost_virtqueue {
>  	struct vhost_dev *dev;
> @@ -100,7 +119,7 @@ struct vhost_virtqueue {
>  	struct iovec *indirect;
>  	struct vring_used_elem *heads;
>  	/* Protected by virtqueue mutex. */
> -	struct vhost_memory *memory;
> +	struct vhost_umem *umem;
>  	void *private_data;
>  	u64 acked_features;
>  	/* Log write descriptors */
> @@ -117,7 +136,6 @@ struct vhost_virtqueue {
>  };
>  
>  struct vhost_dev {
> -	struct vhost_memory *memory;
>  	struct mm_struct *mm;
>  	struct mutex mutex;
>  	struct vhost_virtqueue **vqs;
> @@ -127,14 +145,15 @@ struct vhost_dev {
>  	spinlock_t work_lock;
>  	struct list_head work_list;
>  	struct task_struct *worker;
> +	struct vhost_umem *umem;
>  };
>  
>  void vhost_dev_init(struct vhost_dev *, struct vhost_virtqueue **vqs, int nvqs);
>  long vhost_dev_set_owner(struct vhost_dev *dev);
>  bool vhost_dev_has_owner(struct vhost_dev *dev);
>  long vhost_dev_check_owner(struct vhost_dev *);
> -struct vhost_memory *vhost_dev_reset_owner_prepare(void);
> -void vhost_dev_reset_owner(struct vhost_dev *, struct vhost_memory *);
> +struct vhost_umem *vhost_dev_reset_owner_prepare(void);
> +void vhost_dev_reset_owner(struct vhost_dev *, struct vhost_umem *);
>  void vhost_dev_cleanup(struct vhost_dev *, bool locked);
>  void vhost_dev_stop(struct vhost_dev *);
>  long vhost_dev_ioctl(struct vhost_dev *, unsigned int ioctl, void __user *argp);
> -- 
> 2.5.0

^ permalink raw reply

* Re: [RFC PATCH V2 2/2] vhost: device IOTLB API
From: Michael S. Tsirkin @ 2016-04-27 11:45 UTC (permalink / raw)
  To: Jason Wang
  Cc: kvm, qemu-devel, netdev, linux-kernel, peterx, virtualization,
	pbonzini
In-Reply-To: <1458873274-13961-3-git-send-email-jasowang@redhat.com>

On Fri, Mar 25, 2016 at 10:34:34AM +0800, Jason Wang wrote:
> This patch tries to implement an device IOTLB for vhost. This could be
> used with for co-operation with userspace(qemu) implementation of DMA
> remapping.
> 
> The idea is simple. When vhost meets an IOTLB miss, it will request
> the assistance of userspace to do the translation, this is done
> through:
> 
> - Fill the translation request in a preset userspace address (This
>   address is set through ioctl VHOST_SET_IOTLB_REQUEST_ENTRY).
> - Notify userspace through eventfd (This eventfd was set through ioctl
>   VHOST_SET_IOTLB_FD).

Why use an eventfd for this? We use them for interrupts because
that happens to be what kvm wants, but here - why don't we
just add a generic support for reading out events
on the vhost fd itself?

> - device IOTLB were started and stopped through VHOST_RUN_IOTLB ioctl
> 
> When userspace finishes the translation, it will update the vhost
> IOTLB through VHOST_UPDATE_IOTLB ioctl. Userspace is also in charge of
> snooping the IOTLB invalidation of IOMMU IOTLB and use
> VHOST_UPDATE_IOTLB to invalidate the possible entry in vhost.

There's one problem here, and that is that VQs still do not undergo
translation.  In theory VQ could be mapped in such a way
that it's not contigious in userspace memory.


> Signed-off-by: Jason Wang <jasowang@redhat.com>

What limits amount of entries that kernel keeps around?
Do we want at least a mod parameter for this?


> ---
>  drivers/vhost/net.c        |   6 +-
>  drivers/vhost/vhost.c      | 301 +++++++++++++++++++++++++++++++++++++++------
>  drivers/vhost/vhost.h      |  17 ++-
>  fs/eventfd.c               |   3 +-
>  include/uapi/linux/vhost.h |  35 ++++++
>  5 files changed, 320 insertions(+), 42 deletions(-)
> 
> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> index 481db96..7cbdeed 100644
> --- a/drivers/vhost/net.c
> +++ b/drivers/vhost/net.c
> @@ -334,7 +334,7 @@ static void handle_tx(struct vhost_net *net)
>  		head = vhost_get_vq_desc(vq, vq->iov,
>  					 ARRAY_SIZE(vq->iov),
>  					 &out, &in,
> -					 NULL, NULL);
> +					 NULL, NULL, VHOST_ACCESS_RO);
>  		/* On error, stop handling until the next kick. */
>  		if (unlikely(head < 0))
>  			break;
> @@ -470,7 +470,7 @@ static int get_rx_bufs(struct vhost_virtqueue *vq,
>  		}
>  		r = vhost_get_vq_desc(vq, vq->iov + seg,
>  				      ARRAY_SIZE(vq->iov) - seg, &out,
> -				      &in, log, log_num);
> +				      &in, log, log_num, VHOST_ACCESS_WO);
>  		if (unlikely(r < 0))
>  			goto err;
>  
> @@ -1083,7 +1083,7 @@ static long vhost_net_ioctl(struct file *f, unsigned int ioctl,
>  		r = vhost_dev_ioctl(&n->dev, ioctl, argp);
>  		if (r == -ENOIOCTLCMD)
>  			r = vhost_vring_ioctl(&n->dev, ioctl, argp);
> -		else
> +		else if (ioctl != VHOST_UPDATE_IOTLB)
>  			vhost_net_flush(n);
>  		mutex_unlock(&n->dev.mutex);
>  		return r;
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index 32c35a9..1dd43e8 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -280,6 +280,10 @@ static void vhost_vq_reset(struct vhost_dev *dev,
>  	vq->call_ctx = NULL;
>  	vq->call = NULL;
>  	vq->log_ctx = NULL;
> +	vq->iotlb_call = NULL;
> +	vq->iotlb_call_ctx = NULL;
> +	vq->iotlb_request = NULL;
> +	vq->pending_request.flags.type = VHOST_IOTLB_INVALIDATE;
>  	vq->umem = NULL;
>  	vq->is_le = virtio_legacy_is_little_endian();
>  	vhost_vq_reset_user_be(vq);
> @@ -387,8 +391,10 @@ void vhost_dev_init(struct vhost_dev *dev,
>  	dev->log_ctx = NULL;
>  	dev->log_file = NULL;
>  	dev->umem = NULL;
> +	dev->iotlb = NULL;
>  	dev->mm = NULL;
>  	spin_lock_init(&dev->work_lock);
> +	spin_lock_init(&dev->iotlb_lock);
>  	INIT_LIST_HEAD(&dev->work_list);
>  	dev->worker = NULL;
>  
> @@ -537,6 +543,15 @@ void vhost_dev_stop(struct vhost_dev *dev)
>  }
>  EXPORT_SYMBOL_GPL(vhost_dev_stop);
>  
> +static void vhost_umem_free(struct vhost_umem *umem,
> +			    struct vhost_umem_node *node)
> +{
> +	vhost_umem_interval_tree_remove(node, &umem->umem_tree);
> +	list_del(&node->link);
> +	kfree(node);
> +	umem->numem--;
> +}
> +
>  static void vhost_umem_clean(struct vhost_umem *umem)
>  {
>  	struct vhost_umem_node *node, *tmp;
> @@ -544,11 +559,9 @@ static void vhost_umem_clean(struct vhost_umem *umem)
>  	if (!umem)
>  		return;
>  
> -	list_for_each_entry_safe(node, tmp, &umem->umem_list, link) {
> -		vhost_umem_interval_tree_remove(node, &umem->umem_tree);
> -		list_del(&node->link);
> -		kvfree(node);
> -	}
> +	list_for_each_entry_safe(node, tmp, &umem->umem_list, link)
> +		vhost_umem_free(umem, node);
> +
>  	kvfree(umem);
>  }
>  
> @@ -580,6 +593,8 @@ void vhost_dev_cleanup(struct vhost_dev *dev, bool locked)
>  	/* No one will access memory at this point */
>  	vhost_umem_clean(dev->umem);
>  	dev->umem = NULL;
> +	vhost_umem_clean(dev->iotlb);
> +	dev->iotlb = NULL;
>  	WARN_ON(!list_empty(&dev->work_list));
>  	if (dev->worker) {
>  		kthread_stop(dev->worker);
> @@ -699,11 +714,61 @@ int vhost_vq_access_ok(struct vhost_virtqueue *vq)
>  }
>  EXPORT_SYMBOL_GPL(vhost_vq_access_ok);
>  
> +static int vhost_new_umem_range(struct vhost_umem *umem,
> +				u64 start, u64 size, u64 end,
> +				u64 userspace_addr, int perm)
> +{
> +	struct vhost_umem_node *tmp, *node = kmalloc(sizeof(*node), GFP_ATOMIC);
> +
> +	if (!node)
> +		return -ENOMEM;
> +
> +	if (umem->numem == VHOST_IOTLB_SIZE) {
> +		tmp = list_last_entry(&umem->umem_list, typeof(*tmp), link);
> +		vhost_umem_free(umem, tmp);
> +	}
> +
> +	node->start = start;
> +	node->size = size;
> +	node->last = end;
> +	node->userspace_addr = userspace_addr;
> +	node->perm = perm;
> +	INIT_LIST_HEAD(&node->link);
> +	list_add_tail(&node->link, &umem->umem_list);
> +	vhost_umem_interval_tree_insert(node, &umem->umem_tree);
> +	umem->numem++;
> +
> +	return 0;
> +}
> +
> +static void vhost_del_umem_range(struct vhost_umem *umem,
> +				 u64 start, u64 end)
> +{
> +	struct vhost_umem_node *node;
> +
> +	while ((node = vhost_umem_interval_tree_iter_first(&umem->umem_tree,
> +							   start, end)))
> +		vhost_umem_free(umem, node);
> +}
> +
> +static struct vhost_umem *vhost_umem_alloc(void)
> +{
> +	struct vhost_umem *umem = vhost_kvzalloc(sizeof(*umem));
> +
> +	if (!umem)
> +		return NULL;
> +
> +	umem->umem_tree = RB_ROOT;
> +	umem->numem = 0;
> +	INIT_LIST_HEAD(&umem->umem_list);
> +
> +	return umem;
> +}
> +
>  static long vhost_set_memory(struct vhost_dev *d, struct vhost_memory __user *m)
>  {
>  	struct vhost_memory mem, *newmem;
>  	struct vhost_memory_region *region;
> -	struct vhost_umem_node *node;
>  	struct vhost_umem *newumem, *oldumem;
>  	unsigned long size = offsetof(struct vhost_memory, regions);
>  	int i;
> @@ -725,28 +790,23 @@ static long vhost_set_memory(struct vhost_dev *d, struct vhost_memory __user *m)
>  		return -EFAULT;
>  	}
>  
> -	newumem = vhost_kvzalloc(sizeof(*newumem));
> +	newumem = vhost_umem_alloc();
>  	if (!newumem) {
>  		kvfree(newmem);
>  		return -ENOMEM;
>  	}
>  
> -	newumem->umem_tree = RB_ROOT;
> -	INIT_LIST_HEAD(&newumem->umem_list);
> -
>  	for (region = newmem->regions;
>  	     region < newmem->regions + mem.nregions;
>  	     region++) {
> -		node = vhost_kvzalloc(sizeof(*node));
> -		if (!node)
> +		if (vhost_new_umem_range(newumem,
> +					 region->guest_phys_addr,
> +					 region->memory_size,
> +					 region->guest_phys_addr +
> +					 region->memory_size - 1,
> +					 region->userspace_addr,
> +				         VHOST_ACCESS_RW))
>  			goto err;
> -		node->start = region->guest_phys_addr;
> -		node->size = region->memory_size;
> -		node->last = node->start + node->size - 1;
> -		node->userspace_addr = region->userspace_addr;
> -		INIT_LIST_HEAD(&node->link);
> -		list_add_tail(&node->link, &newumem->umem_list);
> -		vhost_umem_interval_tree_insert(node, &newumem->umem_tree);
>  	}
>  
>  	if (!memory_access_ok(d, newumem, 0))
> @@ -782,6 +842,7 @@ long vhost_vring_ioctl(struct vhost_dev *d, int ioctl, void __user *argp)
>  	struct vhost_vring_state s;
>  	struct vhost_vring_file f;
>  	struct vhost_vring_addr a;
> +	struct vhost_vring_iotlb_entry e;
>  	u32 idx;
>  	long r;
>  
> @@ -910,6 +971,35 @@ long vhost_vring_ioctl(struct vhost_dev *d, int ioctl, void __user *argp)
>  		} else
>  			filep = eventfp;
>  		break;
> +	case VHOST_SET_VRING_IOTLB_REQUEST:
> +		r = -EFAULT;
> +		if (copy_from_user(&e, argp, sizeof e))
> +			break;
> +		if (!access_ok(VERIFY_WRITE, e.userspace_addr,
> +				sizeof(*vq->iotlb_request)))
> +			break;
> +		r = 0;
> +		vq->iotlb_request = (struct vhost_iotlb_entry __user *)e.userspace_addr;
> +		break;
> +	case VHOST_SET_VRING_IOTLB_CALL:
> +		if (copy_from_user(&f, argp, sizeof f)) {
> +			r = -EFAULT;
> +			break;
> +		}
> +		eventfp = f.fd == -1 ? NULL : eventfd_fget(f.fd);
> +		if (IS_ERR(eventfp)) {
> +			r = PTR_ERR(eventfp);
> +			break;
> +		}
> +		if (eventfp != vq->iotlb_call) {
> +			filep = vq->iotlb_call;
> +			ctx = vq->iotlb_call_ctx;
> +			vq->iotlb_call = eventfp;
> +			vq->iotlb_call_ctx = eventfp ?
> +				eventfd_ctx_fileget(eventfp) : NULL;
> +		} else
> +			filep = eventfp;
> +		break;
>  	case VHOST_SET_VRING_CALL:
>  		if (copy_from_user(&f, argp, sizeof f)) {
>  			r = -EFAULT;
> @@ -977,11 +1067,55 @@ long vhost_vring_ioctl(struct vhost_dev *d, int ioctl, void __user *argp)
>  }
>  EXPORT_SYMBOL_GPL(vhost_vring_ioctl);
>  
> +static int vhost_init_device_iotlb(struct vhost_dev *d, bool enabled)
> +{
> +	struct vhost_umem *niotlb, *oiotlb;
> +
> +	if (enabled) {
> +		niotlb = vhost_umem_alloc();
> +		if (!niotlb)
> +			return -ENOMEM;
> +	} else
> +		niotlb = NULL;
> +
> +	spin_lock(&d->iotlb_lock);
> +	oiotlb = d->iotlb;
> +	d->iotlb = niotlb;
> +	spin_unlock(&d->iotlb_lock);
> +
> +	vhost_umem_clean(oiotlb);
> +
> +	return 0;
> +}
> +
> +static void vhost_complete_iotlb_update(struct vhost_dev *d,
> +					struct vhost_iotlb_entry *entry)
> +{
> +	struct vhost_iotlb_entry *req;
> +	struct vhost_virtqueue *vq;
> +	int i;
> +
> +
> +	for (i = 0; i < d->nvqs; i++) {
> +		vq = d->vqs[i];
> +		mutex_lock(&vq->mutex);
> +		req = &vq->pending_request;
> +		if (entry->iova <= req->iova &&
> +		    entry->iova + entry->size - 1 > req->iova &&
> +		    req->flags.type == VHOST_IOTLB_MISS) {
> +			*req = *entry;
> +			vhost_poll_queue(&vq->poll);
> +		}
> +		mutex_unlock(&vq->mutex);
> +	}
> +}
> +
>  /* Caller must have device mutex */
>  long vhost_dev_ioctl(struct vhost_dev *d, unsigned int ioctl, void __user *argp)
>  {
>  	struct file *eventfp, *filep = NULL;
>  	struct eventfd_ctx *ctx = NULL;
> +	struct vhost_iotlb_entry entry;
>  	u64 p;
>  	long r;
>  	int i, fd;
> @@ -1050,6 +1184,52 @@ long vhost_dev_ioctl(struct vhost_dev *d, unsigned int ioctl, void __user *argp)
>  		if (filep)
>  			fput(filep);
>  		break;
> +	case VHOST_RUN_IOTLB:
> +		/* FIXME: enable and disabled */
> +		vhost_init_device_iotlb(d, true);
> +		break;
> +	case VHOST_UPDATE_IOTLB:
> +		r = copy_from_user(&entry, argp, sizeof(entry));
> +		if (r < 0) {
> +			r = -EFAULT;
> +			goto done;
> +		}
> +
> +		spin_lock(&d->iotlb_lock);
> +		if (!d->iotlb) {
> +			spin_unlock(&d->iotlb_lock);
> +			r = -EFAULT;
> +			goto done;
> +		}
> +		switch (entry.flags.type) {
> +		case VHOST_IOTLB_UPDATE:
> +			if (entry.flags.valid != VHOST_IOTLB_VALID) {
> +				break;
> +			}
> +			if (vhost_new_umem_range(d->iotlb,
> +						 entry.iova,
> +						 entry.size,
> +						 entry.iova + entry.size - 1,
> +                                                 entry.userspace_addr,
> +                                                 entry.flags.perm)) {
> +				r = -ENOMEM;
> +				break;
> +			}
> +			break;
> +		case VHOST_IOTLB_INVALIDATE:
> +			vhost_del_umem_range(d->iotlb,
> +					     entry.iova,
> +					     entry.iova + entry.size - 1);
> +			break;
> +		default:
> +			r = -EINVAL;
> +		}
> +		spin_unlock(&d->iotlb_lock);
> +
> +		if (!r && entry.flags.type != VHOST_IOTLB_INVALIDATE)
> +			vhost_complete_iotlb_update(d, &entry);
> +
> +		break;
>  	default:
>  		r = -ENOIOCTLCMD;
>  		break;
> @@ -1197,27 +1377,69 @@ int vhost_init_used(struct vhost_virtqueue *vq)
>  }
>  EXPORT_SYMBOL_GPL(vhost_init_used);
>  
> +static int vhost_iotlb_miss(struct vhost_virtqueue *vq, u64 iova)
> +{
> +	struct vhost_iotlb_entry *pending = &vq->pending_request;
> +	int ret;
> +
> +	if (!vq->iotlb_call_ctx)
> +		return -EOPNOTSUPP;
> +
> +	if (!vq->iotlb_request)
> +		return -EOPNOTSUPP;
> +
> +	if (pending->flags.type == VHOST_IOTLB_MISS) {
> +		return -EEXIST;
> +	}
> +
> +	pending->iova = iova;
> +	pending->flags.type = VHOST_IOTLB_MISS;
> +
> +	ret = __copy_to_user(vq->iotlb_request, pending,
> +			     sizeof(struct vhost_iotlb_entry));
> +	if (ret) {
> +		goto err;
> +	}
> +
> +	if (vq->iotlb_call_ctx)
> +		eventfd_signal(vq->iotlb_call_ctx, 1);
> +err:
> +	return ret;
> +}
> +
>  static int translate_desc(struct vhost_virtqueue *vq, u64 addr, u32 len,
> -			  struct iovec iov[], int iov_size)
> +			  struct iovec iov[], int iov_size, int access)
>  {
>  	const struct vhost_umem_node *node;
> -	struct vhost_umem *umem = vq->umem;
> +	struct vhost_dev *dev = vq->dev;
> +	struct vhost_umem *umem = dev->iotlb ? dev->iotlb : dev->umem;
>  	struct iovec *_iov;
>  	u64 s = 0;
>  	int ret = 0;
>  
> +	spin_lock(&dev->iotlb_lock);
> +
>  	while ((u64)len > s) {
>  		u64 size;
>  		if (unlikely(ret >= iov_size)) {
>  			ret = -ENOBUFS;
>  			break;
>  		}
> +
>  		node = vhost_umem_interval_tree_iter_first(&umem->umem_tree,
>  							addr, addr + len - 1);
>  		if (node == NULL || node->start > addr) {
> -			ret = -EFAULT;
> +			if (umem != dev->iotlb) {
> +				ret = -EFAULT;
> +				break;
> +			}
> +			ret = -EAGAIN;
> +			break;
> +		} else if (!(node->perm & access)) {
> +			ret = -EPERM;
>  			break;
>  		}
> +
>  		_iov = iov + ret;
>  		size = node->size - addr + node->start;
>  		_iov->iov_len = min((u64)len - s, size);
> @@ -1228,6 +1450,10 @@ static int translate_desc(struct vhost_virtqueue *vq, u64 addr, u32 len,
>  		++ret;
>  	}
>  
> +	spin_unlock(&dev->iotlb_lock);
> +
> +	if (ret == -EAGAIN)
> +		vhost_iotlb_miss(vq, addr);
>  	return ret;
>  }
>  
> @@ -1256,7 +1482,7 @@ static int get_indirect(struct vhost_virtqueue *vq,
>  			struct iovec iov[], unsigned int iov_size,
>  			unsigned int *out_num, unsigned int *in_num,
>  			struct vhost_log *log, unsigned int *log_num,
> -			struct vring_desc *indirect)
> +			struct vring_desc *indirect, int access)
>  {
>  	struct vring_desc desc;
>  	unsigned int i = 0, count, found = 0;
> @@ -1274,9 +1500,10 @@ static int get_indirect(struct vhost_virtqueue *vq,
>  	}
>  
>  	ret = translate_desc(vq, vhost64_to_cpu(vq, indirect->addr), len, vq->indirect,
> -			     UIO_MAXIOV);
> +			     UIO_MAXIOV, access);
>  	if (unlikely(ret < 0)) {
> -		vq_err(vq, "Translation failure %d in indirect.\n", ret);
> +		if (ret != -EAGAIN)
> +			vq_err(vq, "Translation failure %d in indirect.\n", ret);
>  		return ret;
>  	}
>  	iov_iter_init(&from, READ, vq->indirect, ret, len);
> @@ -1316,10 +1543,11 @@ static int get_indirect(struct vhost_virtqueue *vq,
>  
>  		ret = translate_desc(vq, vhost64_to_cpu(vq, desc.addr),
>  				     vhost32_to_cpu(vq, desc.len), iov + iov_count,
> -				     iov_size - iov_count);
> +				     iov_size - iov_count, access);
>  		if (unlikely(ret < 0)) {
> -			vq_err(vq, "Translation failure %d indirect idx %d\n",
> -			       ret, i);
> +			if (ret != -EAGAIN)
> +				vq_err(vq, "Translation failure %d indirect idx %d\n",
> +					ret, i);
>  			return ret;
>  		}
>  		/* If this is an input descriptor, increment that count. */
> @@ -1355,7 +1583,8 @@ static int get_indirect(struct vhost_virtqueue *vq,
>  int vhost_get_vq_desc(struct vhost_virtqueue *vq,
>  		      struct iovec iov[], unsigned int iov_size,
>  		      unsigned int *out_num, unsigned int *in_num,
> -		      struct vhost_log *log, unsigned int *log_num)
> +		      struct vhost_log *log, unsigned int *log_num,
> +		      int access)
>  {
>  	struct vring_desc desc;
>  	unsigned int i, head, found = 0;
> @@ -1433,10 +1662,11 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
>  		if (desc.flags & cpu_to_vhost16(vq, VRING_DESC_F_INDIRECT)) {
>  			ret = get_indirect(vq, iov, iov_size,
>  					   out_num, in_num,
> -					   log, log_num, &desc);
> +					   log, log_num, &desc, access);
>  			if (unlikely(ret < 0)) {
> -				vq_err(vq, "Failure detected "
> -				       "in indirect descriptor at idx %d\n", i);
> +				if (ret != -EAGAIN)
> +					vq_err(vq, "Failure detected "
> +						"in indirect descriptor at idx %d\n", i);
>  				return ret;
>  			}
>  			continue;
> @@ -1444,10 +1674,11 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
>  
>  		ret = translate_desc(vq, vhost64_to_cpu(vq, desc.addr),
>  				     vhost32_to_cpu(vq, desc.len), iov + iov_count,
> -				     iov_size - iov_count);
> +				     iov_size - iov_count, access);
>  		if (unlikely(ret < 0)) {
> -			vq_err(vq, "Translation failure %d descriptor idx %d\n",
> -			       ret, i);
> +			if (ret != -EAGAIN)
> +				vq_err(vq, "Translation failure %d descriptor idx %d\n",
> +					ret, i);
>  			return ret;
>  		}
>  		if (desc.flags & cpu_to_vhost16(vq, VRING_DESC_F_WRITE)) {
> diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
> index 5d64393..4365104 100644
> --- a/drivers/vhost/vhost.h
> +++ b/drivers/vhost/vhost.h
> @@ -62,13 +62,15 @@ struct vhost_umem_node {
>  	__u64 last;
>  	__u64 size;
>  	__u64 userspace_addr;
> -	__u64 flags_padding;
> +	__u32 perm;
> +	__u32 flags_padding;
>  	__u64 __subtree_last;
>  };
>  
>  struct vhost_umem {
>  	struct rb_root umem_tree;
>  	struct list_head umem_list;
> +	int numem;
>  };
>  
>  /* The virtqueue structure describes a queue attached to a device. */
> @@ -84,9 +86,13 @@ struct vhost_virtqueue {
>  	struct file *kick;
>  	struct file *call;
>  	struct file *error;
> +	struct file *iotlb_call;
>  	struct eventfd_ctx *call_ctx;
>  	struct eventfd_ctx *error_ctx;
>  	struct eventfd_ctx *log_ctx;
> +	struct eventfd_ctx *iotlb_call_ctx;
> +	struct vhost_iotlb_entry __user *iotlb_request;
> +	struct vhost_iotlb_entry pending_request;
>  
>  	struct vhost_poll poll;
>  
> @@ -135,6 +141,8 @@ struct vhost_virtqueue {
>  #endif
>  };
>  
> +#define VHOST_IOTLB_SIZE 2048
> +
>  struct vhost_dev {
>  	struct mm_struct *mm;
>  	struct mutex mutex;
> @@ -146,6 +154,8 @@ struct vhost_dev {
>  	struct list_head work_list;
>  	struct task_struct *worker;
>  	struct vhost_umem *umem;
> +	struct vhost_umem *iotlb;
> +	spinlock_t iotlb_lock;
>  };
>  
>  void vhost_dev_init(struct vhost_dev *, struct vhost_virtqueue **vqs, int nvqs);
> @@ -164,7 +174,8 @@ int vhost_log_access_ok(struct vhost_dev *);
>  int vhost_get_vq_desc(struct vhost_virtqueue *,
>  		      struct iovec iov[], unsigned int iov_count,
>  		      unsigned int *out_num, unsigned int *in_num,
> -		      struct vhost_log *log, unsigned int *log_num);
> +		      struct vhost_log *log, unsigned int *log_num,
> +		      int access);
>  void vhost_discard_vq_desc(struct vhost_virtqueue *, int n);
>  
>  int vhost_init_used(struct vhost_virtqueue *);
> @@ -183,7 +194,7 @@ int vhost_log_write(struct vhost_virtqueue *vq, struct vhost_log *log,
>  		    unsigned int log_num, u64 len);
>  
>  #define vq_err(vq, fmt, ...) do {                                  \
> -		pr_debug(pr_fmt(fmt), ##__VA_ARGS__);       \
> +		printk(pr_fmt(fmt), ##__VA_ARGS__);       \
>  		if ((vq)->error_ctx)                               \
>  				eventfd_signal((vq)->error_ctx, 1);\
>  	} while (0)
> diff --git a/fs/eventfd.c b/fs/eventfd.c
> index 8d0c0df..5c0a22f 100644
> --- a/fs/eventfd.c
> +++ b/fs/eventfd.c
> @@ -59,8 +59,9 @@ __u64 eventfd_signal(struct eventfd_ctx *ctx, __u64 n)
>  	if (ULLONG_MAX - ctx->count < n)
>  		n = ULLONG_MAX - ctx->count;
>  	ctx->count += n;
> -	if (waitqueue_active(&ctx->wqh))
> +	if (waitqueue_active(&ctx->wqh)) {
>  		wake_up_locked_poll(&ctx->wqh, POLLIN);
> +	}
>  	spin_unlock_irqrestore(&ctx->wqh.lock, flags);
>  
>  	return n;
> diff --git a/include/uapi/linux/vhost.h b/include/uapi/linux/vhost.h
> index ab373191..5c35ab4 100644
> --- a/include/uapi/linux/vhost.h
> +++ b/include/uapi/linux/vhost.h
> @@ -47,6 +47,32 @@ struct vhost_vring_addr {
>  	__u64 log_guest_addr;
>  };
>  
> +struct vhost_iotlb_entry {
> +	__u64 iova;
> +	__u64 size;
> +	__u64 userspace_addr;

Alignment requirements?

> +	struct {
> +#define VHOST_ACCESS_RO      0x1
> +#define VHOST_ACCESS_WO      0x2
> +#define VHOST_ACCESS_RW      0x3
> +		__u8  perm;
> +#define VHOST_IOTLB_MISS           1
> +#define VHOST_IOTLB_UPDATE         2
> +#define VHOST_IOTLB_INVALIDATE     3
> +		__u8  type;
> +#define VHOST_IOTLB_INVALID        0x1
> +#define VHOST_IOTLB_VALID          0x2
> +		__u8  valid;

why do we need this flag?

> +		__u8  u8_padding;
> +		__u32 padding;
> +	} flags;
> +};
> +
> +struct vhost_vring_iotlb_entry {
> +	unsigned int index;
> +	__u64 userspace_addr;
> +};
> +
>  struct vhost_memory_region {
>  	__u64 guest_phys_addr;
>  	__u64 memory_size; /* bytes */
> @@ -127,6 +153,15 @@ struct vhost_memory {
>  /* Set eventfd to signal an error */
>  #define VHOST_SET_VRING_ERR _IOW(VHOST_VIRTIO, 0x22, struct vhost_vring_file)
>  
> +/* IOTLB */
> +/* Specify an eventfd file descriptor to signle on IOTLB miss */

typo

> +#define VHOST_SET_VRING_IOTLB_CALL _IOW(VHOST_VIRTIO, 0x23, struct      \
> +                                        vhost_vring_file)
> +#define VHOST_SET_VRING_IOTLB_REQUEST _IOW(VHOST_VIRTIO, 0x25, struct   \
> +                                           vhost_vring_iotlb_entry)
> +#define VHOST_UPDATE_IOTLB _IOW(VHOST_VIRTIO, 0x24, struct vhost_iotlb_entry)
> +#define VHOST_RUN_IOTLB _IOW(VHOST_VIRTIO, 0x26, int)
> +

Is the assumption that userspace must dedicate a thread to running the iotlb? 
I dislike this one.
Please support asynchronous APIs at least optionally to make
userspace make its own threading decisions.

>  /* VHOST_NET specific defines */
>  
>  /* Attach virtio net ring to a raw socket, or tap device.

Don't we need a feature bit for this?
Are we short on feature bits? If yes maybe it's time to add
something like PROTOCOL_FEATURES that we have in vhost-user.

> -- 
> 2.5.0

^ permalink raw reply

* Re: [PATCH V2 RFC] fixup! virtio: convert to use DMA api
From: David Woodhouse @ 2016-04-27 12:18 UTC (permalink / raw)
  To: Wei Liu, Michael S. Tsirkin
  Cc: Andy Lutomirski, qemu-block, Christian Borntraeger,
	Stefano Stabellini, qemu-devel, peterx, linux-kernel, Amit Shah,
	iommu, Stefan Hajnoczi, kvm, pbonzini, virtualization,
	Anthony PERARD
In-Reply-To: <20160421135416.GE11775@citrix.com>


[-- Attachment #1.1: Type: text/plain, Size: 3339 bytes --]


> > On some systems, including Xen and any system with a physical device
> > that speaks virtio behind a physical IOMMU, we must use the DMA API
> > for virtio DMA to work at all.
> > 
> > Add a feature bit to detect that: VIRTIO_F_IOMMU_PLATFORM.
> > 
> > If not there, we preserve historic behavior and bypass the DMA
> > API unless within Xen guest. This is actually required for
> > systems, including SPARC and PPC64, where virtio-pci devices are
> > enumerated as though they are behind an IOMMU, but the virtio host
> > ignores the IOMMU, so we must either pretend that the IOMMU isn't
> > there or somehow map everything as the identity.
> > 
> > Re: non-virtio devices.
> > 
> > It turns out that on old QEMU hosts, only emulated devices which were
> > part of QEMU use the IOMMU.  Should we want to bypass the IOMMU for such
> > devices *only*, it would be rather easy to detect them by looking at
> > subsystem vendor and device ID. Thus, no new interfaces are required
> > except for virtio which always uses the same subsystem vendor and device ID.

Apologies for dropping this thread; I've been travelling.

But seriously, NO!

I understand why you want to see this as a virtio-specific issue, but
it isn't. And we don't *want* it to be.

In the guest, drivers SHALL use the DMA API. And the DMA API SHALL do
the right thing for each device according to its needs.

So any information passed from qemu to the guest should be directed at
the platform IOMMU code (or handled by qemu-detection quirks in the
guest, if we must).

It is *not* acceptable for the virtio drivers in the guest to just
eschew the DMA API completely, triggered by some device-specific flag.

The qemu implementation is, of course, monolithic. In qemu the fact
that virtio doesn't get translated by the emulated IOMMU *is* actually
down to code in the virtio implementation. I get that.

But then again, it's not just virtio. *Any* device which we emulate for
the guest could have that same issue, and appear as untranslated. (And
assigned PCI devices currently do).

Let's think about the parallel with a system-on-chip. Let's say we have
a peripheral which got included, but which was wired up such that it
bypasses the IOMMU and gets to do direct physical DMA. Is that a
feature of that specific peripheral? Do we hack its drivers to make the
distinction between this incarnation, and a normal discrete version of
the same device? No! It's a feature of the *system* and needs to be
conveyed to the OS IOMMU code to do the right thing. Not to the driver.

In my opinion, adding the VIRTIO_F_IOMMU_PLATFORM feature bit is
absolutely the wrong thing to do.

What we *should* do is a patchset in the guest which both fixes virtio
drivers to *always* use the DMA API, and fixes the DMA API to DTRT at
the same time — by detecting qemu and installing no-op DMA ops for the
appropriate devices, perhaps.

Then we can look at giving qemu a way to properly indicate which
devices it actually does DMA mapping for, so we can remove those
heuristic assumptions.

But that flag does *not* live in the virtio host←→guest ABI.

-- 
David Woodhouse                            Open Source Technology Centre
David.Woodhouse@intel.com                              Intel Corporation


[-- Attachment #1.2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5691 bytes --]

[-- Attachment #2: Type: text/plain, Size: 183 bytes --]

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* Re: [PATCH V2 RFC] fixup! virtio: convert to use DMA api
From: Michael S. Tsirkin @ 2016-04-27 13:37 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Wei Liu, Andy Lutomirski, qemu-block, Christian Borntraeger,
	Stefano Stabellini, qemu-devel, peterx, linux-kernel, Amit Shah,
	iommu, Stefan Hajnoczi, kvm, pbonzini, virtualization,
	Anthony PERARD
In-Reply-To: <1461759501.118304.149.camel@infradead.org>

On Wed, Apr 27, 2016 at 01:18:21PM +0100, David Woodhouse wrote:
> 
> > > On some systems, including Xen and any system with a physical device
> > > that speaks virtio behind a physical IOMMU, we must use the DMA API
> > > for virtio DMA to work at all.
> > > 
> > > Add a feature bit to detect that: VIRTIO_F_IOMMU_PLATFORM.
> > > 
> > > If not there, we preserve historic behavior and bypass the DMA
> > > API unless within Xen guest. This is actually required for
> > > systems, including SPARC and PPC64, where virtio-pci devices are
> > > enumerated as though they are behind an IOMMU, but the virtio host
> > > ignores the IOMMU, so we must either pretend that the IOMMU isn't
> > > there or somehow map everything as the identity.
> > > 
> > > Re: non-virtio devices.
> > > 
> > > It turns out that on old QEMU hosts, only emulated devices which were
> > > part of QEMU use the IOMMU.  Should we want to bypass the IOMMU for such
> > > devices *only*, it would be rather easy to detect them by looking at
> > > subsystem vendor and device ID. Thus, no new interfaces are required
> > > except for virtio which always uses the same subsystem vendor and device ID.
> 
> Apologies for dropping this thread; I've been travelling.
> 
> But seriously, NO!
> 
> I understand why you want to see this as a virtio-specific issue, but
> it isn't. And we don't *want* it to be.
> 
> In the guest, drivers SHALL use the DMA API. And the DMA API SHALL do
> the right thing for each device according to its needs.
> 
> So any information passed from qemu to the guest should be directed at
> the platform IOMMU code (or handled by qemu-detection quirks in the
> guest, if we must).
> 
> It is *not* acceptable for the virtio drivers in the guest to just
> eschew the DMA API completely, triggered by some device-specific flag.
> 
> The qemu implementation is, of course, monolithic. In qemu the fact
> that virtio doesn't get translated by the emulated IOMMU *is* actually
> down to code in the virtio implementation. I get that.
> 
> But then again, it's not just virtio. *Any* device which we emulate for
> the guest could have that same issue, and appear as untranslated. (And
> assigned PCI devices currently do).
> 
> Let's think about the parallel with a system-on-chip. Let's say we have
> a peripheral which got included, but which was wired up such that it
> bypasses the IOMMU and gets to do direct physical DMA. Is that a
> feature of that specific peripheral? Do we hack its drivers to make the
> distinction between this incarnation, and a normal discrete version of
> the same device? No! It's a feature of the *system*

One correction: it's a feature of the device in the system.
There could be a mix of devices bypassing and not
bypassing the IOMMU.

> and needs to be
> conveyed to the OS IOMMU code to do the right thing. Not to the driver.
> 
> In my opinion, adding the VIRTIO_F_IOMMU_PLATFORM feature bit is
> absolutely the wrong thing to do.
> 
> What we *should* do is a patchset in the guest which both fixes virtio
> drivers to *always* use the DMA API, and fixes the DMA API to DTRT at
> the same time — by detecting qemu and installing no-op DMA ops for the
> appropriate devices, perhaps.

Sounds good. And a way to detect appropriate devices could
be by looking at the feature flag, perhaps?


> Then we can look at giving qemu a way to properly indicate which
> devices it actually does DMA mapping for, so we can remove those
> heuristic assumptions.
> 
> But that flag does *not* live in the virtio host←→guest ABI.
> 
> -- 
> David Woodhouse                            Open Source Technology Centre
> David.Woodhouse@intel.com                              Intel Corporation
> 


_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* Re: [PATCH V2 RFC] fixup! virtio: convert to use DMA api
From: Joerg Roedel @ 2016-04-27 14:23 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Anthony PERARD, Wei Liu, Stefan Hajnoczi, qemu-block,
	Stefano Stabellini, qemu-devel, peterx, linux-kernel,
	Christian Borntraeger, iommu, Andy Lutomirski, kvm, Amit Shah,
	pbonzini, virtualization, David Woodhouse
In-Reply-To: <20160427153345-mutt-send-email-mst@redhat.com>

On Wed, Apr 27, 2016 at 04:37:04PM +0300, Michael S. Tsirkin wrote:
> One correction: it's a feature of the device in the system.
> There could be a mix of devices bypassing and not
> bypassing the IOMMU.

No, it really is not. A device can't chose to bypass the IOMMU. But the
IOMMU can chose to let the device bypass. So any fix here belongs
into the platform/iommu code too and not into some driver.

> Sounds good. And a way to detect appropriate devices could
> be by looking at the feature flag, perhaps?

Again, no! The way to detect that is to look into the iommu description
structures provided by the firmware. They provide everything necessary
to tell the iommu code which devices are not translated.



	Joerg

^ permalink raw reply

* Re: [PATCH V2 RFC] fixup! virtio: convert to use DMA api
From: Andy Lutomirski @ 2016-04-27 14:31 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Anthony PERARD, Wei Liu, Stefan Hajnoczi, qemu-block,
	Michael S. Tsirkin, Stefano Stabellini,
	qemu-devel@nongnu.org Developers, peterx,
	linux-kernel@vger.kernel.org, Christian Borntraeger, iommu,
	Andy Lutomirski, kvm list, Amit Shah, Paolo Bonzini,
	Linux Virtualization, David Woodhouse
In-Reply-To: <20160427142331.GH17926@8bytes.org>

On Wed, Apr 27, 2016 at 7:23 AM, Joerg Roedel <joro@8bytes.org> wrote:
> On Wed, Apr 27, 2016 at 04:37:04PM +0300, Michael S. Tsirkin wrote:
>> One correction: it's a feature of the device in the system.
>> There could be a mix of devices bypassing and not
>> bypassing the IOMMU.
>
> No, it really is not. A device can't chose to bypass the IOMMU. But the
> IOMMU can chose to let the device bypass. So any fix here belongs
> into the platform/iommu code too and not into some driver.
>
>> Sounds good. And a way to detect appropriate devices could
>> be by looking at the feature flag, perhaps?
>
> Again, no! The way to detect that is to look into the iommu description
> structures provided by the firmware. They provide everything necessary
> to tell the iommu code which devices are not translated.
>

Except on PPC and SPARC.  As far as I know, those are the only
problematic platforms.

Is it too late to *disable* QEMU's q35-iommu thingy until it can be
fixed to report correct data in the DMAR tables?

--Andy

^ permalink raw reply

* Re: [PATCH V2 RFC] fixup! virtio: convert to use DMA api
From: Michael S. Tsirkin @ 2016-04-27 14:34 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Anthony PERARD, Wei Liu, Stefan Hajnoczi, qemu-block,
	Stefano Stabellini, qemu-devel, peterx, linux-kernel,
	Christian Borntraeger, iommu, Andy Lutomirski, kvm, Amit Shah,
	pbonzini, virtualization, David Woodhouse
In-Reply-To: <20160427142331.GH17926@8bytes.org>

On Wed, Apr 27, 2016 at 04:23:32PM +0200, Joerg Roedel wrote:
> On Wed, Apr 27, 2016 at 04:37:04PM +0300, Michael S. Tsirkin wrote:
> > One correction: it's a feature of the device in the system.
> > There could be a mix of devices bypassing and not
> > bypassing the IOMMU.
> 
> No, it really is not. A device can't chose to bypass the IOMMU. But the
> IOMMU can chose to let the device bypass.

QEMU can choose to bypass IOMMU for one device and not the other.
IOMMU in QEMU isn't involved when it's bypassed.

> So any fix here belongs
> into the platform/iommu code too and not into some driver.

Fine but this is beside the point. Almost all virtio devices
bypass IOMMU and what this patch does is create a way
to detect devices that don't. This code can maybe go into
platform.

> > Sounds good. And a way to detect appropriate devices could
> > be by looking at the feature flag, perhaps?
> 
> Again, no! The way to detect that is to look into the iommu description
> structures provided by the firmware. They provide everything necessary
> to tell the iommu code which devices are not translated.
> 
> 
> 
> 	Joerg

It would be easy if they did but they don't do this on all systems.
In particular the idea for firmware interface was clearly
that a given bus either is connected through IOMMU or bypassing it.
Whether virtio bypasses the iommu is unrelated to the bus it's on.

-- 
MST

^ permalink raw reply

* Re: [PATCH V2 RFC] fixup! virtio: convert to use DMA api
From: Michael S. Tsirkin @ 2016-04-27 14:38 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Anthony PERARD, Wei Liu, Stefan Hajnoczi, qemu-block,
	Stefano Stabellini, Joerg Roedel,
	qemu-devel@nongnu.org Developers, peterx,
	linux-kernel@vger.kernel.org, Christian Borntraeger, iommu,
	Andy Lutomirski, kvm list, Amit Shah, Paolo Bonzini,
	Linux Virtualization, David Woodhouse
In-Reply-To: <CALCETrVkSSJbjoK8i7pLsSYR0o=Wy1UP-mrmn2uxYUd81g18dg@mail.gmail.com>

On Wed, Apr 27, 2016 at 07:31:43AM -0700, Andy Lutomirski wrote:
> On Wed, Apr 27, 2016 at 7:23 AM, Joerg Roedel <joro@8bytes.org> wrote:
> > On Wed, Apr 27, 2016 at 04:37:04PM +0300, Michael S. Tsirkin wrote:
> >> One correction: it's a feature of the device in the system.
> >> There could be a mix of devices bypassing and not
> >> bypassing the IOMMU.
> >
> > No, it really is not. A device can't chose to bypass the IOMMU. But the
> > IOMMU can chose to let the device bypass. So any fix here belongs
> > into the platform/iommu code too and not into some driver.
> >
> >> Sounds good. And a way to detect appropriate devices could
> >> be by looking at the feature flag, perhaps?
> >
> > Again, no! The way to detect that is to look into the iommu description
> > structures provided by the firmware. They provide everything necessary
> > to tell the iommu code which devices are not translated.
> >
> 
> Except on PPC and SPARC.  As far as I know, those are the only
> problematic platforms.
> 
> Is it too late to *disable* QEMU's q35-iommu thingy until it can be
> fixed to report correct data in the DMAR tables?
> 
> --Andy

Meaning virtio or assigned devices?
For virtio - it's way too late since these are working configurations.
For assigned devices - they don't work on x86 so it doesn't have
to be disabled, it's safe to ignore.

-- 
MST

^ permalink raw reply

* Re: [PATCH V2 RFC] fixup! virtio: convert to use DMA api
From: Andy Lutomirski @ 2016-04-27 14:43 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Anthony PERARD, Wei Liu, Stefan Hajnoczi, qemu-block,
	Stefano Stabellini, Joerg Roedel,
	qemu-devel@nongnu.org Developers, peterx,
	linux-kernel@vger.kernel.org, Christian Borntraeger, iommu,
	Andy Lutomirski, kvm list, Amit Shah, Paolo Bonzini,
	Linux Virtualization, David Woodhouse
In-Reply-To: <20160427173603-mutt-send-email-mst@redhat.com>

On Wed, Apr 27, 2016 at 7:38 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> On Wed, Apr 27, 2016 at 07:31:43AM -0700, Andy Lutomirski wrote:
>> On Wed, Apr 27, 2016 at 7:23 AM, Joerg Roedel <joro@8bytes.org> wrote:
>> > On Wed, Apr 27, 2016 at 04:37:04PM +0300, Michael S. Tsirkin wrote:
>> >> One correction: it's a feature of the device in the system.
>> >> There could be a mix of devices bypassing and not
>> >> bypassing the IOMMU.
>> >
>> > No, it really is not. A device can't chose to bypass the IOMMU. But the
>> > IOMMU can chose to let the device bypass. So any fix here belongs
>> > into the platform/iommu code too and not into some driver.
>> >
>> >> Sounds good. And a way to detect appropriate devices could
>> >> be by looking at the feature flag, perhaps?
>> >
>> > Again, no! The way to detect that is to look into the iommu description
>> > structures provided by the firmware. They provide everything necessary
>> > to tell the iommu code which devices are not translated.
>> >
>>
>> Except on PPC and SPARC.  As far as I know, those are the only
>> problematic platforms.
>>
>> Is it too late to *disable* QEMU's q35-iommu thingy until it can be
>> fixed to report correct data in the DMAR tables?
>>
>> --Andy
>
> Meaning virtio or assigned devices?
> For virtio - it's way too late since these are working configurations.
> For assigned devices - they don't work on x86 so it doesn't have
> to be disabled, it's safe to ignore.

I mean actually prevent QEMU from running in q35-iommu mode with any
virtio devices attached or maybe even turn off q35-iommu mode entirely
[1].  Doesn't it require that the user literally pass the word
"experimental" into QEMU right now?  It did at some point IIRC.

The reason I'm asking is that, other than q35-iommu, QEMU's virtio
devices *don't* bypass the IOMMU except on PPC and SPARC, simply
because there is no other configuration AFAICT that has virtio and and
IOMMU.  So maybe the right solution is to fix q35-iommu to use DMAR
correctly (thus breaking q35-iommu users with older guest kernels,
which hopefully don't actually exist) and to come up with a PPC- and
SPARC-specific solution, or maybe OpenFirmware-specific solution, to
handle PPC and SPARC down the road.

[1] I'm pretty sure I emailed the QEMU list before q35-iommu ever
showed up in a release asking the QEMU team to please not do that
until this issue was resolved.  Sadly, that email was ignored :(

--Andy

^ permalink raw reply

* Re: [PATCH V2 RFC] fixup! virtio: convert to use DMA api
From: Michael S. Tsirkin @ 2016-04-27 14:54 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Anthony PERARD, Wei Liu, Stefan Hajnoczi, qemu-block,
	Stefano Stabellini, Joerg Roedel,
	qemu-devel@nongnu.org Developers, peterx,
	linux-kernel@vger.kernel.org, Christian Borntraeger, iommu,
	Andy Lutomirski, kvm list, Amit Shah, Paolo Bonzini,
	Linux Virtualization, David Woodhouse
In-Reply-To: <CALCETrUnXSLApj--ypOHnLWLqVSHLpSa922F2C2OCrvA1B585A@mail.gmail.com>

On Wed, Apr 27, 2016 at 07:43:07AM -0700, Andy Lutomirski wrote:
> On Wed, Apr 27, 2016 at 7:38 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> > On Wed, Apr 27, 2016 at 07:31:43AM -0700, Andy Lutomirski wrote:
> >> On Wed, Apr 27, 2016 at 7:23 AM, Joerg Roedel <joro@8bytes.org> wrote:
> >> > On Wed, Apr 27, 2016 at 04:37:04PM +0300, Michael S. Tsirkin wrote:
> >> >> One correction: it's a feature of the device in the system.
> >> >> There could be a mix of devices bypassing and not
> >> >> bypassing the IOMMU.
> >> >
> >> > No, it really is not. A device can't chose to bypass the IOMMU. But the
> >> > IOMMU can chose to let the device bypass. So any fix here belongs
> >> > into the platform/iommu code too and not into some driver.
> >> >
> >> >> Sounds good. And a way to detect appropriate devices could
> >> >> be by looking at the feature flag, perhaps?
> >> >
> >> > Again, no! The way to detect that is to look into the iommu description
> >> > structures provided by the firmware. They provide everything necessary
> >> > to tell the iommu code which devices are not translated.
> >> >
> >>
> >> Except on PPC and SPARC.  As far as I know, those are the only
> >> problematic platforms.
> >>
> >> Is it too late to *disable* QEMU's q35-iommu thingy until it can be
> >> fixed to report correct data in the DMAR tables?
> >>
> >> --Andy
> >
> > Meaning virtio or assigned devices?
> > For virtio - it's way too late since these are working configurations.
> > For assigned devices - they don't work on x86 so it doesn't have
> > to be disabled, it's safe to ignore.
> 
> I mean actually prevent QEMU from running in q35-iommu mode with any
> virtio devices attached or maybe even turn off q35-iommu mode entirely
> [1].  Doesn't it require that the user literally pass the word
> "experimental" into QEMU right now?  It did at some point IIRC.
> 
> The reason I'm asking is that, other than q35-iommu, QEMU's virtio
> devices *don't* bypass the IOMMU except on PPC and SPARC, simply
> because there is no other configuration AFAICT that has virtio and and
> IOMMU.  So maybe the right solution is to fix q35-iommu to use DMAR
> correctly (thus breaking q35-iommu users with older guest kernels,
> which hopefully don't actually exist) and to come up with a PPC- and
> SPARC-specific solution, or maybe OpenFirmware-specific solution, to
> handle PPC and SPARC down the road.
> 
> [1] I'm pretty sure I emailed the QEMU list before q35-iommu ever
> showed up in a release asking the QEMU team to please not do that
> until this issue was resolved.  Sadly, that email was ignored :(
> 
> --Andy

Sorry, I didn't make myself clear.
Point is, QEMU is not the only virtio implementation out there.
So we can't know no virtio implementations have an IOMMU as long as
linux supports this IOMMU.
virtio always used physical addresses since it was born and if it
changes that it must do this in a way that does not break existing
users.

-- 
MST

^ permalink raw reply

* Re: [PATCH V2 RFC] fixup! virtio: convert to use DMA api
From: Joerg Roedel @ 2016-04-27 14:56 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Anthony PERARD, Wei Liu, Stefan Hajnoczi, qemu-block,
	Stefano Stabellini, qemu-devel, peterx, linux-kernel,
	Christian Borntraeger, iommu, Andy Lutomirski, kvm, Amit Shah,
	pbonzini, virtualization, David Woodhouse
In-Reply-To: <20160427172630-mutt-send-email-mst@redhat.com>

On Wed, Apr 27, 2016 at 05:34:30PM +0300, Michael S. Tsirkin wrote:
> On Wed, Apr 27, 2016 at 04:23:32PM +0200, Joerg Roedel wrote:

> QEMU can choose to bypass IOMMU for one device and not the other.
> IOMMU in QEMU isn't involved when it's bypassed.

And it is QEMU's task to tell the OS, right? And the correct way to do
this is via the firmware ACPI tables.

> Fine but this is beside the point. Almost all virtio devices
> bypass IOMMU and what this patch does is create a way
> to detect devices that don't. This code can maybe go into
> platform.

Again, the way to detect this is in platform code must not be device
specific. This is what the DMAR and IVRS tables on x86 are for.

When there is no way to do this in the firmware (or there is no firmware
at all), we have to do a quirk in the platform code for it.



	Joerg

^ permalink raw reply

* Re: [PATCH V2 RFC] fixup! virtio: convert to use DMA api
From: Joerg Roedel @ 2016-04-27 14:58 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Anthony PERARD, Wei Liu, Stefan Hajnoczi, qemu-block,
	Stefano Stabellini, linux-kernel@vger.kernel.org,
	qemu-devel@nongnu.org Developers, peterx, Andy Lutomirski,
	Christian Borntraeger, iommu, Andy Lutomirski, kvm list,
	Amit Shah, Paolo Bonzini, Linux Virtualization, David Woodhouse
In-Reply-To: <20160427175031-mutt-send-email-mst@redhat.com>

On Wed, Apr 27, 2016 at 05:54:57PM +0300, Michael S. Tsirkin wrote:
> Point is, QEMU is not the only virtio implementation out there.
> So we can't know no virtio implementations have an IOMMU as long as
> linux supports this IOMMU.
> virtio always used physical addresses since it was born and if it
> changes that it must do this in a way that does not break existing
> users.

FWIW, virtio in qemu can continue to just use physical addresses. But
qemu needs to advertise that fact correctly to the OS in the DMAR table.
This way old kernels (where virtio does not use DMA-API) will also
continue to work on the fixed qemu.



	Joerg

^ permalink raw reply

* Re: [PATCH V2 RFC] fixup! virtio: convert to use DMA api
From: Michael S. Tsirkin @ 2016-04-27 15:05 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Anthony PERARD, Wei Liu, Stefan Hajnoczi, qemu-block,
	Stefano Stabellini, qemu-devel, peterx, linux-kernel,
	Christian Borntraeger, iommu, Andy Lutomirski, kvm, Amit Shah,
	pbonzini, virtualization, David Woodhouse
In-Reply-To: <20160427145632.GI17926@8bytes.org>

On Wed, Apr 27, 2016 at 04:56:32PM +0200, Joerg Roedel wrote:
> On Wed, Apr 27, 2016 at 05:34:30PM +0300, Michael S. Tsirkin wrote:
> > On Wed, Apr 27, 2016 at 04:23:32PM +0200, Joerg Roedel wrote:
> 
> > QEMU can choose to bypass IOMMU for one device and not the other.
> > IOMMU in QEMU isn't involved when it's bypassed.
> 
> And it is QEMU's task to tell the OS, right? And the correct way to do
> this is via the firmware ACPI tables.

Going forward, this might be reasonable. Of course it didn't in the past
and no one cared because virtio devices used physical addresses. We have
to keep these setups working.

> > Fine but this is beside the point. Almost all virtio devices
> > bypass IOMMU and what this patch does is create a way
> > to detect devices that don't. This code can maybe go into
> > platform.
> 
> Again, the way to detect this is in platform code must not be device
> specific. This is what the DMAR and IVRS tables on x86 are for.
> 
> When there is no way to do this in the firmware (or there is no firmware
> at all), we have to do a quirk in the platform code for it.
> 
> 
> 
> 	Joerg

I really don't get it.

There's exactly one device that works now and needs the work-around and
so that we need to support, and that is virtio. It happens to have
exactly the same issue on all platforms.

Why would we want to work hard to build platform-specific
solutions to a problem that can be solved in 5 lines of
generic code?

-- 
MST

^ permalink raw reply

* Re: [PATCH V2 RFC] fixup! virtio: convert to use DMA api
From: Michael S. Tsirkin @ 2016-04-27 15:09 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Anthony PERARD, Wei Liu, Stefan Hajnoczi, qemu-block,
	Stefano Stabellini, linux-kernel@vger.kernel.org,
	qemu-devel@nongnu.org Developers, peterx, Andy Lutomirski,
	Christian Borntraeger, iommu, Andy Lutomirski, kvm list,
	Amit Shah, Paolo Bonzini, Linux Virtualization, David Woodhouse
In-Reply-To: <20160427145851.GJ17926@8bytes.org>

On Wed, Apr 27, 2016 at 04:58:51PM +0200, Joerg Roedel wrote:
> On Wed, Apr 27, 2016 at 05:54:57PM +0300, Michael S. Tsirkin wrote:
> > Point is, QEMU is not the only virtio implementation out there.
> > So we can't know no virtio implementations have an IOMMU as long as
> > linux supports this IOMMU.
> > virtio always used physical addresses since it was born and if it
> > changes that it must do this in a way that does not break existing
> > users.
> 
> FWIW, virtio in qemu can continue to just use physical addresses. But
> qemu needs to advertise that fact correctly to the OS in the DMAR table.
> This way old kernels (where virtio does not use DMA-API) will also
> continue to work on the fixed qemu.
> 
> 
> 
> 	Joerg

It's not clear it can do this since DMAR tables seem to assume
a given slot is either bypassing IOMMU or going through it.
QEMU allows reusing same slot for virtio and non-virtio devices.

Besides, this patch is not about it, it's a flag for QEMU to
tell guest that it can trust DMAR tables.

-- 
MST

^ permalink raw reply

* Re: [PATCH V2 RFC] fixup! virtio: convert to use DMA api
From: Andy Lutomirski @ 2016-04-27 15:10 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Anthony PERARD, Wei Liu, Stefan Hajnoczi, qemu-block,
	Stefano Stabellini, Joerg Roedel,
	qemu-devel@nongnu.org Developers, peterx,
	linux-kernel@vger.kernel.org, Christian Borntraeger, iommu,
	Andy Lutomirski, kvm list, Amit Shah, Paolo Bonzini,
	Linux Virtualization, David Woodhouse
In-Reply-To: <20160427175031-mutt-send-email-mst@redhat.com>

On Wed, Apr 27, 2016 at 7:54 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> On Wed, Apr 27, 2016 at 07:43:07AM -0700, Andy Lutomirski wrote:
>> On Wed, Apr 27, 2016 at 7:38 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
>> > On Wed, Apr 27, 2016 at 07:31:43AM -0700, Andy Lutomirski wrote:
>> >> On Wed, Apr 27, 2016 at 7:23 AM, Joerg Roedel <joro@8bytes.org> wrote:
>> >> > On Wed, Apr 27, 2016 at 04:37:04PM +0300, Michael S. Tsirkin wrote:
>> >> >> One correction: it's a feature of the device in the system.
>> >> >> There could be a mix of devices bypassing and not
>> >> >> bypassing the IOMMU.
>> >> >
>> >> > No, it really is not. A device can't chose to bypass the IOMMU. But the
>> >> > IOMMU can chose to let the device bypass. So any fix here belongs
>> >> > into the platform/iommu code too and not into some driver.
>> >> >
>> >> >> Sounds good. And a way to detect appropriate devices could
>> >> >> be by looking at the feature flag, perhaps?
>> >> >
>> >> > Again, no! The way to detect that is to look into the iommu description
>> >> > structures provided by the firmware. They provide everything necessary
>> >> > to tell the iommu code which devices are not translated.
>> >> >
>> >>
>> >> Except on PPC and SPARC.  As far as I know, those are the only
>> >> problematic platforms.
>> >>
>> >> Is it too late to *disable* QEMU's q35-iommu thingy until it can be
>> >> fixed to report correct data in the DMAR tables?
>> >>
>> >> --Andy
>> >
>> > Meaning virtio or assigned devices?
>> > For virtio - it's way too late since these are working configurations.
>> > For assigned devices - they don't work on x86 so it doesn't have
>> > to be disabled, it's safe to ignore.
>>
>> I mean actually prevent QEMU from running in q35-iommu mode with any
>> virtio devices attached or maybe even turn off q35-iommu mode entirely
>> [1].  Doesn't it require that the user literally pass the word
>> "experimental" into QEMU right now?  It did at some point IIRC.
>>
>> The reason I'm asking is that, other than q35-iommu, QEMU's virtio
>> devices *don't* bypass the IOMMU except on PPC and SPARC, simply
>> because there is no other configuration AFAICT that has virtio and and
>> IOMMU.  So maybe the right solution is to fix q35-iommu to use DMAR
>> correctly (thus breaking q35-iommu users with older guest kernels,
>> which hopefully don't actually exist) and to come up with a PPC- and
>> SPARC-specific solution, or maybe OpenFirmware-specific solution, to
>> handle PPC and SPARC down the road.
>>
>> [1] I'm pretty sure I emailed the QEMU list before q35-iommu ever
>> showed up in a release asking the QEMU team to please not do that
>> until this issue was resolved.  Sadly, that email was ignored :(
>>
>> --Andy
>
> Sorry, I didn't make myself clear.
> Point is, QEMU is not the only virtio implementation out there.
> So we can't know no virtio implementations have an IOMMU as long as
> linux supports this IOMMU.
> virtio always used physical addresses since it was born and if it
> changes that it must do this in a way that does not break existing
> users.

Is there any non-QEMU virtio implementation can provide an
IOMMU-bypassing virtio device on a platform that has a nontrivial
IOMMU?

--Andy

^ permalink raw reply

* Re: [PATCH V2 RFC] fixup! virtio: convert to use DMA api
From: David Woodhouse @ 2016-04-27 15:15 UTC (permalink / raw)
  To: Michael S. Tsirkin, Joerg Roedel
  Cc: Wei Liu, Stefan Hajnoczi, qemu-block, Stefano Stabellini,
	qemu-devel, peterx, linux-kernel, Christian Borntraeger, iommu,
	Andy Lutomirski, kvm, Amit Shah, pbonzini, virtualization,
	Anthony PERARD
In-Reply-To: <20160427180007-mutt-send-email-mst@redhat.com>


[-- Attachment #1.1: Type: text/plain, Size: 937 bytes --]

On Wed, 2016-04-27 at 18:05 +0300, Michael S. Tsirkin wrote:
> 
> I really don't get it.
> 
> There's exactly one device that works now and needs the work-around and
> so that we need to support, and that is virtio. It happens to have
> exactly the same issue on all platforms.

False. We have other devices which are currently *not* translated by
the emulated IOMMU and which aren't going to be in the short term
either.

We also have other devices (emulated hardware NICs) to which precisely
the same "we don't need protection" arguments apply, and which we
*could* expose to the guest without an IOMMU translation if we really
wanted to. It makes as much sense as exposing virtio without an IOMMU,
going forward.

> Why would we want to work hard to build platform-specific
> solutions to a problem that can be solved in 5 lines of
> generic code?

Because it's a dirty hack in the *wrong* place.

-- 
dwmw2

[-- Attachment #1.2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5691 bytes --]

[-- Attachment #2: Type: text/plain, Size: 183 bytes --]

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* Re: [PATCH V2 RFC] fixup! virtio: convert to use DMA api
From: Michael S. Tsirkin @ 2016-04-27 18:17 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Wei Liu, Stefan Hajnoczi, qemu-block, Stefano Stabellini,
	Joerg Roedel, qemu-devel, peterx, linux-kernel,
	Christian Borntraeger, iommu, Andy Lutomirski, kvm, Amit Shah,
	pbonzini, virtualization, Anthony PERARD
In-Reply-To: <1461770135.118304.152.camel@infradead.org>

On Wed, Apr 27, 2016 at 04:15:35PM +0100, David Woodhouse wrote:
> On Wed, 2016-04-27 at 18:05 +0300, Michael S. Tsirkin wrote:
> > 
> > I really don't get it.
> > 
> > There's exactly one device that works now and needs the work-around and
> > so that we need to support, and that is virtio. It happens to have
> > exactly the same issue on all platforms.
> 
> False. We have other devices which are currently *not* translated by
> the emulated IOMMU and which aren't going to be in the short term
> either.
> 
> We also have other devices (emulated hardware NICs) to which precisely
> the same "we don't need protection" arguments apply, and which we
> *could* expose to the guest without an IOMMU translation if we really
> wanted to. It makes as much sense as exposing virtio without an IOMMU,
> going forward.

The reasons for virtio are mostly dealing legacy.
We don't need protection is a separate issue
that I'd rather drop for now.

> > Why would we want to work hard to build platform-specific
> > solutions to a problem that can be solved in 5 lines of
> > generic code?
> 
> Because it's a dirty hack in the *wrong* place.

No one came up with a better one so far :(

> -- 
> dwmw2

^ permalink raw reply

* Re: [PATCH V2 RFC] fixup! virtio: convert to use DMA api
From: David Woodhouse @ 2016-04-27 19:16 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Wei Liu, Stefan Hajnoczi, qemu-block, Stefano Stabellini,
	Joerg Roedel, qemu-devel, peterx, linux-kernel,
	Christian Borntraeger, iommu, Andy Lutomirski, kvm, Amit Shah,
	pbonzini, virtualization, Anthony PERARD
In-Reply-To: <20160427211635-mutt-send-email-mst@redhat.com>


[-- Attachment #1.1: Type: text/plain, Size: 1160 bytes --]

On Wed, 2016-04-27 at 21:17 +0300, Michael S. Tsirkin wrote:
> 
> > Because it's a dirty hack in the *wrong* place.
> 
> No one came up with a better one so far :(

Seriously?

Take a look at drivers/iommu/intel-iommu.c. It has quirks for all kinds
of shitty devices that have to be put in passthrough mode or otherwise
excluded.

We don't actually *need* it for the Intel IOMMU; all we need is for
QEMU to stop lying in its DMAR tables.

We *do* want the same kind of quirks in the relevant POWER and ARM
IOMMU code in the kernel. Do that (hell, a simple quirk for all virtio
devices will suffice, but NOT in the virtio driver) at the same moment
you fix the virtio devices to use the DMA API. Job done.

Some time *later* we can work on *refining* that quirk, and a way for
QEMU to tell the guest (via something generic like fwcfg, maybe) that
some devices are and aren't translated.

Actually, I'm about to look at moving dma_ops into struct device and
cleaning up the way we detect which IOMMU is attached, at device
instantiation time. Perhaps I can shove the virtio-exception quirk in
there while I'm at it...

-- 
dwmw2


[-- Attachment #1.2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5691 bytes --]

[-- Attachment #2: Type: text/plain, Size: 183 bytes --]

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* Re: [PATCH v4 00/13] Support non-lru page migration
From: Andrew Morton @ 2016-04-27 20:20 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Rik van Riel, Sergey Senozhatsky, Rafael Aquini, Jonathan Corbet,
	Chan Gyun Jeong, Hugh Dickins, linux-kernel, dri-devel,
	virtualization, John Einar Reitan, linux-mm, Chulmin Kim,
	Gioh Kim, Konstantin Khlebnikov, Sangseok Lee, Joonsoo Kim,
	Kyeongdon Kim, Naoya Horiguchi, Vlastimil Babka, Mel Gorman
In-Reply-To: <1461743305-19970-1-git-send-email-minchan@kernel.org>

On Wed, 27 Apr 2016 16:48:13 +0900 Minchan Kim <minchan@kernel.org> wrote:

> Recently, I got many reports about perfermance degradation in embedded
> system(Android mobile phone, webOS TV and so on) and easy fork fail.
> 
> The problem was fragmentation caused by zram and GPU driver mainly.
> With memory pressure, their pages were spread out all of pageblock and
> it cannot be migrated with current compaction algorithm which supports
> only LRU pages. In the end, compaction cannot work well so reclaimer
> shrinks all of working set pages. It made system very slow and even to
> fail to fork easily which requires order-[2 or 3] allocations.
> 
> Other pain point is that they cannot use CMA memory space so when OOM
> kill happens, I can see many free pages in CMA area, which is not
> memory efficient. In our product which has big CMA memory, it reclaims
> zones too exccessively to allocate GPU and zram page although there are
> lots of free space in CMA so system becomes very slow easily.
> 
> To solve these problem, this patch tries to add facility to migrate
> non-lru pages via introducing new functions and page flags to help
> migration.

I'm seeing some rejects here against Mel's changes and our patch
bandwidth is getting waaay way ahead of our review bandwidth.  So I
think I'll loadshed this patchset at this time, sorry.

^ permalink raw reply

* Re: [PATCH v4 00/13] Support non-lru page migration
From: Minchan Kim @ 2016-04-27 23:54 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Sergey Senozhatsky, Rafael Aquini, Jonathan Corbet,
	Chan Gyun Jeong, Hugh Dickins, linux-kernel, dri-devel,
	virtualization, John Einar Reitan, linux-mm, Chulmin Kim,
	Gioh Kim, Konstantin Khlebnikov, Sangseok Lee, Joonsoo Kim,
	Kyeongdon Kim, Naoya Horiguchi, Vlastimil Babka, Mel Gorman
In-Reply-To: <20160427132035.e96f99f3420c8fb0020b0fc4@linux-foundation.org>

Hello Andrew,

On Wed, Apr 27, 2016 at 01:20:35PM -0700, Andrew Morton wrote:
> On Wed, 27 Apr 2016 16:48:13 +0900 Minchan Kim <minchan@kernel.org> wrote:
> 
> > Recently, I got many reports about perfermance degradation in embedded
> > system(Android mobile phone, webOS TV and so on) and easy fork fail.
> > 
> > The problem was fragmentation caused by zram and GPU driver mainly.
> > With memory pressure, their pages were spread out all of pageblock and
> > it cannot be migrated with current compaction algorithm which supports
> > only LRU pages. In the end, compaction cannot work well so reclaimer
> > shrinks all of working set pages. It made system very slow and even to
> > fail to fork easily which requires order-[2 or 3] allocations.
> > 
> > Other pain point is that they cannot use CMA memory space so when OOM
> > kill happens, I can see many free pages in CMA area, which is not
> > memory efficient. In our product which has big CMA memory, it reclaims
> > zones too exccessively to allocate GPU and zram page although there are
> > lots of free space in CMA so system becomes very slow easily.
> > 
> > To solve these problem, this patch tries to add facility to migrate
> > non-lru pages via introducing new functions and page flags to help
> > migration.
> 
> I'm seeing some rejects here against Mel's changes and our patch
> bandwidth is getting waaay way ahead of our review bandwidth.  So I
> think I'll loadshed this patchset at this time, sorry.

I expected the conflict with Mel's change in recent mmotm but doesn't want
to send patches against recent mmotm because it has several problems in
compaction so my test was really trobule.
I just picked patches from Hugh and Vlastimil and finally can test on it.
Anyway, I will rebase my patches on recent mmotm, hoping you picked every
patches on compaction part and respin after a few days.

Thanks for let me knowing your plan.

^ permalink raw reply

* Re: [PATCH] vhost_net: stop polling socket during rx processing
From: Jason Wang @ 2016-04-28  6:19 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: netdev, linux-kernel, kvm, virtualization
In-Reply-To: <20160427141317-mutt-send-email-mst@redhat.com>



On 04/27/2016 07:28 PM, Michael S. Tsirkin wrote:
> On Tue, Apr 26, 2016 at 03:35:53AM -0400, Jason Wang wrote:
>> We don't stop polling socket during rx processing, this will lead
>> unnecessary wakeups from under layer net devices (E.g
>> sock_def_readable() form tun). Rx will be slowed down in this
>> way. This patch avoids this by stop polling socket during rx
>> processing. A small drawback is that this introduces some overheads in
>> light load case because of the extra start/stop polling, but single
>> netperf TCP_RR does not notice any change. In a super heavy load case,
>> e.g using pktgen to inject packet to guest, we get about ~17%
>> improvement on pps:
>>
>> before: ~1370000 pkt/s
>> after:  ~1500000 pkt/s
>>
>> Signed-off-by: Jason Wang <jasowang@redhat.com>
> Acked-by: Michael S. Tsirkin <mst@redhat.com>
>
> There is one other possible enhancement: we actually have the wait queue
> lock taken in _wake_up, but we give it up only to take it again in the
> handler.
>
> It would be nicer to just remove the entry when we wake
> the vhost thread. Re-add it if required.
> I think that something like the below would give you the necessary API.
> Pls feel free to use it if you are going to implement a patch on top
> doing this - that's not a reason not to include this simple patch
> though.

Thanks, this looks useful, will give it a try.

>
> --->
>
> wait: add API to drop a wait_queue_t entry from wake up handler
>
> A wake up handler might want to remove its own wait queue entry to avoid
> future wakeups.  In particular, vhost has such a need.  As wait queue
> lock is already taken, all we need is an API to remove the entry without
> wait_queue_head_t which isn't currently accessible to wake up handlers.
>
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
>
> ---
>
> diff --git a/include/linux/wait.h b/include/linux/wait.h
> index 27d7a0a..9c6604b 100644
> --- a/include/linux/wait.h
> +++ b/include/linux/wait.h
> @@ -191,11 +191,17 @@ __add_wait_queue_tail_exclusive(wait_queue_head_t *q, wait_queue_t *wait)
>  }
>  
>  static inline void
> -__remove_wait_queue(wait_queue_head_t *head, wait_queue_t *old)
> +__remove_wait_queue_entry(wait_queue_t *old)
>  {
>  	list_del(&old->task_list);
>  }
>  
> +static inline void
> +__remove_wait_queue(wait_queue_head_t *head, wait_queue_t *old)
> +{
> +	__remove_wait_queue_entry(old);
> +}
> +
>  typedef int wait_bit_action_f(struct wait_bit_key *, int mode);
>  void __wake_up(wait_queue_head_t *q, unsigned int mode, int nr, void *key);
>  void __wake_up_locked_key(wait_queue_head_t *q, unsigned int mode, void *key);

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox