[RFC 0/7] Support high-order page bulk allocation

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC 0/7] Support high-order page bulk allocation
@ 2020-08-14 17:31 Minchan Kim
  2020-08-14 17:31 ` [RFC 1/7] mm: page_owner: split page by order Minchan Kim
                   ` (8 more replies)
  0 siblings, 9 replies; 27+ messages in thread
From: Minchan Kim @ 2020-08-14 17:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, Joonsoo Kim, Vlastimil Babka, John Dias,
	Suren Baghdasaryan, pullip.cho, Minchan Kim

There is a need for special HW to require bulk allocation of
high-order pages. For example, 4800 * order-4 pages.

To meet the requirement, a option is using CMA area because
page allocator with compaction under memory pressure is
easily failed to meet the requirement and too slow for 4800
times. However, CMA has also the following drawbacks:

 * 4800 of order-4 * cma_alloc is too slow

To avoid the slowness, we could try to allocate 300M contiguous
memory once and then split them into order-4 chunks.
The problem of this approach is CMA allocation fails one of the
pages in those range couldn't migrate out, which happens easily
with fs write under memory pressure.

To solve issues, this patch introduces alloc_pages_bulk.

  int alloc_pages_bulk(unsigned long start, unsigned long end,
                       unsigned int migratetype, gfp_t gfp_mask,
                       unsigned int order, unsigned int nr_elem,
                       struct page **pages);

It will investigate the [start, end) and migrate movable pages
out there by best effort(by upcoming patches) to make requested
order's free pages.

The allocated pages will be returned using pages parameter.
Return value represents how many of requested order pages we got.
It could be less than user requested by nr_elem.

/**
 * alloc_pages_bulk() -- tries to allocate high order pages
 * by batch from given range [start, end)
 * @start:      start PFN to allocate
 * @end:        one-past-the-last PFN to allocate
 * @migratetype:        migratetype of the underlaying pageblocks (either
 *                      #MIGRATE_MOVABLE or #MIGRATE_CMA).  All pageblocks
 *                      in range must have the same migratetype and it must
 *                      be either of the two.
 * @gfp_mask:   GFP mask to use during compaction
 * @order:      page order requested
 * @nr_elem:    the number of high-order pages to allocate
 * @pages:      page array pointer to store allocated pages (must
 *              have space for at least nr_elem elements)
 *
 * The PFN range does not have to be pageblock or MAX_ORDER_NR_PAGES
 * aligned.  The PFN range must belong to a single zone.
 *
 * Return: the number of pages allocated on success or negative error code.
 * The allocated pages should be freed using __free_pages
 */

The test goes order-4 * 4800 allocation(i.e., total 300MB) under kernel
build workload. System RAM size is 1.5GB and CMA is 500M.

With using CMA to allocate to 300M, ran 10 times trial, 10 time failed
with big latency(up to several seconds).

With this alloc_pages_bulk API, ran 10 time trial, 7 times are
successful to allocate 4800 times. Rest 3 times are allocated 4799, 4789
and 4799. They are all done with 300ms.

This patchset is against on next-20200813

Minchan Kim (7):
  mm: page_owner: split page by order
  mm: introduce split_page_by_order
  mm: compaction: deal with upcoming high-order page splitting
  mm: factor __alloc_contig_range out
  mm: introduce alloc_pages_bulk API
  mm: make alloc_pages_bulk best effort
  mm/page_isolation: avoid drain_all_pages for alloc_pages_bulk

 include/linux/gfp.h            |   5 +
 include/linux/mm.h             |   2 +
 include/linux/page-isolation.h |   1 +
 include/linux/page_owner.h     |  10 +-
 mm/compaction.c                |  64 +++++++----
 mm/huge_memory.c               |   2 +-
 mm/internal.h                  |   5 +-
 mm/page_alloc.c                | 198 ++++++++++++++++++++++++++-------
 mm/page_isolation.c            |  10 +-
 mm/page_owner.c                |   7 +-
 10 files changed, 230 insertions(+), 74 deletions(-)

-- 
2.28.0.220.ged08abb693-goog

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [RFC 1/7] mm: page_owner: split page by order
  2020-08-14 17:31 [RFC 0/7] Support high-order page bulk allocation Minchan Kim
@ 2020-08-14 17:31 ` Minchan Kim
  2020-08-14 17:31 ` [RFC 2/7] mm: introduce split_page_by_order Minchan Kim
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 27+ messages in thread
From: Minchan Kim @ 2020-08-14 17:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, Joonsoo Kim, Vlastimil Babka, John Dias,
	Suren Baghdasaryan, pullip.cho, Minchan Kim

split_page_owner has assumed that a high-order page allocation is
always split into order-0 allocations. This patch enables splitting
a high-order allocation into any smaller-order allocations.

Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 include/linux/page_owner.h | 10 ++++++----
 mm/huge_memory.c           |  2 +-
 mm/page_alloc.c            |  2 +-
 mm/page_owner.c            |  7 +++++--
 4 files changed, 13 insertions(+), 8 deletions(-)

diff --git a/include/linux/page_owner.h b/include/linux/page_owner.h
index 8679ccd722e8..60231997edb7 100644
--- a/include/linux/page_owner.h
+++ b/include/linux/page_owner.h
@@ -11,7 +11,8 @@ extern struct page_ext_operations page_owner_ops;
 extern void __reset_page_owner(struct page *page, unsigned int order);
 extern void __set_page_owner(struct page *page,
 			unsigned int order, gfp_t gfp_mask);
-extern void __split_page_owner(struct page *page, unsigned int order);
+extern void __split_page_owner(struct page *page, unsigned int order,
+			unsigned int new_order);
 extern void __copy_page_owner(struct page *oldpage, struct page *newpage);
 extern void __set_page_owner_migrate_reason(struct page *page, int reason);
 extern void __dump_page_owner(struct page *page);
@@ -31,10 +32,11 @@ static inline void set_page_owner(struct page *page,
 		__set_page_owner(page, order, gfp_mask);
 }
 
-static inline void split_page_owner(struct page *page, unsigned int order)
+static inline void split_page_owner(struct page *page, unsigned int order,
+			unsigned int new_order)
 {
 	if (static_branch_unlikely(&page_owner_inited))
-		__split_page_owner(page, order);
+		__split_page_owner(page, order, new_order);
 }
 static inline void copy_page_owner(struct page *oldpage, struct page *newpage)
 {
@@ -60,7 +62,7 @@ static inline void set_page_owner(struct page *page,
 {
 }
 static inline void split_page_owner(struct page *page,
-			unsigned int order)
+			unsigned int order, unsigned int new_order)
 {
 }
 static inline void copy_page_owner(struct page *oldpage, struct page *newpage)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 07007a8b68fe..2858a342ce87 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2420,7 +2420,7 @@ static void __split_huge_page(struct page *page, struct list_head *list,
 
 	ClearPageCompound(head);
 
-	split_page_owner(head, HPAGE_PMD_ORDER);
+	split_page_owner(head, HPAGE_PMD_ORDER, 0);
 
 	/* See comment in __split_huge_page_tail() */
 	if (PageAnon(head)) {
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index cf0b25161fea..8ce30cc50577 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3205,7 +3205,7 @@ void split_page(struct page *page, unsigned int order)
 
 	for (i = 1; i < (1 << order); i++)
 		set_page_refcounted(page + i);
-	split_page_owner(page, order);
+	split_page_owner(page, order, 0);
 }
 EXPORT_SYMBOL_GPL(split_page);
 
diff --git a/mm/page_owner.c b/mm/page_owner.c
index 360461509423..c7a07b53eb92 100644
--- a/mm/page_owner.c
+++ b/mm/page_owner.c
@@ -204,7 +204,8 @@ void __set_page_owner_migrate_reason(struct page *page, int reason)
 	page_owner->last_migrate_reason = reason;
 }
 
-void __split_page_owner(struct page *page, unsigned int order)
+void __split_page_owner(struct page *page, unsigned int order,
+			unsigned int new_order)
 {
 	int i;
 	struct page_ext *page_ext = lookup_page_ext(page);
@@ -213,9 +214,11 @@ void __split_page_owner(struct page *page, unsigned int order)
 	if (unlikely(!page_ext))
 		return;
 
+	VM_BUG_ON_PAGE(order < new_order, page);
+
 	for (i = 0; i < (1 << order); i++) {
 		page_owner = get_page_owner(page_ext);
-		page_owner->order = 0;
+		page_owner->order = new_order;
 		page_ext = page_ext_next(page_ext);
 	}
 }
-- 
2.28.0.220.ged08abb693-goog



^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [RFC 2/7] mm: introduce split_page_by_order
  2020-08-14 17:31 [RFC 0/7] Support high-order page bulk allocation Minchan Kim
  2020-08-14 17:31 ` [RFC 1/7] mm: page_owner: split page by order Minchan Kim
@ 2020-08-14 17:31 ` Minchan Kim
  2020-08-14 17:31 ` [RFC 3/7] mm: compaction: deal with upcoming high-order page splitting Minchan Kim
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 27+ messages in thread
From: Minchan Kim @ 2020-08-14 17:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, Joonsoo Kim, Vlastimil Babka, John Dias,
	Suren Baghdasaryan, pullip.cho, Minchan Kim

This patch introduces split_page_by_order to support splitting
a high-order page into group of smaller high-order pages and
use it in split_map_pages for supporting upcoming high-order
bulk operation.

This patch shouldn't change any behavior.

Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 include/linux/mm.h |  2 ++
 mm/compaction.c    |  2 +-
 mm/page_alloc.c    | 27 +++++++++++++++++++--------
 3 files changed, 22 insertions(+), 9 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 8ab941cf73f4..9a51abbe8625 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -849,6 +849,8 @@ void __put_page(struct page *page);
 
 void put_pages_list(struct list_head *pages);
 
+void split_page_by_order(struct page *page, unsigned int order,
+			unsigned int new_order);
 void split_page(struct page *page, unsigned int order);
 
 /*
diff --git a/mm/compaction.c b/mm/compaction.c
index 176dcded298e..f31799a841f2 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -98,7 +98,7 @@ static void split_map_pages(struct list_head *list)
 
 		post_alloc_hook(page, order, __GFP_MOVABLE);
 		if (order)
-			split_page(page, order);
+			split_page_by_order(page, order, 0);
 
 		for (i = 0; i < nr_pages; i++) {
 			list_add(&page->lru, &tmp_list);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 8ce30cc50577..4caab47377a7 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3188,6 +3188,24 @@ void free_unref_page_list(struct list_head *list)
 	local_irq_restore(flags);
 }
 
+/*
+ * split_page_by_order takes a non-compound higher-order page, and splits
+ * it into n (1 << (order - new_order)) sub-order pages: page[0..n]
+ * Each sub-page must be freed individually.
+ */
+void split_page_by_order(struct page *page, unsigned int order,
+			unsigned int new_order)
+{
+	int i;
+
+	VM_BUG_ON_PAGE(PageCompound(page), page);
+	VM_BUG_ON_PAGE(!page_count(page), page);
+
+	for (i = 1; i < (1 << (order - new_order)); i++)
+		set_page_refcounted(page + i * (1 << new_order));
+	split_page_owner(page, order, new_order);
+}
+
 /*
  * split_page takes a non-compound higher-order page, and splits it into
  * n (1<<order) sub-pages: page[0..n]
@@ -3198,14 +3216,7 @@ void free_unref_page_list(struct list_head *list)
  */
 void split_page(struct page *page, unsigned int order)
 {
-	int i;
-
-	VM_BUG_ON_PAGE(PageCompound(page), page);
-	VM_BUG_ON_PAGE(!page_count(page), page);
-
-	for (i = 1; i < (1 << order); i++)
-		set_page_refcounted(page + i);
-	split_page_owner(page, order, 0);
+	split_page_by_order(page, order, 0);
 }
 EXPORT_SYMBOL_GPL(split_page);
 
-- 
2.28.0.220.ged08abb693-goog



^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [RFC 3/7] mm: compaction: deal with upcoming high-order page splitting
  2020-08-14 17:31 [RFC 0/7] Support high-order page bulk allocation Minchan Kim
  2020-08-14 17:31 ` [RFC 1/7] mm: page_owner: split page by order Minchan Kim
  2020-08-14 17:31 ` [RFC 2/7] mm: introduce split_page_by_order Minchan Kim
@ 2020-08-14 17:31 ` Minchan Kim
  2020-08-14 17:31 ` [RFC 4/7] mm: factor __alloc_contig_range out Minchan Kim
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 27+ messages in thread
From: Minchan Kim @ 2020-08-14 17:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, Joonsoo Kim, Vlastimil Babka, John Dias,
	Suren Baghdasaryan, pullip.cho, Minchan Kim

When compaction isolates free pages, it needs to consider freed
pages's order and sub-page splitting to support upcoming high
order page bulk allocation. Since we have primitive functions
to deal with high order page splitting, this patch introduces
cc->isolate_order to indicate what order pages the API user
want to allocate. It isolates free pages with order greater or
equal to cc->isolate_order. After isolating it splits them into
sub pages of cc->isolate_order order.

Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 mm/compaction.c | 42 ++++++++++++++++++++++++++++--------------
 mm/internal.h   |  1 +
 2 files changed, 29 insertions(+), 14 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index f31799a841f2..76f380cb801d 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -68,7 +68,8 @@ static const unsigned int HPAGE_FRAG_CHECK_INTERVAL_MSEC = 500;
 #define COMPACTION_HPAGE_ORDER	(PMD_SHIFT - PAGE_SHIFT)
 #endif
 
-static unsigned long release_freepages(struct list_head *freelist)
+static unsigned long release_freepages(struct list_head *freelist,
+					unsigned int order)
 {
 	struct page *page, *next;
 	unsigned long high_pfn = 0;
@@ -76,7 +77,7 @@ static unsigned long release_freepages(struct list_head *freelist)
 	list_for_each_entry_safe(page, next, freelist, lru) {
 		unsigned long pfn = page_to_pfn(page);
 		list_del(&page->lru);
-		__free_page(page);
+		__free_pages(page, order);
 		if (pfn > high_pfn)
 			high_pfn = pfn;
 	}
@@ -84,7 +85,7 @@ static unsigned long release_freepages(struct list_head *freelist)
 	return high_pfn;
 }
 
-static void split_map_pages(struct list_head *list)
+static void split_map_pages(struct list_head *list, unsigned int split_order)
 {
 	unsigned int i, order, nr_pages;
 	struct page *page, *next;
@@ -94,15 +95,15 @@ static void split_map_pages(struct list_head *list)
 		list_del(&page->lru);
 
 		order = page_private(page);
-		nr_pages = 1 << order;
+		nr_pages = 1 << (order - split_order);
 
 		post_alloc_hook(page, order, __GFP_MOVABLE);
-		if (order)
-			split_page_by_order(page, order, 0);
+		if (order > split_order)
+			split_page_by_order(page, order, split_order);
 
 		for (i = 0; i < nr_pages; i++) {
 			list_add(&page->lru, &tmp_list);
-			page++;
+			page += 1 << split_order;
 		}
 	}
 
@@ -547,8 +548,10 @@ static bool compact_unlock_should_abort(spinlock_t *lock,
 }
 
 /*
- * Isolate free pages onto a private freelist. If @strict is true, will abort
- * returning 0 on any invalid PFNs or non-free pages inside of the pageblock
+ * Isolate free pages onto a private freelist if order of page is greater
+ * or equal to cc->isolate_order. If @strict is true, will abort
+ * returning 0 on any invalid PFNs, pages with order lower than
+ * cc->isolate_order or non-free pages inside of the pageblock
  * (even though it may still end up isolating some pages).
  */
 static unsigned long isolate_freepages_block(struct compact_control *cc,
@@ -625,8 +628,19 @@ static unsigned long isolate_freepages_block(struct compact_control *cc,
 				goto isolate_fail;
 		}
 
-		/* Found a free page, will break it into order-0 pages */
+		/*
+		 * Found a free page. will isolate and possibly split the page
+		 * into isolate_order sub pages if the page's order is greater
+		 * than or equal to the isolate_order. Otherwise, it will keep
+		 * going with further pages to isolate them unless strict is
+		 * true.
+		 */
 		order = page_order(page);
+		if (order < cc->isolate_order) {
+			blockpfn += (1UL << order) - 1;
+			cursor += (1UL << order) - 1;
+			goto isolate_fail;
+		}
 		isolated = __isolate_free_page(page, order);
 		if (!isolated)
 			break;
@@ -752,11 +766,11 @@ isolate_freepages_range(struct compact_control *cc,
 	}
 
 	/* __isolate_free_page() does not map the pages */
-	split_map_pages(&freelist);
+	split_map_pages(&freelist, cc->isolate_order);
 
 	if (pfn < end_pfn) {
 		/* Loop terminated early, cleanup. */
-		release_freepages(&freelist);
+		release_freepages(&freelist, cc->isolate_order);
 		return 0;
 	}
 
@@ -1564,7 +1578,7 @@ static void isolate_freepages(struct compact_control *cc)
 
 splitmap:
 	/* __isolate_free_page() does not map the pages */
-	split_map_pages(freelist);
+	split_map_pages(freelist, 0);
 }
 
 /*
@@ -2376,7 +2390,7 @@ compact_zone(struct compact_control *cc, struct capture_control *capc)
 	 * so we don't leave any returned pages behind in the next attempt.
 	 */
 	if (cc->nr_freepages > 0) {
-		unsigned long free_pfn = release_freepages(&cc->freepages);
+		unsigned long free_pfn = release_freepages(&cc->freepages, 0);
 
 		cc->nr_freepages = 0;
 		VM_BUG_ON(free_pfn == 0);
diff --git a/mm/internal.h b/mm/internal.h
index 10c677655912..5f1e9d76a623 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -244,6 +244,7 @@ struct compact_control {
 	bool contended;			/* Signal lock or sched contention */
 	bool rescan;			/* Rescanning the same pageblock */
 	bool alloc_contig;		/* alloc_contig_range allocation */
+	int isolate_order;		/* minimum order isolated from buddy */
 };
 
 /*
-- 
2.28.0.220.ged08abb693-goog



^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [RFC 4/7] mm: factor __alloc_contig_range out
  2020-08-14 17:31 [RFC 0/7] Support high-order page bulk allocation Minchan Kim
                   ` (2 preceding siblings ...)
  2020-08-14 17:31 ` [RFC 3/7] mm: compaction: deal with upcoming high-order page splitting Minchan Kim
@ 2020-08-14 17:31 ` Minchan Kim
  2020-08-14 17:31 ` [RFC 5/7] mm: introduce alloc_pages_bulk API Minchan Kim
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 27+ messages in thread
From: Minchan Kim @ 2020-08-14 17:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, Joonsoo Kim, Vlastimil Babka, John Dias,
	Suren Baghdasaryan, pullip.cho, Minchan Kim

To prepare new API which will reuse most of alloc_contig_range,
this patch factor out the common part as __alloc_contig_range.

Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 mm/page_alloc.c | 50 +++++++++++++++++++++++++++----------------------
 1 file changed, 28 insertions(+), 22 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 4caab47377a7..caf393d8b413 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -8401,28 +8401,7 @@ static int __alloc_contig_migrate_range(struct compact_control *cc,
 	return 0;
 }
 
-/**
- * alloc_contig_range() -- tries to allocate given range of pages
- * @start:	start PFN to allocate
- * @end:	one-past-the-last PFN to allocate
- * @migratetype:	migratetype of the underlaying pageblocks (either
- *			#MIGRATE_MOVABLE or #MIGRATE_CMA).  All pageblocks
- *			in range must have the same migratetype and it must
- *			be either of the two.
- * @gfp_mask:	GFP mask to use during compaction
- *
- * The PFN range does not have to be pageblock or MAX_ORDER_NR_PAGES
- * aligned.  The PFN range must belong to a single zone.
- *
- * The first thing this routine does is attempt to MIGRATE_ISOLATE all
- * pageblocks in the range.  Once isolated, the pageblocks should not
- * be modified by others.
- *
- * Return: zero on success or negative error code.  On success all
- * pages which PFN is in [start, end) are allocated for the caller and
- * need to be freed with free_contig_range().
- */
-int alloc_contig_range(unsigned long start, unsigned long end,
+static int __alloc_contig_range(unsigned long start, unsigned long end,
 		       unsigned migratetype, gfp_t gfp_mask)
 {
 	unsigned long outer_start, outer_end;
@@ -8555,6 +8534,33 @@ int alloc_contig_range(unsigned long start, unsigned long end,
 }
 EXPORT_SYMBOL(alloc_contig_range);
 
+/**
+ * alloc_contig_range() -- tries to allocate given range of pages
+ * @start:	start PFN to allocate
+ * @end:	one-past-the-last PFN to allocate
+ * @migratetype:	migratetype of the underlaying pageblocks (either
+ *			#MIGRATE_MOVABLE or #MIGRATE_CMA).  All pageblocks
+ *			in range must have the same migratetype and it must
+ *			be either of the two.
+ * @gfp_mask:	GFP mask to use during compaction
+ *
+ * The PFN range does not have to be pageblock or MAX_ORDER_NR_PAGES
+ * aligned.  The PFN range must belong to a single zone.
+ *
+ * The first thing this routine does is attempt to MIGRATE_ISOLATE all
+ * pageblocks in the range.  Once isolated, the pageblocks should not
+ * be modified by others.
+ *
+ * Return: zero on success or negative error code.  On success all
+ * pages which PFN is in [start, end) are allocated for the caller and
+ * need to be freed with free_contig_range().
+ */
+int alloc_contig_range(unsigned long start, unsigned long end,
+		       unsigned migratetype, gfp_t gfp_mask)
+{
+	return __alloc_contig_range(start, end, migratetype, gfp_mask);
+}
+
 static int __alloc_contig_pages(unsigned long start_pfn,
 				unsigned long nr_pages, gfp_t gfp_mask)
 {
-- 
2.28.0.220.ged08abb693-goog



^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [RFC 5/7] mm: introduce alloc_pages_bulk API
  2020-08-14 17:31 [RFC 0/7] Support high-order page bulk allocation Minchan Kim
                   ` (3 preceding siblings ...)
  2020-08-14 17:31 ` [RFC 4/7] mm: factor __alloc_contig_range out Minchan Kim
@ 2020-08-14 17:31 ` Minchan Kim
  2020-08-17 17:40   ` David Hildenbrand
  2020-08-14 17:31 ` [RFC 6/7] mm: make alloc_pages_bulk best effort Minchan Kim
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 27+ messages in thread
From: Minchan Kim @ 2020-08-14 17:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, Joonsoo Kim, Vlastimil Babka, John Dias,
	Suren Baghdasaryan, pullip.cho, Minchan Kim

There is a need for special HW to require bulk allocation of
high-order pages. For example, 4800 * order-4 pages.

To meet the requirement, a option is using CMA area because
page allocator with compaction under memory pressure is
easily failed to meet the requirement and too slow for 4800
times. However, CMA has also the following drawbacks:

 * 4800 of order-4 * cma_alloc is too slow

To avoid the slowness, we could try to allocate 300M contiguous
memory once and then split them into order-4 chunks.
The problem of this approach is CMA allocation fails one of the
pages in those range couldn't migrate out, which happens easily
with fs write under memory pressure.

To solve issues, this patch introduces alloc_pages_bulk.

  int alloc_pages_bulk(unsigned long start, unsigned long end,
                       unsigned int migratetype, gfp_t gfp_mask,
                       unsigned int order, unsigned int nr_elem,
                       struct page **pages);

It will investigate the [start, end) and migrate movable pages
out there by best effort(by upcoming patches) to make requested
order's free pages.

The allocated pages will be returned using pages parameter.
Return value represents how many of requested order pages we got.
It could be less than user requested by nr_elem.

/**
 * alloc_pages_bulk() -- tries to allocate high order pages
 * by batch from given range [start, end)
 * @start:      start PFN to allocate
 * @end:        one-past-the-last PFN to allocate
 * @migratetype:        migratetype of the underlaying pageblocks (either
 *                      #MIGRATE_MOVABLE or #MIGRATE_CMA).  All pageblocks
 *                      in range must have the same migratetype and it must
 *                      be either of the two.
 * @gfp_mask:   GFP mask to use during compaction
 * @order:      page order requested
 * @nr_elem:    the number of high-order pages to allocate
 * @pages:      page array pointer to store allocated pages (must
 *              have space for at least nr_elem elements)
 *
 * The PFN range does not have to be pageblock or MAX_ORDER_NR_PAGES
 * aligned.  The PFN range must belong to a single zone.
 *
 * Return: the number of pages allocated on success or negative error code.
 * The allocated pages should be freed using __free_pages
 */

The test goes order-4 * 4800 allocation(i.e., total 300MB) under kernel
build workload. System RAM size is 1.5GB and CMA is 500M.

With using CMA to allocate to 300M, ran 10 times trial, 10 time failed
with big latency(up to several seconds).

With this alloc_pages_bulk API, ran 10 time trial, 7 times are
successful to allocate 4800 times. Rest 3 times are allocated 4799, 4789
and 4799. They are all done with 300ms.

Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 include/linux/gfp.h |  5 +++
 mm/compaction.c     | 11 +++--
 mm/internal.h       |  3 +-
 mm/page_alloc.c     | 97 +++++++++++++++++++++++++++++++++++++++++----
 4 files changed, 102 insertions(+), 14 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 67a0774e080b..79ff38f25def 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -625,6 +625,11 @@ static inline bool pm_suspended_storage(void)
 /* The below functions must be run on a range from a single zone. */
 extern int alloc_contig_range(unsigned long start, unsigned long end,
 			      unsigned migratetype, gfp_t gfp_mask);
+extern int alloc_pages_bulk(unsigned long start, unsigned long end,
+			unsigned int migratetype, gfp_t gfp_mask,
+			unsigned int order, unsigned int nr_elem,
+			struct page **pages);
+
 extern struct page *alloc_contig_pages(unsigned long nr_pages, gfp_t gfp_mask,
 				       int nid, nodemask_t *nodemask);
 #endif
diff --git a/mm/compaction.c b/mm/compaction.c
index 76f380cb801d..1e4392f6fec3 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -713,10 +713,10 @@ static unsigned long isolate_freepages_block(struct compact_control *cc,
  */
 unsigned long
 isolate_freepages_range(struct compact_control *cc,
-			unsigned long start_pfn, unsigned long end_pfn)
+			unsigned long start_pfn, unsigned long end_pfn,
+			struct list_head *freepage_list)
 {
 	unsigned long isolated, pfn, block_start_pfn, block_end_pfn;
-	LIST_HEAD(freelist);
 
 	pfn = start_pfn;
 	block_start_pfn = pageblock_start_pfn(pfn);
@@ -748,7 +748,7 @@ isolate_freepages_range(struct compact_control *cc,
 			break;
 
 		isolated = isolate_freepages_block(cc, &isolate_start_pfn,
-					block_end_pfn, &freelist, 0, true);
+					block_end_pfn, freepage_list, 0, true);
 
 		/*
 		 * In strict mode, isolate_freepages_block() returns 0 if
@@ -766,15 +766,14 @@ isolate_freepages_range(struct compact_control *cc,
 	}
 
 	/* __isolate_free_page() does not map the pages */
-	split_map_pages(&freelist, cc->isolate_order);
+	split_map_pages(freepage_list, cc->isolate_order);
 
 	if (pfn < end_pfn) {
 		/* Loop terminated early, cleanup. */
-		release_freepages(&freelist, cc->isolate_order);
+		release_freepages(freepage_list, cc->isolate_order);
 		return 0;
 	}
 
-	/* We don't use freelists for anything. */
 	return pfn;
 }
 
diff --git a/mm/internal.h b/mm/internal.h
index 5f1e9d76a623..f9b86257fae2 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -258,7 +258,8 @@ struct capture_control {
 
 unsigned long
 isolate_freepages_range(struct compact_control *cc,
-			unsigned long start_pfn, unsigned long end_pfn);
+			unsigned long start_pfn, unsigned long end_pfn,
+			struct list_head *freepage_list);
 unsigned long
 isolate_migratepages_range(struct compact_control *cc,
 			   unsigned long low_pfn, unsigned long end_pfn);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index caf393d8b413..cdf956feae80 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -8402,10 +8402,14 @@ static int __alloc_contig_migrate_range(struct compact_control *cc,
 }
 
 static int __alloc_contig_range(unsigned long start, unsigned long end,
-		       unsigned migratetype, gfp_t gfp_mask)
+		       unsigned int migratetype, gfp_t gfp_mask,
+		       unsigned int alloc_order,
+		       struct list_head *freepage_list)
 {
 	unsigned long outer_start, outer_end;
 	unsigned int order;
+	struct page *page, *page2;
+	unsigned long pfn;
 	int ret = 0;
 
 	struct compact_control cc = {
@@ -8417,6 +8421,7 @@ static int __alloc_contig_range(unsigned long start, unsigned long end,
 		.no_set_skip_hint = true,
 		.gfp_mask = current_gfp_context(gfp_mask),
 		.alloc_contig = true,
+		.isolate_order = alloc_order,
 	};
 	INIT_LIST_HEAD(&cc.migratepages);
 
@@ -8515,17 +8520,42 @@ static int __alloc_contig_range(unsigned long start, unsigned long end,
 	}
 
 	/* Grab isolated pages from freelists. */
-	outer_end = isolate_freepages_range(&cc, outer_start, end);
+	outer_end = isolate_freepages_range(&cc, outer_start, end,
+					freepage_list);
 	if (!outer_end) {
 		ret = -EBUSY;
 		goto done;
 	}
 
 	/* Free head and tail (if any) */
-	if (start != outer_start)
-		free_contig_range(outer_start, start - outer_start);
-	if (end != outer_end)
-		free_contig_range(end, outer_end - end);
+	if (start != outer_start) {
+		if (alloc_order == 0)
+			free_contig_range(outer_start, start - outer_start);
+		else {
+			list_for_each_entry_safe(page, page2,
+						freepage_list, lru) {
+				pfn = page_to_pfn(page);
+				if (pfn >= start)
+					break;
+				list_del(&page->lru);
+				__free_pages(page, alloc_order);
+			}
+		}
+	}
+	if (end != outer_end) {
+		if (alloc_order == 0)
+			free_contig_range(end, outer_end - end);
+		else {
+			list_for_each_entry_safe_reverse(page, page2,
+						freepage_list, lru) {
+				pfn = page_to_pfn(page);
+				if ((pfn + (1 << alloc_order)) <= end)
+					break;
+				list_del(&page->lru);
+				__free_pages(page, alloc_order);
+			}
+		}
+	}
 
 done:
 	undo_isolate_page_range(pfn_max_align_down(start),
@@ -8558,8 +8588,61 @@ EXPORT_SYMBOL(alloc_contig_range);
 int alloc_contig_range(unsigned long start, unsigned long end,
 		       unsigned migratetype, gfp_t gfp_mask)
 {
-	return __alloc_contig_range(start, end, migratetype, gfp_mask);
+	LIST_HEAD(freepage_list);
+
+	return __alloc_contig_range(start, end, migratetype,
+			gfp_mask, 0, &freepage_list);
+}
+
+/**
+ * alloc_pages_bulk() -- tries to allocate high order pages
+ * by batch from given range [start, end)
+ * @start:	start PFN to allocate
+ * @end:	one-past-the-last PFN to allocate
+ * @migratetype:	migratetype of the underlaying pageblocks (either
+ *			#MIGRATE_MOVABLE or #MIGRATE_CMA).  All pageblocks
+ *			in range must have the same migratetype and it must
+ *			be either of the two.
+ * @gfp_mask:	GFP mask to use during compaction
+ * @order:	page order requested
+ * @nr_elem:    the number of high-order pages to allocate
+ * @pages:      page array pointer to store allocated pages (must
+ *              have space for at least nr_elem elements)
+ *
+ * The PFN range does not have to be pageblock or MAX_ORDER_NR_PAGES
+ * aligned.  The PFN range must belong to a single zone.
+ *
+ * Return: the number of pages allocated on success or negative error code.
+ * The allocated pages need to be free with __free_pages
+ */
+int alloc_pages_bulk(unsigned long start, unsigned long end,
+			unsigned int migratetype, gfp_t gfp_mask,
+			unsigned int order, unsigned int nr_elem,
+			struct page **pages)
+{
+	int ret;
+	struct page *page, *page2;
+	LIST_HEAD(freepage_list);
+
+	if (order >= MAX_ORDER)
+		return -EINVAL;
+
+	ret = __alloc_contig_range(start, end, migratetype,
+				gfp_mask, order, &freepage_list);
+	if (ret)
+		return ret;
+
+	/* keep pfn ordering */
+	list_for_each_entry_safe(page, page2, &freepage_list, lru) {
+		if (ret < nr_elem)
+			pages[ret++] = page;
+		else
+			__free_pages(page, order);
+	}
+
+	return ret;
 }
+EXPORT_SYMBOL(alloc_pages_bulk);
 
 static int __alloc_contig_pages(unsigned long start_pfn,
 				unsigned long nr_pages, gfp_t gfp_mask)
-- 
2.28.0.220.ged08abb693-goog



^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [RFC 5/7] mm: introduce alloc_pages_bulk API
  2020-08-14 17:31 ` [RFC 5/7] mm: introduce alloc_pages_bulk API Minchan Kim
@ 2020-08-17 17:40   ` David Hildenbrand
  0 siblings, 0 replies; 27+ messages in thread
From: David Hildenbrand @ 2020-08-17 17:40 UTC (permalink / raw)
  To: Minchan Kim, Andrew Morton
  Cc: linux-mm, Joonsoo Kim, Vlastimil Babka, John Dias,
	Suren Baghdasaryan, pullip.cho

[...]

>  unsigned long
>  isolate_freepages_range(struct compact_control *cc,
> -			unsigned long start_pfn, unsigned long end_pfn);
> +			unsigned long start_pfn, unsigned long end_pfn,
> +			struct list_head *freepage_list);
>  unsigned long
>  isolate_migratepages_range(struct compact_control *cc,
>  			   unsigned long low_pfn, unsigned long end_pfn);
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index caf393d8b413..cdf956feae80 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -8402,10 +8402,14 @@ static int __alloc_contig_migrate_range(struct compact_control *cc,
>  }
>  
>  static int __alloc_contig_range(unsigned long start, unsigned long end,
> -		       unsigned migratetype, gfp_t gfp_mask)
> +		       unsigned int migratetype, gfp_t gfp_mask,
> +		       unsigned int alloc_order,
> +		       struct list_head *freepage_list)

I have to say that this interface gets really ugly, especially as you
add yet another (questionable to me) parameter in the next patch. I
don't like that.

It feels like your trying to squeeze a very specific behavior into a
fairly simple and basic range allocator (well, it's complicated stuff,
but the interface is at least fairly simple). Something like that should
be handled on a higher level if possible. And similar to Matthew, I am
not sure if working on PFN ranges is actually what we want here.

You only want *some* order-4 pages in your driver and identified
performance issues when using CMA for the individual allocations. Now
you convert the existing range allocator API into a "allocate something
within something" but don't even show how that one would be used within
CMA to speed up stuff.

I still wonder if there isn't an easier approach to achieve what you
want, speeding up CMA allocations on the one hand, and dealing with
temporarily unmovable pages on the other hand.

Any experts around?

-- 
Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [RFC 6/7] mm: make alloc_pages_bulk best effort
  2020-08-14 17:31 [RFC 0/7] Support high-order page bulk allocation Minchan Kim
                   ` (4 preceding siblings ...)
  2020-08-14 17:31 ` [RFC 5/7] mm: introduce alloc_pages_bulk API Minchan Kim
@ 2020-08-14 17:31 ` Minchan Kim
  2020-08-14 17:31 ` [RFC 7/7] mm/page_isolation: avoid drain_all_pages for alloc_pages_bulk Minchan Kim
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 27+ messages in thread
From: Minchan Kim @ 2020-08-14 17:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, Joonsoo Kim, Vlastimil Babka, John Dias,
	Suren Baghdasaryan, pullip.cho, Minchan Kim

alloc_pages_bulk takes best effort approach to make high order
pages so it should keep going with further range even though it
encounters non-movable pages. To achieve it, this patch introduces
ALLOW_ISOLATE_FAILURE flags for start_isolate_page_range and
alloc_bulk in compact_control so it could proceed with further
range although some failures happen from isolation/migration/
free page isolation.

What it does with new flag are
 * skip the pageblock if it's not affordable for changing the block
   MIGRATE_ISOLATE
 * skip the pageblock if it couldn't migrate a page by some reasons
 * skip the pageblock if it couldn't isolate free pages by some reasons

Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 include/linux/page-isolation.h |  1 +
 mm/compaction.c                | 17 +++++++++++++----
 mm/internal.h                  |  1 +
 mm/page_alloc.c                | 32 +++++++++++++++++++++++---------
 mm/page_isolation.c            |  4 ++++
 5 files changed, 42 insertions(+), 13 deletions(-)

diff --git a/include/linux/page-isolation.h b/include/linux/page-isolation.h
index 572458016331..b8b6789d2bd9 100644
--- a/include/linux/page-isolation.h
+++ b/include/linux/page-isolation.h
@@ -32,6 +32,7 @@ static inline bool is_migrate_isolate(int migratetype)
 
 #define MEMORY_OFFLINE	0x1
 #define REPORT_FAILURE	0x2
+#define ALLOW_ISOLATE_FAILURE	0x4
 
 struct page *has_unmovable_pages(struct zone *zone, struct page *page,
 				 int migratetype, int flags);
diff --git a/mm/compaction.c b/mm/compaction.c
index 1e4392f6fec3..94dee139ce0d 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -748,15 +748,24 @@ isolate_freepages_range(struct compact_control *cc,
 			break;
 
 		isolated = isolate_freepages_block(cc, &isolate_start_pfn,
-					block_end_pfn, freepage_list, 0, true);
+					block_end_pfn, freepage_list,
+					cc->alloc_bulk ? 1 : 0,
+					cc->alloc_bulk ? false : true);
 
 		/*
 		 * In strict mode, isolate_freepages_block() returns 0 if
 		 * there are any holes in the block (ie. invalid PFNs or
-		 * non-free pages).
+		 * non-free pages) so just stop the isolation in the case.
+		 * However, in alloc_bulk mode, we could check further range
+		 * to find affordable high order free pages so keep going
+		 * with next pageblock.
 		 */
-		if (!isolated)
-			break;
+		if (!isolated) {
+			if (!cc->alloc_bulk)
+				break;
+			pfn = block_end_pfn;
+			continue;
+		}
 
 		/*
 		 * If we managed to isolate pages, it is always (1 << n) *
diff --git a/mm/internal.h b/mm/internal.h
index f9b86257fae2..71f00284326e 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -244,6 +244,7 @@ struct compact_control {
 	bool contended;			/* Signal lock or sched contention */
 	bool rescan;			/* Rescanning the same pageblock */
 	bool alloc_contig;		/* alloc_contig_range allocation */
+	bool alloc_bulk;		/* alloc_pages_bulk allocation */
 	int isolate_order;		/* minimum order isolated from buddy */
 };
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index cdf956feae80..66cea47ae2b6 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -8359,8 +8359,8 @@ static int __alloc_contig_migrate_range(struct compact_control *cc,
 	/* This function is based on compact_zone() from compaction.c. */
 	unsigned int nr_reclaimed;
 	unsigned long pfn = start;
-	unsigned int tries = 0;
-	int ret = 0;
+	unsigned int tries;
+	int ret;
 	struct migration_target_control mtc = {
 		.nid = zone_to_nid(cc->zone),
 		.gfp_mask = GFP_USER | __GFP_MOVABLE | __GFP_RETRY_MAYFAIL,
@@ -8368,6 +8368,8 @@ static int __alloc_contig_migrate_range(struct compact_control *cc,
 
 	migrate_prep();
 
+next:
+	tries = ret = 0;
 	while (pfn < end || !list_empty(&cc->migratepages)) {
 		if (fatal_signal_pending(current)) {
 			ret = -EINTR;
@@ -8396,15 +8398,25 @@ static int __alloc_contig_migrate_range(struct compact_control *cc,
 	}
 	if (ret < 0) {
 		putback_movable_pages(&cc->migratepages);
-		return ret;
+		if (cc->alloc_bulk && pfn < end) {
+			/*
+			 * -EINTR means current process has fatal signal.
+			 * -ENOMEM means there is no free memory.
+			 *  In these cases, stop the effort to work with
+			 *  next blocks.
+			 */
+			if (ret != -EINTR && ret != -ENOMEM)
+				goto next;
+		}
 	}
-	return 0;
+	return ret;
 }
 
 static int __alloc_contig_range(unsigned long start, unsigned long end,
 		       unsigned int migratetype, gfp_t gfp_mask,
 		       unsigned int alloc_order,
-		       struct list_head *freepage_list)
+		       struct list_head *freepage_list,
+		       bool alloc_bulk)
 {
 	unsigned long outer_start, outer_end;
 	unsigned int order;
@@ -8422,6 +8434,7 @@ static int __alloc_contig_range(unsigned long start, unsigned long end,
 		.gfp_mask = current_gfp_context(gfp_mask),
 		.alloc_contig = true,
 		.isolate_order = alloc_order,
+		.alloc_bulk = alloc_bulk,
 	};
 	INIT_LIST_HEAD(&cc.migratepages);
 
@@ -8450,7 +8463,8 @@ static int __alloc_contig_range(unsigned long start, unsigned long end,
 	 */
 
 	ret = start_isolate_page_range(pfn_max_align_down(start),
-				       pfn_max_align_up(end), migratetype, 0);
+				       pfn_max_align_up(end), migratetype,
+				       alloc_bulk ? ALLOW_ISOLATE_FAILURE : 0);
 	if (ret < 0)
 		return ret;
 
@@ -8512,7 +8526,7 @@ static int __alloc_contig_range(unsigned long start, unsigned long end,
 	}
 
 	/* Make sure the range is really isolated. */
-	if (test_pages_isolated(outer_start, end, 0)) {
+	if (!alloc_bulk && test_pages_isolated(outer_start, end, 0)) {
 		pr_info_ratelimited("%s: [%lx, %lx) PFNs busy\n",
 			__func__, outer_start, end);
 		ret = -EBUSY;
@@ -8591,7 +8605,7 @@ int alloc_contig_range(unsigned long start, unsigned long end,
 	LIST_HEAD(freepage_list);
 
 	return __alloc_contig_range(start, end, migratetype,
-			gfp_mask, 0, &freepage_list);
+			gfp_mask, 0, &freepage_list, false);
 }
 
 /**
@@ -8628,7 +8642,7 @@ int alloc_pages_bulk(unsigned long start, unsigned long end,
 		return -EINVAL;
 
 	ret = __alloc_contig_range(start, end, migratetype,
-				gfp_mask, order, &freepage_list);
+				gfp_mask, order, &freepage_list, true);
 	if (ret)
 		return ret;
 
diff --git a/mm/page_isolation.c b/mm/page_isolation.c
index 242c03121d73..6208db89a31b 100644
--- a/mm/page_isolation.c
+++ b/mm/page_isolation.c
@@ -154,6 +154,8 @@ __first_valid_page(unsigned long pfn, unsigned long nr_pages)
  *					 and PageOffline() pages.
  *			REPORT_FAILURE - report details about the failure to
  *			isolate the range
+ *			ALLOW_ISOLATE_FAILURE - skip the pageblock of the range
+ *			whenever we fail to set MIGRATE_ISOLATE
  *
  * Making page-allocation-type to be MIGRATE_ISOLATE means free pages in
  * the range will never be allocated. Any free pages and pages freed in the
@@ -190,6 +192,8 @@ int start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
 		page = __first_valid_page(pfn, pageblock_nr_pages);
 		if (page) {
 			if (set_migratetype_isolate(page, migratetype, flags)) {
+				if (flags & ALLOW_ISOLATE_FAILURE)
+					continue;
 				undo_pfn = pfn;
 				goto undo;
 			}
-- 
2.28.0.220.ged08abb693-goog



^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [RFC 7/7] mm/page_isolation: avoid drain_all_pages for alloc_pages_bulk
  2020-08-14 17:31 [RFC 0/7] Support high-order page bulk allocation Minchan Kim
                   ` (5 preceding siblings ...)
  2020-08-14 17:31 ` [RFC 6/7] mm: make alloc_pages_bulk best effort Minchan Kim
@ 2020-08-14 17:31 ` Minchan Kim
  2020-08-14 17:40 ` [RFC 0/7] Support high-order page bulk allocation Matthew Wilcox
  2020-08-16 12:31 ` David Hildenbrand
  8 siblings, 0 replies; 27+ messages in thread
From: Minchan Kim @ 2020-08-14 17:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, Joonsoo Kim, Vlastimil Babka, John Dias,
	Suren Baghdasaryan, pullip.cho, Minchan Kim

The draining PCP of CPUs whenever we marked a pageblock
MIGRATE_ISOLATE in the big range is too expensive when we consider
fact that alloc_pages_bulk is just best effort approach.

Thus, this patch avoids the flush when the semantic allows
ISOLATE_FAILURE.

Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 mm/page_isolation.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/mm/page_isolation.c b/mm/page_isolation.c
index 6208db89a31b..e70bdded02e9 100644
--- a/mm/page_isolation.c
+++ b/mm/page_isolation.c
@@ -54,9 +54,7 @@ static int set_migratetype_isolate(struct page *page, int migratetype, int isol_
 
 out:
 	spin_unlock_irqrestore(&zone->lock, flags);
-	if (!ret) {
-		drain_all_pages(zone);
-	} else {
+	if (ret) {
 		WARN_ON_ONCE(zone_idx(zone) == ZONE_MOVABLE);
 
 		if ((isol_flags & REPORT_FAILURE) && unmovable)
@@ -197,6 +195,8 @@ int start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
 				undo_pfn = pfn;
 				goto undo;
 			}
+			if (!(flags & ALLOW_ISOLATE_FAILURE))
+				drain_all_pages(page_zone(page));
 			nr_isolate_pageblock++;
 		}
 	}
-- 
2.28.0.220.ged08abb693-goog



^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [RFC 0/7] Support high-order page bulk allocation
  2020-08-14 17:31 [RFC 0/7] Support high-order page bulk allocation Minchan Kim
                   ` (6 preceding siblings ...)
  2020-08-14 17:31 ` [RFC 7/7] mm/page_isolation: avoid drain_all_pages for alloc_pages_bulk Minchan Kim
@ 2020-08-14 17:40 ` Matthew Wilcox
  2020-08-14 20:55   ` Minchan Kim
  2020-08-16 12:31 ` David Hildenbrand
  8 siblings, 1 reply; 27+ messages in thread
From: Matthew Wilcox @ 2020-08-14 17:40 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-mm, Joonsoo Kim, Vlastimil Babka, John Dias,
	Suren Baghdasaryan, pullip.cho

On Fri, Aug 14, 2020 at 10:31:24AM -0700, Minchan Kim wrote:
> There is a need for special HW to require bulk allocation of
> high-order pages. For example, 4800 * order-4 pages.

... but you haven't shown that user.

>   int alloc_pages_bulk(unsigned long start, unsigned long end,
>                        unsigned int migratetype, gfp_t gfp_mask,
>                        unsigned int order, unsigned int nr_elem,
>                        struct page **pages);
> 
> It will investigate the [start, end) and migrate movable pages
> out there by best effort(by upcoming patches) to make requested
> order's free pages.
> 
> The allocated pages will be returned using pages parameter.
> Return value represents how many of requested order pages we got.
> It could be less than user requested by nr_elem.

I don't understand why a user would need to know the PFNs to allocate
between.  This seems like something that's usually specified by GFP_DMA32
or similar.

Is it useful to return fewer pages than requested?



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC 0/7] Support high-order page bulk allocation
  2020-08-14 17:40 ` [RFC 0/7] Support high-order page bulk allocation Matthew Wilcox
@ 2020-08-14 20:55   ` Minchan Kim
  2020-08-18  2:16     ` Cho KyongHo
  2020-08-18  9:22     ` Cho KyongHo
  0 siblings, 2 replies; 27+ messages in thread
From: Minchan Kim @ 2020-08-14 20:55 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Andrew Morton, linux-mm, Joonsoo Kim, Vlastimil Babka, John Dias,
	Suren Baghdasaryan, pullip.cho

On Fri, Aug 14, 2020 at 06:40:20PM +0100, Matthew Wilcox wrote:
> On Fri, Aug 14, 2020 at 10:31:24AM -0700, Minchan Kim wrote:
> > There is a need for special HW to require bulk allocation of
> > high-order pages. For example, 4800 * order-4 pages.
> 
> ... but you haven't shown that user.

Kyoungho is working on it.
I am not sure how much he could share but hopefully, he could
show some. Kyoungho?

> 
> >   int alloc_pages_bulk(unsigned long start, unsigned long end,
> >                        unsigned int migratetype, gfp_t gfp_mask,
> >                        unsigned int order, unsigned int nr_elem,
> >                        struct page **pages);
> > 
> > It will investigate the [start, end) and migrate movable pages
> > out there by best effort(by upcoming patches) to make requested
> > order's free pages.
> > 
> > The allocated pages will be returned using pages parameter.
> > Return value represents how many of requested order pages we got.
> > It could be less than user requested by nr_elem.
> 
> I don't understand why a user would need to know the PFNs to allocate
> between.  This seems like something that's usually specified by GFP_DMA32
> or similar.

I wanted to let the API wok from CMA area and/or movable zone where are
always fulled with migrable pages.
If we carry on only zone flags without pfn range, it couldn't fulfil cma
area cases.
Other reason is if user see fewer pages returned, he could try subsequent
ranges to get remained ones.

> Is it useful to return fewer pages than requested?

It's useful because user could ask further than what they need or retry.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC 0/7] Support high-order page bulk allocation
  2020-08-14 20:55   ` Minchan Kim
@ 2020-08-18  2:16     ` Cho KyongHo
  2020-08-18  9:22     ` Cho KyongHo
  1 sibling, 0 replies; 27+ messages in thread
From: Cho KyongHo @ 2020-08-18  2:16 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Matthew Wilcox, Andrew Morton, linux-mm, Joonsoo Kim,
	Vlastimil Babka, John Dias, Suren Baghdasaryan

[-- Attachment #1: Type: text/plain, Size: 2511 bytes --]

On Fri, Aug 14, 2020 at 01:55:58PM -0700, Minchan Kim wrote:
> On Fri, Aug 14, 2020 at 06:40:20PM +0100, Matthew Wilcox wrote:
> > On Fri, Aug 14, 2020 at 10:31:24AM -0700, Minchan Kim wrote:
> > > There is a need for special HW to require bulk allocation of
> > > high-order pages. For example, 4800 * order-4 pages.
> > 
> > ... but you haven't shown that user.
> 
> Kyoungho is working on it.
> I am not sure how much he could share but hopefully, he could
> show some. Kyoungho?

We are preparing the following patch to dma-buf heap that uses
alloc_pages_bulk(). The heap collects pages of identical size from
alloc_pages_bulk() for the H/Ws that have restrictions in memory
alignments due to the performance or the functionality.

> > 
> > >   int alloc_pages_bulk(unsigned long start, unsigned long end,
> > >                        unsigned int migratetype, gfp_t gfp_mask,
> > >                        unsigned int order, unsigned int nr_elem,
> > >                        struct page **pages);
> > > 
> > > It will investigate the [start, end) and migrate movable pages
> > > out there by best effort(by upcoming patches) to make requested
> > > order's free pages.
> > > 
> > > The allocated pages will be returned using pages parameter.
> > > Return value represents how many of requested order pages we got.
> > > It could be less than user requested by nr_elem.
> > 
> > I don't understand why a user would need to know the PFNs to allocate
> > between.  This seems like something that's usually specified by GFP_DMA32
> > or similar.
> 
> I wanted to let the API wok from CMA area and/or movable zone where are
> always fulled with migrable pages.
> If we carry on only zone flags without pfn range, it couldn't fulfil cma
> area cases.
> Other reason is if user see fewer pages returned, he could try subsequent
> ranges to get remained ones.
> 
> > Is it useful to return fewer pages than requested?
> 
> It's useful because user could ask further than what they need or retry.

I agree with Minchan. A CMA area is private to a device or a device
driver except the shared DMA pool. Drivers first try collecting large
order pages from its private CMA area. If the number of pages returned
by alloc_pages_bulk() is less than nr_elem, then the driver can retry
with fewer nr_elem after a while with expecting pinned pages or pages
under races in the CMA area are resolved.
The driver may further try call alloc_pages_bulk() on another CMA area
which is available to the driver or the movable zone.

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC 0/7] Support high-order page bulk allocation
  2020-08-14 20:55   ` Minchan Kim
  2020-08-18  2:16     ` Cho KyongHo
@ 2020-08-18  9:22     ` Cho KyongHo
  1 sibling, 0 replies; 27+ messages in thread
From: Cho KyongHo @ 2020-08-18  9:22 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Matthew Wilcox, Andrew Morton, linux-mm, Joonsoo Kim,
	Vlastimil Babka, John Dias, Suren Baghdasaryan, hyesoo.yu,
	janghyuck.kim

[-- Attachment #1: Type: text/plain, Size: 2149 bytes --]

On Fri, Aug 14, 2020 at 01:55:58PM -0700, Minchan Kim wrote:
> On Fri, Aug 14, 2020 at 06:40:20PM +0100, Matthew Wilcox wrote:
> > On Fri, Aug 14, 2020 at 10:31:24AM -0700, Minchan Kim wrote:
> > > There is a need for special HW to require bulk allocation of
> > > high-order pages. For example, 4800 * order-4 pages.
> > 
> > ... but you haven't shown that user.
> 
> Kyoungho is working on it.
> I am not sure how much he could share but hopefully, he could
> show some. Kyoungho?
> 
Hyesoo posted a patch series that uses alloc_pages_bulk() in a dma-heap;
please take a look at:
https://lore.kernel.org/linux-mm/20200818080415.7531-1-hyesoo.yu@samsung.com/

The patch series introduces a new type of dma-heap, chunk heap which is
initialized by a device tree node. The chunk heap also needs its device
tree node should have a phandle to reserved memory node with "reusable"
property.

> > 
> > >   int alloc_pages_bulk(unsigned long start, unsigned long end,
> > >                        unsigned int migratetype, gfp_t gfp_mask,
> > >                        unsigned int order, unsigned int nr_elem,
> > >                        struct page **pages);
> > > 
> > > It will investigate the [start, end) and migrate movable pages
> > > out there by best effort(by upcoming patches) to make requested
> > > order's free pages.
> > > 
> > > The allocated pages will be returned using pages parameter.
> > > Return value represents how many of requested order pages we got.
> > > It could be less than user requested by nr_elem.
> > 
> > I don't understand why a user would need to know the PFNs to allocate
> > between.  This seems like something that's usually specified by GFP_DMA32
> > or similar.
> 
> I wanted to let the API wok from CMA area and/or movable zone where are
> always fulled with migrable pages.
> If we carry on only zone flags without pfn range, it couldn't fulfil cma
> area cases.
> Other reason is if user see fewer pages returned, he could try subsequent
> ranges to get remained ones.
> 
> > Is it useful to return fewer pages than requested?
> 
> It's useful because user could ask further than what they need or retry.
> 

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC 0/7] Support high-order page bulk allocation
  2020-08-14 17:31 [RFC 0/7] Support high-order page bulk allocation Minchan Kim
                   ` (7 preceding siblings ...)
  2020-08-14 17:40 ` [RFC 0/7] Support high-order page bulk allocation Matthew Wilcox
@ 2020-08-16 12:31 ` David Hildenbrand
  2020-08-17 15:27   ` Minchan Kim
  8 siblings, 1 reply; 27+ messages in thread
From: David Hildenbrand @ 2020-08-16 12:31 UTC (permalink / raw)
  To: Minchan Kim, Andrew Morton
  Cc: linux-mm, Joonsoo Kim, Vlastimil Babka, John Dias,
	Suren Baghdasaryan, pullip.cho

On 14.08.20 19:31, Minchan Kim wrote:
> There is a need for special HW to require bulk allocation of
> high-order pages. For example, 4800 * order-4 pages.
> 
> To meet the requirement, a option is using CMA area because
> page allocator with compaction under memory pressure is
> easily failed to meet the requirement and too slow for 4800
> times. However, CMA has also the following drawbacks:
> 
>  * 4800 of order-4 * cma_alloc is too slow
> 
> To avoid the slowness, we could try to allocate 300M contiguous
> memory once and then split them into order-4 chunks.
> The problem of this approach is CMA allocation fails one of the
> pages in those range couldn't migrate out, which happens easily
> with fs write under memory pressure.

Why not chose a value in between? Like try to allocate MAX_ORDER - 1
chunks and split them. That would already heavily reduce the call frequency.

I don't see a real need for a completely new range allocator function
for this special case yet.

> 
> To solve issues, this patch introduces alloc_pages_bulk.
> 
>   int alloc_pages_bulk(unsigned long start, unsigned long end,
>                        unsigned int migratetype, gfp_t gfp_mask,
>                        unsigned int order, unsigned int nr_elem,
>                        struct page **pages);
> 
> It will investigate the [start, end) and migrate movable pages
> out there by best effort(by upcoming patches) to make requested
> order's free pages.
> 
> The allocated pages will be returned using pages parameter.
> Return value represents how many of requested order pages we got.
> It could be less than user requested by nr_elem.
> 
> /**
>  * alloc_pages_bulk() -- tries to allocate high order pages
>  * by batch from given range [start, end)
>  * @start:      start PFN to allocate
>  * @end:        one-past-the-last PFN to allocate
>  * @migratetype:        migratetype of the underlaying pageblocks (either
>  *                      #MIGRATE_MOVABLE or #MIGRATE_CMA).  All pageblocks
>  *                      in range must have the same migratetype and it must
>  *                      be either of the two.
>  * @gfp_mask:   GFP mask to use during compaction
>  * @order:      page order requested
>  * @nr_elem:    the number of high-order pages to allocate
>  * @pages:      page array pointer to store allocated pages (must
>  *              have space for at least nr_elem elements)
>  *
>  * The PFN range does not have to be pageblock or MAX_ORDER_NR_PAGES
>  * aligned.  The PFN range must belong to a single zone.
>  *
>  * Return: the number of pages allocated on success or negative error code.
>  * The allocated pages should be freed using __free_pages
>  */
> 
> The test goes order-4 * 4800 allocation(i.e., total 300MB) under kernel
> build workload. System RAM size is 1.5GB and CMA is 500M.
> 
> With using CMA to allocate to 300M, ran 10 times trial, 10 time failed
> with big latency(up to several seconds).
> 
> With this alloc_pages_bulk API, ran 10 time trial, 7 times are
> successful to allocate 4800 times. Rest 3 times are allocated 4799, 4789
> and 4799. They are all done with 300ms.
> 
> This patchset is against on next-20200813
> 
> Minchan Kim (7):
>   mm: page_owner: split page by order
>   mm: introduce split_page_by_order
>   mm: compaction: deal with upcoming high-order page splitting
>   mm: factor __alloc_contig_range out
>   mm: introduce alloc_pages_bulk API
>   mm: make alloc_pages_bulk best effort
>   mm/page_isolation: avoid drain_all_pages for alloc_pages_bulk
> 
>  include/linux/gfp.h            |   5 +
>  include/linux/mm.h             |   2 +
>  include/linux/page-isolation.h |   1 +
>  include/linux/page_owner.h     |  10 +-
>  mm/compaction.c                |  64 +++++++----
>  mm/huge_memory.c               |   2 +-
>  mm/internal.h                  |   5 +-
>  mm/page_alloc.c                | 198 ++++++++++++++++++++++++++-------
>  mm/page_isolation.c            |  10 +-
>  mm/page_owner.c                |   7 +-
>  10 files changed, 230 insertions(+), 74 deletions(-)
> 


-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC 0/7] Support high-order page bulk allocation
  2020-08-16 12:31 ` David Hildenbrand
@ 2020-08-17 15:27   ` Minchan Kim
  2020-08-17 15:45     ` David Hildenbrand
  0 siblings, 1 reply; 27+ messages in thread
From: Minchan Kim @ 2020-08-17 15:27 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Andrew Morton, linux-mm, Joonsoo Kim, Vlastimil Babka, John Dias,
	Suren Baghdasaryan, pullip.cho

On Sun, Aug 16, 2020 at 02:31:22PM +0200, David Hildenbrand wrote:
> On 14.08.20 19:31, Minchan Kim wrote:
> > There is a need for special HW to require bulk allocation of
> > high-order pages. For example, 4800 * order-4 pages.
> > 
> > To meet the requirement, a option is using CMA area because
> > page allocator with compaction under memory pressure is
> > easily failed to meet the requirement and too slow for 4800
> > times. However, CMA has also the following drawbacks:
> > 
> >  * 4800 of order-4 * cma_alloc is too slow
> > 
> > To avoid the slowness, we could try to allocate 300M contiguous
> > memory once and then split them into order-4 chunks.
> > The problem of this approach is CMA allocation fails one of the
> > pages in those range couldn't migrate out, which happens easily
> > with fs write under memory pressure.
> 
> Why not chose a value in between? Like try to allocate MAX_ORDER - 1
> chunks and split them. That would already heavily reduce the call frequency.

I think you meant this:

    alloc_pages(GFP_KERNEL|__GFP_NOWARN, MAX_ORDER - 1)

It would work if system has lots of non-fragmented free memory.
However, once they are fragmented, it doesn't work. That's why we have
seen even order-4 allocation failure in the field easily and that's why
CMA was there.

CMA has more logics to isolate the memory during allocation/freeing as
well as fragmentation avoidance so that it has less chance to be stealed
from others and increase high success ratio. That's why I want this API
to be used with CMA or movable zone.

A usecase is device can set a exclusive CMA area up when system boots.
When device needs 4800 * order-4 pages, it could call this bulk against
of the area so that it could effectively be guaranteed to allocate
enough fast.

> 
> I don't see a real need for a completely new range allocator function
> for this special case yet.
> 
> > 
> > To solve issues, this patch introduces alloc_pages_bulk.
> > 
> >   int alloc_pages_bulk(unsigned long start, unsigned long end,
> >                        unsigned int migratetype, gfp_t gfp_mask,
> >                        unsigned int order, unsigned int nr_elem,
> >                        struct page **pages);
> > 
> > It will investigate the [start, end) and migrate movable pages
> > out there by best effort(by upcoming patches) to make requested
> > order's free pages.
> > 
> > The allocated pages will be returned using pages parameter.
> > Return value represents how many of requested order pages we got.
> > It could be less than user requested by nr_elem.
> > 
> > /**
> >  * alloc_pages_bulk() -- tries to allocate high order pages
> >  * by batch from given range [start, end)
> >  * @start:      start PFN to allocate
> >  * @end:        one-past-the-last PFN to allocate
> >  * @migratetype:        migratetype of the underlaying pageblocks (either
> >  *                      #MIGRATE_MOVABLE or #MIGRATE_CMA).  All pageblocks
> >  *                      in range must have the same migratetype and it must
> >  *                      be either of the two.
> >  * @gfp_mask:   GFP mask to use during compaction
> >  * @order:      page order requested
> >  * @nr_elem:    the number of high-order pages to allocate
> >  * @pages:      page array pointer to store allocated pages (must
> >  *              have space for at least nr_elem elements)
> >  *
> >  * The PFN range does not have to be pageblock or MAX_ORDER_NR_PAGES
> >  * aligned.  The PFN range must belong to a single zone.
> >  *
> >  * Return: the number of pages allocated on success or negative error code.
> >  * The allocated pages should be freed using __free_pages
> >  */
> > 
> > The test goes order-4 * 4800 allocation(i.e., total 300MB) under kernel
> > build workload. System RAM size is 1.5GB and CMA is 500M.
> > 
> > With using CMA to allocate to 300M, ran 10 times trial, 10 time failed
> > with big latency(up to several seconds).
> > 
> > With this alloc_pages_bulk API, ran 10 time trial, 7 times are
> > successful to allocate 4800 times. Rest 3 times are allocated 4799, 4789
> > and 4799. They are all done with 300ms.
> > 
> > This patchset is against on next-20200813
> > 
> > Minchan Kim (7):
> >   mm: page_owner: split page by order
> >   mm: introduce split_page_by_order
> >   mm: compaction: deal with upcoming high-order page splitting
> >   mm: factor __alloc_contig_range out
> >   mm: introduce alloc_pages_bulk API
> >   mm: make alloc_pages_bulk best effort
> >   mm/page_isolation: avoid drain_all_pages for alloc_pages_bulk
> > 
> >  include/linux/gfp.h            |   5 +
> >  include/linux/mm.h             |   2 +
> >  include/linux/page-isolation.h |   1 +
> >  include/linux/page_owner.h     |  10 +-
> >  mm/compaction.c                |  64 +++++++----
> >  mm/huge_memory.c               |   2 +-
> >  mm/internal.h                  |   5 +-
> >  mm/page_alloc.c                | 198 ++++++++++++++++++++++++++-------
> >  mm/page_isolation.c            |  10 +-
> >  mm/page_owner.c                |   7 +-
> >  10 files changed, 230 insertions(+), 74 deletions(-)
> > 
> 
> 
> -- 
> Thanks,
> 
> David / dhildenb
> 


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC 0/7] Support high-order page bulk allocation
  2020-08-17 15:27   ` Minchan Kim
@ 2020-08-17 15:45     ` David Hildenbrand
  2020-08-17 16:30       ` Minchan Kim
  0 siblings, 1 reply; 27+ messages in thread
From: David Hildenbrand @ 2020-08-17 15:45 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-mm, Joonsoo Kim, Vlastimil Babka, John Dias,
	Suren Baghdasaryan, pullip.cho

On 17.08.20 17:27, Minchan Kim wrote:
> On Sun, Aug 16, 2020 at 02:31:22PM +0200, David Hildenbrand wrote:
>> On 14.08.20 19:31, Minchan Kim wrote:
>>> There is a need for special HW to require bulk allocation of
>>> high-order pages. For example, 4800 * order-4 pages.
>>>
>>> To meet the requirement, a option is using CMA area because
>>> page allocator with compaction under memory pressure is
>>> easily failed to meet the requirement and too slow for 4800
>>> times. However, CMA has also the following drawbacks:
>>>
>>>  * 4800 of order-4 * cma_alloc is too slow
>>>
>>> To avoid the slowness, we could try to allocate 300M contiguous
>>> memory once and then split them into order-4 chunks.
>>> The problem of this approach is CMA allocation fails one of the
>>> pages in those range couldn't migrate out, which happens easily
>>> with fs write under memory pressure.
>>
>> Why not chose a value in between? Like try to allocate MAX_ORDER - 1
>> chunks and split them. That would already heavily reduce the call frequency.
> 
> I think you meant this:
> 
>     alloc_pages(GFP_KERNEL|__GFP_NOWARN, MAX_ORDER - 1)
> 
> It would work if system has lots of non-fragmented free memory.
> However, once they are fragmented, it doesn't work. That's why we have
> seen even order-4 allocation failure in the field easily and that's why
> CMA was there.
> 
> CMA has more logics to isolate the memory during allocation/freeing as
> well as fragmentation avoidance so that it has less chance to be stealed
> from others and increase high success ratio. That's why I want this API
> to be used with CMA or movable zone.

I was talking about doing MAX_ORDER - 1 CMA allocations instead of one
big 300M allocation. As you correctly note, memory placed into CMA
should be movable, except for (short/long) term pinnings. In these
cases, doing allocations smaller than 300M and splitting them up should
be good enough to reduce the call frequency, no?

> 
> A usecase is device can set a exclusive CMA area up when system boots.
> When device needs 4800 * order-4 pages, it could call this bulk against
> of the area so that it could effectively be guaranteed to allocate
> enough fast.

Just wondering

a) Why does it have to be fast?
b) Why does it need that many order-4 pages?
c) How dynamic is the device need at runtime?
d) Would it be reasonable in your setup to mark a CMA region in a way
such that it will never be used for other (movable) allocations,
guaranteeing that you can immediately allocate it? Something like,
reserving a region during boot you know you'll immediately need later
completely for a device?


-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC 0/7] Support high-order page bulk allocation
  2020-08-17 15:45     ` David Hildenbrand
@ 2020-08-17 16:30       ` Minchan Kim
  2020-08-17 16:44         ` David Hildenbrand
  0 siblings, 1 reply; 27+ messages in thread
From: Minchan Kim @ 2020-08-17 16:30 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Andrew Morton, linux-mm, Joonsoo Kim, Vlastimil Babka, John Dias,
	Suren Baghdasaryan, pullip.cho

On Mon, Aug 17, 2020 at 05:45:59PM +0200, David Hildenbrand wrote:
> On 17.08.20 17:27, Minchan Kim wrote:
> > On Sun, Aug 16, 2020 at 02:31:22PM +0200, David Hildenbrand wrote:
> >> On 14.08.20 19:31, Minchan Kim wrote:
> >>> There is a need for special HW to require bulk allocation of
> >>> high-order pages. For example, 4800 * order-4 pages.
> >>>
> >>> To meet the requirement, a option is using CMA area because
> >>> page allocator with compaction under memory pressure is
> >>> easily failed to meet the requirement and too slow for 4800
> >>> times. However, CMA has also the following drawbacks:
> >>>
> >>>  * 4800 of order-4 * cma_alloc is too slow
> >>>
> >>> To avoid the slowness, we could try to allocate 300M contiguous
> >>> memory once and then split them into order-4 chunks.
> >>> The problem of this approach is CMA allocation fails one of the
> >>> pages in those range couldn't migrate out, which happens easily
> >>> with fs write under memory pressure.
> >>
> >> Why not chose a value in between? Like try to allocate MAX_ORDER - 1
> >> chunks and split them. That would already heavily reduce the call frequency.
> > 
> > I think you meant this:
> > 
> >     alloc_pages(GFP_KERNEL|__GFP_NOWARN, MAX_ORDER - 1)
> > 
> > It would work if system has lots of non-fragmented free memory.
> > However, once they are fragmented, it doesn't work. That's why we have
> > seen even order-4 allocation failure in the field easily and that's why
> > CMA was there.
> > 
> > CMA has more logics to isolate the memory during allocation/freeing as
> > well as fragmentation avoidance so that it has less chance to be stealed
> > from others and increase high success ratio. That's why I want this API
> > to be used with CMA or movable zone.
> 
> I was talking about doing MAX_ORDER - 1 CMA allocations instead of one
> big 300M allocation. As you correctly note, memory placed into CMA
> should be movable, except for (short/long) term pinnings. In these
> cases, doing allocations smaller than 300M and splitting them up should
> be good enough to reduce the call frequency, no?

I should have written that. The 300M I mentioned is really minimum size.
In some scenraio, we need way bigger than 300M, up to several GB.
Furthermore, the demand would be increased in near future.

> 
> > 
> > A usecase is device can set a exclusive CMA area up when system boots.
> > When device needs 4800 * order-4 pages, it could call this bulk against
> > of the area so that it could effectively be guaranteed to allocate
> > enough fast.
> 
> Just wondering
> 
> a) Why does it have to be fast?

That's because it's related to application latency, which ends up
user feel bad.

> b) Why does it need that many order-4 pages?

It's HW requirement. I couldn't say much about that.

> c) How dynamic is the device need at runtime?

Whenever the application launched. It depends on user's usage pattern.

> d) Would it be reasonable in your setup to mark a CMA region in a way
> such that it will never be used for other (movable) allocations,

I don't get your point. If we don't want the area to used up for
other movable allocation, why should we use it as CMA first?
It sounds like reserved memory and just wasted the memory.

Please clarify if I misudersoold your suggestion.

> guaranteeing that you can immediately allocate it? Something like,
> reserving a region during boot you know you'll immediately need later
> completely for a device?


> 
> 
> -- 
> Thanks,
> 
> David / dhildenb
> 


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC 0/7] Support high-order page bulk allocation
  2020-08-17 16:30       ` Minchan Kim
@ 2020-08-17 16:44         ` David Hildenbrand
  2020-08-17 17:03           ` David Hildenbrand
  2020-08-17 23:34           ` Minchan Kim
  0 siblings, 2 replies; 27+ messages in thread
From: David Hildenbrand @ 2020-08-17 16:44 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-mm, Joonsoo Kim, Vlastimil Babka, John Dias,
	Suren Baghdasaryan, pullip.cho

On 17.08.20 18:30, Minchan Kim wrote:
> On Mon, Aug 17, 2020 at 05:45:59PM +0200, David Hildenbrand wrote:
>> On 17.08.20 17:27, Minchan Kim wrote:
>>> On Sun, Aug 16, 2020 at 02:31:22PM +0200, David Hildenbrand wrote:
>>>> On 14.08.20 19:31, Minchan Kim wrote:
>>>>> There is a need for special HW to require bulk allocation of
>>>>> high-order pages. For example, 4800 * order-4 pages.
>>>>>
>>>>> To meet the requirement, a option is using CMA area because
>>>>> page allocator with compaction under memory pressure is
>>>>> easily failed to meet the requirement and too slow for 4800
>>>>> times. However, CMA has also the following drawbacks:
>>>>>
>>>>>  * 4800 of order-4 * cma_alloc is too slow
>>>>>
>>>>> To avoid the slowness, we could try to allocate 300M contiguous
>>>>> memory once and then split them into order-4 chunks.
>>>>> The problem of this approach is CMA allocation fails one of the
>>>>> pages in those range couldn't migrate out, which happens easily
>>>>> with fs write under memory pressure.
>>>>
>>>> Why not chose a value in between? Like try to allocate MAX_ORDER - 1
>>>> chunks and split them. That would already heavily reduce the call frequency.
>>>
>>> I think you meant this:
>>>
>>>     alloc_pages(GFP_KERNEL|__GFP_NOWARN, MAX_ORDER - 1)
>>>
>>> It would work if system has lots of non-fragmented free memory.
>>> However, once they are fragmented, it doesn't work. That's why we have
>>> seen even order-4 allocation failure in the field easily and that's why
>>> CMA was there.
>>>
>>> CMA has more logics to isolate the memory during allocation/freeing as
>>> well as fragmentation avoidance so that it has less chance to be stealed
>>> from others and increase high success ratio. That's why I want this API
>>> to be used with CMA or movable zone.
>>
>> I was talking about doing MAX_ORDER - 1 CMA allocations instead of one
>> big 300M allocation. As you correctly note, memory placed into CMA
>> should be movable, except for (short/long) term pinnings. In these
>> cases, doing allocations smaller than 300M and splitting them up should
>> be good enough to reduce the call frequency, no?
> 
> I should have written that. The 300M I mentioned is really minimum size.
> In some scenraio, we need way bigger than 300M, up to several GB.
> Furthermore, the demand would be increased in near future.

And what will the driver do with that data besides providing it to the
device? Can it be mapped to user space? I think we really need more
information / the actual user.

>>
>>>
>>> A usecase is device can set a exclusive CMA area up when system boots.
>>> When device needs 4800 * order-4 pages, it could call this bulk against
>>> of the area so that it could effectively be guaranteed to allocate
>>> enough fast.
>>
>> Just wondering
>>
>> a) Why does it have to be fast?
> 
> That's because it's related to application latency, which ends up
> user feel bad.

Okay, but in theory, your device-needs are very similar to
application-needs, besides you requiring order-4 pages, correct? Similar
to an application that starts up and pins 300M (or more), just with
ordr-4 pages.

I don't get quite yet why you need a range allocator for that. Because
you intend to use CMA?

> 
>> b) Why does it need that many order-4 pages?
> 
> It's HW requirement. I couldn't say much about that.

Hm.

> 
>> c) How dynamic is the device need at runtime?
> 
> Whenever the application launched. It depends on user's usage pattern.
> 
>> d) Would it be reasonable in your setup to mark a CMA region in a way
>> such that it will never be used for other (movable) allocations,
> 
> I don't get your point. If we don't want the area to used up for
> other movable allocation, why should we use it as CMA first?
> It sounds like reserved memory and just wasted the memory.

Right, it's just very hard to get what you are trying to achieve without
the actual user at hand.

For example, will the pages you allocate be movable? Does the device
allow for that? If not, then the MOVABLE zone is usually not valid
(similar to gigantic pages not being allocated from the MOVABLE zone).
So your stuck with the NORMAL zone or CMA. Especially for the NORMAL
zone, alloc_contig_range() is currently not prepared to properly handle
sub-MAX_ORDER - 1 ranges. If any involved pageblock contains an
unmovable page, the allcoation will fail (see pageblock isolation /
has_unmovable_pages()). So CMA would be your only option.

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC 0/7] Support high-order page bulk allocation
  2020-08-17 16:44         ` David Hildenbrand
@ 2020-08-17 17:03           ` David Hildenbrand
  2020-08-17 23:34           ` Minchan Kim
  1 sibling, 0 replies; 27+ messages in thread
From: David Hildenbrand @ 2020-08-17 17:03 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-mm, Joonsoo Kim, Vlastimil Babka, John Dias,
	Suren Baghdasaryan, pullip.cho

>>>> A usecase is device can set a exclusive CMA area up when system boots.
>>>> When device needs 4800 * order-4 pages, it could call this bulk against
>>>> of the area so that it could effectively be guaranteed to allocate
>>>> enough fast.
>>>
>>> Just wondering
>>>
>>> a) Why does it have to be fast?
>>
>> That's because it's related to application latency, which ends up
>> user feel bad.
> 
> Okay, but in theory, your device-needs are very similar to
> application-needs, besides you requiring order-4 pages, correct? Similar
> to an application that starts up and pins 300M (or more), just with
> ordr-4 pages.

Pinning was probably misleading.

I meant either actual pinning, like vfio pins all pages backing a VM
e.g., in QEMU, or mlocking+populating all memory.


-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC 0/7] Support high-order page bulk allocation
  2020-08-17 16:44         ` David Hildenbrand
  2020-08-17 17:03           ` David Hildenbrand
@ 2020-08-17 23:34           ` Minchan Kim
  2020-08-18  7:42             ` Nicholas Piggin
  2020-08-18  7:49             ` David Hildenbrand
  1 sibling, 2 replies; 27+ messages in thread
From: Minchan Kim @ 2020-08-17 23:34 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Andrew Morton, linux-mm, Joonsoo Kim, Vlastimil Babka, John Dias,
	Suren Baghdasaryan, pullip.cho

On Mon, Aug 17, 2020 at 06:44:50PM +0200, David Hildenbrand wrote:
> On 17.08.20 18:30, Minchan Kim wrote:
> > On Mon, Aug 17, 2020 at 05:45:59PM +0200, David Hildenbrand wrote:
> >> On 17.08.20 17:27, Minchan Kim wrote:
> >>> On Sun, Aug 16, 2020 at 02:31:22PM +0200, David Hildenbrand wrote:
> >>>> On 14.08.20 19:31, Minchan Kim wrote:
> >>>>> There is a need for special HW to require bulk allocation of
> >>>>> high-order pages. For example, 4800 * order-4 pages.
> >>>>>
> >>>>> To meet the requirement, a option is using CMA area because
> >>>>> page allocator with compaction under memory pressure is
> >>>>> easily failed to meet the requirement and too slow for 4800
> >>>>> times. However, CMA has also the following drawbacks:
> >>>>>
> >>>>>  * 4800 of order-4 * cma_alloc is too slow
> >>>>>
> >>>>> To avoid the slowness, we could try to allocate 300M contiguous
> >>>>> memory once and then split them into order-4 chunks.
> >>>>> The problem of this approach is CMA allocation fails one of the
> >>>>> pages in those range couldn't migrate out, which happens easily
> >>>>> with fs write under memory pressure.
> >>>>
> >>>> Why not chose a value in between? Like try to allocate MAX_ORDER - 1
> >>>> chunks and split them. That would already heavily reduce the call frequency.
> >>>
> >>> I think you meant this:
> >>>
> >>>     alloc_pages(GFP_KERNEL|__GFP_NOWARN, MAX_ORDER - 1)
> >>>
> >>> It would work if system has lots of non-fragmented free memory.
> >>> However, once they are fragmented, it doesn't work. That's why we have
> >>> seen even order-4 allocation failure in the field easily and that's why
> >>> CMA was there.
> >>>
> >>> CMA has more logics to isolate the memory during allocation/freeing as
> >>> well as fragmentation avoidance so that it has less chance to be stealed
> >>> from others and increase high success ratio. That's why I want this API
> >>> to be used with CMA or movable zone.
> >>
> >> I was talking about doing MAX_ORDER - 1 CMA allocations instead of one
> >> big 300M allocation. As you correctly note, memory placed into CMA
> >> should be movable, except for (short/long) term pinnings. In these
> >> cases, doing allocations smaller than 300M and splitting them up should
> >> be good enough to reduce the call frequency, no?
> > 
> > I should have written that. The 300M I mentioned is really minimum size.
> > In some scenraio, we need way bigger than 300M, up to several GB.
> > Furthermore, the demand would be increased in near future.
> 
> And what will the driver do with that data besides providing it to the
> device? Can it be mapped to user space? I think we really need more
> information / the actual user.
> 
> >>
> >>>
> >>> A usecase is device can set a exclusive CMA area up when system boots.
> >>> When device needs 4800 * order-4 pages, it could call this bulk against
> >>> of the area so that it could effectively be guaranteed to allocate
> >>> enough fast.
> >>
> >> Just wondering
> >>
> >> a) Why does it have to be fast?
> > 
> > That's because it's related to application latency, which ends up
> > user feel bad.
> 
> Okay, but in theory, your device-needs are very similar to
> application-needs, besides you requiring order-4 pages, correct? Similar
> to an application that starts up and pins 300M (or more), just with
> ordr-4 pages.

Yes.

> 
> I don't get quite yet why you need a range allocator for that. Because
> you intend to use CMA?

Yes, with CMA, it could be more guaranteed and fast enough with little
tweaking. Currently, CMA is too slow due to below IPI overheads.

1. set_migratetype_isolate does drain_all_pages for every pageblock.
2. __aloc_contig_migrate_range does migrate_prep
3. alloc_contig_range does lru_add_drain_all.

Thus, if we need to increase frequency of call as your suggestion,
the set up overhead is also scaling up depending on the size.
Such overhead makes sense if caller requests big contiguous memory
but it's too much for normal high-order allocations.

Maybe, we might optimize those call sites to reduce or/remove
frequency of those IPI call smarter way but that would need to deal
with success ratio vs fastness dance in the end.

Another concern to use existing cma API is it's trying to make
allocation successful at the cost of latency. For example, waiting
a page writeback.

That's the this new sematic API comes from for compromise since I believe
we need some way to separate original CMA alloc(biased to be guaranteed
but slower) with this new API(biased to be fast but less guaranteed).

Is there any idea to work without existing cma API tweaking?

> 
> > 
> >> b) Why does it need that many order-4 pages?
> > 
> > It's HW requirement. I couldn't say much about that.
> 
> Hm.
> 
> > 
> >> c) How dynamic is the device need at runtime?
> > 
> > Whenever the application launched. It depends on user's usage pattern.
> > 
> >> d) Would it be reasonable in your setup to mark a CMA region in a way
> >> such that it will never be used for other (movable) allocations,
> > 
> > I don't get your point. If we don't want the area to used up for
> > other movable allocation, why should we use it as CMA first?
> > It sounds like reserved memory and just wasted the memory.
> 
> Right, it's just very hard to get what you are trying to achieve without
> the actual user at hand.
> 
> For example, will the pages you allocate be movable? Does the device
> allow for that? If not, then the MOVABLE zone is usually not valid
> (similar to gigantic pages not being allocated from the MOVABLE zone).
> So your stuck with the NORMAL zone or CMA. Especially for the NORMAL
> zone, alloc_contig_range() is currently not prepared to properly handle
> sub-MAX_ORDER - 1 ranges. If any involved pageblock contains an
> unmovable page, the allcoation will fail (see pageblock isolation /
> has_unmovable_pages()). So CMA would be your only option.

Those page are not migrable so I agree here CMA would be only option.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC 0/7] Support high-order page bulk allocation
  2020-08-17 23:34           ` Minchan Kim
@ 2020-08-18  7:42             ` Nicholas Piggin
  2020-08-18  7:49             ` David Hildenbrand
  1 sibling, 0 replies; 27+ messages in thread
From: Nicholas Piggin @ 2020-08-18  7:42 UTC (permalink / raw)
  To: David Hildenbrand, Minchan Kim
  Cc: Andrew Morton, Joonsoo Kim, John Dias, linux-mm, pullip.cho,
	Suren Baghdasaryan, Vlastimil Babka

Excerpts from Minchan Kim's message of August 18, 2020 9:34 am:
> On Mon, Aug 17, 2020 at 06:44:50PM +0200, David Hildenbrand wrote:
>> On 17.08.20 18:30, Minchan Kim wrote:
>> > On Mon, Aug 17, 2020 at 05:45:59PM +0200, David Hildenbrand wrote:
>> >> On 17.08.20 17:27, Minchan Kim wrote:
>> >>> On Sun, Aug 16, 2020 at 02:31:22PM +0200, David Hildenbrand wrote:
>> >>>> On 14.08.20 19:31, Minchan Kim wrote:
>> >>>>> There is a need for special HW to require bulk allocation of
>> >>>>> high-order pages. For example, 4800 * order-4 pages.
>> >>>>>
>> >>>>> To meet the requirement, a option is using CMA area because
>> >>>>> page allocator with compaction under memory pressure is
>> >>>>> easily failed to meet the requirement and too slow for 4800
>> >>>>> times. However, CMA has also the following drawbacks:
>> >>>>>
>> >>>>>  * 4800 of order-4 * cma_alloc is too slow
>> >>>>>
>> >>>>> To avoid the slowness, we could try to allocate 300M contiguous
>> >>>>> memory once and then split them into order-4 chunks.
>> >>>>> The problem of this approach is CMA allocation fails one of the
>> >>>>> pages in those range couldn't migrate out, which happens easily
>> >>>>> with fs write under memory pressure.
>> >>>>
>> >>>> Why not chose a value in between? Like try to allocate MAX_ORDER - 1
>> >>>> chunks and split them. That would already heavily reduce the call frequency.
>> >>>
>> >>> I think you meant this:
>> >>>
>> >>>     alloc_pages(GFP_KERNEL|__GFP_NOWARN, MAX_ORDER - 1)
>> >>>
>> >>> It would work if system has lots of non-fragmented free memory.
>> >>> However, once they are fragmented, it doesn't work. That's why we have
>> >>> seen even order-4 allocation failure in the field easily and that's why
>> >>> CMA was there.
>> >>>
>> >>> CMA has more logics to isolate the memory during allocation/freeing as
>> >>> well as fragmentation avoidance so that it has less chance to be stealed
>> >>> from others and increase high success ratio. That's why I want this API
>> >>> to be used with CMA or movable zone.
>> >>
>> >> I was talking about doing MAX_ORDER - 1 CMA allocations instead of one
>> >> big 300M allocation. As you correctly note, memory placed into CMA
>> >> should be movable, except for (short/long) term pinnings. In these
>> >> cases, doing allocations smaller than 300M and splitting them up should
>> >> be good enough to reduce the call frequency, no?
>> > 
>> > I should have written that. The 300M I mentioned is really minimum size.
>> > In some scenraio, we need way bigger than 300M, up to several GB.
>> > Furthermore, the demand would be increased in near future.
>> 
>> And what will the driver do with that data besides providing it to the
>> device? Can it be mapped to user space? I think we really need more
>> information / the actual user.
>> 
>> >>
>> >>>
>> >>> A usecase is device can set a exclusive CMA area up when system boots.
>> >>> When device needs 4800 * order-4 pages, it could call this bulk against
>> >>> of the area so that it could effectively be guaranteed to allocate
>> >>> enough fast.
>> >>
>> >> Just wondering
>> >>
>> >> a) Why does it have to be fast?
>> > 
>> > That's because it's related to application latency, which ends up
>> > user feel bad.
>> 
>> Okay, but in theory, your device-needs are very similar to
>> application-needs, besides you requiring order-4 pages, correct? Similar
>> to an application that starts up and pins 300M (or more), just with
>> ordr-4 pages.
> 
> Yes.

Linux has never seriously catered for broken devices that require
large contiguous physical ranges to perform well.

The problem with doing this is it allows hardware designers to get
progressively lazier and foist more of their work onto us, and then
we'd be stuck with it.

I think you need to provide a lot better justification than this, and
probably should just solve it with some hack like allocating larger
pages or pre-allocating some of that CMA space before the user opens
the device, or require application to use hugetlbfs.

Thanks,
Nick


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC 0/7] Support high-order page bulk allocation
  2020-08-17 23:34           ` Minchan Kim
  2020-08-18  7:42             ` Nicholas Piggin
@ 2020-08-18  7:49             ` David Hildenbrand
  2020-08-18 15:15               ` Minchan Kim
  1 sibling, 1 reply; 27+ messages in thread
From: David Hildenbrand @ 2020-08-18  7:49 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-mm, Joonsoo Kim, Vlastimil Babka, John Dias,
	Suren Baghdasaryan, pullip.cho, Chris Goldsworthy

On 18.08.20 01:34, Minchan Kim wrote:
> On Mon, Aug 17, 2020 at 06:44:50PM +0200, David Hildenbrand wrote:
>> On 17.08.20 18:30, Minchan Kim wrote:
>>> On Mon, Aug 17, 2020 at 05:45:59PM +0200, David Hildenbrand wrote:
>>>> On 17.08.20 17:27, Minchan Kim wrote:
>>>>> On Sun, Aug 16, 2020 at 02:31:22PM +0200, David Hildenbrand wrote:
>>>>>> On 14.08.20 19:31, Minchan Kim wrote:
>>>>>>> There is a need for special HW to require bulk allocation of
>>>>>>> high-order pages. For example, 4800 * order-4 pages.
>>>>>>>
>>>>>>> To meet the requirement, a option is using CMA area because
>>>>>>> page allocator with compaction under memory pressure is
>>>>>>> easily failed to meet the requirement and too slow for 4800
>>>>>>> times. However, CMA has also the following drawbacks:
>>>>>>>
>>>>>>>  * 4800 of order-4 * cma_alloc is too slow
>>>>>>>
>>>>>>> To avoid the slowness, we could try to allocate 300M contiguous
>>>>>>> memory once and then split them into order-4 chunks.
>>>>>>> The problem of this approach is CMA allocation fails one of the
>>>>>>> pages in those range couldn't migrate out, which happens easily
>>>>>>> with fs write under memory pressure.
>>>>>>
>>>>>> Why not chose a value in between? Like try to allocate MAX_ORDER - 1
>>>>>> chunks and split them. That would already heavily reduce the call frequency.
>>>>>
>>>>> I think you meant this:
>>>>>
>>>>>     alloc_pages(GFP_KERNEL|__GFP_NOWARN, MAX_ORDER - 1)
>>>>>
>>>>> It would work if system has lots of non-fragmented free memory.
>>>>> However, once they are fragmented, it doesn't work. That's why we have
>>>>> seen even order-4 allocation failure in the field easily and that's why
>>>>> CMA was there.
>>>>>
>>>>> CMA has more logics to isolate the memory during allocation/freeing as
>>>>> well as fragmentation avoidance so that it has less chance to be stealed
>>>>> from others and increase high success ratio. That's why I want this API
>>>>> to be used with CMA or movable zone.
>>>>
>>>> I was talking about doing MAX_ORDER - 1 CMA allocations instead of one
>>>> big 300M allocation. As you correctly note, memory placed into CMA
>>>> should be movable, except for (short/long) term pinnings. In these
>>>> cases, doing allocations smaller than 300M and splitting them up should
>>>> be good enough to reduce the call frequency, no?
>>>
>>> I should have written that. The 300M I mentioned is really minimum size.
>>> In some scenraio, we need way bigger than 300M, up to several GB.
>>> Furthermore, the demand would be increased in near future.
>>
>> And what will the driver do with that data besides providing it to the
>> device? Can it be mapped to user space? I think we really need more
>> information / the actual user.
>>
>>>>
>>>>>
>>>>> A usecase is device can set a exclusive CMA area up when system boots.
>>>>> When device needs 4800 * order-4 pages, it could call this bulk against
>>>>> of the area so that it could effectively be guaranteed to allocate
>>>>> enough fast.
>>>>
>>>> Just wondering
>>>>
>>>> a) Why does it have to be fast?
>>>
>>> That's because it's related to application latency, which ends up
>>> user feel bad.
>>
>> Okay, but in theory, your device-needs are very similar to
>> application-needs, besides you requiring order-4 pages, correct? Similar
>> to an application that starts up and pins 300M (or more), just with
>> ordr-4 pages.
> 
> Yes.
> 
>>
>> I don't get quite yet why you need a range allocator for that. Because
>> you intend to use CMA?
> 
> Yes, with CMA, it could be more guaranteed and fast enough with little
> tweaking. Currently, CMA is too slow due to below IPI overheads.
> 
> 1. set_migratetype_isolate does drain_all_pages for every pageblock.
> 2. __aloc_contig_migrate_range does migrate_prep
> 3. alloc_contig_range does lru_add_drain_all.
> 
> Thus, if we need to increase frequency of call as your suggestion,
> the set up overhead is also scaling up depending on the size.
> Such overhead makes sense if caller requests big contiguous memory
> but it's too much for normal high-order allocations.
> 
> Maybe, we might optimize those call sites to reduce or/remove
> frequency of those IPI call smarter way but that would need to deal
> with success ratio vs fastness dance in the end.
> 
> Another concern to use existing cma API is it's trying to make
> allocation successful at the cost of latency. For example, waiting
> a page writeback.
> 
> That's the this new sematic API comes from for compromise since I believe
> we need some way to separate original CMA alloc(biased to be guaranteed
> but slower) with this new API(biased to be fast but less guaranteed).
> 
> Is there any idea to work without existing cma API tweaking?

Let me try to summarize:

1. Your driver needs a lot of order-4 pages. And it's needs them fast,
because of observerable lag/delay in an application. The pages will be
unmovable by the driver.

2. Your idea is to use CMA, as that avoids unmovable allocations,
theoretically allowing you to allocate all memory. But you don't
actually want a large contiguous memory area.

3. Doing a whole bunch of order-4 cma allocations is slow.

4. Doing a single large cma allocation and splitting it manually in the
caller can fail easily due to temporary page pinnings.


Regarding 4., [1] comes to mind, which has the same issues with
temporary page pinnings and solves it by simply retrying. Yeah, there
will be some lag, but maybe it's overall faster than doing separate
order-4 cma allocations?

In general, proactive compaction [2] comes to mind, does that help?

[1]
https://lore.kernel.org/r/1596682582-29139-2-git-send-email-cgoldswo@codeaurora.org/
[2] https://nitingupta.dev/post/proactive-compaction/

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC 0/7] Support high-order page bulk allocation
  2020-08-18  7:49             ` David Hildenbrand
@ 2020-08-18 15:15               ` Minchan Kim
  2020-08-18 15:58                 ` Matthew Wilcox
  0 siblings, 1 reply; 27+ messages in thread
From: Minchan Kim @ 2020-08-18 15:15 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Andrew Morton, linux-mm, Joonsoo Kim, Vlastimil Babka, John Dias,
	Suren Baghdasaryan, pullip.cho, Chris Goldsworthy

On Tue, Aug 18, 2020 at 09:49:24AM +0200, David Hildenbrand wrote:
> On 18.08.20 01:34, Minchan Kim wrote:
> > On Mon, Aug 17, 2020 at 06:44:50PM +0200, David Hildenbrand wrote:
> >> On 17.08.20 18:30, Minchan Kim wrote:
> >>> On Mon, Aug 17, 2020 at 05:45:59PM +0200, David Hildenbrand wrote:
> >>>> On 17.08.20 17:27, Minchan Kim wrote:
> >>>>> On Sun, Aug 16, 2020 at 02:31:22PM +0200, David Hildenbrand wrote:
> >>>>>> On 14.08.20 19:31, Minchan Kim wrote:
> >>>>>>> There is a need for special HW to require bulk allocation of
> >>>>>>> high-order pages. For example, 4800 * order-4 pages.
> >>>>>>>
> >>>>>>> To meet the requirement, a option is using CMA area because
> >>>>>>> page allocator with compaction under memory pressure is
> >>>>>>> easily failed to meet the requirement and too slow for 4800
> >>>>>>> times. However, CMA has also the following drawbacks:
> >>>>>>>
> >>>>>>>  * 4800 of order-4 * cma_alloc is too slow
> >>>>>>>
> >>>>>>> To avoid the slowness, we could try to allocate 300M contiguous
> >>>>>>> memory once and then split them into order-4 chunks.
> >>>>>>> The problem of this approach is CMA allocation fails one of the
> >>>>>>> pages in those range couldn't migrate out, which happens easily
> >>>>>>> with fs write under memory pressure.
> >>>>>>
> >>>>>> Why not chose a value in between? Like try to allocate MAX_ORDER - 1
> >>>>>> chunks and split them. That would already heavily reduce the call frequency.
> >>>>>
> >>>>> I think you meant this:
> >>>>>
> >>>>>     alloc_pages(GFP_KERNEL|__GFP_NOWARN, MAX_ORDER - 1)
> >>>>>
> >>>>> It would work if system has lots of non-fragmented free memory.
> >>>>> However, once they are fragmented, it doesn't work. That's why we have
> >>>>> seen even order-4 allocation failure in the field easily and that's why
> >>>>> CMA was there.
> >>>>>
> >>>>> CMA has more logics to isolate the memory during allocation/freeing as
> >>>>> well as fragmentation avoidance so that it has less chance to be stealed
> >>>>> from others and increase high success ratio. That's why I want this API
> >>>>> to be used with CMA or movable zone.
> >>>>
> >>>> I was talking about doing MAX_ORDER - 1 CMA allocations instead of one
> >>>> big 300M allocation. As you correctly note, memory placed into CMA
> >>>> should be movable, except for (short/long) term pinnings. In these
> >>>> cases, doing allocations smaller than 300M and splitting them up should
> >>>> be good enough to reduce the call frequency, no?
> >>>
> >>> I should have written that. The 300M I mentioned is really minimum size.
> >>> In some scenraio, we need way bigger than 300M, up to several GB.
> >>> Furthermore, the demand would be increased in near future.
> >>
> >> And what will the driver do with that data besides providing it to the
> >> device? Can it be mapped to user space? I think we really need more
> >> information / the actual user.
> >>
> >>>>
> >>>>>
> >>>>> A usecase is device can set a exclusive CMA area up when system boots.
> >>>>> When device needs 4800 * order-4 pages, it could call this bulk against
> >>>>> of the area so that it could effectively be guaranteed to allocate
> >>>>> enough fast.
> >>>>
> >>>> Just wondering
> >>>>
> >>>> a) Why does it have to be fast?
> >>>
> >>> That's because it's related to application latency, which ends up
> >>> user feel bad.
> >>
> >> Okay, but in theory, your device-needs are very similar to
> >> application-needs, besides you requiring order-4 pages, correct? Similar
> >> to an application that starts up and pins 300M (or more), just with
> >> ordr-4 pages.
> > 
> > Yes.
> > 
> >>
> >> I don't get quite yet why you need a range allocator for that. Because
> >> you intend to use CMA?
> > 
> > Yes, with CMA, it could be more guaranteed and fast enough with little
> > tweaking. Currently, CMA is too slow due to below IPI overheads.
> > 
> > 1. set_migratetype_isolate does drain_all_pages for every pageblock.
> > 2. __aloc_contig_migrate_range does migrate_prep
> > 3. alloc_contig_range does lru_add_drain_all.
> > 
> > Thus, if we need to increase frequency of call as your suggestion,
> > the set up overhead is also scaling up depending on the size.
> > Such overhead makes sense if caller requests big contiguous memory
> > but it's too much for normal high-order allocations.
> > 
> > Maybe, we might optimize those call sites to reduce or/remove
> > frequency of those IPI call smarter way but that would need to deal
> > with success ratio vs fastness dance in the end.
> > 
> > Another concern to use existing cma API is it's trying to make
> > allocation successful at the cost of latency. For example, waiting
> > a page writeback.
> > 
> > That's the this new sematic API comes from for compromise since I believe
> > we need some way to separate original CMA alloc(biased to be guaranteed
> > but slower) with this new API(biased to be fast but less guaranteed).
> > 
> > Is there any idea to work without existing cma API tweaking?
> 
> Let me try to summarize:
> 
> 1. Your driver needs a lot of order-4 pages. And it's needs them fast,
> because of observerable lag/delay in an application. The pages will be
> unmovable by the driver.
> 
> 2. Your idea is to use CMA, as that avoids unmovable allocations,
> theoretically allowing you to allocate all memory. But you don't
> actually want a large contiguous memory area.
> 
> 3. Doing a whole bunch of order-4 cma allocations is slow.
> 
> 4. Doing a single large cma allocation and splitting it manually in the
> caller can fail easily due to temporary page pinnings.
> 
> 
> Regarding 4., [1] comes to mind, which has the same issues with
> temporary page pinnings and solves it by simply retrying. Yeah, there
> will be some lag, but maybe it's overall faster than doing separate
> order-4 cma allocations?

Thanks for the pointer. However, it's not a single reason to make CMA
failure. Historically, there are various potentail problems to make
"temporal" as "non-temporal" like page write, indirect dependency
between objects.

> 
> In general, proactive compaction [2] comes to mind, does that help?

I think it makes sense if such high-order allocation are dominant in the
system workload because the benefit caused by TLB would be bigger than cost
caused by frequent migration overhead. However, it's not the our usecase.

> 
> [1]
> https://lore.kernel.org/r/1596682582-29139-2-git-send-email-cgoldswo@codeaurora.org/
> [2] https://nitingupta.dev/post/proactive-compaction/
> 

I understand pfn stuff in the API is not pretty but the concept of idea
makes sense to me in that go though the *migratable area* and get possible
order pages with hard effort. It looks like GFP_NORETRY version for
kmem_cache_alloc_bulk.

How about this?

    int cma_alloc(struct cma *cma, int order, unsigned int nr_elem, struct page **pages);


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC 0/7] Support high-order page bulk allocation
  2020-08-18 15:15               ` Minchan Kim
@ 2020-08-18 15:58                 ` Matthew Wilcox
  2020-08-18 16:22                   ` David Hildenbrand
  0 siblings, 1 reply; 27+ messages in thread
From: Matthew Wilcox @ 2020-08-18 15:58 UTC (permalink / raw)
  To: Minchan Kim
  Cc: David Hildenbrand, Andrew Morton, linux-mm, Joonsoo Kim,
	Vlastimil Babka, John Dias, Suren Baghdasaryan, pullip.cho,
	Chris Goldsworthy

On Tue, Aug 18, 2020 at 08:15:43AM -0700, Minchan Kim wrote:
> I understand pfn stuff in the API is not pretty but the concept of idea
> makes sense to me in that go though the *migratable area* and get possible
> order pages with hard effort. It looks like GFP_NORETRY version for
> kmem_cache_alloc_bulk.
> 
> How about this?
> 
>     int cma_alloc(struct cma *cma, int order, unsigned int nr_elem, struct page **pages);

I think that makes a lot more sense as an API.  Although I think you want

int cma_bulk_alloc(struct cma *cma, unsigned order, unsigned nr_elem,
		struct page **pages);



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC 0/7] Support high-order page bulk allocation
  2020-08-18 15:58                 ` Matthew Wilcox
@ 2020-08-18 16:22                   ` David Hildenbrand
  2020-08-18 16:49                     ` Minchan Kim
  2020-08-19  0:27                     ` Yang Shi
  0 siblings, 2 replies; 27+ messages in thread
From: David Hildenbrand @ 2020-08-18 16:22 UTC (permalink / raw)
  To: Matthew Wilcox, Minchan Kim
  Cc: Andrew Morton, linux-mm, Joonsoo Kim, Vlastimil Babka, John Dias,
	Suren Baghdasaryan, pullip.cho, Chris Goldsworthy,
	Nicholas Piggin

On 18.08.20 17:58, Matthew Wilcox wrote:
> On Tue, Aug 18, 2020 at 08:15:43AM -0700, Minchan Kim wrote:
>> I understand pfn stuff in the API is not pretty but the concept of idea
>> makes sense to me in that go though the *migratable area* and get possible
>> order pages with hard effort. It looks like GFP_NORETRY version for
>> kmem_cache_alloc_bulk.
>>
>> How about this?
>>
>>     int cma_alloc(struct cma *cma, int order, unsigned int nr_elem, struct page **pages);
> 
> I think that makes a lot more sense as an API.  Although I think you want
> 
> int cma_bulk_alloc(struct cma *cma, unsigned order, unsigned nr_elem,
> 		struct page **pages);
> 

Right, and I would start with a very simple implementation that does not
mess with alloc_contig_range() (meaning: modify it).

I'd then much rather want to see simple tweaks to alloc_contig_range()
to improve the situation. E.g., some kind of "fail fast" flag that let's
the caller specify to skip some draining (or do it manually in cma
before a bulk allocation) and rather fail fast than really trying to
allocate the range whatever it costs.

There are multiple optimizations you can play with then (start with big
granularity and split, move to smaller granularity on demand, etc., all
nicely wrapped in cma_bulk_alloc()).

Yes, it might not end up as fast as this big hack (sorry) here, but as
Nicholas correctly said, it's not our motivation to implement and
maintain such complexity just to squeeze the last milliseconds out of an
allocation path for "broken devices".

I absolutely dislike pushing this very specific allocation policy down
to the core range allocator. It's already makes my head spin every time
I look at it in detail.

-- 
Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC 0/7] Support high-order page bulk allocation
  2020-08-18 16:22                   ` David Hildenbrand
@ 2020-08-18 16:49                     ` Minchan Kim
  2020-08-19  0:27                     ` Yang Shi
  1 sibling, 0 replies; 27+ messages in thread
From: Minchan Kim @ 2020-08-18 16:49 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Matthew Wilcox, Andrew Morton, linux-mm, Joonsoo Kim,
	Vlastimil Babka, John Dias, Suren Baghdasaryan, pullip.cho,
	Chris Goldsworthy, Nicholas Piggin

On Tue, Aug 18, 2020 at 06:22:10PM +0200, David Hildenbrand wrote:
> On 18.08.20 17:58, Matthew Wilcox wrote:
> > On Tue, Aug 18, 2020 at 08:15:43AM -0700, Minchan Kim wrote:
> >> I understand pfn stuff in the API is not pretty but the concept of idea
> >> makes sense to me in that go though the *migratable area* and get possible
> >> order pages with hard effort. It looks like GFP_NORETRY version for
> >> kmem_cache_alloc_bulk.
> >>
> >> How about this?
> >>
> >>     int cma_alloc(struct cma *cma, int order, unsigned int nr_elem, struct page **pages);
> > 
> > I think that makes a lot more sense as an API.  Although I think you want
> > 
> > int cma_bulk_alloc(struct cma *cma, unsigned order, unsigned nr_elem,
> > 		struct page **pages);
> > 
> 
> Right, and I would start with a very simple implementation that does not
> mess with alloc_contig_range() (meaning: modify it).
> 
> I'd then much rather want to see simple tweaks to alloc_contig_range()
> to improve the situation. E.g., some kind of "fail fast" flag that let's
> the caller specify to skip some draining (or do it manually in cma
> before a bulk allocation) and rather fail fast than really trying to
> allocate the range whatever it costs.
> 
> There are multiple optimizations you can play with then (start with big
> granularity and split, move to smaller granularity on demand, etc., all
> nicely wrapped in cma_bulk_alloc()).

Okay, let me hide the detail inside cma_bulk_alloc as much as possible.
Anyway, at least we need to pass some flag to indicate "fail fast"
in alloc_contig_range. Maybe __GFP_NORETRY could carry on the indication.

Thanks for the review.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC 0/7] Support high-order page bulk allocation
  2020-08-18 16:22                   ` David Hildenbrand
  2020-08-18 16:49                     ` Minchan Kim
@ 2020-08-19  0:27                     ` Yang Shi
  1 sibling, 0 replies; 27+ messages in thread
From: Yang Shi @ 2020-08-19  0:27 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Matthew Wilcox, Minchan Kim, Andrew Morton, linux-mm, Joonsoo Kim,
	Vlastimil Babka, John Dias, Suren Baghdasaryan, pullip.cho,
	Chris Goldsworthy, Nicholas Piggin

On Tue, Aug 18, 2020 at 9:22 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 18.08.20 17:58, Matthew Wilcox wrote:
> > On Tue, Aug 18, 2020 at 08:15:43AM -0700, Minchan Kim wrote:
> >> I understand pfn stuff in the API is not pretty but the concept of idea
> >> makes sense to me in that go though the *migratable area* and get possible
> >> order pages with hard effort. It looks like GFP_NORETRY version for
> >> kmem_cache_alloc_bulk.
> >>
> >> How about this?
> >>
> >>     int cma_alloc(struct cma *cma, int order, unsigned int nr_elem, struct page **pages);
> >
> > I think that makes a lot more sense as an API.  Although I think you want
> >
> > int cma_bulk_alloc(struct cma *cma, unsigned order, unsigned nr_elem,
> >               struct page **pages);
> >
>
> Right, and I would start with a very simple implementation that does not
> mess with alloc_contig_range() (meaning: modify it).
>
> I'd then much rather want to see simple tweaks to alloc_contig_range()
> to improve the situation. E.g., some kind of "fail fast" flag that let's
> the caller specify to skip some draining (or do it manually in cma
> before a bulk allocation) and rather fail fast than really trying to
> allocate the range whatever it costs.

Make sense to me. There are plenty such optimizations in mm, i.e.
light async migration vs sync migration.

And, it looks Minchan could accept allocation failure (return less
than "elem" pages), then the user could just retry.

>
> There are multiple optimizations you can play with then (start with big
> granularity and split, move to smaller granularity on demand, etc., all
> nicely wrapped in cma_bulk_alloc()).
>
> Yes, it might not end up as fast as this big hack (sorry) here, but as
> Nicholas correctly said, it's not our motivation to implement and
> maintain such complexity just to squeeze the last milliseconds out of an
> allocation path for "broken devices".
>
> I absolutely dislike pushing this very specific allocation policy down
> to the core range allocator. It's already makes my head spin every time
> I look at it in detail.
>
> --
> Thanks,
>
> David / dhildenb
>
>


^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2020-08-19  0:27 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2020-08-14 17:31 [RFC 0/7] Support high-order page bulk allocation Minchan Kim
2020-08-14 17:31 ` [RFC 1/7] mm: page_owner: split page by order Minchan Kim
2020-08-14 17:31 ` [RFC 2/7] mm: introduce split_page_by_order Minchan Kim
2020-08-14 17:31 ` [RFC 3/7] mm: compaction: deal with upcoming high-order page splitting Minchan Kim
2020-08-14 17:31 ` [RFC 4/7] mm: factor __alloc_contig_range out Minchan Kim
2020-08-14 17:31 ` [RFC 5/7] mm: introduce alloc_pages_bulk API Minchan Kim
2020-08-17 17:40   ` David Hildenbrand
2020-08-14 17:31 ` [RFC 6/7] mm: make alloc_pages_bulk best effort Minchan Kim
2020-08-14 17:31 ` [RFC 7/7] mm/page_isolation: avoid drain_all_pages for alloc_pages_bulk Minchan Kim
2020-08-14 17:40 ` [RFC 0/7] Support high-order page bulk allocation Matthew Wilcox
2020-08-14 20:55   ` Minchan Kim
2020-08-18  2:16     ` Cho KyongHo
2020-08-18  9:22     ` Cho KyongHo
2020-08-16 12:31 ` David Hildenbrand
2020-08-17 15:27   ` Minchan Kim
2020-08-17 15:45     ` David Hildenbrand
2020-08-17 16:30       ` Minchan Kim
2020-08-17 16:44         ` David Hildenbrand
2020-08-17 17:03           ` David Hildenbrand
2020-08-17 23:34           ` Minchan Kim
2020-08-18  7:42             ` Nicholas Piggin
2020-08-18  7:49             ` David Hildenbrand
2020-08-18 15:15               ` Minchan Kim
2020-08-18 15:58                 ` Matthew Wilcox
2020-08-18 16:22                   ` David Hildenbrand
2020-08-18 16:49                     ` Minchan Kim
2020-08-19  0:27                     ` Yang Shi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).