* [PATCH mm-unstable v2 0/3] mm/hugetlb: alloc/free gigantic folios
@ 2024-08-14 3:54 Yu Zhao
2024-08-14 3:54 ` [PATCH mm-unstable v2 1/3] mm/contig_alloc: support __GFP_COMP Yu Zhao
` (3 more replies)
0 siblings, 4 replies; 26+ messages in thread
From: Yu Zhao @ 2024-08-14 3:54 UTC (permalink / raw)
To: Andrew Morton, Muchun Song
Cc: Matthew Wilcox (Oracle), Zi Yan, linux-mm, linux-kernel, Yu Zhao
Use __GFP_COMP for gigantic folios can greatly reduce not only the
amount of code but also the allocation and free time.
Approximate LOC to mm/hugetlb.c: +60, -240
Allocate and free 500 1GB hugeTLB memory without HVO by:
time echo 500 >/sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
time echo 0 >/sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
Before After
Alloc ~13s ~10s
Free ~15s <1s
The above magnitude generally holds for multiple x86 and arm64 CPU
models.
Perf profile before:
Alloc
- 99.99% alloc_pool_huge_folio
- __alloc_fresh_hugetlb_folio
- 83.23% alloc_contig_pages_noprof
- 47.46% alloc_contig_range_noprof
- 20.96% isolate_freepages_range
16.10% split_page
- 14.10% start_isolate_page_range
- 12.02% undo_isolate_page_range
Free
- update_and_free_pages_bulk
- 87.71% free_contig_range
- 76.02% free_unref_page
- 41.30% free_unref_page_commit
- 32.58% free_pcppages_bulk
- 24.75% __free_one_page
13.96% _raw_spin_trylock
12.27% __update_and_free_hugetlb_folio
Perf profile after:
Alloc
- 99.99% alloc_pool_huge_folio
alloc_gigantic_folio
- alloc_contig_pages_noprof
- 59.15% alloc_contig_range_noprof
- 20.72% start_isolate_page_range
20.64% prep_new_page
- 17.13% undo_isolate_page_range
Free
- update_and_free_pages_bulk
- __folio_put
- __free_pages_ok
7.46% free_tail_page_prepare
- 1.97% free_one_page
1.86% __free_one_page
Yu Zhao (3):
mm/contig_alloc: support __GFP_COMP
mm/cma: add cma_{alloc,free}_folio()
mm/hugetlb: use __GFP_COMP for gigantic folios
include/linux/cma.h | 16 +++
include/linux/gfp.h | 23 ++++
include/linux/hugetlb.h | 9 +-
mm/cma.c | 55 ++++++--
mm/compaction.c | 41 +-----
mm/hugetlb.c | 293 ++++++++--------------------------------
mm/page_alloc.c | 111 ++++++++++-----
7 files changed, 226 insertions(+), 322 deletions(-)
--
2.46.0.76.ge559c4bf1a-goog
^ permalink raw reply [flat|nested] 26+ messages in thread
* [PATCH mm-unstable v2 1/3] mm/contig_alloc: support __GFP_COMP
2024-08-14 3:54 [PATCH mm-unstable v2 0/3] mm/hugetlb: alloc/free gigantic folios Yu Zhao
@ 2024-08-14 3:54 ` Yu Zhao
2024-08-22 8:21 ` Yu Liao
` (2 more replies)
2024-08-14 3:54 ` [PATCH mm-unstable v2 2/3] mm/cma: add cma_{alloc,free}_folio() Yu Zhao
` (2 subsequent siblings)
3 siblings, 3 replies; 26+ messages in thread
From: Yu Zhao @ 2024-08-14 3:54 UTC (permalink / raw)
To: Andrew Morton, Muchun Song
Cc: Matthew Wilcox (Oracle), Zi Yan, linux-mm, linux-kernel, Yu Zhao
Support __GFP_COMP in alloc_contig_range(). When the flag is set, upon
success the function returns a large folio prepared by
prep_new_page(), rather than a range of order-0 pages prepared by
split_free_pages() (which is renamed from split_map_pages()).
alloc_contig_range() can be used to allocate folios larger than
MAX_PAGE_ORDER, e.g., gigantic hugeTLB folios. So on the free path,
free_one_page() needs to handle that by split_large_buddy().
Signed-off-by: Yu Zhao <yuzhao@google.com>
---
include/linux/gfp.h | 23 +++++++++
mm/compaction.c | 41 ++--------------
mm/page_alloc.c | 111 +++++++++++++++++++++++++++++++-------------
3 files changed, 108 insertions(+), 67 deletions(-)
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index f53f76e0b17e..59266df56aeb 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -446,4 +446,27 @@ extern struct page *alloc_contig_pages_noprof(unsigned long nr_pages, gfp_t gfp_
#endif
void free_contig_range(unsigned long pfn, unsigned long nr_pages);
+#ifdef CONFIG_CONTIG_ALLOC
+static inline struct folio *folio_alloc_gigantic_noprof(int order, gfp_t gfp,
+ int nid, nodemask_t *node)
+{
+ struct page *page;
+
+ if (WARN_ON(!order || !(gfp | __GFP_COMP)))
+ return NULL;
+
+ page = alloc_contig_pages_noprof(1 << order, gfp, nid, node);
+
+ return page ? page_folio(page) : NULL;
+}
+#else
+static inline struct folio *folio_alloc_gigantic_noprof(int order, gfp_t gfp,
+ int nid, nodemask_t *node)
+{
+ return NULL;
+}
+#endif
+/* This should be paired with folio_put() rather than free_contig_range(). */
+#define folio_alloc_gigantic(...) alloc_hooks(folio_alloc_gigantic_noprof(__VA_ARGS__))
+
#endif /* __LINUX_GFP_H */
diff --git a/mm/compaction.c b/mm/compaction.c
index eb95e9b435d0..d1041fbce679 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -86,33 +86,6 @@ static struct page *mark_allocated_noprof(struct page *page, unsigned int order,
}
#define mark_allocated(...) alloc_hooks(mark_allocated_noprof(__VA_ARGS__))
-static void split_map_pages(struct list_head *freepages)
-{
- unsigned int i, order;
- struct page *page, *next;
- LIST_HEAD(tmp_list);
-
- for (order = 0; order < NR_PAGE_ORDERS; order++) {
- list_for_each_entry_safe(page, next, &freepages[order], lru) {
- unsigned int nr_pages;
-
- list_del(&page->lru);
-
- nr_pages = 1 << order;
-
- mark_allocated(page, order, __GFP_MOVABLE);
- if (order)
- split_page(page, order);
-
- for (i = 0; i < nr_pages; i++) {
- list_add(&page->lru, &tmp_list);
- page++;
- }
- }
- list_splice_init(&tmp_list, &freepages[0]);
- }
-}
-
static unsigned long release_free_list(struct list_head *freepages)
{
int order;
@@ -742,11 +715,11 @@ static unsigned long isolate_freepages_block(struct compact_control *cc,
*
* Non-free pages, invalid PFNs, or zone boundaries within the
* [start_pfn, end_pfn) range are considered errors, cause function to
- * undo its actions and return zero.
+ * undo its actions and return zero. cc->freepages[] are empty.
*
* Otherwise, function returns one-past-the-last PFN of isolated page
* (which may be greater then end_pfn if end fell in a middle of
- * a free page).
+ * a free page). cc->freepages[] contain free pages isolated.
*/
unsigned long
isolate_freepages_range(struct compact_control *cc,
@@ -754,10 +727,9 @@ isolate_freepages_range(struct compact_control *cc,
{
unsigned long isolated, pfn, block_start_pfn, block_end_pfn;
int order;
- struct list_head tmp_freepages[NR_PAGE_ORDERS];
for (order = 0; order < NR_PAGE_ORDERS; order++)
- INIT_LIST_HEAD(&tmp_freepages[order]);
+ INIT_LIST_HEAD(&cc->freepages[order]);
pfn = start_pfn;
block_start_pfn = pageblock_start_pfn(pfn);
@@ -788,7 +760,7 @@ isolate_freepages_range(struct compact_control *cc,
break;
isolated = isolate_freepages_block(cc, &isolate_start_pfn,
- block_end_pfn, tmp_freepages, 0, true);
+ block_end_pfn, cc->freepages, 0, true);
/*
* In strict mode, isolate_freepages_block() returns 0 if
@@ -807,13 +779,10 @@ isolate_freepages_range(struct compact_control *cc,
if (pfn < end_pfn) {
/* Loop terminated early, cleanup. */
- release_free_list(tmp_freepages);
+ release_free_list(cc->freepages);
return 0;
}
- /* __isolate_free_page() does not map the pages */
- split_map_pages(tmp_freepages);
-
/* We don't use freelists for anything. */
return pfn;
}
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 5841bbea482a..0a43e4ea29e4 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1197,16 +1197,36 @@ static void free_pcppages_bulk(struct zone *zone, int count,
spin_unlock_irqrestore(&zone->lock, flags);
}
+/* Split a multi-block free page into its individual pageblocks. */
+static void split_large_buddy(struct zone *zone, struct page *page,
+ unsigned long pfn, int order, fpi_t fpi)
+{
+ unsigned long end = pfn + (1 << order);
+
+ VM_WARN_ON_ONCE(!IS_ALIGNED(pfn, 1 << order));
+ /* Caller removed page from freelist, buddy info cleared! */
+ VM_WARN_ON_ONCE(PageBuddy(page));
+
+ if (order > pageblock_order)
+ order = pageblock_order;
+
+ while (pfn != end) {
+ int mt = get_pfnblock_migratetype(page, pfn);
+
+ __free_one_page(page, pfn, zone, order, mt, fpi);
+ pfn += 1 << order;
+ page = pfn_to_page(pfn);
+ }
+}
+
static void free_one_page(struct zone *zone, struct page *page,
unsigned long pfn, unsigned int order,
fpi_t fpi_flags)
{
unsigned long flags;
- int migratetype;
spin_lock_irqsave(&zone->lock, flags);
- migratetype = get_pfnblock_migratetype(page, pfn);
- __free_one_page(page, pfn, zone, order, migratetype, fpi_flags);
+ split_large_buddy(zone, page, pfn, order, fpi_flags);
spin_unlock_irqrestore(&zone->lock, flags);
}
@@ -1698,27 +1718,6 @@ static unsigned long find_large_buddy(unsigned long start_pfn)
return start_pfn;
}
-/* Split a multi-block free page into its individual pageblocks */
-static void split_large_buddy(struct zone *zone, struct page *page,
- unsigned long pfn, int order)
-{
- unsigned long end_pfn = pfn + (1 << order);
-
- VM_WARN_ON_ONCE(order <= pageblock_order);
- VM_WARN_ON_ONCE(pfn & (pageblock_nr_pages - 1));
-
- /* Caller removed page from freelist, buddy info cleared! */
- VM_WARN_ON_ONCE(PageBuddy(page));
-
- while (pfn != end_pfn) {
- int mt = get_pfnblock_migratetype(page, pfn);
-
- __free_one_page(page, pfn, zone, pageblock_order, mt, FPI_NONE);
- pfn += pageblock_nr_pages;
- page = pfn_to_page(pfn);
- }
-}
-
/**
* move_freepages_block_isolate - move free pages in block for page isolation
* @zone: the zone
@@ -1759,7 +1758,7 @@ bool move_freepages_block_isolate(struct zone *zone, struct page *page,
del_page_from_free_list(buddy, zone, order,
get_pfnblock_migratetype(buddy, pfn));
set_pageblock_migratetype(page, migratetype);
- split_large_buddy(zone, buddy, pfn, order);
+ split_large_buddy(zone, buddy, pfn, order, FPI_NONE);
return true;
}
@@ -1770,7 +1769,7 @@ bool move_freepages_block_isolate(struct zone *zone, struct page *page,
del_page_from_free_list(page, zone, order,
get_pfnblock_migratetype(page, pfn));
set_pageblock_migratetype(page, migratetype);
- split_large_buddy(zone, page, pfn, order);
+ split_large_buddy(zone, page, pfn, order, FPI_NONE);
return true;
}
move:
@@ -6440,6 +6439,31 @@ int __alloc_contig_migrate_range(struct compact_control *cc,
return (ret < 0) ? ret : 0;
}
+static void split_free_pages(struct list_head *list)
+{
+ int order;
+
+ for (order = 0; order < NR_PAGE_ORDERS; order++) {
+ struct page *page, *next;
+ int nr_pages = 1 << order;
+
+ list_for_each_entry_safe(page, next, &list[order], lru) {
+ int i;
+
+ post_alloc_hook(page, order, __GFP_MOVABLE);
+ if (!order)
+ continue;
+
+ split_page(page, order);
+
+ /* Add all subpages to the order-0 head, in sequence. */
+ list_del(&page->lru);
+ for (i = 0; i < nr_pages; i++)
+ list_add_tail(&page[i].lru, &list[0]);
+ }
+ }
+}
+
/**
* alloc_contig_range() -- tries to allocate given range of pages
* @start: start PFN to allocate
@@ -6552,12 +6576,25 @@ int alloc_contig_range_noprof(unsigned long start, unsigned long end,
goto done;
}
- /* Free head and tail (if any) */
- if (start != outer_start)
- free_contig_range(outer_start, start - outer_start);
- if (end != outer_end)
- free_contig_range(end, outer_end - end);
+ if (!(gfp_mask & __GFP_COMP)) {
+ split_free_pages(cc.freepages);
+ /* Free head and tail (if any) */
+ if (start != outer_start)
+ free_contig_range(outer_start, start - outer_start);
+ if (end != outer_end)
+ free_contig_range(end, outer_end - end);
+ } else if (start == outer_start && end == outer_end && is_power_of_2(end - start)) {
+ struct page *head = pfn_to_page(start);
+ int order = ilog2(end - start);
+
+ check_new_pages(head, order);
+ prep_new_page(head, order, gfp_mask, 0);
+ } else {
+ ret = -EINVAL;
+ WARN(true, "PFN range: requested [%lu, %lu), allocated [%lu, %lu)\n",
+ start, end, outer_start, outer_end);
+ }
done:
undo_isolate_page_range(start, end, migratetype);
return ret;
@@ -6666,6 +6703,18 @@ struct page *alloc_contig_pages_noprof(unsigned long nr_pages, gfp_t gfp_mask,
void free_contig_range(unsigned long pfn, unsigned long nr_pages)
{
unsigned long count = 0;
+ struct folio *folio = pfn_folio(pfn);
+
+ if (folio_test_large(folio)) {
+ int expected = folio_nr_pages(folio);
+
+ if (nr_pages == expected)
+ folio_put(folio);
+ else
+ WARN(true, "PFN %lu: nr_pages %lu != expected %d\n",
+ pfn, nr_pages, expected);
+ return;
+ }
for (; nr_pages--; pfn++) {
struct page *page = pfn_to_page(pfn);
--
2.46.0.76.ge559c4bf1a-goog
^ permalink raw reply related [flat|nested] 26+ messages in thread
* [PATCH mm-unstable v2 2/3] mm/cma: add cma_{alloc,free}_folio()
2024-08-14 3:54 [PATCH mm-unstable v2 0/3] mm/hugetlb: alloc/free gigantic folios Yu Zhao
2024-08-14 3:54 ` [PATCH mm-unstable v2 1/3] mm/contig_alloc: support __GFP_COMP Yu Zhao
@ 2024-08-14 3:54 ` Yu Zhao
2024-08-22 17:24 ` Yu Zhao
2024-08-14 3:54 ` [PATCH mm-unstable v2 3/3] mm/hugetlb: use __GFP_COMP for gigantic folios Yu Zhao
2024-08-30 17:55 ` [PATCH mm-unstable v2 0/3] mm/hugetlb: alloc/free " jane.chu
3 siblings, 1 reply; 26+ messages in thread
From: Yu Zhao @ 2024-08-14 3:54 UTC (permalink / raw)
To: Andrew Morton, Muchun Song
Cc: Matthew Wilcox (Oracle), Zi Yan, linux-mm, linux-kernel, Yu Zhao
With alloc_contig_range() and free_contig_range() supporting large
folios, CMA can allocate and free large folios too, by
cma_alloc_folio() and cma_free_folio().
Signed-off-by: Yu Zhao <yuzhao@google.com>
---
include/linux/cma.h | 16 +++++++++++++
mm/cma.c | 55 ++++++++++++++++++++++++++++++++-------------
2 files changed, 56 insertions(+), 15 deletions(-)
diff --git a/include/linux/cma.h b/include/linux/cma.h
index 9db877506ea8..d15b64f51336 100644
--- a/include/linux/cma.h
+++ b/include/linux/cma.h
@@ -52,4 +52,20 @@ extern bool cma_release(struct cma *cma, const struct page *pages, unsigned long
extern int cma_for_each_area(int (*it)(struct cma *cma, void *data), void *data);
extern void cma_reserve_pages_on_error(struct cma *cma);
+
+#ifdef CONFIG_CMA
+struct folio *cma_alloc_folio(struct cma *cma, int order, gfp_t gfp);
+bool cma_free_folio(struct cma *cma, const struct folio *folio);
+#else
+static inline struct folio *cma_alloc_folio(struct cma *cma, int order, gfp_t gfp)
+{
+ return NULL;
+}
+
+static inline bool cma_free_folio(struct cma *cma, const struct folio *folio)
+{
+ return false;
+}
+#endif
+
#endif
diff --git a/mm/cma.c b/mm/cma.c
index 95d6950e177b..4354823d28cf 100644
--- a/mm/cma.c
+++ b/mm/cma.c
@@ -403,18 +403,8 @@ static void cma_debug_show_areas(struct cma *cma)
spin_unlock_irq(&cma->lock);
}
-/**
- * cma_alloc() - allocate pages from contiguous area
- * @cma: Contiguous memory region for which the allocation is performed.
- * @count: Requested number of pages.
- * @align: Requested alignment of pages (in PAGE_SIZE order).
- * @no_warn: Avoid printing message about failed allocation
- *
- * This function allocates part of contiguous memory on specific
- * contiguous memory area.
- */
-struct page *cma_alloc(struct cma *cma, unsigned long count,
- unsigned int align, bool no_warn)
+static struct page *__cma_alloc(struct cma *cma, unsigned long count,
+ unsigned int align, gfp_t gfp)
{
unsigned long mask, offset;
unsigned long pfn = -1;
@@ -463,8 +453,7 @@ struct page *cma_alloc(struct cma *cma, unsigned long count,
pfn = cma->base_pfn + (bitmap_no << cma->order_per_bit);
mutex_lock(&cma_mutex);
- ret = alloc_contig_range(pfn, pfn + count, MIGRATE_CMA,
- GFP_KERNEL | (no_warn ? __GFP_NOWARN : 0));
+ ret = alloc_contig_range(pfn, pfn + count, MIGRATE_CMA, gfp);
mutex_unlock(&cma_mutex);
if (ret == 0) {
page = pfn_to_page(pfn);
@@ -494,7 +483,7 @@ struct page *cma_alloc(struct cma *cma, unsigned long count,
page_kasan_tag_reset(nth_page(page, i));
}
- if (ret && !no_warn) {
+ if (ret && !(gfp & __GFP_NOWARN)) {
pr_err_ratelimited("%s: %s: alloc failed, req-size: %lu pages, ret: %d\n",
__func__, cma->name, count, ret);
cma_debug_show_areas(cma);
@@ -513,6 +502,34 @@ struct page *cma_alloc(struct cma *cma, unsigned long count,
return page;
}
+/**
+ * cma_alloc() - allocate pages from contiguous area
+ * @cma: Contiguous memory region for which the allocation is performed.
+ * @count: Requested number of pages.
+ * @align: Requested alignment of pages (in PAGE_SIZE order).
+ * @no_warn: Avoid printing message about failed allocation
+ *
+ * This function allocates part of contiguous memory on specific
+ * contiguous memory area.
+ */
+struct page *cma_alloc(struct cma *cma, unsigned long count,
+ unsigned int align, bool no_warn)
+{
+ return __cma_alloc(cma, count, align, GFP_KERNEL | (no_warn ? __GFP_NOWARN : 0));
+}
+
+struct folio *cma_alloc_folio(struct cma *cma, int order, gfp_t gfp)
+{
+ struct page *page;
+
+ if (WARN_ON(!order || !(gfp | __GFP_COMP)))
+ return NULL;
+
+ page = __cma_alloc(cma, 1 << order, order, gfp);
+
+ return page ? page_folio(page) : NULL;
+}
+
bool cma_pages_valid(struct cma *cma, const struct page *pages,
unsigned long count)
{
@@ -564,6 +581,14 @@ bool cma_release(struct cma *cma, const struct page *pages,
return true;
}
+bool cma_free_folio(struct cma *cma, const struct folio *folio)
+{
+ if (WARN_ON(!folio_test_large(folio)))
+ return false;
+
+ return cma_release(cma, &folio->page, folio_nr_pages(folio));
+}
+
int cma_for_each_area(int (*it)(struct cma *cma, void *data), void *data)
{
int i;
--
2.46.0.76.ge559c4bf1a-goog
^ permalink raw reply related [flat|nested] 26+ messages in thread
* [PATCH mm-unstable v2 3/3] mm/hugetlb: use __GFP_COMP for gigantic folios
2024-08-14 3:54 [PATCH mm-unstable v2 0/3] mm/hugetlb: alloc/free gigantic folios Yu Zhao
2024-08-14 3:54 ` [PATCH mm-unstable v2 1/3] mm/contig_alloc: support __GFP_COMP Yu Zhao
2024-08-14 3:54 ` [PATCH mm-unstable v2 2/3] mm/cma: add cma_{alloc,free}_folio() Yu Zhao
@ 2024-08-14 3:54 ` Yu Zhao
2024-08-16 15:23 ` Zi Yan
2024-08-30 17:55 ` [PATCH mm-unstable v2 0/3] mm/hugetlb: alloc/free " jane.chu
3 siblings, 1 reply; 26+ messages in thread
From: Yu Zhao @ 2024-08-14 3:54 UTC (permalink / raw)
To: Andrew Morton, Muchun Song
Cc: Matthew Wilcox (Oracle), Zi Yan, linux-mm, linux-kernel, Yu Zhao,
Frank van der Linden
Use __GFP_COMP for gigantic folios to greatly reduce not only the
amount of code but also the allocation and free time.
LOC (approximately): +60, -240
Allocate and free 500 1GB hugeTLB memory without HVO by:
time echo 500 >/sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
time echo 0 >/sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
Before After
Alloc ~13s ~10s
Free ~15s <1s
The above magnitude generally holds for multiple x86 and arm64 CPU
models.
Signed-off-by: Yu Zhao <yuzhao@google.com>
Reported-by: Frank van der Linden <fvdl@google.com>
---
include/linux/hugetlb.h | 9 +-
mm/hugetlb.c | 293 ++++++++--------------------------------
2 files changed, 62 insertions(+), 240 deletions(-)
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 3100a52ceb73..98c47c394b89 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -896,10 +896,11 @@ static inline bool hugepage_movable_supported(struct hstate *h)
/* Movability of hugepages depends on migration support. */
static inline gfp_t htlb_alloc_mask(struct hstate *h)
{
- if (hugepage_movable_supported(h))
- return GFP_HIGHUSER_MOVABLE;
- else
- return GFP_HIGHUSER;
+ gfp_t gfp = __GFP_COMP | __GFP_NOWARN;
+
+ gfp |= hugepage_movable_supported(h) ? GFP_HIGHUSER_MOVABLE : GFP_HIGHUSER;
+
+ return gfp;
}
static inline gfp_t htlb_modify_alloc_mask(struct hstate *h, gfp_t gfp_mask)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 71d469c8e711..efa77ce87dcc 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -56,16 +56,6 @@ struct hstate hstates[HUGE_MAX_HSTATE];
#ifdef CONFIG_CMA
static struct cma *hugetlb_cma[MAX_NUMNODES];
static unsigned long hugetlb_cma_size_in_node[MAX_NUMNODES] __initdata;
-static bool hugetlb_cma_folio(struct folio *folio, unsigned int order)
-{
- return cma_pages_valid(hugetlb_cma[folio_nid(folio)], &folio->page,
- 1 << order);
-}
-#else
-static bool hugetlb_cma_folio(struct folio *folio, unsigned int order)
-{
- return false;
-}
#endif
static unsigned long hugetlb_cma_size __initdata;
@@ -100,6 +90,17 @@ static void hugetlb_unshare_pmds(struct vm_area_struct *vma,
unsigned long start, unsigned long end);
static struct resv_map *vma_resv_map(struct vm_area_struct *vma);
+static void hugetlb_free_folio(struct folio *folio)
+{
+#ifdef CONFIG_CMA
+ int nid = folio_nid(folio);
+
+ if (cma_free_folio(hugetlb_cma[nid], folio))
+ return;
+#endif
+ folio_put(folio);
+}
+
static inline bool subpool_is_free(struct hugepage_subpool *spool)
{
if (spool->count)
@@ -1512,95 +1513,54 @@ static int hstate_next_node_to_free(struct hstate *h, nodemask_t *nodes_allowed)
((node = hstate_next_node_to_free(hs, mask)) || 1); \
nr_nodes--)
-/* used to demote non-gigantic_huge pages as well */
-static void __destroy_compound_gigantic_folio(struct folio *folio,
- unsigned int order, bool demote)
-{
- int i;
- int nr_pages = 1 << order;
- struct page *p;
-
- atomic_set(&folio->_entire_mapcount, 0);
- atomic_set(&folio->_large_mapcount, 0);
- atomic_set(&folio->_pincount, 0);
-
- for (i = 1; i < nr_pages; i++) {
- p = folio_page(folio, i);
- p->flags &= ~PAGE_FLAGS_CHECK_AT_FREE;
- p->mapping = NULL;
- clear_compound_head(p);
- if (!demote)
- set_page_refcounted(p);
- }
-
- __folio_clear_head(folio);
-}
-
-static void destroy_compound_hugetlb_folio_for_demote(struct folio *folio,
- unsigned int order)
-{
- __destroy_compound_gigantic_folio(folio, order, true);
-}
-
#ifdef CONFIG_ARCH_HAS_GIGANTIC_PAGE
-static void destroy_compound_gigantic_folio(struct folio *folio,
- unsigned int order)
-{
- __destroy_compound_gigantic_folio(folio, order, false);
-}
-
-static void free_gigantic_folio(struct folio *folio, unsigned int order)
-{
- /*
- * If the page isn't allocated using the cma allocator,
- * cma_release() returns false.
- */
-#ifdef CONFIG_CMA
- int nid = folio_nid(folio);
-
- if (cma_release(hugetlb_cma[nid], &folio->page, 1 << order))
- return;
-#endif
-
- free_contig_range(folio_pfn(folio), 1 << order);
-}
-
#ifdef CONFIG_CONTIG_ALLOC
static struct folio *alloc_gigantic_folio(struct hstate *h, gfp_t gfp_mask,
int nid, nodemask_t *nodemask)
{
- struct page *page;
- unsigned long nr_pages = pages_per_huge_page(h);
+ struct folio *folio;
+ int order = huge_page_order(h);
+ bool retried = false;
+
if (nid == NUMA_NO_NODE)
nid = numa_mem_id();
-
+retry:
+ folio = NULL;
#ifdef CONFIG_CMA
{
int node;
- if (hugetlb_cma[nid]) {
- page = cma_alloc(hugetlb_cma[nid], nr_pages,
- huge_page_order(h), true);
- if (page)
- return page_folio(page);
- }
+ if (hugetlb_cma[nid])
+ folio = cma_alloc_folio(hugetlb_cma[nid], order, gfp_mask);
- if (!(gfp_mask & __GFP_THISNODE)) {
+ if (!folio && !(gfp_mask & __GFP_THISNODE)) {
for_each_node_mask(node, *nodemask) {
if (node == nid || !hugetlb_cma[node])
continue;
- page = cma_alloc(hugetlb_cma[node], nr_pages,
- huge_page_order(h), true);
- if (page)
- return page_folio(page);
+ folio = cma_alloc_folio(hugetlb_cma[node], order, gfp_mask);
+ if (folio)
+ break;
}
}
}
#endif
+ if (!folio) {
+ folio = folio_alloc_gigantic(order, gfp_mask, nid, nodemask);
+ if (!folio)
+ return NULL;
+ }
- page = alloc_contig_pages(nr_pages, gfp_mask, nid, nodemask);
- return page ? page_folio(page) : NULL;
+ if (folio_ref_freeze(folio, 1))
+ return folio;
+
+ pr_warn("HugeTLB: unexpected refcount on PFN %lu\n", folio_pfn(folio));
+ hugetlb_free_folio(folio);
+ if (!retried) {
+ retried = true;
+ goto retry;
+ }
+ return NULL;
}
#else /* !CONFIG_CONTIG_ALLOC */
@@ -1617,10 +1577,6 @@ static struct folio *alloc_gigantic_folio(struct hstate *h, gfp_t gfp_mask,
{
return NULL;
}
-static inline void free_gigantic_folio(struct folio *folio,
- unsigned int order) { }
-static inline void destroy_compound_gigantic_folio(struct folio *folio,
- unsigned int order) { }
#endif
/*
@@ -1747,20 +1703,9 @@ static void __update_and_free_hugetlb_folio(struct hstate *h,
folio_clear_hugetlb_hwpoison(folio);
folio_ref_unfreeze(folio, 1);
-
- /*
- * Non-gigantic pages demoted from CMA allocated gigantic pages
- * need to be given back to CMA in free_gigantic_folio.
- */
- if (hstate_is_gigantic(h) ||
- hugetlb_cma_folio(folio, huge_page_order(h))) {
- destroy_compound_gigantic_folio(folio, huge_page_order(h));
- free_gigantic_folio(folio, huge_page_order(h));
- } else {
- INIT_LIST_HEAD(&folio->_deferred_list);
- folio_clear_partially_mapped(folio);
- folio_put(folio);
- }
+ INIT_LIST_HEAD(&folio->_deferred_list);
+ folio_clear_partially_mapped(folio);
+ hugetlb_free_folio(folio);
}
/*
@@ -2033,95 +1978,6 @@ static void prep_new_hugetlb_folio(struct hstate *h, struct folio *folio, int ni
spin_unlock_irq(&hugetlb_lock);
}
-static bool __prep_compound_gigantic_folio(struct folio *folio,
- unsigned int order, bool demote)
-{
- int i, j;
- int nr_pages = 1 << order;
- struct page *p;
-
- __folio_clear_reserved(folio);
- for (i = 0; i < nr_pages; i++) {
- p = folio_page(folio, i);
-
- /*
- * For gigantic hugepages allocated through bootmem at
- * boot, it's safer to be consistent with the not-gigantic
- * hugepages and clear the PG_reserved bit from all tail pages
- * too. Otherwise drivers using get_user_pages() to access tail
- * pages may get the reference counting wrong if they see
- * PG_reserved set on a tail page (despite the head page not
- * having PG_reserved set). Enforcing this consistency between
- * head and tail pages allows drivers to optimize away a check
- * on the head page when they need know if put_page() is needed
- * after get_user_pages().
- */
- if (i != 0) /* head page cleared above */
- __ClearPageReserved(p);
- /*
- * Subtle and very unlikely
- *
- * Gigantic 'page allocators' such as memblock or cma will
- * return a set of pages with each page ref counted. We need
- * to turn this set of pages into a compound page with tail
- * page ref counts set to zero. Code such as speculative page
- * cache adding could take a ref on a 'to be' tail page.
- * We need to respect any increased ref count, and only set
- * the ref count to zero if count is currently 1. If count
- * is not 1, we return an error. An error return indicates
- * the set of pages can not be converted to a gigantic page.
- * The caller who allocated the pages should then discard the
- * pages using the appropriate free interface.
- *
- * In the case of demote, the ref count will be zero.
- */
- if (!demote) {
- if (!page_ref_freeze(p, 1)) {
- pr_warn("HugeTLB page can not be used due to unexpected inflated ref count\n");
- goto out_error;
- }
- } else {
- VM_BUG_ON_PAGE(page_count(p), p);
- }
- if (i != 0)
- set_compound_head(p, &folio->page);
- }
- __folio_set_head(folio);
- /* we rely on prep_new_hugetlb_folio to set the hugetlb flag */
- folio_set_order(folio, order);
- atomic_set(&folio->_entire_mapcount, -1);
- atomic_set(&folio->_large_mapcount, -1);
- atomic_set(&folio->_pincount, 0);
- return true;
-
-out_error:
- /* undo page modifications made above */
- for (j = 0; j < i; j++) {
- p = folio_page(folio, j);
- if (j != 0)
- clear_compound_head(p);
- set_page_refcounted(p);
- }
- /* need to clear PG_reserved on remaining tail pages */
- for (; j < nr_pages; j++) {
- p = folio_page(folio, j);
- __ClearPageReserved(p);
- }
- return false;
-}
-
-static bool prep_compound_gigantic_folio(struct folio *folio,
- unsigned int order)
-{
- return __prep_compound_gigantic_folio(folio, order, false);
-}
-
-static bool prep_compound_gigantic_folio_for_demote(struct folio *folio,
- unsigned int order)
-{
- return __prep_compound_gigantic_folio(folio, order, true);
-}
-
/*
* Find and lock address space (mapping) in write mode.
*
@@ -2160,7 +2016,6 @@ static struct folio *alloc_buddy_hugetlb_folio(struct hstate *h,
*/
if (node_alloc_noretry && node_isset(nid, *node_alloc_noretry))
alloc_try_hard = false;
- gfp_mask |= __GFP_COMP|__GFP_NOWARN;
if (alloc_try_hard)
gfp_mask |= __GFP_RETRY_MAYFAIL;
if (nid == NUMA_NO_NODE)
@@ -2207,48 +2062,16 @@ static struct folio *alloc_buddy_hugetlb_folio(struct hstate *h,
return folio;
}
-static struct folio *__alloc_fresh_hugetlb_folio(struct hstate *h,
- gfp_t gfp_mask, int nid, nodemask_t *nmask,
- nodemask_t *node_alloc_noretry)
-{
- struct folio *folio;
- bool retry = false;
-
-retry:
- if (hstate_is_gigantic(h))
- folio = alloc_gigantic_folio(h, gfp_mask, nid, nmask);
- else
- folio = alloc_buddy_hugetlb_folio(h, gfp_mask,
- nid, nmask, node_alloc_noretry);
- if (!folio)
- return NULL;
-
- if (hstate_is_gigantic(h)) {
- if (!prep_compound_gigantic_folio(folio, huge_page_order(h))) {
- /*
- * Rare failure to convert pages to compound page.
- * Free pages and try again - ONCE!
- */
- free_gigantic_folio(folio, huge_page_order(h));
- if (!retry) {
- retry = true;
- goto retry;
- }
- return NULL;
- }
- }
-
- return folio;
-}
-
static struct folio *only_alloc_fresh_hugetlb_folio(struct hstate *h,
gfp_t gfp_mask, int nid, nodemask_t *nmask,
nodemask_t *node_alloc_noretry)
{
struct folio *folio;
- folio = __alloc_fresh_hugetlb_folio(h, gfp_mask, nid, nmask,
- node_alloc_noretry);
+ if (hstate_is_gigantic(h))
+ folio = alloc_gigantic_folio(h, gfp_mask, nid, nmask);
+ else
+ folio = alloc_buddy_hugetlb_folio(h, gfp_mask, nid, nmask, node_alloc_noretry);
if (folio)
init_new_hugetlb_folio(h, folio);
return folio;
@@ -2266,7 +2089,10 @@ static struct folio *alloc_fresh_hugetlb_folio(struct hstate *h,
{
struct folio *folio;
- folio = __alloc_fresh_hugetlb_folio(h, gfp_mask, nid, nmask, NULL);
+ if (hstate_is_gigantic(h))
+ folio = alloc_gigantic_folio(h, gfp_mask, nid, nmask);
+ else
+ folio = alloc_buddy_hugetlb_folio(h, gfp_mask, nid, nmask, NULL);
if (!folio)
return NULL;
@@ -2550,9 +2376,8 @@ struct folio *alloc_buddy_hugetlb_folio_with_mpol(struct hstate *h,
nid = huge_node(vma, addr, gfp_mask, &mpol, &nodemask);
if (mpol_is_preferred_many(mpol)) {
- gfp_t gfp = gfp_mask | __GFP_NOWARN;
+ gfp_t gfp = gfp_mask & ~(__GFP_DIRECT_RECLAIM | __GFP_NOFAIL);
- gfp &= ~(__GFP_DIRECT_RECLAIM | __GFP_NOFAIL);
folio = alloc_surplus_hugetlb_folio(h, gfp, nid, nodemask);
/* Fallback to all nodes if page==NULL */
@@ -3334,6 +3159,7 @@ static void __init hugetlb_folio_init_tail_vmemmap(struct folio *folio,
for (pfn = head_pfn + start_page_number; pfn < end_pfn; pfn++) {
struct page *page = pfn_to_page(pfn);
+ __ClearPageReserved(folio_page(folio, pfn - head_pfn));
__init_single_page(page, pfn, zone, nid);
prep_compound_tail((struct page *)folio, pfn - head_pfn);
ret = page_ref_freeze(page, 1);
@@ -3950,21 +3776,16 @@ static long demote_free_hugetlb_folios(struct hstate *src, struct hstate *dst,
continue;
list_del(&folio->lru);
- /*
- * Use destroy_compound_hugetlb_folio_for_demote for all huge page
- * sizes as it will not ref count folios.
- */
- destroy_compound_hugetlb_folio_for_demote(folio, huge_page_order(src));
+
+ split_page_owner(&folio->page, huge_page_order(src), huge_page_order(dst));
+ pgalloc_tag_split(&folio->page, 1 << huge_page_order(src));
for (i = 0; i < pages_per_huge_page(src); i += pages_per_huge_page(dst)) {
struct page *page = folio_page(folio, i);
- if (hstate_is_gigantic(dst))
- prep_compound_gigantic_folio_for_demote(page_folio(page),
- dst->order);
- else
- prep_compound_page(page, dst->order);
- set_page_private(page, 0);
+ page->mapping = NULL;
+ clear_compound_head(page);
+ prep_compound_page(page, dst->order);
init_new_hugetlb_folio(dst, page_folio(page));
list_add(&page->lru, &dst_list);
--
2.46.0.76.ge559c4bf1a-goog
^ permalink raw reply related [flat|nested] 26+ messages in thread
* Re: [PATCH mm-unstable v2 3/3] mm/hugetlb: use __GFP_COMP for gigantic folios
2024-08-14 3:54 ` [PATCH mm-unstable v2 3/3] mm/hugetlb: use __GFP_COMP for gigantic folios Yu Zhao
@ 2024-08-16 15:23 ` Zi Yan
2024-08-16 16:02 ` Yu Zhao
0 siblings, 1 reply; 26+ messages in thread
From: Zi Yan @ 2024-08-16 15:23 UTC (permalink / raw)
To: Yu Zhao
Cc: Andrew Morton, Muchun Song, Matthew Wilcox (Oracle), linux-mm,
linux-kernel, Frank van der Linden
[-- Attachment #1: Type: text/plain, Size: 2833 bytes --]
On 13 Aug 2024, at 23:54, Yu Zhao wrote:
> Use __GFP_COMP for gigantic folios to greatly reduce not only the
> amount of code but also the allocation and free time.
>
> LOC (approximately): +60, -240
>
> Allocate and free 500 1GB hugeTLB memory without HVO by:
> time echo 500 >/sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
> time echo 0 >/sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
>
> Before After
> Alloc ~13s ~10s
> Free ~15s <1s
>
> The above magnitude generally holds for multiple x86 and arm64 CPU
> models.
>
> Signed-off-by: Yu Zhao <yuzhao@google.com>
> Reported-by: Frank van der Linden <fvdl@google.com>
> ---
> include/linux/hugetlb.h | 9 +-
> mm/hugetlb.c | 293 ++++++++--------------------------------
> 2 files changed, 62 insertions(+), 240 deletions(-)
>
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index 3100a52ceb73..98c47c394b89 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -896,10 +896,11 @@ static inline bool hugepage_movable_supported(struct hstate *h)
> /* Movability of hugepages depends on migration support. */
> static inline gfp_t htlb_alloc_mask(struct hstate *h)
> {
> - if (hugepage_movable_supported(h))
> - return GFP_HIGHUSER_MOVABLE;
> - else
> - return GFP_HIGHUSER;
> + gfp_t gfp = __GFP_COMP | __GFP_NOWARN;
> +
> + gfp |= hugepage_movable_supported(h) ? GFP_HIGHUSER_MOVABLE : GFP_HIGHUSER;
> +
> + return gfp;
> }
>
> static inline gfp_t htlb_modify_alloc_mask(struct hstate *h, gfp_t gfp_mask)
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 71d469c8e711..efa77ce87dcc 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -56,16 +56,6 @@ struct hstate hstates[HUGE_MAX_HSTATE];
> #ifdef CONFIG_CMA
> static struct cma *hugetlb_cma[MAX_NUMNODES];
> static unsigned long hugetlb_cma_size_in_node[MAX_NUMNODES] __initdata;
> -static bool hugetlb_cma_folio(struct folio *folio, unsigned int order)
> -{
> - return cma_pages_valid(hugetlb_cma[folio_nid(folio)], &folio->page,
> - 1 << order);
> -}
> -#else
> -static bool hugetlb_cma_folio(struct folio *folio, unsigned int order)
> -{
> - return false;
> -}
> #endif
> static unsigned long hugetlb_cma_size __initdata;
>
> @@ -100,6 +90,17 @@ static void hugetlb_unshare_pmds(struct vm_area_struct *vma,
> unsigned long start, unsigned long end);
> static struct resv_map *vma_resv_map(struct vm_area_struct *vma);
>
> +static void hugetlb_free_folio(struct folio *folio)
> +{
> +#ifdef CONFIG_CMA
> + int nid = folio_nid(folio);
> +
> + if (cma_free_folio(hugetlb_cma[nid], folio))
> + return;
> +#endif
> + folio_put(folio);
> +}
> +
It seems that we no longer use free_contig_range() to free gigantic
folios from alloc_contig_range(). Will it work? Or did I miss anything?
Best Regards,
Yan, Zi
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH mm-unstable v2 3/3] mm/hugetlb: use __GFP_COMP for gigantic folios
2024-08-16 15:23 ` Zi Yan
@ 2024-08-16 16:02 ` Yu Zhao
2024-08-16 16:30 ` Zi Yan
0 siblings, 1 reply; 26+ messages in thread
From: Yu Zhao @ 2024-08-16 16:02 UTC (permalink / raw)
To: Zi Yan
Cc: Andrew Morton, Muchun Song, Matthew Wilcox (Oracle), linux-mm,
linux-kernel, Frank van der Linden
On Fri, Aug 16, 2024 at 9:23 AM Zi Yan <ziy@nvidia.com> wrote:
>
> On 13 Aug 2024, at 23:54, Yu Zhao wrote:
>
> > Use __GFP_COMP for gigantic folios to greatly reduce not only the
> > amount of code but also the allocation and free time.
> >
> > LOC (approximately): +60, -240
> >
> > Allocate and free 500 1GB hugeTLB memory without HVO by:
> > time echo 500 >/sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
> > time echo 0 >/sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
> >
> > Before After
> > Alloc ~13s ~10s
> > Free ~15s <1s
> >
> > The above magnitude generally holds for multiple x86 and arm64 CPU
> > models.
> >
> > Signed-off-by: Yu Zhao <yuzhao@google.com>
> > Reported-by: Frank van der Linden <fvdl@google.com>
> > ---
> > include/linux/hugetlb.h | 9 +-
> > mm/hugetlb.c | 293 ++++++++--------------------------------
> > 2 files changed, 62 insertions(+), 240 deletions(-)
> >
> > diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> > index 3100a52ceb73..98c47c394b89 100644
> > --- a/include/linux/hugetlb.h
> > +++ b/include/linux/hugetlb.h
> > @@ -896,10 +896,11 @@ static inline bool hugepage_movable_supported(struct hstate *h)
> > /* Movability of hugepages depends on migration support. */
> > static inline gfp_t htlb_alloc_mask(struct hstate *h)
> > {
> > - if (hugepage_movable_supported(h))
> > - return GFP_HIGHUSER_MOVABLE;
> > - else
> > - return GFP_HIGHUSER;
> > + gfp_t gfp = __GFP_COMP | __GFP_NOWARN;
> > +
> > + gfp |= hugepage_movable_supported(h) ? GFP_HIGHUSER_MOVABLE : GFP_HIGHUSER;
> > +
> > + return gfp;
> > }
> >
> > static inline gfp_t htlb_modify_alloc_mask(struct hstate *h, gfp_t gfp_mask)
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index 71d469c8e711..efa77ce87dcc 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -56,16 +56,6 @@ struct hstate hstates[HUGE_MAX_HSTATE];
> > #ifdef CONFIG_CMA
> > static struct cma *hugetlb_cma[MAX_NUMNODES];
> > static unsigned long hugetlb_cma_size_in_node[MAX_NUMNODES] __initdata;
> > -static bool hugetlb_cma_folio(struct folio *folio, unsigned int order)
> > -{
> > - return cma_pages_valid(hugetlb_cma[folio_nid(folio)], &folio->page,
> > - 1 << order);
> > -}
> > -#else
> > -static bool hugetlb_cma_folio(struct folio *folio, unsigned int order)
> > -{
> > - return false;
> > -}
> > #endif
> > static unsigned long hugetlb_cma_size __initdata;
> >
> > @@ -100,6 +90,17 @@ static void hugetlb_unshare_pmds(struct vm_area_struct *vma,
> > unsigned long start, unsigned long end);
> > static struct resv_map *vma_resv_map(struct vm_area_struct *vma);
> >
> > +static void hugetlb_free_folio(struct folio *folio)
> > +{
> > +#ifdef CONFIG_CMA
> > + int nid = folio_nid(folio);
> > +
> > + if (cma_free_folio(hugetlb_cma[nid], folio))
> > + return;
> > +#endif
> > + folio_put(folio);
> > +}
> > +
>
> It seems that we no longer use free_contig_range() to free gigantic
> folios from alloc_contig_range().
We switched to two pairs of extern (to the allocator) APIs in this patch:
folio_alloc_gigantic()
folio_put()
and
cma_alloc_folio()
cma_free_folio()
> Will it work? Or did I miss anything?
alloc_contig_range and free_contig_range() also works with __GFP_COMP
/ large folios, but this pair is internal (to the allocator) and
shouldn't be used directly except to implement external APIs like
above.
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH mm-unstable v2 3/3] mm/hugetlb: use __GFP_COMP for gigantic folios
2024-08-16 16:02 ` Yu Zhao
@ 2024-08-16 16:30 ` Zi Yan
0 siblings, 0 replies; 26+ messages in thread
From: Zi Yan @ 2024-08-16 16:30 UTC (permalink / raw)
To: Yu Zhao
Cc: Andrew Morton, Muchun Song, Matthew Wilcox (Oracle), linux-mm,
linux-kernel, Frank van der Linden
[-- Attachment #1: Type: text/plain, Size: 4134 bytes --]
On 16 Aug 2024, at 12:02, Yu Zhao wrote:
> On Fri, Aug 16, 2024 at 9:23 AM Zi Yan <ziy@nvidia.com> wrote:
>>
>> On 13 Aug 2024, at 23:54, Yu Zhao wrote:
>>
>>> Use __GFP_COMP for gigantic folios to greatly reduce not only the
>>> amount of code but also the allocation and free time.
>>>
>>> LOC (approximately): +60, -240
>>>
>>> Allocate and free 500 1GB hugeTLB memory without HVO by:
>>> time echo 500 >/sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
>>> time echo 0 >/sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
>>>
>>> Before After
>>> Alloc ~13s ~10s
>>> Free ~15s <1s
>>>
>>> The above magnitude generally holds for multiple x86 and arm64 CPU
>>> models.
>>>
>>> Signed-off-by: Yu Zhao <yuzhao@google.com>
>>> Reported-by: Frank van der Linden <fvdl@google.com>
>>> ---
>>> include/linux/hugetlb.h | 9 +-
>>> mm/hugetlb.c | 293 ++++++++--------------------------------
>>> 2 files changed, 62 insertions(+), 240 deletions(-)
>>>
>>> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
>>> index 3100a52ceb73..98c47c394b89 100644
>>> --- a/include/linux/hugetlb.h
>>> +++ b/include/linux/hugetlb.h
>>> @@ -896,10 +896,11 @@ static inline bool hugepage_movable_supported(struct hstate *h)
>>> /* Movability of hugepages depends on migration support. */
>>> static inline gfp_t htlb_alloc_mask(struct hstate *h)
>>> {
>>> - if (hugepage_movable_supported(h))
>>> - return GFP_HIGHUSER_MOVABLE;
>>> - else
>>> - return GFP_HIGHUSER;
>>> + gfp_t gfp = __GFP_COMP | __GFP_NOWARN;
>>> +
>>> + gfp |= hugepage_movable_supported(h) ? GFP_HIGHUSER_MOVABLE : GFP_HIGHUSER;
>>> +
>>> + return gfp;
>>> }
>>>
>>> static inline gfp_t htlb_modify_alloc_mask(struct hstate *h, gfp_t gfp_mask)
>>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>>> index 71d469c8e711..efa77ce87dcc 100644
>>> --- a/mm/hugetlb.c
>>> +++ b/mm/hugetlb.c
>>> @@ -56,16 +56,6 @@ struct hstate hstates[HUGE_MAX_HSTATE];
>>> #ifdef CONFIG_CMA
>>> static struct cma *hugetlb_cma[MAX_NUMNODES];
>>> static unsigned long hugetlb_cma_size_in_node[MAX_NUMNODES] __initdata;
>>> -static bool hugetlb_cma_folio(struct folio *folio, unsigned int order)
>>> -{
>>> - return cma_pages_valid(hugetlb_cma[folio_nid(folio)], &folio->page,
>>> - 1 << order);
>>> -}
>>> -#else
>>> -static bool hugetlb_cma_folio(struct folio *folio, unsigned int order)
>>> -{
>>> - return false;
>>> -}
>>> #endif
>>> static unsigned long hugetlb_cma_size __initdata;
>>>
>>> @@ -100,6 +90,17 @@ static void hugetlb_unshare_pmds(struct vm_area_struct *vma,
>>> unsigned long start, unsigned long end);
>>> static struct resv_map *vma_resv_map(struct vm_area_struct *vma);
>>>
>>> +static void hugetlb_free_folio(struct folio *folio)
>>> +{
>>> +#ifdef CONFIG_CMA
>>> + int nid = folio_nid(folio);
>>> +
>>> + if (cma_free_folio(hugetlb_cma[nid], folio))
>>> + return;
>>> +#endif
>>> + folio_put(folio);
>>> +}
>>> +
>>
>> It seems that we no longer use free_contig_range() to free gigantic
>> folios from alloc_contig_range().
>
> We switched to two pairs of extern (to the allocator) APIs in this patch:
> folio_alloc_gigantic()
> folio_put()
> and
> cma_alloc_folio()
> cma_free_folio()
>
>> Will it work? Or did I miss anything?
>
> alloc_contig_range and free_contig_range() also works with __GFP_COMP
> / large folios, but this pair is internal (to the allocator) and
> shouldn't be used directly except to implement external APIs like
> above.
Oh, I missed split_large_buddy() addition to free_one_page() in patch 1.
That handles >MAX_ORDER page freeing. It also makes sure the pageblocks
of a free cross-pageblock gigantic hugetlb page go back to the right
free lists when each pageblock might have different migratetypes,
since the page is freed at pageblock granularity.
The whole series looks good to me.
Acked-by: Zi Yan <ziy@nvidia.com>
Best Regards,
Yan, Zi
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH mm-unstable v2 1/3] mm/contig_alloc: support __GFP_COMP
2024-08-14 3:54 ` [PATCH mm-unstable v2 1/3] mm/contig_alloc: support __GFP_COMP Yu Zhao
@ 2024-08-22 8:21 ` Yu Liao
2024-08-22 17:25 ` Yu Zhao
2024-08-22 18:36 ` Andrew Morton
2024-08-22 17:23 ` Yu Zhao
2024-11-19 15:29 ` David Hildenbrand
2 siblings, 2 replies; 26+ messages in thread
From: Yu Liao @ 2024-08-22 8:21 UTC (permalink / raw)
To: Yu Zhao, Andrew Morton, Muchun Song
Cc: Matthew Wilcox (Oracle), Zi Yan, linux-mm, linux-kernel,
Kefeng Wang
On 2024/8/14 11:54, Yu Zhao wrote:
> Support __GFP_COMP in alloc_contig_range(). When the flag is set, upon
> success the function returns a large folio prepared by
> prep_new_page(), rather than a range of order-0 pages prepared by
> split_free_pages() (which is renamed from split_map_pages()).
>
> alloc_contig_range() can be used to allocate folios larger than
> MAX_PAGE_ORDER, e.g., gigantic hugeTLB folios. So on the free path,
> free_one_page() needs to handle that by split_large_buddy().
>
> Signed-off-by: Yu Zhao <yuzhao@google.com>
> ---
> include/linux/gfp.h | 23 +++++++++
> mm/compaction.c | 41 ++--------------
> mm/page_alloc.c | 111 +++++++++++++++++++++++++++++++-------------
> 3 files changed, 108 insertions(+), 67 deletions(-)
>
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index f53f76e0b17e..59266df56aeb 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -446,4 +446,27 @@ extern struct page *alloc_contig_pages_noprof(unsigned long nr_pages, gfp_t gfp_
> #endif
> void free_contig_range(unsigned long pfn, unsigned long nr_pages);
>
> +#ifdef CONFIG_CONTIG_ALLOC
> +static inline struct folio *folio_alloc_gigantic_noprof(int order, gfp_t gfp,
> + int nid, nodemask_t *node)
> +{
> + struct page *page;
> +
> + if (WARN_ON(!order || !(gfp | __GFP_COMP)))
It doesn't seem right, it should be !(gfp & __GFP_COMP).
Best Regards,
Yu
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH mm-unstable v2 1/3] mm/contig_alloc: support __GFP_COMP
2024-08-14 3:54 ` [PATCH mm-unstable v2 1/3] mm/contig_alloc: support __GFP_COMP Yu Zhao
2024-08-22 8:21 ` Yu Liao
@ 2024-08-22 17:23 ` Yu Zhao
2024-08-22 18:42 ` Matthew Wilcox
2024-11-19 15:29 ` David Hildenbrand
2 siblings, 1 reply; 26+ messages in thread
From: Yu Zhao @ 2024-08-22 17:23 UTC (permalink / raw)
To: Andrew Morton, Muchun Song
Cc: Matthew Wilcox (Oracle), Zi Yan, linux-mm, linux-kernel
On Tue, Aug 13, 2024 at 09:54:49PM -0600, Yu Zhao wrote:
> Support __GFP_COMP in alloc_contig_range(). When the flag is set, upon
> success the function returns a large folio prepared by
> prep_new_page(), rather than a range of order-0 pages prepared by
> split_free_pages() (which is renamed from split_map_pages()).
>
> alloc_contig_range() can be used to allocate folios larger than
> MAX_PAGE_ORDER, e.g., gigantic hugeTLB folios. So on the free path,
> free_one_page() needs to handle that by split_large_buddy().
>
> Signed-off-by: Yu Zhao <yuzhao@google.com>
> ---
> include/linux/gfp.h | 23 +++++++++
> mm/compaction.c | 41 ++--------------
> mm/page_alloc.c | 111 +++++++++++++++++++++++++++++++-------------
> 3 files changed, 108 insertions(+), 67 deletions(-)
>
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index f53f76e0b17e..59266df56aeb 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -446,4 +446,27 @@ extern struct page *alloc_contig_pages_noprof(unsigned long nr_pages, gfp_t gfp_
> #endif
> void free_contig_range(unsigned long pfn, unsigned long nr_pages);
>
> +#ifdef CONFIG_CONTIG_ALLOC
> +static inline struct folio *folio_alloc_gigantic_noprof(int order, gfp_t gfp,
> + int nid, nodemask_t *node)
> +{
> + struct page *page;
> +
> + if (WARN_ON(!order || !(gfp | __GFP_COMP)))
Andrew, could you patch up the line above? This is what it's supposed
to check:
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 59266df56aeb..03ba9563c6db 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -452,7 +452,7 @@ static inline struct folio *folio_alloc_gigantic_noprof(int order, gfp_t gfp,
{
struct page *page;
- if (WARN_ON(!order || !(gfp | __GFP_COMP)))
+ if (WARN_ON(!order || !(gfp & __GFP_COMP)))
return NULL;
page = alloc_contig_pages_noprof(1 << order, gfp, nid, node);
^ permalink raw reply related [flat|nested] 26+ messages in thread
* Re: [PATCH mm-unstable v2 2/3] mm/cma: add cma_{alloc,free}_folio()
2024-08-14 3:54 ` [PATCH mm-unstable v2 2/3] mm/cma: add cma_{alloc,free}_folio() Yu Zhao
@ 2024-08-22 17:24 ` Yu Zhao
2024-09-02 17:04 ` Yu Zhao
0 siblings, 1 reply; 26+ messages in thread
From: Yu Zhao @ 2024-08-22 17:24 UTC (permalink / raw)
To: Andrew Morton, Muchun Song
Cc: Matthew Wilcox (Oracle), Zi Yan, linux-mm, linux-kernel
On Tue, Aug 13, 2024 at 09:54:50PM -0600, Yu Zhao wrote:
> With alloc_contig_range() and free_contig_range() supporting large
> folios, CMA can allocate and free large folios too, by
> cma_alloc_folio() and cma_free_folio().
>
> Signed-off-by: Yu Zhao <yuzhao@google.com>
> ---
> include/linux/cma.h | 16 +++++++++++++
> mm/cma.c | 55 ++++++++++++++++++++++++++++++++-------------
> 2 files changed, 56 insertions(+), 15 deletions(-)
>
> diff --git a/include/linux/cma.h b/include/linux/cma.h
> index 9db877506ea8..d15b64f51336 100644
> --- a/include/linux/cma.h
> +++ b/include/linux/cma.h
> @@ -52,4 +52,20 @@ extern bool cma_release(struct cma *cma, const struct page *pages, unsigned long
> extern int cma_for_each_area(int (*it)(struct cma *cma, void *data), void *data);
>
> extern void cma_reserve_pages_on_error(struct cma *cma);
> +
> +#ifdef CONFIG_CMA
> +struct folio *cma_alloc_folio(struct cma *cma, int order, gfp_t gfp);
> +bool cma_free_folio(struct cma *cma, const struct folio *folio);
> +#else
> +static inline struct folio *cma_alloc_folio(struct cma *cma, int order, gfp_t gfp)
> +{
> + return NULL;
> +}
> +
> +static inline bool cma_free_folio(struct cma *cma, const struct folio *folio)
> +{
> + return false;
> +}
> +#endif
> +
> #endif
> diff --git a/mm/cma.c b/mm/cma.c
> index 95d6950e177b..4354823d28cf 100644
> --- a/mm/cma.c
> +++ b/mm/cma.c
> @@ -403,18 +403,8 @@ static void cma_debug_show_areas(struct cma *cma)
> spin_unlock_irq(&cma->lock);
> }
>
> -/**
> - * cma_alloc() - allocate pages from contiguous area
> - * @cma: Contiguous memory region for which the allocation is performed.
> - * @count: Requested number of pages.
> - * @align: Requested alignment of pages (in PAGE_SIZE order).
> - * @no_warn: Avoid printing message about failed allocation
> - *
> - * This function allocates part of contiguous memory on specific
> - * contiguous memory area.
> - */
> -struct page *cma_alloc(struct cma *cma, unsigned long count,
> - unsigned int align, bool no_warn)
> +static struct page *__cma_alloc(struct cma *cma, unsigned long count,
> + unsigned int align, gfp_t gfp)
> {
> unsigned long mask, offset;
> unsigned long pfn = -1;
> @@ -463,8 +453,7 @@ struct page *cma_alloc(struct cma *cma, unsigned long count,
>
> pfn = cma->base_pfn + (bitmap_no << cma->order_per_bit);
> mutex_lock(&cma_mutex);
> - ret = alloc_contig_range(pfn, pfn + count, MIGRATE_CMA,
> - GFP_KERNEL | (no_warn ? __GFP_NOWARN : 0));
> + ret = alloc_contig_range(pfn, pfn + count, MIGRATE_CMA, gfp);
> mutex_unlock(&cma_mutex);
> if (ret == 0) {
> page = pfn_to_page(pfn);
> @@ -494,7 +483,7 @@ struct page *cma_alloc(struct cma *cma, unsigned long count,
> page_kasan_tag_reset(nth_page(page, i));
> }
>
> - if (ret && !no_warn) {
> + if (ret && !(gfp & __GFP_NOWARN)) {
> pr_err_ratelimited("%s: %s: alloc failed, req-size: %lu pages, ret: %d\n",
> __func__, cma->name, count, ret);
> cma_debug_show_areas(cma);
> @@ -513,6 +502,34 @@ struct page *cma_alloc(struct cma *cma, unsigned long count,
> return page;
> }
>
> +/**
> + * cma_alloc() - allocate pages from contiguous area
> + * @cma: Contiguous memory region for which the allocation is performed.
> + * @count: Requested number of pages.
> + * @align: Requested alignment of pages (in PAGE_SIZE order).
> + * @no_warn: Avoid printing message about failed allocation
> + *
> + * This function allocates part of contiguous memory on specific
> + * contiguous memory area.
> + */
> +struct page *cma_alloc(struct cma *cma, unsigned long count,
> + unsigned int align, bool no_warn)
> +{
> + return __cma_alloc(cma, count, align, GFP_KERNEL | (no_warn ? __GFP_NOWARN : 0));
> +}
> +
> +struct folio *cma_alloc_folio(struct cma *cma, int order, gfp_t gfp)
> +{
> + struct page *page;
> +
> + if (WARN_ON(!order || !(gfp | __GFP_COMP)))
And here too. Thank you.
diff --git a/mm/cma.c b/mm/cma.c
index 4354823d28cf..2d9fae939283 100644
--- a/mm/cma.c
+++ b/mm/cma.c
@@ -522,7 +522,7 @@ struct folio *cma_alloc_folio(struct cma *cma, int order, gfp_t gfp)
{
struct page *page;
- if (WARN_ON(!order || !(gfp | __GFP_COMP)))
+ if (WARN_ON(!order || !(gfp & __GFP_COMP)))
return NULL;
page = __cma_alloc(cma, 1 << order, order, gfp);
^ permalink raw reply related [flat|nested] 26+ messages in thread
* Re: [PATCH mm-unstable v2 1/3] mm/contig_alloc: support __GFP_COMP
2024-08-22 8:21 ` Yu Liao
@ 2024-08-22 17:25 ` Yu Zhao
2024-08-22 18:36 ` Andrew Morton
1 sibling, 0 replies; 26+ messages in thread
From: Yu Zhao @ 2024-08-22 17:25 UTC (permalink / raw)
To: Yu Liao
Cc: Andrew Morton, Muchun Song, Matthew Wilcox (Oracle), Zi Yan,
linux-mm, linux-kernel, Kefeng Wang
On Thu, Aug 22, 2024 at 2:21 AM Yu Liao <liaoyu15@huawei.com> wrote:
>
> On 2024/8/14 11:54, Yu Zhao wrote:
> > Support __GFP_COMP in alloc_contig_range(). When the flag is set, upon
> > success the function returns a large folio prepared by
> > prep_new_page(), rather than a range of order-0 pages prepared by
> > split_free_pages() (which is renamed from split_map_pages()).
> >
> > alloc_contig_range() can be used to allocate folios larger than
> > MAX_PAGE_ORDER, e.g., gigantic hugeTLB folios. So on the free path,
> > free_one_page() needs to handle that by split_large_buddy().
> >
> > Signed-off-by: Yu Zhao <yuzhao@google.com>
> > ---
> > include/linux/gfp.h | 23 +++++++++
> > mm/compaction.c | 41 ++--------------
> > mm/page_alloc.c | 111 +++++++++++++++++++++++++++++++-------------
> > 3 files changed, 108 insertions(+), 67 deletions(-)
> >
> > diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> > index f53f76e0b17e..59266df56aeb 100644
> > --- a/include/linux/gfp.h
> > +++ b/include/linux/gfp.h
> > @@ -446,4 +446,27 @@ extern struct page *alloc_contig_pages_noprof(unsigned long nr_pages, gfp_t gfp_
> > #endif
> > void free_contig_range(unsigned long pfn, unsigned long nr_pages);
> >
> > +#ifdef CONFIG_CONTIG_ALLOC
> > +static inline struct folio *folio_alloc_gigantic_noprof(int order, gfp_t gfp,
> > + int nid, nodemask_t *node)
> > +{
> > + struct page *page;
> > +
> > + if (WARN_ON(!order || !(gfp | __GFP_COMP)))
>
> It doesn't seem right, it should be !(gfp & __GFP_COMP).
Thanks. I've asked Andrew to patch this up (and another place in mm/cma.c).
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH mm-unstable v2 1/3] mm/contig_alloc: support __GFP_COMP
2024-08-22 8:21 ` Yu Liao
2024-08-22 17:25 ` Yu Zhao
@ 2024-08-22 18:36 ` Andrew Morton
1 sibling, 0 replies; 26+ messages in thread
From: Andrew Morton @ 2024-08-22 18:36 UTC (permalink / raw)
To: Yu Liao
Cc: Yu Zhao, Muchun Song, Matthew Wilcox (Oracle), Zi Yan, linux-mm,
linux-kernel, Kefeng Wang
On Thu, 22 Aug 2024 16:21:27 +0800 Yu Liao <liaoyu15@huawei.com> wrote:
> > +#ifdef CONFIG_CONTIG_ALLOC
> > +static inline struct folio *folio_alloc_gigantic_noprof(int order, gfp_t gfp,
> > + int nid, nodemask_t *node)
> > +{
> > + struct page *page;
> > +
> > + if (WARN_ON(!order || !(gfp | __GFP_COMP)))
>
> It doesn't seem right, it should be !(gfp & __GFP_COMP).
Yup. Although I'm wondering why we're checking __GFP_COMP at all?
It's a gigantic page - we know it's compound, so why require that the
caller tell us something we already know?
--- a/include/linux/gfp.h~mm-contig_alloc-support-__gfp_comp-fix
+++ a/include/linux/gfp.h
@@ -452,7 +452,7 @@ static inline struct folio *folio_alloc_
{
struct page *page;
- if (WARN_ON(!order || !(gfp | __GFP_COMP)))
+ if (WARN_ON(!order || !(gfp & __GFP_COMP)))
return NULL;
page = alloc_contig_pages_noprof(1 << order, gfp, nid, node);
_
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH mm-unstable v2 1/3] mm/contig_alloc: support __GFP_COMP
2024-08-22 17:23 ` Yu Zhao
@ 2024-08-22 18:42 ` Matthew Wilcox
2024-09-02 17:02 ` Yu Zhao
0 siblings, 1 reply; 26+ messages in thread
From: Matthew Wilcox @ 2024-08-22 18:42 UTC (permalink / raw)
To: Yu Zhao; +Cc: Andrew Morton, Muchun Song, Zi Yan, linux-mm, linux-kernel
On Thu, Aug 22, 2024 at 11:23:04AM -0600, Yu Zhao wrote:
> Andrew, could you patch up the line above? This is what it's supposed
> to check:
>
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index 59266df56aeb..03ba9563c6db 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -452,7 +452,7 @@ static inline struct folio *folio_alloc_gigantic_noprof(int order, gfp_t gfp,
> {
> struct page *page;
>
> - if (WARN_ON(!order || !(gfp | __GFP_COMP)))
> + if (WARN_ON(!order || !(gfp & __GFP_COMP)))
> return NULL;
I don't think we should do this at all. Just this should be enough:
gfp |= __GFP_COMP;
same as folio_alloc() (or now folio_alloc_noprof()).
Do we really caree if somebody tries to allocate a gigantic page with an
order of 0? It's weird, but would work, so I don't see the need for the
warning.
> page = alloc_contig_pages_noprof(1 << order, gfp, nid, node);
>
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH mm-unstable v2 0/3] mm/hugetlb: alloc/free gigantic folios
2024-08-14 3:54 [PATCH mm-unstable v2 0/3] mm/hugetlb: alloc/free gigantic folios Yu Zhao
` (2 preceding siblings ...)
2024-08-14 3:54 ` [PATCH mm-unstable v2 3/3] mm/hugetlb: use __GFP_COMP for gigantic folios Yu Zhao
@ 2024-08-30 17:55 ` jane.chu
3 siblings, 0 replies; 26+ messages in thread
From: jane.chu @ 2024-08-30 17:55 UTC (permalink / raw)
To: Yu Zhao, Andrew Morton, Muchun Song
Cc: Matthew Wilcox (Oracle), Zi Yan, linux-mm, linux-kernel
On 8/13/2024 8:54 PM, Yu Zhao wrote:
> Use __GFP_COMP for gigantic folios can greatly reduce not only the
> amount of code but also the allocation and free time.
>
> Approximate LOC to mm/hugetlb.c: +60, -240
>
> Allocate and free 500 1GB hugeTLB memory without HVO by:
Do you also have numbers with HVO enabled ?
> time echo 500 >/sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
> time echo 0 >/sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
>
> Before After
> Alloc ~13s ~10s
> Free ~15s <1s
>
Thanks,
-jane
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH mm-unstable v2 1/3] mm/contig_alloc: support __GFP_COMP
2024-08-22 18:42 ` Matthew Wilcox
@ 2024-09-02 17:02 ` Yu Zhao
0 siblings, 0 replies; 26+ messages in thread
From: Yu Zhao @ 2024-09-02 17:02 UTC (permalink / raw)
To: Matthew Wilcox; +Cc: Andrew Morton, Muchun Song, Zi Yan, linux-mm, linux-kernel
On Thu, Aug 22, 2024 at 07:42:20PM +0100, Matthew Wilcox wrote:
> On Thu, Aug 22, 2024 at 11:23:04AM -0600, Yu Zhao wrote:
> > Andrew, could you patch up the line above? This is what it's supposed
> > to check:
> >
> > diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> > index 59266df56aeb..03ba9563c6db 100644
> > --- a/include/linux/gfp.h
> > +++ b/include/linux/gfp.h
> > @@ -452,7 +452,7 @@ static inline struct folio *folio_alloc_gigantic_noprof(int order, gfp_t gfp,
> > {
> > struct page *page;
> >
> > - if (WARN_ON(!order || !(gfp | __GFP_COMP)))
> > + if (WARN_ON(!order || !(gfp & __GFP_COMP)))
> > return NULL;
>
> I don't think we should do this at all. Just this should be enough:
>
> gfp |= __GFP_COMP;
>
> same as folio_alloc() (or now folio_alloc_noprof()).
> Do we really caree if somebody tries to allocate a gigantic page with an
> order of 0?
If this ever happens, I'd bet it's a bug.
> It's weird, but would work, so I don't see the need for the
> warning.
So the warning could catch that, but if we think it's verbose, then
please fold the following in:
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index a951de920e20..b43934d79dd9 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -452,10 +452,7 @@ static inline struct folio *folio_alloc_gigantic_noprof(int order, gfp_t gfp,
{
struct page *page;
- if (WARN_ON(!order || !(gfp & __GFP_COMP)))
- return NULL;
-
- page = alloc_contig_pages_noprof(1 << order, gfp, nid, node);
+ page = alloc_contig_pages_noprof(1 << order, gfp | __GFP_COMP, nid, node);
return page ? page_folio(page) : NULL;
}
^ permalink raw reply related [flat|nested] 26+ messages in thread
* Re: [PATCH mm-unstable v2 2/3] mm/cma: add cma_{alloc,free}_folio()
2024-08-22 17:24 ` Yu Zhao
@ 2024-09-02 17:04 ` Yu Zhao
0 siblings, 0 replies; 26+ messages in thread
From: Yu Zhao @ 2024-09-02 17:04 UTC (permalink / raw)
To: Andrew Morton, Muchun Song
Cc: Matthew Wilcox (Oracle), Zi Yan, linux-mm, linux-kernel
On Thu, Aug 22, 2024 at 11:24:14AM -0600, Yu Zhao wrote:
> On Tue, Aug 13, 2024 at 09:54:50PM -0600, Yu Zhao wrote:
> > With alloc_contig_range() and free_contig_range() supporting large
> > folios, CMA can allocate and free large folios too, by
> > cma_alloc_folio() and cma_free_folio().
> >
> > Signed-off-by: Yu Zhao <yuzhao@google.com>
> > ---
> > include/linux/cma.h | 16 +++++++++++++
> > mm/cma.c | 55 ++++++++++++++++++++++++++++++++-------------
> > 2 files changed, 56 insertions(+), 15 deletions(-)
> >
> > diff --git a/include/linux/cma.h b/include/linux/cma.h
> > index 9db877506ea8..d15b64f51336 100644
> > --- a/include/linux/cma.h
> > +++ b/include/linux/cma.h
> > @@ -52,4 +52,20 @@ extern bool cma_release(struct cma *cma, const struct page *pages, unsigned long
> > extern int cma_for_each_area(int (*it)(struct cma *cma, void *data), void *data);
> >
> > extern void cma_reserve_pages_on_error(struct cma *cma);
> > +
> > +#ifdef CONFIG_CMA
> > +struct folio *cma_alloc_folio(struct cma *cma, int order, gfp_t gfp);
> > +bool cma_free_folio(struct cma *cma, const struct folio *folio);
> > +#else
> > +static inline struct folio *cma_alloc_folio(struct cma *cma, int order, gfp_t gfp)
> > +{
> > + return NULL;
> > +}
> > +
> > +static inline bool cma_free_folio(struct cma *cma, const struct folio *folio)
> > +{
> > + return false;
> > +}
> > +#endif
> > +
> > #endif
> > diff --git a/mm/cma.c b/mm/cma.c
> > index 95d6950e177b..4354823d28cf 100644
> > --- a/mm/cma.c
> > +++ b/mm/cma.c
> > @@ -403,18 +403,8 @@ static void cma_debug_show_areas(struct cma *cma)
> > spin_unlock_irq(&cma->lock);
> > }
> >
> > -/**
> > - * cma_alloc() - allocate pages from contiguous area
> > - * @cma: Contiguous memory region for which the allocation is performed.
> > - * @count: Requested number of pages.
> > - * @align: Requested alignment of pages (in PAGE_SIZE order).
> > - * @no_warn: Avoid printing message about failed allocation
> > - *
> > - * This function allocates part of contiguous memory on specific
> > - * contiguous memory area.
> > - */
> > -struct page *cma_alloc(struct cma *cma, unsigned long count,
> > - unsigned int align, bool no_warn)
> > +static struct page *__cma_alloc(struct cma *cma, unsigned long count,
> > + unsigned int align, gfp_t gfp)
> > {
> > unsigned long mask, offset;
> > unsigned long pfn = -1;
> > @@ -463,8 +453,7 @@ struct page *cma_alloc(struct cma *cma, unsigned long count,
> >
> > pfn = cma->base_pfn + (bitmap_no << cma->order_per_bit);
> > mutex_lock(&cma_mutex);
> > - ret = alloc_contig_range(pfn, pfn + count, MIGRATE_CMA,
> > - GFP_KERNEL | (no_warn ? __GFP_NOWARN : 0));
> > + ret = alloc_contig_range(pfn, pfn + count, MIGRATE_CMA, gfp);
> > mutex_unlock(&cma_mutex);
> > if (ret == 0) {
> > page = pfn_to_page(pfn);
> > @@ -494,7 +483,7 @@ struct page *cma_alloc(struct cma *cma, unsigned long count,
> > page_kasan_tag_reset(nth_page(page, i));
> > }
> >
> > - if (ret && !no_warn) {
> > + if (ret && !(gfp & __GFP_NOWARN)) {
> > pr_err_ratelimited("%s: %s: alloc failed, req-size: %lu pages, ret: %d\n",
> > __func__, cma->name, count, ret);
> > cma_debug_show_areas(cma);
> > @@ -513,6 +502,34 @@ struct page *cma_alloc(struct cma *cma, unsigned long count,
> > return page;
> > }
> >
> > +/**
> > + * cma_alloc() - allocate pages from contiguous area
> > + * @cma: Contiguous memory region for which the allocation is performed.
> > + * @count: Requested number of pages.
> > + * @align: Requested alignment of pages (in PAGE_SIZE order).
> > + * @no_warn: Avoid printing message about failed allocation
> > + *
> > + * This function allocates part of contiguous memory on specific
> > + * contiguous memory area.
> > + */
> > +struct page *cma_alloc(struct cma *cma, unsigned long count,
> > + unsigned int align, bool no_warn)
> > +{
> > + return __cma_alloc(cma, count, align, GFP_KERNEL | (no_warn ? __GFP_NOWARN : 0));
> > +}
> > +
> > +struct folio *cma_alloc_folio(struct cma *cma, int order, gfp_t gfp)
> > +{
> > + struct page *page;
> > +
> > + if (WARN_ON(!order || !(gfp | __GFP_COMP)))
>
> And here too. Thank you.
>
> diff --git a/mm/cma.c b/mm/cma.c
> index 4354823d28cf..2d9fae939283 100644
> --- a/mm/cma.c
> +++ b/mm/cma.c
> @@ -522,7 +522,7 @@ struct folio *cma_alloc_folio(struct cma *cma, int order, gfp_t gfp)
> {
> struct page *page;
>
> - if (WARN_ON(!order || !(gfp | __GFP_COMP)))
> + if (WARN_ON(!order || !(gfp & __GFP_COMP)))
> return NULL;
>
> page = __cma_alloc(cma, 1 << order, order, gfp);
And the following instead of the above, if we don't want to warn the
potential bug.
diff --git a/mm/cma.c b/mm/cma.c
index 2d9fae939283..00c30dcee200 100644
--- a/mm/cma.c
+++ b/mm/cma.c
@@ -522,10 +522,7 @@ struct folio *cma_alloc_folio(struct cma *cma, int order, gfp_t gfp)
{
struct page *page;
- if (WARN_ON(!order || !(gfp & __GFP_COMP)))
- return NULL;
-
- page = __cma_alloc(cma, 1 << order, order, gfp);
+ page = __cma_alloc(cma, 1 << order, order, gfp | __GFP_COMP);
return page ? page_folio(page) : NULL;
}
^ permalink raw reply related [flat|nested] 26+ messages in thread
* Re: [PATCH mm-unstable v2 1/3] mm/contig_alloc: support __GFP_COMP
2024-08-14 3:54 ` [PATCH mm-unstable v2 1/3] mm/contig_alloc: support __GFP_COMP Yu Zhao
2024-08-22 8:21 ` Yu Liao
2024-08-22 17:23 ` Yu Zhao
@ 2024-11-19 15:29 ` David Hildenbrand
2024-11-19 16:12 ` Zi Yan
2 siblings, 1 reply; 26+ messages in thread
From: David Hildenbrand @ 2024-11-19 15:29 UTC (permalink / raw)
To: Yu Zhao, Andrew Morton, Muchun Song
Cc: Matthew Wilcox (Oracle), Zi Yan, linux-mm, linux-kernel
> +/* Split a multi-block free page into its individual pageblocks. */
> +static void split_large_buddy(struct zone *zone, struct page *page,
> + unsigned long pfn, int order, fpi_t fpi)
> +{
> + unsigned long end = pfn + (1 << order);
> +
> + VM_WARN_ON_ONCE(!IS_ALIGNED(pfn, 1 << order));
> + /* Caller removed page from freelist, buddy info cleared! */
> + VM_WARN_ON_ONCE(PageBuddy(page));
> +
> + if (order > pageblock_order)
> + order = pageblock_order;
> +
> + while (pfn != end) {
> + int mt = get_pfnblock_migratetype(page, pfn);
> +
> + __free_one_page(page, pfn, zone, order, mt, fpi);
> + pfn += 1 << order;
> + page = pfn_to_page(pfn);
> + }
> +}
Hi,
stumbling over this while digging through the code ....
> +
> static void free_one_page(struct zone *zone, struct page *page,
> unsigned long pfn, unsigned int order,
> fpi_t fpi_flags)
> {
> unsigned long flags;
> - int migratetype;
>
> spin_lock_irqsave(&zone->lock, flags);
> - migratetype = get_pfnblock_migratetype(page, pfn);
> - __free_one_page(page, pfn, zone, order, migratetype, fpi_flags);
This change is rather undesired:
via __free_pages_core()->__free_pages_ok() we can easily end up here
with order=MAX_PAGE_ORDER.
What your new code will do is split this perfectly reasonable
MAX_PAGE_ORDER chunk via split_large_buddy() into pageblock-sized
chunks, and let the buddy merging logic undo our unnecessary splitting.
Is there a way to avoid this and just process the whole MAX_PAGE_ORDER
chunk like we used to?
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH mm-unstable v2 1/3] mm/contig_alloc: support __GFP_COMP
2024-11-19 15:29 ` David Hildenbrand
@ 2024-11-19 16:12 ` Zi Yan
2024-11-19 16:25 ` David Hildenbrand
2024-11-19 16:31 ` David Hildenbrand
0 siblings, 2 replies; 26+ messages in thread
From: Zi Yan @ 2024-11-19 16:12 UTC (permalink / raw)
To: David Hildenbrand
Cc: Yu Zhao, Andrew Morton, Muchun Song, Matthew Wilcox (Oracle),
linux-mm, linux-kernel
On 19 Nov 2024, at 10:29, David Hildenbrand wrote:
>> +/* Split a multi-block free page into its individual pageblocks. */
>> +static void split_large_buddy(struct zone *zone, struct page *page,
>> + unsigned long pfn, int order, fpi_t fpi)
>> +{
>> + unsigned long end = pfn + (1 << order);
>> +
>> + VM_WARN_ON_ONCE(!IS_ALIGNED(pfn, 1 << order));
>> + /* Caller removed page from freelist, buddy info cleared! */
>> + VM_WARN_ON_ONCE(PageBuddy(page));
>> +
>> + if (order > pageblock_order)
>> + order = pageblock_order;
>> +
>> + while (pfn != end) {
>> + int mt = get_pfnblock_migratetype(page, pfn);
>> +
>> + __free_one_page(page, pfn, zone, order, mt, fpi);
>> + pfn += 1 << order;
>> + page = pfn_to_page(pfn);
>> + }
>> +}
>
> Hi,
>
> stumbling over this while digging through the code ....
>
>> +
>> static void free_one_page(struct zone *zone, struct page *page,
>> unsigned long pfn, unsigned int order,
>> fpi_t fpi_flags)
>> {
>> unsigned long flags;
>> - int migratetype;
>> spin_lock_irqsave(&zone->lock, flags);
>> - migratetype = get_pfnblock_migratetype(page, pfn);
>> - __free_one_page(page, pfn, zone, order, migratetype, fpi_flags);
>
> This change is rather undesired:
>
> via __free_pages_core()->__free_pages_ok() we can easily end up here with order=MAX_PAGE_ORDER.
Do you have a concrete example? PMD THP on x86_64 is pageblock_order.
We do not have PMD level mTHP yet. Any other possible source?
>
> What your new code will do is split this perfectly reasonable MAX_PAGE_ORDER chunk via split_large_buddy() into pageblock-sized chunks, and let the buddy merging logic undo our unnecessary splitting.
>
> Is there a way to avoid this and just process the whole MAX_PAGE_ORDER chunk like we used to?
Probably split_large_buddy() can check the migratetypes of the to-be-freed
page, if order > pageblock_order. If all migratetypes are the same, the page
can be freed at MAX_PAGE_ORDER, otherwise pageblock_order.
Best Regards,
Yan, Zi
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH mm-unstable v2 1/3] mm/contig_alloc: support __GFP_COMP
2024-11-19 16:12 ` Zi Yan
@ 2024-11-19 16:25 ` David Hildenbrand
2024-11-19 16:31 ` David Hildenbrand
1 sibling, 0 replies; 26+ messages in thread
From: David Hildenbrand @ 2024-11-19 16:25 UTC (permalink / raw)
To: Zi Yan
Cc: Yu Zhao, Andrew Morton, Muchun Song, Matthew Wilcox (Oracle),
linux-mm, linux-kernel
On 19.11.24 17:12, Zi Yan wrote:
> On 19 Nov 2024, at 10:29, David Hildenbrand wrote:
>
>>> +/* Split a multi-block free page into its individual pageblocks. */
>>> +static void split_large_buddy(struct zone *zone, struct page *page,
>>> + unsigned long pfn, int order, fpi_t fpi)
>>> +{
>>> + unsigned long end = pfn + (1 << order);
>>> +
>>> + VM_WARN_ON_ONCE(!IS_ALIGNED(pfn, 1 << order));
>>> + /* Caller removed page from freelist, buddy info cleared! */
>>> + VM_WARN_ON_ONCE(PageBuddy(page));
>>> +
>>> + if (order > pageblock_order)
>>> + order = pageblock_order;
>>> +
>>> + while (pfn != end) {
>>> + int mt = get_pfnblock_migratetype(page, pfn);
>>> +
>>> + __free_one_page(page, pfn, zone, order, mt, fpi);
>>> + pfn += 1 << order;
>>> + page = pfn_to_page(pfn);
>>> + }
>>> +}
>>
>> Hi,
>>
>> stumbling over this while digging through the code ....
>>
>>> +
>>> static void free_one_page(struct zone *zone, struct page *page,
>>> unsigned long pfn, unsigned int order,
>>> fpi_t fpi_flags)
>>> {
>>> unsigned long flags;
>>> - int migratetype;
>>> spin_lock_irqsave(&zone->lock, flags);
>>> - migratetype = get_pfnblock_migratetype(page, pfn);
>>> - __free_one_page(page, pfn, zone, order, migratetype, fpi_flags);
>>
>> This change is rather undesired:
>>
>> via __free_pages_core()->__free_pages_ok() we can easily end up here with order=MAX_PAGE_ORDER.
>
> Do you have a concrete example? PMD THP on x86_64 is pageblock_order.
> We do not have PMD level mTHP yet. Any other possible source?
Memory init during boot. See deferred_free_pages() and
__free_pages_memory()->memblock_free_pages().
So this is used for exposing most memory during boot to the buddy in
MAX_PAGE_ORDER granularity.
The other is memory hotplug via generic_online_pages().
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH mm-unstable v2 1/3] mm/contig_alloc: support __GFP_COMP
2024-11-19 16:12 ` Zi Yan
2024-11-19 16:25 ` David Hildenbrand
@ 2024-11-19 16:31 ` David Hildenbrand
2024-11-19 16:41 ` Zi Yan
1 sibling, 1 reply; 26+ messages in thread
From: David Hildenbrand @ 2024-11-19 16:31 UTC (permalink / raw)
To: Zi Yan
Cc: Yu Zhao, Andrew Morton, Muchun Song, Matthew Wilcox (Oracle),
linux-mm, linux-kernel
On 19.11.24 17:12, Zi Yan wrote:
> On 19 Nov 2024, at 10:29, David Hildenbrand wrote:
>
>>> +/* Split a multi-block free page into its individual pageblocks. */
>>> +static void split_large_buddy(struct zone *zone, struct page *page,
>>> + unsigned long pfn, int order, fpi_t fpi)
>>> +{
>>> + unsigned long end = pfn + (1 << order);
>>> +
>>> + VM_WARN_ON_ONCE(!IS_ALIGNED(pfn, 1 << order));
>>> + /* Caller removed page from freelist, buddy info cleared! */
>>> + VM_WARN_ON_ONCE(PageBuddy(page));
>>> +
>>> + if (order > pageblock_order)
>>> + order = pageblock_order;
>>> +
>>> + while (pfn != end) {
>>> + int mt = get_pfnblock_migratetype(page, pfn);
>>> +
>>> + __free_one_page(page, pfn, zone, order, mt, fpi);
>>> + pfn += 1 << order;
>>> + page = pfn_to_page(pfn);
>>> + }
>>> +}
>>
>> Hi,
>>
>> stumbling over this while digging through the code ....
>>
>>> +
>>> static void free_one_page(struct zone *zone, struct page *page,
>>> unsigned long pfn, unsigned int order,
>>> fpi_t fpi_flags)
>>> {
>>> unsigned long flags;
>>> - int migratetype;
>>> spin_lock_irqsave(&zone->lock, flags);
>>> - migratetype = get_pfnblock_migratetype(page, pfn);
>>> - __free_one_page(page, pfn, zone, order, migratetype, fpi_flags);
>>
>> This change is rather undesired:
>>
>> via __free_pages_core()->__free_pages_ok() we can easily end up here with order=MAX_PAGE_ORDER.
>
> Do you have a concrete example? PMD THP on x86_64 is pageblock_order.
> We do not have PMD level mTHP yet. Any other possible source?
>
>>
>> What your new code will do is split this perfectly reasonable MAX_PAGE_ORDER chunk via split_large_buddy() into pageblock-sized chunks, and let the buddy merging logic undo our unnecessary splitting.
>>
>> Is there a way to avoid this and just process the whole MAX_PAGE_ORDER chunk like we used to?
>
> Probably split_large_buddy() can check the migratetypes of the to-be-freed
> page, if order > pageblock_order. If all migratetypes are the same, the page
> can be freed at MAX_PAGE_ORDER, otherwise pageblock_order.
Thinking about this: why do we care about the migratetype?
We only have to fallback to pageblocks if any pageblock is
"MIGRATE_ISOLATE" (and maybe MIGRATE_CMA), but not all. Otherwise, we
can just ignore the migratetype (or rather overwrite it)
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH mm-unstable v2 1/3] mm/contig_alloc: support __GFP_COMP
2024-11-19 16:31 ` David Hildenbrand
@ 2024-11-19 16:41 ` Zi Yan
2024-11-19 16:49 ` David Hildenbrand
0 siblings, 1 reply; 26+ messages in thread
From: Zi Yan @ 2024-11-19 16:41 UTC (permalink / raw)
To: David Hildenbrand
Cc: Yu Zhao, Andrew Morton, Muchun Song, Matthew Wilcox (Oracle),
linux-mm, linux-kernel
On 19 Nov 2024, at 11:31, David Hildenbrand wrote:
> On 19.11.24 17:12, Zi Yan wrote:
>> On 19 Nov 2024, at 10:29, David Hildenbrand wrote:
>>
>>>> +/* Split a multi-block free page into its individual pageblocks. */
>>>> +static void split_large_buddy(struct zone *zone, struct page *page,
>>>> + unsigned long pfn, int order, fpi_t fpi)
>>>> +{
>>>> + unsigned long end = pfn + (1 << order);
>>>> +
>>>> + VM_WARN_ON_ONCE(!IS_ALIGNED(pfn, 1 << order));
>>>> + /* Caller removed page from freelist, buddy info cleared! */
>>>> + VM_WARN_ON_ONCE(PageBuddy(page));
>>>> +
>>>> + if (order > pageblock_order)
>>>> + order = pageblock_order;
>>>> +
>>>> + while (pfn != end) {
>>>> + int mt = get_pfnblock_migratetype(page, pfn);
>>>> +
>>>> + __free_one_page(page, pfn, zone, order, mt, fpi);
>>>> + pfn += 1 << order;
>>>> + page = pfn_to_page(pfn);
>>>> + }
>>>> +}
>>>
>>> Hi,
>>>
>>> stumbling over this while digging through the code ....
>>>
>>>> +
>>>> static void free_one_page(struct zone *zone, struct page *page,
>>>> unsigned long pfn, unsigned int order,
>>>> fpi_t fpi_flags)
>>>> {
>>>> unsigned long flags;
>>>> - int migratetype;
>>>> spin_lock_irqsave(&zone->lock, flags);
>>>> - migratetype = get_pfnblock_migratetype(page, pfn);
>>>> - __free_one_page(page, pfn, zone, order, migratetype, fpi_flags);
>>>
>>> This change is rather undesired:
>>>
>>> via __free_pages_core()->__free_pages_ok() we can easily end up here with order=MAX_PAGE_ORDER.
>>
>> Do you have a concrete example? PMD THP on x86_64 is pageblock_order.
>> We do not have PMD level mTHP yet. Any other possible source?
>>
>>>
>>> What your new code will do is split this perfectly reasonable MAX_PAGE_ORDER chunk via split_large_buddy() into pageblock-sized chunks, and let the buddy merging logic undo our unnecessary splitting.
>>>
>>> Is there a way to avoid this and just process the whole MAX_PAGE_ORDER chunk like we used to?
>>
>> Probably split_large_buddy() can check the migratetypes of the to-be-freed
>> page, if order > pageblock_order. If all migratetypes are the same, the page
>> can be freed at MAX_PAGE_ORDER, otherwise pageblock_order.
>
> Thinking about this: why do we care about the migratetype?
>
> We only have to fallback to pageblocks if any pageblock is "MIGRATE_ISOLATE" (and maybe MIGRATE_CMA), but not all. Otherwise, we can just ignore the migratetype (or rather overwrite it)
There are VM_WARN_ONCEs around *_free_list() operations to make sure
page migratetype matches the migratetype of the list it is on. Ignoring
migratetype would trigger these WARNs. Overwriting it would work but
free page migratetype accounting needs to be taken care of.
An implicit reason is that __free_one_page() does not support >MAX_PAGE_ORDER
and gigantic hugetlb folios are freed via free_one_page(), where
split_large_buddy() is used to work around the limitation.
For the two memory init cases you mentioned in the other email, maybe a new
fpi flag to make free_one_page() use __free_one_page() for them,
since migratetypes should be the same across MAX_PAGE_ORDER range there, right?
Best Regards,
Yan, Zi
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH mm-unstable v2 1/3] mm/contig_alloc: support __GFP_COMP
2024-11-19 16:41 ` Zi Yan
@ 2024-11-19 16:49 ` David Hildenbrand
2024-11-19 16:52 ` David Hildenbrand
0 siblings, 1 reply; 26+ messages in thread
From: David Hildenbrand @ 2024-11-19 16:49 UTC (permalink / raw)
To: Zi Yan
Cc: Yu Zhao, Andrew Morton, Muchun Song, Matthew Wilcox (Oracle),
linux-mm, linux-kernel
On 19.11.24 17:41, Zi Yan wrote:
> On 19 Nov 2024, at 11:31, David Hildenbrand wrote:
>
>> On 19.11.24 17:12, Zi Yan wrote:
>>> On 19 Nov 2024, at 10:29, David Hildenbrand wrote:
>>>
>>>>> +/* Split a multi-block free page into its individual pageblocks. */
>>>>> +static void split_large_buddy(struct zone *zone, struct page *page,
>>>>> + unsigned long pfn, int order, fpi_t fpi)
>>>>> +{
>>>>> + unsigned long end = pfn + (1 << order);
>>>>> +
>>>>> + VM_WARN_ON_ONCE(!IS_ALIGNED(pfn, 1 << order));
>>>>> + /* Caller removed page from freelist, buddy info cleared! */
>>>>> + VM_WARN_ON_ONCE(PageBuddy(page));
>>>>> +
>>>>> + if (order > pageblock_order)
>>>>> + order = pageblock_order;
>>>>> +
>>>>> + while (pfn != end) {
>>>>> + int mt = get_pfnblock_migratetype(page, pfn);
>>>>> +
>>>>> + __free_one_page(page, pfn, zone, order, mt, fpi);
>>>>> + pfn += 1 << order;
>>>>> + page = pfn_to_page(pfn);
>>>>> + }
>>>>> +}
>>>>
>>>> Hi,
>>>>
>>>> stumbling over this while digging through the code ....
>>>>
>>>>> +
>>>>> static void free_one_page(struct zone *zone, struct page *page,
>>>>> unsigned long pfn, unsigned int order,
>>>>> fpi_t fpi_flags)
>>>>> {
>>>>> unsigned long flags;
>>>>> - int migratetype;
>>>>> spin_lock_irqsave(&zone->lock, flags);
>>>>> - migratetype = get_pfnblock_migratetype(page, pfn);
>>>>> - __free_one_page(page, pfn, zone, order, migratetype, fpi_flags);
>>>>
>>>> This change is rather undesired:
>>>>
>>>> via __free_pages_core()->__free_pages_ok() we can easily end up here with order=MAX_PAGE_ORDER.
>>>
>>> Do you have a concrete example? PMD THP on x86_64 is pageblock_order.
>>> We do not have PMD level mTHP yet. Any other possible source?
>>>
>>>>
>>>> What your new code will do is split this perfectly reasonable MAX_PAGE_ORDER chunk via split_large_buddy() into pageblock-sized chunks, and let the buddy merging logic undo our unnecessary splitting.
>>>>
>>>> Is there a way to avoid this and just process the whole MAX_PAGE_ORDER chunk like we used to?
>>>
>>> Probably split_large_buddy() can check the migratetypes of the to-be-freed
>>> page, if order > pageblock_order. If all migratetypes are the same, the page
>>> can be freed at MAX_PAGE_ORDER, otherwise pageblock_order.
>>
>> Thinking about this: why do we care about the migratetype?
>>
>> We only have to fallback to pageblocks if any pageblock is "MIGRATE_ISOLATE" (and maybe MIGRATE_CMA), but not all. Otherwise, we can just ignore the migratetype (or rather overwrite it)
>
> There are VM_WARN_ONCEs around *_free_list() operations to make sure
> page migratetype matches the migratetype of the list it is on. Ignoring
> migratetype would trigger these WARNs. Overwriting it would work but
> free page migratetype accounting needs to be taken care of.
>
> An implicit reason is that __free_one_page() does not support >MAX_PAGE_ORDER
> and gigantic hugetlb folios are freed via free_one_page(), where
> split_large_buddy() is used to work around the limitation.
Yes, I saw that change. But that can be easily identified cased by
unlikely(order > MAX_PAGE_ORDER) to handle that rare case in a special way.
> > For the two memory init cases you mentioned in the other email,
maybe a new
> fpi flag to make free_one_page() use __free_one_page() for them,
> since migratetypes should be the same across MAX_PAGE_ORDER range there, right?
In the context of alloc_frozen_range()/free_frozen_range() I want to
free MAX_PAGE_ORDER chunks when possible, and not have some odd logic in
the freeing path undo some of that effort.
In the common case, the pageblocks will all have the same time when
freeing a MAX_PAGE_ORDER page, so we should identify the odd case and
only special-case that.
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH mm-unstable v2 1/3] mm/contig_alloc: support __GFP_COMP
2024-11-19 16:49 ` David Hildenbrand
@ 2024-11-19 16:52 ` David Hildenbrand
2024-11-20 15:55 ` Zi Yan
0 siblings, 1 reply; 26+ messages in thread
From: David Hildenbrand @ 2024-11-19 16:52 UTC (permalink / raw)
To: Zi Yan
Cc: Yu Zhao, Andrew Morton, Muchun Song, Matthew Wilcox (Oracle),
linux-mm, linux-kernel
On 19.11.24 17:49, David Hildenbrand wrote:
> On 19.11.24 17:41, Zi Yan wrote:
>> On 19 Nov 2024, at 11:31, David Hildenbrand wrote:
>>
>>> On 19.11.24 17:12, Zi Yan wrote:
>>>> On 19 Nov 2024, at 10:29, David Hildenbrand wrote:
>>>>
>>>>>> +/* Split a multi-block free page into its individual pageblocks. */
>>>>>> +static void split_large_buddy(struct zone *zone, struct page *page,
>>>>>> + unsigned long pfn, int order, fpi_t fpi)
>>>>>> +{
>>>>>> + unsigned long end = pfn + (1 << order);
>>>>>> +
>>>>>> + VM_WARN_ON_ONCE(!IS_ALIGNED(pfn, 1 << order));
>>>>>> + /* Caller removed page from freelist, buddy info cleared! */
>>>>>> + VM_WARN_ON_ONCE(PageBuddy(page));
>>>>>> +
>>>>>> + if (order > pageblock_order)
>>>>>> + order = pageblock_order;
>>>>>> +
>>>>>> + while (pfn != end) {
>>>>>> + int mt = get_pfnblock_migratetype(page, pfn);
>>>>>> +
>>>>>> + __free_one_page(page, pfn, zone, order, mt, fpi);
>>>>>> + pfn += 1 << order;
>>>>>> + page = pfn_to_page(pfn);
>>>>>> + }
>>>>>> +}
>>>>>
>>>>> Hi,
>>>>>
>>>>> stumbling over this while digging through the code ....
>>>>>
>>>>>> +
>>>>>> static void free_one_page(struct zone *zone, struct page *page,
>>>>>> unsigned long pfn, unsigned int order,
>>>>>> fpi_t fpi_flags)
>>>>>> {
>>>>>> unsigned long flags;
>>>>>> - int migratetype;
>>>>>> spin_lock_irqsave(&zone->lock, flags);
>>>>>> - migratetype = get_pfnblock_migratetype(page, pfn);
>>>>>> - __free_one_page(page, pfn, zone, order, migratetype, fpi_flags);
>>>>>
>>>>> This change is rather undesired:
>>>>>
>>>>> via __free_pages_core()->__free_pages_ok() we can easily end up here with order=MAX_PAGE_ORDER.
>>>>
>>>> Do you have a concrete example? PMD THP on x86_64 is pageblock_order.
>>>> We do not have PMD level mTHP yet. Any other possible source?
>>>>
>>>>>
>>>>> What your new code will do is split this perfectly reasonable MAX_PAGE_ORDER chunk via split_large_buddy() into pageblock-sized chunks, and let the buddy merging logic undo our unnecessary splitting.
>>>>>
>>>>> Is there a way to avoid this and just process the whole MAX_PAGE_ORDER chunk like we used to?
>>>>
>>>> Probably split_large_buddy() can check the migratetypes of the to-be-freed
>>>> page, if order > pageblock_order. If all migratetypes are the same, the page
>>>> can be freed at MAX_PAGE_ORDER, otherwise pageblock_order.
>>>
>>> Thinking about this: why do we care about the migratetype?
>>>
>>> We only have to fallback to pageblocks if any pageblock is "MIGRATE_ISOLATE" (and maybe MIGRATE_CMA), but not all. Otherwise, we can just ignore the migratetype (or rather overwrite it)
>>
>> There are VM_WARN_ONCEs around *_free_list() operations to make sure
>> page migratetype matches the migratetype of the list it is on. Ignoring
>> migratetype would trigger these WARNs. Overwriting it would work but
>> free page migratetype accounting needs to be taken care of.
>>
>> An implicit reason is that __free_one_page() does not support >MAX_PAGE_ORDER
>> and gigantic hugetlb folios are freed via free_one_page(), where
>> split_large_buddy() is used to work around the limitation.
>
> Yes, I saw that change. But that can be easily identified cased by
> unlikely(order > MAX_PAGE_ORDER) to handle that rare case in a special way.
>
> > > For the two memory init cases you mentioned in the other email,
> maybe a new
>> fpi flag to make free_one_page() use __free_one_page() for them,
>> since migratetypes should be the same across MAX_PAGE_ORDER range there, right?
>
> In the context of alloc_frozen_range()/free_frozen_range() I want to
> free MAX_PAGE_ORDER chunks when possible, and not have some odd logic in
> the freeing path undo some of that effort.
Adding a pointer to that discussion:
https://lkml.kernel.org/r/ZzZdnuZBf-xgiwD2@casper.infradead.org
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH mm-unstable v2 1/3] mm/contig_alloc: support __GFP_COMP
2024-11-19 16:52 ` David Hildenbrand
@ 2024-11-20 15:55 ` Zi Yan
2024-11-20 16:13 ` David Hildenbrand
0 siblings, 1 reply; 26+ messages in thread
From: Zi Yan @ 2024-11-20 15:55 UTC (permalink / raw)
To: David Hildenbrand
Cc: Yu Zhao, Andrew Morton, Muchun Song, Matthew Wilcox (Oracle),
linux-mm, linux-kernel
On 19 Nov 2024, at 11:52, David Hildenbrand wrote:
> On 19.11.24 17:49, David Hildenbrand wrote:
>> On 19.11.24 17:41, Zi Yan wrote:
>>> On 19 Nov 2024, at 11:31, David Hildenbrand wrote:
>>>
>>>> On 19.11.24 17:12, Zi Yan wrote:
>>>>> On 19 Nov 2024, at 10:29, David Hildenbrand wrote:
>>>>>
>>>>>>> +/* Split a multi-block free page into its individual pageblocks. */
>>>>>>> +static void split_large_buddy(struct zone *zone, struct page *page,
>>>>>>> + unsigned long pfn, int order, fpi_t fpi)
>>>>>>> +{
>>>>>>> + unsigned long end = pfn + (1 << order);
>>>>>>> +
>>>>>>> + VM_WARN_ON_ONCE(!IS_ALIGNED(pfn, 1 << order));
>>>>>>> + /* Caller removed page from freelist, buddy info cleared! */
>>>>>>> + VM_WARN_ON_ONCE(PageBuddy(page));
>>>>>>> +
>>>>>>> + if (order > pageblock_order)
>>>>>>> + order = pageblock_order;
>>>>>>> +
>>>>>>> + while (pfn != end) {
>>>>>>> + int mt = get_pfnblock_migratetype(page, pfn);
>>>>>>> +
>>>>>>> + __free_one_page(page, pfn, zone, order, mt, fpi);
>>>>>>> + pfn += 1 << order;
>>>>>>> + page = pfn_to_page(pfn);
>>>>>>> + }
>>>>>>> +}
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> stumbling over this while digging through the code ....
>>>>>>
>>>>>>> +
>>>>>>> static void free_one_page(struct zone *zone, struct page *page,
>>>>>>> unsigned long pfn, unsigned int order,
>>>>>>> fpi_t fpi_flags)
>>>>>>> {
>>>>>>> unsigned long flags;
>>>>>>> - int migratetype;
>>>>>>> spin_lock_irqsave(&zone->lock, flags);
>>>>>>> - migratetype = get_pfnblock_migratetype(page, pfn);
>>>>>>> - __free_one_page(page, pfn, zone, order, migratetype, fpi_flags);
>>>>>>
>>>>>> This change is rather undesired:
>>>>>>
>>>>>> via __free_pages_core()->__free_pages_ok() we can easily end up here with order=MAX_PAGE_ORDER.
>>>>>
>>>>> Do you have a concrete example? PMD THP on x86_64 is pageblock_order.
>>>>> We do not have PMD level mTHP yet. Any other possible source?
>>>>>
>>>>>>
>>>>>> What your new code will do is split this perfectly reasonable MAX_PAGE_ORDER chunk via split_large_buddy() into pageblock-sized chunks, and let the buddy merging logic undo our unnecessary splitting.
>>>>>>
>>>>>> Is there a way to avoid this and just process the whole MAX_PAGE_ORDER chunk like we used to?
>>>>>
>>>>> Probably split_large_buddy() can check the migratetypes of the to-be-freed
>>>>> page, if order > pageblock_order. If all migratetypes are the same, the page
>>>>> can be freed at MAX_PAGE_ORDER, otherwise pageblock_order.
>>>>
>>>> Thinking about this: why do we care about the migratetype?
>>>>
>>>> We only have to fallback to pageblocks if any pageblock is "MIGRATE_ISOLATE" (and maybe MIGRATE_CMA), but not all. Otherwise, we can just ignore the migratetype (or rather overwrite it)
>>>
>>> There are VM_WARN_ONCEs around *_free_list() operations to make sure
>>> page migratetype matches the migratetype of the list it is on. Ignoring
>>> migratetype would trigger these WARNs. Overwriting it would work but
>>> free page migratetype accounting needs to be taken care of.
>>>
>>> An implicit reason is that __free_one_page() does not support >MAX_PAGE_ORDER
>>> and gigantic hugetlb folios are freed via free_one_page(), where
>>> split_large_buddy() is used to work around the limitation.
>>
>> Yes, I saw that change. But that can be easily identified cased by
>> unlikely(order > MAX_PAGE_ORDER) to handle that rare case in a special way.
>>
>> > > For the two memory init cases you mentioned in the other email,
>> maybe a new
>>> fpi flag to make free_one_page() use __free_one_page() for them,
>>> since migratetypes should be the same across MAX_PAGE_ORDER range there, right?
>>
>> In the context of alloc_frozen_range()/free_frozen_range() I want to
>> free MAX_PAGE_ORDER chunks when possible, and not have some odd logic in
>> the freeing path undo some of that effort.
>
> Adding a pointer to that discussion:
>
> https://lkml.kernel.org/r/ZzZdnuZBf-xgiwD2@casper.infradead.org
Thanks.
So you are thinking about something like this:
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index b6958333054d..3d3341dc1ad1 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1254,7 +1254,12 @@ static void free_one_page(struct zone *zone, struct page *page,
unsigned long flags;
spin_lock_irqsave(&zone->lock, flags);
- split_large_buddy(zone, page, pfn, order, fpi_flags);
+ if (unlikely(order > MAX_PAGE_ORDER))
+ split_large_buddy(zone, page, pfn, order, fpi_flags);
+ else {
+ int migratetype = get_pfnblock_migratetype(page, pfn);
+ __free_one_page(page, pfn, zone, order, migratetype, fpi_flags);
+ }
spin_unlock_irqrestore(&zone->lock, flags);
__count_vm_events(PGFREE, 1 << order);
Is it possible to have a MAX_PAGE_ORDER hugetlb folio? If no, we are good.
If yes, alloc_contig_range() can change the migratetype of one of half that folio
and during migration phase, that folio will be freed via __free_one_page()
and causing migratetype mismatch.
Best Regards,
Yan, Zi
^ permalink raw reply related [flat|nested] 26+ messages in thread
* Re: [PATCH mm-unstable v2 1/3] mm/contig_alloc: support __GFP_COMP
2024-11-20 15:55 ` Zi Yan
@ 2024-11-20 16:13 ` David Hildenbrand
2024-11-20 16:46 ` Zi Yan
0 siblings, 1 reply; 26+ messages in thread
From: David Hildenbrand @ 2024-11-20 16:13 UTC (permalink / raw)
To: Zi Yan
Cc: Yu Zhao, Andrew Morton, Muchun Song, Matthew Wilcox (Oracle),
linux-mm, linux-kernel
On 20.11.24 16:55, Zi Yan wrote:
> On 19 Nov 2024, at 11:52, David Hildenbrand wrote:
>
>> On 19.11.24 17:49, David Hildenbrand wrote:
>>> On 19.11.24 17:41, Zi Yan wrote:
>>>> On 19 Nov 2024, at 11:31, David Hildenbrand wrote:
>>>>
>>>>> On 19.11.24 17:12, Zi Yan wrote:
>>>>>> On 19 Nov 2024, at 10:29, David Hildenbrand wrote:
>>>>>>
>>>>>>>> +/* Split a multi-block free page into its individual pageblocks. */
>>>>>>>> +static void split_large_buddy(struct zone *zone, struct page *page,
>>>>>>>> + unsigned long pfn, int order, fpi_t fpi)
>>>>>>>> +{
>>>>>>>> + unsigned long end = pfn + (1 << order);
>>>>>>>> +
>>>>>>>> + VM_WARN_ON_ONCE(!IS_ALIGNED(pfn, 1 << order));
>>>>>>>> + /* Caller removed page from freelist, buddy info cleared! */
>>>>>>>> + VM_WARN_ON_ONCE(PageBuddy(page));
>>>>>>>> +
>>>>>>>> + if (order > pageblock_order)
>>>>>>>> + order = pageblock_order;
>>>>>>>> +
>>>>>>>> + while (pfn != end) {
>>>>>>>> + int mt = get_pfnblock_migratetype(page, pfn);
>>>>>>>> +
>>>>>>>> + __free_one_page(page, pfn, zone, order, mt, fpi);
>>>>>>>> + pfn += 1 << order;
>>>>>>>> + page = pfn_to_page(pfn);
>>>>>>>> + }
>>>>>>>> +}
>>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> stumbling over this while digging through the code ....
>>>>>>>
>>>>>>>> +
>>>>>>>> static void free_one_page(struct zone *zone, struct page *page,
>>>>>>>> unsigned long pfn, unsigned int order,
>>>>>>>> fpi_t fpi_flags)
>>>>>>>> {
>>>>>>>> unsigned long flags;
>>>>>>>> - int migratetype;
>>>>>>>> spin_lock_irqsave(&zone->lock, flags);
>>>>>>>> - migratetype = get_pfnblock_migratetype(page, pfn);
>>>>>>>> - __free_one_page(page, pfn, zone, order, migratetype, fpi_flags);
>>>>>>>
>>>>>>> This change is rather undesired:
>>>>>>>
>>>>>>> via __free_pages_core()->__free_pages_ok() we can easily end up here with order=MAX_PAGE_ORDER.
>>>>>>
>>>>>> Do you have a concrete example? PMD THP on x86_64 is pageblock_order.
>>>>>> We do not have PMD level mTHP yet. Any other possible source?
>>>>>>
>>>>>>>
>>>>>>> What your new code will do is split this perfectly reasonable MAX_PAGE_ORDER chunk via split_large_buddy() into pageblock-sized chunks, and let the buddy merging logic undo our unnecessary splitting.
>>>>>>>
>>>>>>> Is there a way to avoid this and just process the whole MAX_PAGE_ORDER chunk like we used to?
>>>>>>
>>>>>> Probably split_large_buddy() can check the migratetypes of the to-be-freed
>>>>>> page, if order > pageblock_order. If all migratetypes are the same, the page
>>>>>> can be freed at MAX_PAGE_ORDER, otherwise pageblock_order.
>>>>>
>>>>> Thinking about this: why do we care about the migratetype?
>>>>>
>>>>> We only have to fallback to pageblocks if any pageblock is "MIGRATE_ISOLATE" (and maybe MIGRATE_CMA), but not all. Otherwise, we can just ignore the migratetype (or rather overwrite it)
>>>>
>>>> There are VM_WARN_ONCEs around *_free_list() operations to make sure
>>>> page migratetype matches the migratetype of the list it is on. Ignoring
>>>> migratetype would trigger these WARNs. Overwriting it would work but
>>>> free page migratetype accounting needs to be taken care of.
>>>>
>>>> An implicit reason is that __free_one_page() does not support >MAX_PAGE_ORDER
>>>> and gigantic hugetlb folios are freed via free_one_page(), where
>>>> split_large_buddy() is used to work around the limitation.
>>>
>>> Yes, I saw that change. But that can be easily identified cased by
>>> unlikely(order > MAX_PAGE_ORDER) to handle that rare case in a special way.
>>>
>>> > > For the two memory init cases you mentioned in the other email,
>>> maybe a new
>>>> fpi flag to make free_one_page() use __free_one_page() for them,
>>>> since migratetypes should be the same across MAX_PAGE_ORDER range there, right?
>>>
>>> In the context of alloc_frozen_range()/free_frozen_range() I want to
>>> free MAX_PAGE_ORDER chunks when possible, and not have some odd logic in
>>> the freeing path undo some of that effort.
>>
>> Adding a pointer to that discussion:
>>
>> https://lkml.kernel.org/r/ZzZdnuZBf-xgiwD2@casper.infradead.org
>
> Thanks.
>
> So you are thinking about something like this:
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index b6958333054d..3d3341dc1ad1 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1254,7 +1254,12 @@ static void free_one_page(struct zone *zone, struct page *page,
> unsigned long flags;
>
> spin_lock_irqsave(&zone->lock, flags);
> - split_large_buddy(zone, page, pfn, order, fpi_flags);
> + if (unlikely(order > MAX_PAGE_ORDER))
> + split_large_buddy(zone, page, pfn, order, fpi_flags);> + else {
> + int migratetype = get_pfnblock_migratetype(page, pfn);
> + __free_one_page(page, pfn, zone, order, migratetype, fpi_flags);
> + }
> spin_unlock_irqrestore(&zone->lock, flags);
>
> __count_vm_events(PGFREE, 1 << order);
>
Something in that direction, but likely something like:
if (likely(order <= pageblock_order)) {
int migratetype = get_pfnblock_migratetype(page, pfn);
__free_one_page(page, pfn, zone, order, migratetype, fpi_flags);
} else if (order <= MAX_PAGE_ORDER) {
/* We might have to split when some pageblocks differ. */
maybe_split_large_buddy(zone, page, pfn, order, fpi_flags);
} else {
/* The buddy works in max MAX_PAGE_ORDER chunks. */
for /* each MAX_PAGE_ORDER chunk */
maybe_split_large_buddy(zone, page, pfn, MAX_PAGE_ORDER, fpi_flags);
pfn += MAX_ORDER_NR_PAGES;
page = ...
}
}
Whereby maybe_split_large_buddy would check if it has to do anything with the
pageblocks, or whether it can just call __free_one_page(order). It would only
accept order <=
In the future with free_frozen_pages(), the last else case would vanish.
>
> Is it possible to have a MAX_PAGE_ORDER hugetlb folio? If no, we are good.
Hm, I wouldn't rule it out. To be future proof, we'd likely should add a
way to handle it. Shouldn't be too hard and expensive: simply read all
pageblock migratetypes.
I wonder if there are configs (no page isolation?) where we don't have to
check anything.
> If yes, alloc_contig_range() can change the migratetype of one of half that folio
> and during migration phase, that folio will be freed via __free_one_page()
> and causing migratetype mismatch.
Can you remind me how that happens?
Likely, we isolate a single pageblock, and shortly after we free a larger
page that covers multiple pageblocks? So this could affect other
alloc_contig_range() users as well, I assume?
Thanks!
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH mm-unstable v2 1/3] mm/contig_alloc: support __GFP_COMP
2024-11-20 16:13 ` David Hildenbrand
@ 2024-11-20 16:46 ` Zi Yan
0 siblings, 0 replies; 26+ messages in thread
From: Zi Yan @ 2024-11-20 16:46 UTC (permalink / raw)
To: David Hildenbrand
Cc: Yu Zhao, Andrew Morton, Muchun Song, Matthew Wilcox (Oracle),
linux-mm, linux-kernel
On 20 Nov 2024, at 11:13, David Hildenbrand wrote:
> On 20.11.24 16:55, Zi Yan wrote:
>> On 19 Nov 2024, at 11:52, David Hildenbrand wrote:
>>
>>> On 19.11.24 17:49, David Hildenbrand wrote:
>>>> On 19.11.24 17:41, Zi Yan wrote:
>>>>> On 19 Nov 2024, at 11:31, David Hildenbrand wrote:
>>>>>
>>>>>> On 19.11.24 17:12, Zi Yan wrote:
>>>>>>> On 19 Nov 2024, at 10:29, David Hildenbrand wrote:
>>>>>>>
>>>>>>>>> +/* Split a multi-block free page into its individual pageblocks. */
>>>>>>>>> +static void split_large_buddy(struct zone *zone, struct page *page,
>>>>>>>>> + unsigned long pfn, int order, fpi_t fpi)
>>>>>>>>> +{
>>>>>>>>> + unsigned long end = pfn + (1 << order);
>>>>>>>>> +
>>>>>>>>> + VM_WARN_ON_ONCE(!IS_ALIGNED(pfn, 1 << order));
>>>>>>>>> + /* Caller removed page from freelist, buddy info cleared! */
>>>>>>>>> + VM_WARN_ON_ONCE(PageBuddy(page));
>>>>>>>>> +
>>>>>>>>> + if (order > pageblock_order)
>>>>>>>>> + order = pageblock_order;
>>>>>>>>> +
>>>>>>>>> + while (pfn != end) {
>>>>>>>>> + int mt = get_pfnblock_migratetype(page, pfn);
>>>>>>>>> +
>>>>>>>>> + __free_one_page(page, pfn, zone, order, mt, fpi);
>>>>>>>>> + pfn += 1 << order;
>>>>>>>>> + page = pfn_to_page(pfn);
>>>>>>>>> + }
>>>>>>>>> +}
>>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> stumbling over this while digging through the code ....
>>>>>>>>
>>>>>>>>> +
>>>>>>>>> static void free_one_page(struct zone *zone, struct page *page,
>>>>>>>>> unsigned long pfn, unsigned int order,
>>>>>>>>> fpi_t fpi_flags)
>>>>>>>>> {
>>>>>>>>> unsigned long flags;
>>>>>>>>> - int migratetype;
>>>>>>>>> spin_lock_irqsave(&zone->lock, flags);
>>>>>>>>> - migratetype = get_pfnblock_migratetype(page, pfn);
>>>>>>>>> - __free_one_page(page, pfn, zone, order, migratetype, fpi_flags);
>>>>>>>>
>>>>>>>> This change is rather undesired:
>>>>>>>>
>>>>>>>> via __free_pages_core()->__free_pages_ok() we can easily end up here with order=MAX_PAGE_ORDER.
>>>>>>>
>>>>>>> Do you have a concrete example? PMD THP on x86_64 is pageblock_order.
>>>>>>> We do not have PMD level mTHP yet. Any other possible source?
>>>>>>>
>>>>>>>>
>>>>>>>> What your new code will do is split this perfectly reasonable MAX_PAGE_ORDER chunk via split_large_buddy() into pageblock-sized chunks, and let the buddy merging logic undo our unnecessary splitting.
>>>>>>>>
>>>>>>>> Is there a way to avoid this and just process the whole MAX_PAGE_ORDER chunk like we used to?
>>>>>>>
>>>>>>> Probably split_large_buddy() can check the migratetypes of the to-be-freed
>>>>>>> page, if order > pageblock_order. If all migratetypes are the same, the page
>>>>>>> can be freed at MAX_PAGE_ORDER, otherwise pageblock_order.
>>>>>>
>>>>>> Thinking about this: why do we care about the migratetype?
>>>>>>
>>>>>> We only have to fallback to pageblocks if any pageblock is "MIGRATE_ISOLATE" (and maybe MIGRATE_CMA), but not all. Otherwise, we can just ignore the migratetype (or rather overwrite it)
>>>>>
>>>>> There are VM_WARN_ONCEs around *_free_list() operations to make sure
>>>>> page migratetype matches the migratetype of the list it is on. Ignoring
>>>>> migratetype would trigger these WARNs. Overwriting it would work but
>>>>> free page migratetype accounting needs to be taken care of.
>>>>>
>>>>> An implicit reason is that __free_one_page() does not support >MAX_PAGE_ORDER
>>>>> and gigantic hugetlb folios are freed via free_one_page(), where
>>>>> split_large_buddy() is used to work around the limitation.
>>>>
>>>> Yes, I saw that change. But that can be easily identified cased by
>>>> unlikely(order > MAX_PAGE_ORDER) to handle that rare case in a special way.
>>>>
>>>> > > For the two memory init cases you mentioned in the other email,
>>>> maybe a new
>>>>> fpi flag to make free_one_page() use __free_one_page() for them,
>>>>> since migratetypes should be the same across MAX_PAGE_ORDER range there, right?
>>>>
>>>> In the context of alloc_frozen_range()/free_frozen_range() I want to
>>>> free MAX_PAGE_ORDER chunks when possible, and not have some odd logic in
>>>> the freeing path undo some of that effort.
>>>
>>> Adding a pointer to that discussion:
>>>
>>> https://lkml.kernel.org/r/ZzZdnuZBf-xgiwD2@casper.infradead.org
>>
>> Thanks.
>>
>> So you are thinking about something like this:
>>
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index b6958333054d..3d3341dc1ad1 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -1254,7 +1254,12 @@ static void free_one_page(struct zone *zone, struct page *page,
>> unsigned long flags;
>>
>> spin_lock_irqsave(&zone->lock, flags);
>> - split_large_buddy(zone, page, pfn, order, fpi_flags);
>> + if (unlikely(order > MAX_PAGE_ORDER))
>> + split_large_buddy(zone, page, pfn, order, fpi_flags);> + else {
>> + int migratetype = get_pfnblock_migratetype(page, pfn);
>> + __free_one_page(page, pfn, zone, order, migratetype, fpi_flags);
>> + }
>> spin_unlock_irqrestore(&zone->lock, flags);
>>
>> __count_vm_events(PGFREE, 1 << order);
>>
>
> Something in that direction, but likely something like:
>
> if (likely(order <= pageblock_order)) {
> int migratetype = get_pfnblock_migratetype(page, pfn);
> __free_one_page(page, pfn, zone, order, migratetype, fpi_flags);
> } else if (order <= MAX_PAGE_ORDER) {
> /* We might have to split when some pageblocks differ. */
> maybe_split_large_buddy(zone, page, pfn, order, fpi_flags);
> } else {
> /* The buddy works in max MAX_PAGE_ORDER chunks. */
> for /* each MAX_PAGE_ORDER chunk */
> maybe_split_large_buddy(zone, page, pfn, MAX_PAGE_ORDER, fpi_flags);
> pfn += MAX_ORDER_NR_PAGES;
> page = ...
> }
> }
>
> Whereby maybe_split_large_buddy would check if it has to do anything with the
> pageblocks, or whether it can just call __free_one_page(order). It would only
> accept order <=
>
> In the future with free_frozen_pages(), the last else case would vanish.
>
>>
>> Is it possible to have a MAX_PAGE_ORDER hugetlb folio? If no, we are good.
>
> Hm, I wouldn't rule it out. To be future proof, we'd likely should add a
> way to handle it. Shouldn't be too hard and expensive: simply read all
> pageblock migratetypes.
>
> I wonder if there are configs (no page isolation?) where we don't have to
> check anything.
CONFIG_CONTIG_ALLOC? CONFIG_MEMORY_ISOLATION seems too much.
>
>> If yes, alloc_contig_range() can change the migratetype of one of half that folio
>> and during migration phase, that folio will be freed via __free_one_page()
>> and causing migratetype mismatch.
>
> Can you remind me how that happens?
There is a MAX_PAGE_ORDER hugetlb folio across two pageblocks and
alloc_contig_range() wants to get a range that overlaps with first pageblock
of that hugetlb folio. All pageblocks are marked as isolated first during
start_isolate_page_range(), then __alloc_contig_migrate_range() isolates
and migrates the hugetlb folio. During migration, the original hugetlb
folio is freed and since it is MAX_PAGE_ORDER and based on my code above,
it is freed via __free_one_page() with two different migratetypes.
But with your maybe_split_large_buddy(), it would work.
>
> Likely, we isolate a single pageblock, and shortly after we free a larger
> page that covers multiple pageblocks? So this could affect other
> alloc_contig_range() users as well, I assume?
Best Regards,
Yan, Zi
^ permalink raw reply [flat|nested] 26+ messages in thread
end of thread, other threads:[~2024-11-20 16:46 UTC | newest]
Thread overview: 26+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-08-14 3:54 [PATCH mm-unstable v2 0/3] mm/hugetlb: alloc/free gigantic folios Yu Zhao
2024-08-14 3:54 ` [PATCH mm-unstable v2 1/3] mm/contig_alloc: support __GFP_COMP Yu Zhao
2024-08-22 8:21 ` Yu Liao
2024-08-22 17:25 ` Yu Zhao
2024-08-22 18:36 ` Andrew Morton
2024-08-22 17:23 ` Yu Zhao
2024-08-22 18:42 ` Matthew Wilcox
2024-09-02 17:02 ` Yu Zhao
2024-11-19 15:29 ` David Hildenbrand
2024-11-19 16:12 ` Zi Yan
2024-11-19 16:25 ` David Hildenbrand
2024-11-19 16:31 ` David Hildenbrand
2024-11-19 16:41 ` Zi Yan
2024-11-19 16:49 ` David Hildenbrand
2024-11-19 16:52 ` David Hildenbrand
2024-11-20 15:55 ` Zi Yan
2024-11-20 16:13 ` David Hildenbrand
2024-11-20 16:46 ` Zi Yan
2024-08-14 3:54 ` [PATCH mm-unstable v2 2/3] mm/cma: add cma_{alloc,free}_folio() Yu Zhao
2024-08-22 17:24 ` Yu Zhao
2024-09-02 17:04 ` Yu Zhao
2024-08-14 3:54 ` [PATCH mm-unstable v2 3/3] mm/hugetlb: use __GFP_COMP for gigantic folios Yu Zhao
2024-08-16 15:23 ` Zi Yan
2024-08-16 16:02 ` Yu Zhao
2024-08-16 16:30 ` Zi Yan
2024-08-30 17:55 ` [PATCH mm-unstable v2 0/3] mm/hugetlb: alloc/free " jane.chu
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).