[PATCH 00/12] Batch hugetlb vmemmap modification operations

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 00/12] Batch hugetlb vmemmap modification operations
@ 2023-08-25 19:04 Mike Kravetz
  2023-08-25 19:04 ` [PATCH 01/12] hugetlb: clear flags in tail pages that will be freed individually Mike Kravetz
                   ` (11 more replies)
  0 siblings, 12 replies; 36+ messages in thread
From: Mike Kravetz @ 2023-08-25 19:04 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Muchun Song, Joao Martins, Oscar Salvador, David Hildenbrand,
	Miaohe Lin, David Rientjes, Anshuman Khandual, Naoya Horiguchi,
	Barry Song, Michal Hocko, Matthew Wilcox, Xiongchun Duan,
	Andrew Morton, Mike Kravetz

When hugetlb vmemmap optimization was introduced, the overhead of enabling
the option was measured as described in commit 426e5c429d16 [1].  The summary
states that allocating a hugetlb page should be ~2x slower with optimization
and freeing a hugetlb page should be ~2-3x slower.  Such overhead was deemed
an acceptable trade off for the memory savings obtained by freeing vmemmap
pages.

It was recently reported that the overhead associated with enabling vmemmap
optimization could be as high as 190x for hugetlb page allocations.
Yes, 190x!  Some actual numbers from other environments are:

Bare Metal 8 socket Intel(R) Xeon(R) CPU E7-8895
------------------------------------------------
Unmodified next-20230824, vm.hugetlb_optimize_vmemmap = 0
time echo 500000 > .../hugepages-2048kB/nr_hugepages
real    0m4.119s
time echo 0 > .../hugepages-2048kB/nr_hugepages
real    0m4.477s

Unmodified next-20230824, vm.hugetlb_optimize_vmemmap = 1
time echo 500000 > .../hugepages-2048kB/nr_hugepages
real    0m28.973s
time echo 0 > .../hugepages-2048kB/nr_hugepages
real    0m36.748s

VM with 252 vcpus on host with 2 socket AMD EPYC 7J13 Milan
-----------------------------------------------------------
Unmodified next-20230824, vm.hugetlb_optimize_vmemmap = 0
time echo 524288 > .../hugepages-2048kB/nr_hugepages
real    0m2.463s
time echo 0 > .../hugepages-2048kB/nr_hugepages
real    0m2.931s

Unmodified next-20230824, vm.hugetlb_optimize_vmemmap = 1
time echo 524288 > .../hugepages-2048kB/nr_hugepages
real    2m27.609s
time echo 0 > .../hugepages-2048kB/nr_hugepages
real    2m29.924s

In the VM environment, the slowdown of enabling hugetlb vmemmap optimization
resulted in allocation times being 61x slower.

A quick profile showed that the vast majority of this overhead was due to
TLB flushing.  Each time we modify the kernel pagetable we need to flush
the TLB.  For each hugetlb that is optimized, there could be potentially
two TLB flushes performed.  One for the vmemmap pages associated with the
hugetlb page, and potentially another one if the vmemmap pages are mapped
at the PMD level and must be split.  The TLB flushes required for the kernel
pagetable, result in a broadcast IPI with each CPU having to flush a range
of pages, or do a global flush if a threshold is exceeded.  So, the flush
time increases with the number of CPUs.  In addition, in virtual environments
the broadcast IPI can’t be accelerated by hypervisor hardware and leads to
traps that need to wakeup/IPI all vCPUs which is very expensive.  Because of
this the slowdown in virtual environments is even worse than bare metal as
the number of vCPUS/CPUs is increased.

The following series attempts to reduce amount of time spent in TLB flushing.
The idea is to batch the vmemmap modification operations for multiple hugetlb
pages.  Instead of doing one or two TLB flushes for each page, we do two TLB
flushes for each batch of pages.  One flush after splitting pages mapped at
the PMD level, and another after remapping vmemmap associated with all
hugetlb pages.  Results of such batching are as follows:

Bare Metal 8 socket Intel(R) Xeon(R) CPU E7-8895
------------------------------------------------
next-20230824 + Batching patches, vm.hugetlb_optimize_vmemmap = 0
time echo 500000 > .../hugepages-2048kB/nr_hugepages
real    0m4.719s
time echo 0 > .../hugepages-2048kB/nr_hugepages
real    0m4.245s

next-20230824 + Batching patches, vm.hugetlb_optimize_vmemmap = 1
time echo 500000 > .../hugepages-2048kB/nr_hugepages
real    0m7.267s
time echo 0 > .../hugepages-2048kB/nr_hugepages
real    0m13.199s

VM with 252 vcpus on host with 2 socket AMD EPYC 7J13 Milan
-----------------------------------------------------------
next-20230824 + Batching patches, vm.hugetlb_optimize_vmemmap = 0
time echo 524288 > .../hugepages-2048kB/nr_hugepages
real    0m2.715s
time echo 0 > .../hugepages-2048kB/nr_hugepages
real    0m3.186s

next-20230824 + Batching patches, vm.hugetlb_optimize_vmemmap = 1
time echo 524288 > .../hugepages-2048kB/nr_hugepages
real    0m4.799s
time echo 0 > .../hugepages-2048kB/nr_hugepages
real    0m5.273s

With batching, results are back in the 2-3x slowdown range.

This series is based on next-20230824.
The first 4 patches of the series are modifications currently going into
the mm tree that modify the same area.  They are not directly related to
the batching changes.
Patch 5 (hugetlb: restructure pool allocations) is where batching changes
begin.

[1] https://github.com/torvalds/linux/commit/426e5c429d16e4cd5ded46e21ff8e939bf8abd0f
Joao Martins (2):
  hugetlb: batch PMD split for bulk vmemmap dedup
  hugetlb: batch TLB flushes when freeing vmemmap

Matthew Wilcox (Oracle) (3):
  hugetlb: Use a folio in free_hpage_workfn()
  hugetlb: Remove a few calls to page_folio()
  hugetlb: Convert remove_pool_huge_page() to
    remove_pool_hugetlb_folio()

Mike Kravetz (7):
  hugetlb: clear flags in tail pages that will be freed individually
  hugetlb: restructure pool allocations
  hugetlb: perform vmemmap optimization on a list of pages
  hugetlb: perform vmemmap restoration on a list of pages
  hugetlb: batch freeing of vmemmap pages
  hugetlb_vmemmap: Optimistically set Optimized flag
  hugetlb: batch TLB flushes when restoring vmemmap

 mm/hugetlb.c         | 238 +++++++++++++++++++++++++++++--------------
 mm/hugetlb_vmemmap.c | 197 ++++++++++++++++++++++++++++++-----
 mm/hugetlb_vmemmap.h |  11 ++
 3 files changed, 343 insertions(+), 103 deletions(-)

-- 
2.41.0

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 01/12] hugetlb: clear flags in tail pages that will be freed individually
  2023-08-25 19:04 [PATCH 00/12] Batch hugetlb vmemmap modification operations Mike Kravetz
@ 2023-08-25 19:04 ` Mike Kravetz
  2023-08-25 19:04 ` [PATCH 02/12] hugetlb: Use a folio in free_hpage_workfn() Mike Kravetz
                   ` (10 subsequent siblings)
  11 siblings, 0 replies; 36+ messages in thread
From: Mike Kravetz @ 2023-08-25 19:04 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Muchun Song, Joao Martins, Oscar Salvador, David Hildenbrand,
	Miaohe Lin, David Rientjes, Anshuman Khandual, Naoya Horiguchi,
	Barry Song, Michal Hocko, Matthew Wilcox, Xiongchun Duan,
	Andrew Morton, Mike Kravetz

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 mm/hugetlb.c | 11 +----------
 1 file changed, 1 insertion(+), 10 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 0bf70f27c1ba..ba6d39b71cb1 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1489,6 +1489,7 @@ static void __destroy_compound_gigantic_folio(struct folio *folio,
 
 	for (i = 1; i < nr_pages; i++) {
 		p = folio_page(folio, i);
+		p->flags &= ~PAGE_FLAGS_CHECK_AT_FREE;
 		p->mapping = NULL;
 		clear_compound_head(p);
 		if (!demote)
@@ -1707,8 +1708,6 @@ static void add_hugetlb_folio(struct hstate *h, struct folio *folio,
 static void __update_and_free_hugetlb_folio(struct hstate *h,
 						struct folio *folio)
 {
-	int i;
-	struct page *subpage;
 	bool clear_dtor = folio_test_hugetlb_vmemmap_optimized(folio);
 
 	if (hstate_is_gigantic(h) && !gigantic_page_runtime_supported())
@@ -1750,14 +1749,6 @@ static void __update_and_free_hugetlb_folio(struct hstate *h,
 		spin_unlock_irq(&hugetlb_lock);
 	}
 
-	for (i = 0; i < pages_per_huge_page(h); i++) {
-		subpage = folio_page(folio, i);
-		subpage->flags &= ~(1 << PG_locked | 1 << PG_error |
-				1 << PG_referenced | 1 << PG_dirty |
-				1 << PG_active | 1 << PG_private |
-				1 << PG_writeback);
-	}
-
 	/*
 	 * Non-gigantic pages demoted from CMA allocated gigantic pages
 	 * need to be given back to CMA in free_gigantic_folio.
-- 
2.41.0



^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 02/12] hugetlb: Use a folio in free_hpage_workfn()
  2023-08-25 19:04 [PATCH 00/12] Batch hugetlb vmemmap modification operations Mike Kravetz
  2023-08-25 19:04 ` [PATCH 01/12] hugetlb: clear flags in tail pages that will be freed individually Mike Kravetz
@ 2023-08-25 19:04 ` Mike Kravetz
  2023-08-25 19:04 ` [PATCH 03/12] hugetlb: Remove a few calls to page_folio() Mike Kravetz
                   ` (9 subsequent siblings)
  11 siblings, 0 replies; 36+ messages in thread
From: Mike Kravetz @ 2023-08-25 19:04 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Muchun Song, Joao Martins, Oscar Salvador, David Hildenbrand,
	Miaohe Lin, David Rientjes, Anshuman Khandual, Naoya Horiguchi,
	Barry Song, Michal Hocko, Matthew Wilcox, Xiongchun Duan,
	Andrew Morton, Mike Kravetz, Sidhartha Kumar

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

update_and_free_hugetlb_folio puts the memory on hpage_freelist as a folio
so we can take it off the list as a folio.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
---
 mm/hugetlb.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index ba6d39b71cb1..1a48a83846cb 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1782,22 +1782,22 @@ static void free_hpage_workfn(struct work_struct *work)
 	node = llist_del_all(&hpage_freelist);
 
 	while (node) {
-		struct page *page;
+		struct folio *folio;
 		struct hstate *h;
 
-		page = container_of((struct address_space **)node,
-				     struct page, mapping);
+		folio = container_of((struct address_space **)node,
+				     struct folio, mapping);
 		node = node->next;
-		page->mapping = NULL;
+		folio->mapping = NULL;
 		/*
 		 * The VM_BUG_ON_FOLIO(!folio_test_hugetlb(folio), folio) in
 		 * folio_hstate() is going to trigger because a previous call to
 		 * remove_hugetlb_folio() will clear the hugetlb bit, so do
 		 * not use folio_hstate() directly.
 		 */
-		h = size_to_hstate(page_size(page));
+		h = size_to_hstate(folio_size(folio));
 
-		__update_and_free_hugetlb_folio(h, page_folio(page));
+		__update_and_free_hugetlb_folio(h, folio);
 
 		cond_resched();
 	}
-- 
2.41.0



^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 03/12] hugetlb: Remove a few calls to page_folio()
  2023-08-25 19:04 [PATCH 00/12] Batch hugetlb vmemmap modification operations Mike Kravetz
  2023-08-25 19:04 ` [PATCH 01/12] hugetlb: clear flags in tail pages that will be freed individually Mike Kravetz
  2023-08-25 19:04 ` [PATCH 02/12] hugetlb: Use a folio in free_hpage_workfn() Mike Kravetz
@ 2023-08-25 19:04 ` Mike Kravetz
  2023-08-25 19:04 ` [PATCH 04/12] hugetlb: Convert remove_pool_huge_page() to remove_pool_hugetlb_folio() Mike Kravetz
                   ` (8 subsequent siblings)
  11 siblings, 0 replies; 36+ messages in thread
From: Mike Kravetz @ 2023-08-25 19:04 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Muchun Song, Joao Martins, Oscar Salvador, David Hildenbrand,
	Miaohe Lin, David Rientjes, Anshuman Khandual, Naoya Horiguchi,
	Barry Song, Michal Hocko, Matthew Wilcox, Xiongchun Duan,
	Andrew Morton, Mike Kravetz, Sidhartha Kumar

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

Anything found on a linked list threaded through ->lru is guaranteed to
be a folio as the compound_head found in a tail page overlaps the ->lru
member of struct page.  So we can pull folios directly off these lists
no matter whether pages or folios were added to the list.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
---
 mm/hugetlb.c | 26 +++++++++++---------------
 1 file changed, 11 insertions(+), 15 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 1a48a83846cb..a5348dfada89 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1831,11 +1831,9 @@ static void update_and_free_hugetlb_folio(struct hstate *h, struct folio *folio,
 
 static void update_and_free_pages_bulk(struct hstate *h, struct list_head *list)
 {
-	struct page *page, *t_page;
-	struct folio *folio;
+	struct folio *folio, *t_folio;
 
-	list_for_each_entry_safe(page, t_page, list, lru) {
-		folio = page_folio(page);
+	list_for_each_entry_safe(folio, t_folio, list, lru) {
 		update_and_free_hugetlb_folio(h, folio, false);
 		cond_resched();
 	}
@@ -2224,8 +2222,7 @@ static struct page *remove_pool_huge_page(struct hstate *h,
 						 bool acct_surplus)
 {
 	int nr_nodes, node;
-	struct page *page = NULL;
-	struct folio *folio;
+	struct folio *folio = NULL;
 
 	lockdep_assert_held(&hugetlb_lock);
 	for_each_node_mask_to_free(h, nr_nodes, node, nodes_allowed) {
@@ -2235,15 +2232,14 @@ static struct page *remove_pool_huge_page(struct hstate *h,
 		 */
 		if ((!acct_surplus || h->surplus_huge_pages_node[node]) &&
 		    !list_empty(&h->hugepage_freelists[node])) {
-			page = list_entry(h->hugepage_freelists[node].next,
-					  struct page, lru);
-			folio = page_folio(page);
+			folio = list_entry(h->hugepage_freelists[node].next,
+					  struct folio, lru);
 			remove_hugetlb_folio(h, folio, acct_surplus);
 			break;
 		}
 	}
 
-	return page;
+	return &folio->page;
 }
 
 /*
@@ -3359,15 +3355,15 @@ static void try_to_free_low(struct hstate *h, unsigned long count,
 	 * Collect pages to be freed on a list, and free after dropping lock
 	 */
 	for_each_node_mask(i, *nodes_allowed) {
-		struct page *page, *next;
+		struct folio *folio, *next;
 		struct list_head *freel = &h->hugepage_freelists[i];
-		list_for_each_entry_safe(page, next, freel, lru) {
+		list_for_each_entry_safe(folio, next, freel, lru) {
 			if (count >= h->nr_huge_pages)
 				goto out;
-			if (PageHighMem(page))
+			if (folio_test_highmem(folio))
 				continue;
-			remove_hugetlb_folio(h, page_folio(page), false);
-			list_add(&page->lru, &page_list);
+			remove_hugetlb_folio(h, folio, false);
+			list_add(&folio->lru, &page_list);
 		}
 	}
 
-- 
2.41.0



^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 04/12] hugetlb: Convert remove_pool_huge_page() to remove_pool_hugetlb_folio()
  2023-08-25 19:04 [PATCH 00/12] Batch hugetlb vmemmap modification operations Mike Kravetz
                   ` (2 preceding siblings ...)
  2023-08-25 19:04 ` [PATCH 03/12] hugetlb: Remove a few calls to page_folio() Mike Kravetz
@ 2023-08-25 19:04 ` Mike Kravetz
  2023-08-25 19:04 ` [PATCH 05/12] hugetlb: restructure pool allocations Mike Kravetz
                   ` (7 subsequent siblings)
  11 siblings, 0 replies; 36+ messages in thread
From: Mike Kravetz @ 2023-08-25 19:04 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Muchun Song, Joao Martins, Oscar Salvador, David Hildenbrand,
	Miaohe Lin, David Rientjes, Anshuman Khandual, Naoya Horiguchi,
	Barry Song, Michal Hocko, Matthew Wilcox, Xiongchun Duan,
	Andrew Morton, Mike Kravetz, Sidhartha Kumar

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

Convert the callers to expect a folio and remove the unnecesary conversion
back to a struct page.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
---
 mm/hugetlb.c | 29 +++++++++++++++--------------
 1 file changed, 15 insertions(+), 14 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index a5348dfada89..ec10e3804060 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1446,7 +1446,7 @@ static int hstate_next_node_to_alloc(struct hstate *h,
 }
 
 /*
- * helper for remove_pool_huge_page() - return the previously saved
+ * helper for remove_pool_hugetlb_folio() - return the previously saved
  * node ["this node"] from which to free a huge page.  Advance the
  * next node id whether or not we find a free huge page to free so
  * that the next attempt to free addresses the next node.
@@ -2217,9 +2217,8 @@ static int alloc_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed,
  * an additional call to free the page to low level allocators.
  * Called with hugetlb_lock locked.
  */
-static struct page *remove_pool_huge_page(struct hstate *h,
-						nodemask_t *nodes_allowed,
-						 bool acct_surplus)
+static struct folio *remove_pool_hugetlb_folio(struct hstate *h,
+		nodemask_t *nodes_allowed, bool acct_surplus)
 {
 	int nr_nodes, node;
 	struct folio *folio = NULL;
@@ -2239,7 +2238,7 @@ static struct page *remove_pool_huge_page(struct hstate *h,
 		}
 	}
 
-	return &folio->page;
+	return folio;
 }
 
 /*
@@ -2593,7 +2592,6 @@ static void return_unused_surplus_pages(struct hstate *h,
 					unsigned long unused_resv_pages)
 {
 	unsigned long nr_pages;
-	struct page *page;
 	LIST_HEAD(page_list);
 
 	lockdep_assert_held(&hugetlb_lock);
@@ -2614,15 +2612,17 @@ static void return_unused_surplus_pages(struct hstate *h,
 	 * evenly across all nodes with memory. Iterate across these nodes
 	 * until we can no longer free unreserved surplus pages. This occurs
 	 * when the nodes with surplus pages have no free pages.
-	 * remove_pool_huge_page() will balance the freed pages across the
+	 * remove_pool_hugetlb_folio() will balance the freed pages across the
 	 * on-line nodes with memory and will handle the hstate accounting.
 	 */
 	while (nr_pages--) {
-		page = remove_pool_huge_page(h, &node_states[N_MEMORY], 1);
-		if (!page)
+		struct folio *folio;
+
+		folio = remove_pool_hugetlb_folio(h, &node_states[N_MEMORY], 1);
+		if (!folio)
 			goto out;
 
-		list_add(&page->lru, &page_list);
+		list_add(&folio->lru, &page_list);
 	}
 
 out:
@@ -3417,7 +3417,6 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
 			      nodemask_t *nodes_allowed)
 {
 	unsigned long min_count, ret;
-	struct page *page;
 	LIST_HEAD(page_list);
 	NODEMASK_ALLOC(nodemask_t, node_alloc_noretry, GFP_KERNEL);
 
@@ -3537,11 +3536,13 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
 	 * Collect pages to be removed on list without dropping lock
 	 */
 	while (min_count < persistent_huge_pages(h)) {
-		page = remove_pool_huge_page(h, nodes_allowed, 0);
-		if (!page)
+		struct folio *folio;
+
+		folio = remove_pool_hugetlb_folio(h, nodes_allowed, 0);
+		if (!folio)
 			break;
 
-		list_add(&page->lru, &page_list);
+		list_add(&folio->lru, &page_list);
 	}
 	/* free the pages after dropping lock */
 	spin_unlock_irq(&hugetlb_lock);
-- 
2.41.0



^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 05/12] hugetlb: restructure pool allocations
  2023-08-25 19:04 [PATCH 00/12] Batch hugetlb vmemmap modification operations Mike Kravetz
                   ` (3 preceding siblings ...)
  2023-08-25 19:04 ` [PATCH 04/12] hugetlb: Convert remove_pool_huge_page() to remove_pool_hugetlb_folio() Mike Kravetz
@ 2023-08-25 19:04 ` Mike Kravetz
  2023-08-25 19:04 ` [PATCH 06/12] hugetlb: perform vmemmap optimization on a list of pages Mike Kravetz
                   ` (6 subsequent siblings)
  11 siblings, 0 replies; 36+ messages in thread
From: Mike Kravetz @ 2023-08-25 19:04 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Muchun Song, Joao Martins, Oscar Salvador, David Hildenbrand,
	Miaohe Lin, David Rientjes, Anshuman Khandual, Naoya Horiguchi,
	Barry Song, Michal Hocko, Matthew Wilcox, Xiongchun Duan,
	Andrew Morton, Mike Kravetz

Allocation of a hugetlb page for the hugetlb pool is done by the routine
alloc_pool_huge_page.  This routine will allocate contiguous pages from
a low level allocator, prep the pages for usage as a hugetlb page and
then add the resulting hugetlb page to the pool.

In the 'prep' stage, optional vmemmap optimization is done.  For
performance reasons we want to perform vmemmap optimization on multiple
hugetlb pages at once.  To do this, restructure the hugetlb pool
allocation code such that vmemmap optimization can be isolated and later
batched.

No functional changes, only code restructure.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 mm/hugetlb.c | 155 +++++++++++++++++++++++++++++++++++++++------------
 1 file changed, 119 insertions(+), 36 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index ec10e3804060..452e7ed6dc36 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2146,16 +2146,9 @@ static struct folio *alloc_buddy_hugetlb_folio(struct hstate *h,
 	return page_folio(page);
 }
 
-/*
- * Common helper to allocate a fresh hugetlb page. All specific allocators
- * should use this function to get new hugetlb pages
- *
- * Note that returned page is 'frozen':  ref count of head page and all tail
- * pages is zero.
- */
-static struct folio *alloc_fresh_hugetlb_folio(struct hstate *h,
-		gfp_t gfp_mask, int nid, nodemask_t *nmask,
-		nodemask_t *node_alloc_noretry)
+static struct folio *__alloc_fresh_hugetlb_folio(struct hstate *h,
+                gfp_t gfp_mask, int nid, nodemask_t *nmask,
+                nodemask_t *node_alloc_noretry)
 {
 	struct folio *folio;
 	bool retry = false;
@@ -2168,6 +2161,7 @@ static struct folio *alloc_fresh_hugetlb_folio(struct hstate *h,
 				nid, nmask, node_alloc_noretry);
 	if (!folio)
 		return NULL;
+
 	if (hstate_is_gigantic(h)) {
 		if (!prep_compound_gigantic_folio(folio, huge_page_order(h))) {
 			/*
@@ -2182,32 +2176,80 @@ static struct folio *alloc_fresh_hugetlb_folio(struct hstate *h,
 			return NULL;
 		}
 	}
-	prep_new_hugetlb_folio(h, folio, folio_nid(folio));
 
 	return folio;
 }
 
+static struct folio *only_alloc_fresh_hugetlb_folio(struct hstate *h,
+		gfp_t gfp_mask, int nid, nodemask_t *nmask,
+		nodemask_t *node_alloc_noretry)
+{
+	return __alloc_fresh_hugetlb_folio(h, gfp_mask, nid, nmask,
+						node_alloc_noretry);
+}
+
 /*
- * Allocates a fresh page to the hugetlb allocator pool in the node interleaved
- * manner.
+ * Common helper to allocate a fresh hugetlb page. All specific allocators
+ * should use this function to get new hugetlb pages
+ *
+ * Note that returned page is 'frozen':  ref count of head page and all tail
+ * pages is zero.
  */
-static int alloc_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed,
-				nodemask_t *node_alloc_noretry)
+static struct folio *alloc_fresh_hugetlb_folio(struct hstate *h,
+		gfp_t gfp_mask, int nid, nodemask_t *nmask,
+		nodemask_t *node_alloc_noretry)
 {
 	struct folio *folio;
-	int nr_nodes, node;
+
+	folio = __alloc_fresh_hugetlb_folio(h, gfp_mask, nid, nmask,
+						node_alloc_noretry);
+	if (!folio)
+		return NULL;
+
+	prep_new_hugetlb_folio(h, folio, folio_nid(folio));
+	return folio;
+}
+
+static void prep_and_add_pool_folio(struct hstate *h, struct folio *folio)
+{
+	prep_new_hugetlb_folio(h, folio, folio_nid(folio));
+
+	/* add it to the appropriate hugepage pool */
+	free_huge_folio(folio);
+}
+
+static void prep_and_add_allocated_folios(struct hstate *h,
+					struct list_head *folio_list)
+{
+	struct folio *folio, *tmp_f;
+
+	list_for_each_entry_safe(folio, tmp_f, folio_list, lru)
+		prep_and_add_pool_folio(h, folio);
+
+	INIT_LIST_HEAD(folio_list);
+}
+
+/*
+ * Allocates a fresh hugetlb page in a node interleaved manner.  The page
+ * will later be added to the appropriate hugetlb pool.
+ */
+static struct folio *alloc_pool_huge_folio(struct hstate *h,
+					nodemask_t *nodes_allowed,
+					nodemask_t *node_alloc_noretry)
+{
 	gfp_t gfp_mask = htlb_alloc_mask(h) | __GFP_THISNODE;
+	int nr_nodes, node;
 
 	for_each_node_mask_to_alloc(h, nr_nodes, node, nodes_allowed) {
-		folio = alloc_fresh_hugetlb_folio(h, gfp_mask, node,
+		struct folio *folio;
+
+		folio = only_alloc_fresh_hugetlb_folio(h, gfp_mask, node,
 					nodes_allowed, node_alloc_noretry);
-		if (folio) {
-			free_huge_folio(folio); /* free it into the hugepage allocator */
-			return 1;
-		}
+		if (folio)
+			return folio;
 	}
 
-	return 0;
+	return NULL;
 }
 
 /*
@@ -3184,8 +3226,7 @@ static void __init gather_bootmem_prealloc(void)
 		WARN_ON(folio_ref_count(folio) != 1);
 		if (prep_compound_gigantic_folio(folio, huge_page_order(h))) {
 			WARN_ON(folio_test_reserved(folio));
-			prep_new_hugetlb_folio(h, folio, folio_nid(folio));
-			free_huge_folio(folio); /* add to the hugepage allocator */
+			prep_and_add_pool_folio(h, folio);
 		} else {
 			/* VERY unlikely inflated ref count on a tail page */
 			free_gigantic_folio(folio, huge_page_order(h));
@@ -3231,9 +3272,22 @@ static void __init hugetlb_hstate_alloc_pages_onenode(struct hstate *h, int nid)
 	h->max_huge_pages_node[nid] = i;
 }
 
+/*
+ * NOTE: this routine is called in different contexts for gigantic and
+ * non-gigantic pages.
+ * - For gigantic pages, this is called early in the boot process and
+ *   pages are allocated from memblock allocated or something similar.
+ *   Gigantic pages are actually added to pools later with the routine
+ *   gather_bootmem_prealloc.
+ * - For non-gigantic pages, this is called later in the boot process after
+ *   all of mm is up and functional.  Pages are allocated from buddy and
+ *   then added to hugetlb pools.
+ */
 static void __init hugetlb_hstate_alloc_pages(struct hstate *h)
 {
 	unsigned long i;
+	struct folio *folio;
+	LIST_HEAD(folio_list);
 	nodemask_t *node_alloc_noretry;
 	bool node_specific_alloc = false;
 
@@ -3275,14 +3329,25 @@ static void __init hugetlb_hstate_alloc_pages(struct hstate *h)
 
 	for (i = 0; i < h->max_huge_pages; ++i) {
 		if (hstate_is_gigantic(h)) {
+			/*
+			 * gigantic pages not added to list as they are not
+			 * added to pools now.
+			 */
 			if (!alloc_bootmem_huge_page(h, NUMA_NO_NODE))
 				break;
-		} else if (!alloc_pool_huge_page(h,
-					 &node_states[N_MEMORY],
-					 node_alloc_noretry))
-			break;
+		} else {
+			folio = alloc_pool_huge_folio(h, &node_states[N_MEMORY],
+							node_alloc_noretry);
+			if (!folio)
+				break;
+			list_add(&folio->lru, &folio_list);
+		}
 		cond_resched();
 	}
+
+	/* list will be empty if hstate_is_gigantic */
+	prep_and_add_allocated_folios(h, &folio_list);
+
 	if (i < h->max_huge_pages) {
 		char buf[32];
 
@@ -3416,7 +3481,9 @@ static int adjust_pool_surplus(struct hstate *h, nodemask_t *nodes_allowed,
 static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
 			      nodemask_t *nodes_allowed)
 {
-	unsigned long min_count, ret;
+	unsigned long min_count;
+	unsigned long allocated;
+	struct folio *folio;
 	LIST_HEAD(page_list);
 	NODEMASK_ALLOC(nodemask_t, node_alloc_noretry, GFP_KERNEL);
 
@@ -3491,7 +3558,8 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
 			break;
 	}
 
-	while (count > persistent_huge_pages(h)) {
+	allocated = 0;
+	while (count > (persistent_huge_pages(h) + allocated)) {
 		/*
 		 * If this allocation races such that we no longer need the
 		 * page, free_huge_folio will handle it by freeing the page
@@ -3502,15 +3570,32 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
 		/* yield cpu to avoid soft lockup */
 		cond_resched();
 
-		ret = alloc_pool_huge_page(h, nodes_allowed,
+		folio = alloc_pool_huge_folio(h, nodes_allowed,
 						node_alloc_noretry);
-		spin_lock_irq(&hugetlb_lock);
-		if (!ret)
+		if (!folio) {
+			prep_and_add_allocated_folios(h, &page_list);
+			spin_lock_irq(&hugetlb_lock);
 			goto out;
+		}
+
+		list_add(&folio->lru, &page_list);
+		allocated++;
 
 		/* Bail for signals. Probably ctrl-c from user */
-		if (signal_pending(current))
+		if (signal_pending(current)) {
+			prep_and_add_allocated_folios(h, &page_list);
+			spin_lock_irq(&hugetlb_lock);
 			goto out;
+		}
+
+		spin_lock_irq(&hugetlb_lock);
+	}
+
+	/* Add allocated pages to the pool */
+	if (!list_empty(&page_list)) {
+		spin_unlock_irq(&hugetlb_lock);
+		prep_and_add_allocated_folios(h, &page_list);
+		spin_lock_irq(&hugetlb_lock);
 	}
 
 	/*
@@ -3536,8 +3621,6 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
 	 * Collect pages to be removed on list without dropping lock
 	 */
 	while (min_count < persistent_huge_pages(h)) {
-		struct folio *folio;
-
 		folio = remove_pool_hugetlb_folio(h, nodes_allowed, 0);
 		if (!folio)
 			break;
-- 
2.41.0



^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 06/12] hugetlb: perform vmemmap optimization on a list of pages
  2023-08-25 19:04 [PATCH 00/12] Batch hugetlb vmemmap modification operations Mike Kravetz
                   ` (4 preceding siblings ...)
  2023-08-25 19:04 ` [PATCH 05/12] hugetlb: restructure pool allocations Mike Kravetz
@ 2023-08-25 19:04 ` Mike Kravetz
  2023-08-25 19:04 ` [PATCH 07/12] hugetlb: perform vmemmap restoration " Mike Kravetz
                   ` (5 subsequent siblings)
  11 siblings, 0 replies; 36+ messages in thread
From: Mike Kravetz @ 2023-08-25 19:04 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Muchun Song, Joao Martins, Oscar Salvador, David Hildenbrand,
	Miaohe Lin, David Rientjes, Anshuman Khandual, Naoya Horiguchi,
	Barry Song, Michal Hocko, Matthew Wilcox, Xiongchun Duan,
	Andrew Morton, Mike Kravetz

When adding hugetlb pages to the pool, we first create a list of the
allocated pages before adding to the pool.  Pass this list of pages to a
new routine hugetlb_vmemmap_optimize_folios() for vmemmap optimization.

We also modify the routine vmemmap_should_optimize() to check for pages
that are already optimized.  There are code paths that might request
vmemmap optimization twice and we want to make sure this is not
attempted.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 mm/hugetlb.c         |  8 ++++++++
 mm/hugetlb_vmemmap.c | 11 +++++++++++
 mm/hugetlb_vmemmap.h |  5 +++++
 3 files changed, 24 insertions(+)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 452e7ed6dc36..3133dbd89696 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2223,6 +2223,14 @@ static void prep_and_add_allocated_folios(struct hstate *h,
 {
 	struct folio *folio, *tmp_f;
 
+	/*
+	 * Send list for bulk vmemmap optimization processing.
+	 * prep_new_hugetlb_folio (called from prep_and_add_pool_folio)
+	 * will also attempt to perform vmemmap optimization, but this
+	 * is a NOOP if already done.
+	 */
+	hugetlb_vmemmap_optimize_folios(h, folio_list);
+
 	list_for_each_entry_safe(folio, tmp_f, folio_list, lru)
 		prep_and_add_pool_folio(h, folio);
 
diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
index 4b9734777f69..147018a504a6 100644
--- a/mm/hugetlb_vmemmap.c
+++ b/mm/hugetlb_vmemmap.c
@@ -482,6 +482,9 @@ int hugetlb_vmemmap_restore(const struct hstate *h, struct page *head)
 /* Return true iff a HugeTLB whose vmemmap should and can be optimized. */
 static bool vmemmap_should_optimize(const struct hstate *h, const struct page *head)
 {
+	if (HPageVmemmapOptimized((struct page *)head))
+		return false;
+
 	if (!READ_ONCE(vmemmap_optimize_enabled))
 		return false;
 
@@ -570,6 +573,14 @@ void hugetlb_vmemmap_optimize(const struct hstate *h, struct page *head)
 		SetHPageVmemmapOptimized(head);
 }
 
+void hugetlb_vmemmap_optimize_folios(struct hstate *h, struct list_head *folio_list)
+{
+	struct folio *folio;
+
+	list_for_each_entry(folio, folio_list, lru)
+		hugetlb_vmemmap_optimize(h, &folio->page);
+}
+
 static struct ctl_table hugetlb_vmemmap_sysctls[] = {
 	{
 		.procname	= "hugetlb_optimize_vmemmap",
diff --git a/mm/hugetlb_vmemmap.h b/mm/hugetlb_vmemmap.h
index 25bd0e002431..036494e040ca 100644
--- a/mm/hugetlb_vmemmap.h
+++ b/mm/hugetlb_vmemmap.h
@@ -13,6 +13,7 @@
 #ifdef CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP
 int hugetlb_vmemmap_restore(const struct hstate *h, struct page *head);
 void hugetlb_vmemmap_optimize(const struct hstate *h, struct page *head);
+void hugetlb_vmemmap_optimize_folios(struct hstate *h, struct list_head *folio_list);
 
 /*
  * Reserve one vmemmap page, all vmemmap addresses are mapped to it. See
@@ -47,6 +48,10 @@ static inline void hugetlb_vmemmap_optimize(const struct hstate *h, struct page
 {
 }
 
+static inline void hugetlb_vmemmap_optimize_folios(struct hstate *h, struct list_head *folio_list)
+{
+}
+
 static inline unsigned int hugetlb_vmemmap_optimizable_size(const struct hstate *h)
 {
 	return 0;
-- 
2.41.0



^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 07/12] hugetlb: perform vmemmap restoration on a list of pages
  2023-08-25 19:04 [PATCH 00/12] Batch hugetlb vmemmap modification operations Mike Kravetz
                   ` (5 preceding siblings ...)
  2023-08-25 19:04 ` [PATCH 06/12] hugetlb: perform vmemmap optimization on a list of pages Mike Kravetz
@ 2023-08-25 19:04 ` Mike Kravetz
  2023-08-26  6:58   ` kernel test robot
  2023-08-30  8:33   ` Muchun Song
  2023-08-25 19:04 ` [PATCH 08/12] hugetlb: batch freeing of vmemmap pages Mike Kravetz
                   ` (4 subsequent siblings)
  11 siblings, 2 replies; 36+ messages in thread
From: Mike Kravetz @ 2023-08-25 19:04 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Muchun Song, Joao Martins, Oscar Salvador, David Hildenbrand,
	Miaohe Lin, David Rientjes, Anshuman Khandual, Naoya Horiguchi,
	Barry Song, Michal Hocko, Matthew Wilcox, Xiongchun Duan,
	Andrew Morton, Mike Kravetz

When removing hugetlb pages from the pool, we first create a list
of removed pages and then free those pages back to low level allocators.
Part of the 'freeing process' is to restore vmemmap for all base pages
if necessary.  Pass this list of pages to a new routine
hugetlb_vmemmap_restore_folios() so that vmemmap restoration can be
performed in bulk.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 mm/hugetlb.c         | 3 +++
 mm/hugetlb_vmemmap.c | 8 ++++++++
 mm/hugetlb_vmemmap.h | 6 ++++++
 3 files changed, 17 insertions(+)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 3133dbd89696..1bde5e234d5c 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1833,6 +1833,9 @@ static void update_and_free_pages_bulk(struct hstate *h, struct list_head *list)
 {
 	struct folio *folio, *t_folio;
 
+	/* First restore vmemmap for all pages on list. */
+	hugetlb_vmemmap_restore_folios(h, list);
+
 	list_for_each_entry_safe(folio, t_folio, list, lru) {
 		update_and_free_hugetlb_folio(h, folio, false);
 		cond_resched();
diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
index 147018a504a6..d5e6b6c76dce 100644
--- a/mm/hugetlb_vmemmap.c
+++ b/mm/hugetlb_vmemmap.c
@@ -479,6 +479,14 @@ int hugetlb_vmemmap_restore(const struct hstate *h, struct page *head)
 	return ret;
 }
 
+void hugetlb_vmemmap_restore_folios(const struct hstate *h, struct list_head *folio_list)
+{
+	struct folio *folio;
+
+	list_for_each_entry(folio, folio_list, lru)
+		hugetlb_vmemmap_restore(h, &folio->page);
+}
+
 /* Return true iff a HugeTLB whose vmemmap should and can be optimized. */
 static bool vmemmap_should_optimize(const struct hstate *h, const struct page *head)
 {
diff --git a/mm/hugetlb_vmemmap.h b/mm/hugetlb_vmemmap.h
index 036494e040ca..b7074672ceb2 100644
--- a/mm/hugetlb_vmemmap.h
+++ b/mm/hugetlb_vmemmap.h
@@ -12,6 +12,7 @@
 
 #ifdef CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP
 int hugetlb_vmemmap_restore(const struct hstate *h, struct page *head);
+void hugetlb_vmemmap_restore_folios(const struct hstate *h, struct list_head *folio_list);
 void hugetlb_vmemmap_optimize(const struct hstate *h, struct page *head);
 void hugetlb_vmemmap_optimize_folios(struct hstate *h, struct list_head *folio_list);
 
@@ -44,6 +45,11 @@ static inline int hugetlb_vmemmap_restore(const struct hstate *h, struct page *h
 	return 0;
 }
 
+static inline void hugetlb_vmemmap_restore_folios(const struct hstate *h, struct list_head *folio_list)
+{
+	return 0;
+}
+
 static inline void hugetlb_vmemmap_optimize(const struct hstate *h, struct page *head)
 {
 }
-- 
2.41.0



^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 08/12] hugetlb: batch freeing of vmemmap pages
  2023-08-25 19:04 [PATCH 00/12] Batch hugetlb vmemmap modification operations Mike Kravetz
                   ` (6 preceding siblings ...)
  2023-08-25 19:04 ` [PATCH 07/12] hugetlb: perform vmemmap restoration " Mike Kravetz
@ 2023-08-25 19:04 ` Mike Kravetz
  2023-08-26  4:00   ` kernel test robot
  2023-08-30  7:20   ` Muchun Song
  2023-08-25 19:04 ` [PATCH 09/12] hugetlb_vmemmap: Optimistically set Optimized flag Mike Kravetz
                   ` (3 subsequent siblings)
  11 siblings, 2 replies; 36+ messages in thread
From: Mike Kravetz @ 2023-08-25 19:04 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Muchun Song, Joao Martins, Oscar Salvador, David Hildenbrand,
	Miaohe Lin, David Rientjes, Anshuman Khandual, Naoya Horiguchi,
	Barry Song, Michal Hocko, Matthew Wilcox, Xiongchun Duan,
	Andrew Morton, Mike Kravetz

Now that batching of hugetlb vmemmap optimization processing is possible,
batch the freeing of vmemmap pages.  When freeing vmemmap pages for a
hugetlb page, we add them to a list that is freed after the entire batch
has been processed.

This enhances the ability to return contiguous ranges of memory to the
low level allocators.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 mm/hugetlb_vmemmap.c | 56 ++++++++++++++++++++++++++++++++------------
 1 file changed, 41 insertions(+), 15 deletions(-)

diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
index d5e6b6c76dce..e390170c0887 100644
--- a/mm/hugetlb_vmemmap.c
+++ b/mm/hugetlb_vmemmap.c
@@ -305,11 +305,14 @@ static void vmemmap_restore_pte(pte_t *pte, unsigned long addr,
  * @end:	end address of the vmemmap virtual address range that we want to
  *		remap.
  * @reuse:	reuse address.
+ * @bulk_pages: list to deposit vmemmap pages to be freed in bulk operations
+ *		or NULL in non-bulk case;
  *
  * Return: %0 on success, negative error code otherwise.
  */
 static int vmemmap_remap_free(unsigned long start, unsigned long end,
-			      unsigned long reuse)
+			      unsigned long reuse,
+			      struct list_head *bulk_pages)
 {
 	int ret;
 	LIST_HEAD(vmemmap_pages);
@@ -372,7 +375,14 @@ static int vmemmap_remap_free(unsigned long start, unsigned long end,
 	}
 	mmap_read_unlock(&init_mm);
 
-	free_vmemmap_page_list(&vmemmap_pages);
+	/*
+	 * if performing bulk operation, do not free pages here.
+	 * rather add them to the bulk list
+	 */
+	if (!bulk_pages)
+		free_vmemmap_page_list(&vmemmap_pages);
+	else
+		list_splice(&vmemmap_pages, bulk_pages);
 
 	return ret;
 }
@@ -546,17 +556,9 @@ static bool vmemmap_should_optimize(const struct hstate *h, const struct page *h
 	return true;
 }
 
-/**
- * hugetlb_vmemmap_optimize - optimize @head page's vmemmap pages.
- * @h:		struct hstate.
- * @head:	the head page whose vmemmap pages will be optimized.
- *
- * This function only tries to optimize @head's vmemmap pages and does not
- * guarantee that the optimization will succeed after it returns. The caller
- * can use HPageVmemmapOptimized(@head) to detect if @head's vmemmap pages
- * have been optimized.
- */
-void hugetlb_vmemmap_optimize(const struct hstate *h, struct page *head)
+static void __hugetlb_vmemmap_optimize(const struct hstate *h,
+					struct page *head,
+					struct list_head *bulk_pages)
 {
 	unsigned long vmemmap_start = (unsigned long)head, vmemmap_end;
 	unsigned long vmemmap_reuse;
@@ -575,18 +577,42 @@ void hugetlb_vmemmap_optimize(const struct hstate *h, struct page *head)
 	 * to the page which @vmemmap_reuse is mapped to, then free the pages
 	 * which the range [@vmemmap_start, @vmemmap_end] is mapped to.
 	 */
-	if (vmemmap_remap_free(vmemmap_start, vmemmap_end, vmemmap_reuse))
+	if (vmemmap_remap_free(vmemmap_start, vmemmap_end, vmemmap_reuse, bulk_pages))
 		static_branch_dec(&hugetlb_optimize_vmemmap_key);
 	else
 		SetHPageVmemmapOptimized(head);
 }
 
+/**
+ * hugetlb_vmemmap_optimize - optimize @head page's vmemmap pages.
+ * @h:		struct hstate.
+ * @head:	the head page whose vmemmap pages will be optimized.
+ *
+ * This function only tries to optimize @head's vmemmap pages and does not
+ * guarantee that the optimization will succeed after it returns. The caller
+ * can use HPageVmemmapOptimized(@head) to detect if @head's vmemmap pages
+ * have been optimized.
+ */
+void hugetlb_vmemmap_optimize(const struct hstate *h, struct page *head)
+{
+	__hugetlb_vmemmap_optimize(h, head, NULL);
+}
+
+void hugetlb_vmemmap_optimize_bulk(const struct hstate *h, struct page *head,
+			struct list_head *bulk_pages)
+{
+	__hugetlb_vmemmap_optimize(h, head, bulk_pages);
+}
+
 void hugetlb_vmemmap_optimize_folios(struct hstate *h, struct list_head *folio_list)
 {
 	struct folio *folio;
+	LIST_HEAD(vmemmap_pages);
 
 	list_for_each_entry(folio, folio_list, lru)
-		hugetlb_vmemmap_optimize(h, &folio->page);
+		hugetlb_vmemmap_optimize_bulk(h, &folio->page, &vmemmap_pages);
+
+	free_vmemmap_page_list(&vmemmap_pages);
 }
 
 static struct ctl_table hugetlb_vmemmap_sysctls[] = {
-- 
2.41.0



^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 09/12] hugetlb_vmemmap: Optimistically set Optimized flag
  2023-08-25 19:04 [PATCH 00/12] Batch hugetlb vmemmap modification operations Mike Kravetz
                   ` (7 preceding siblings ...)
  2023-08-25 19:04 ` [PATCH 08/12] hugetlb: batch freeing of vmemmap pages Mike Kravetz
@ 2023-08-25 19:04 ` Mike Kravetz
  2023-08-30  7:26   ` Muchun Song
  2023-08-25 19:04 ` [PATCH 10/12] hugetlb: batch PMD split for bulk vmemmap dedup Mike Kravetz
                   ` (2 subsequent siblings)
  11 siblings, 1 reply; 36+ messages in thread
From: Mike Kravetz @ 2023-08-25 19:04 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Muchun Song, Joao Martins, Oscar Salvador, David Hildenbrand,
	Miaohe Lin, David Rientjes, Anshuman Khandual, Naoya Horiguchi,
	Barry Song, Michal Hocko, Matthew Wilcox, Xiongchun Duan,
	Andrew Morton, Mike Kravetz

At the beginning of hugetlb_vmemmap_optimize, optimistically set
the HPageVmemmapOptimized flag in the head page.  Clear the flag
if the operation fails.

No change in behavior.  However, this will become important in
subsequent patches where we batch delay TLB flushing.  We need to
make sure the content in the old and new vmemmap pages are the same.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 mm/hugetlb_vmemmap.c | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
index e390170c0887..500a118915ff 100644
--- a/mm/hugetlb_vmemmap.c
+++ b/mm/hugetlb_vmemmap.c
@@ -566,7 +566,9 @@ static void __hugetlb_vmemmap_optimize(const struct hstate *h,
 	if (!vmemmap_should_optimize(h, head))
 		return;
 
+	/* Optimistically assume success */
 	static_branch_inc(&hugetlb_optimize_vmemmap_key);
+	SetHPageVmemmapOptimized(head);
 
 	vmemmap_end	= vmemmap_start + hugetlb_vmemmap_size(h);
 	vmemmap_reuse	= vmemmap_start;
@@ -577,10 +579,10 @@ static void __hugetlb_vmemmap_optimize(const struct hstate *h,
 	 * to the page which @vmemmap_reuse is mapped to, then free the pages
 	 * which the range [@vmemmap_start, @vmemmap_end] is mapped to.
 	 */
-	if (vmemmap_remap_free(vmemmap_start, vmemmap_end, vmemmap_reuse, bulk_pages))
+	if (vmemmap_remap_free(vmemmap_start, vmemmap_end, vmemmap_reuse, bulk_pages)) {
 		static_branch_dec(&hugetlb_optimize_vmemmap_key);
-	else
-		SetHPageVmemmapOptimized(head);
+		ClearHPageVmemmapOptimized(head);
+	}
 }
 
 /**
-- 
2.41.0



^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 10/12] hugetlb: batch PMD split for bulk vmemmap dedup
  2023-08-25 19:04 [PATCH 00/12] Batch hugetlb vmemmap modification operations Mike Kravetz
                   ` (8 preceding siblings ...)
  2023-08-25 19:04 ` [PATCH 09/12] hugetlb_vmemmap: Optimistically set Optimized flag Mike Kravetz
@ 2023-08-25 19:04 ` Mike Kravetz
  2023-08-26  5:56   ` kernel test robot
                     ` (2 more replies)
  2023-08-25 19:04 ` [PATCH 11/12] hugetlb: batch TLB flushes when freeing vmemmap Mike Kravetz
  2023-08-25 19:04 ` [PATCH 12/12] hugetlb: batch TLB flushes when restoring vmemmap Mike Kravetz
  11 siblings, 3 replies; 36+ messages in thread
From: Mike Kravetz @ 2023-08-25 19:04 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Muchun Song, Joao Martins, Oscar Salvador, David Hildenbrand,
	Miaohe Lin, David Rientjes, Anshuman Khandual, Naoya Horiguchi,
	Barry Song, Michal Hocko, Matthew Wilcox, Xiongchun Duan,
	Andrew Morton, Mike Kravetz

From: Joao Martins <joao.m.martins@oracle.com>

In an effort to minimize amount of TLB flushes, batch all PMD splits
belonging to a range of pages in order to perform only 1 (global) TLB
flush. This brings down from 14.2secs into 7.9secs a 1T hugetlb
allocation.

Rebased by Mike Kravetz

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 mm/hugetlb_vmemmap.c | 94 ++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 90 insertions(+), 4 deletions(-)

diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
index 500a118915ff..904a64fe5669 100644
--- a/mm/hugetlb_vmemmap.c
+++ b/mm/hugetlb_vmemmap.c
@@ -26,6 +26,7 @@
  * @reuse_addr:		the virtual address of the @reuse_page page.
  * @vmemmap_pages:	the list head of the vmemmap pages that can be freed
  *			or is mapped from.
+ * @flags		used to modify behavior in bulk operations
  */
 struct vmemmap_remap_walk {
 	void			(*remap_pte)(pte_t *pte, unsigned long addr,
@@ -34,9 +35,11 @@ struct vmemmap_remap_walk {
 	struct page		*reuse_page;
 	unsigned long		reuse_addr;
 	struct list_head	*vmemmap_pages;
+#define VMEMMAP_REMAP_ONLY_SPLIT	BIT(0)
+	unsigned long		flags;
 };
 
-static int split_vmemmap_huge_pmd(pmd_t *pmd, unsigned long start)
+static int split_vmemmap_huge_pmd(pmd_t *pmd, unsigned long start, bool bulk)
 {
 	pmd_t __pmd;
 	int i;
@@ -79,7 +82,8 @@ static int split_vmemmap_huge_pmd(pmd_t *pmd, unsigned long start)
 		/* Make pte visible before pmd. See comment in pmd_install(). */
 		smp_wmb();
 		pmd_populate_kernel(&init_mm, pmd, pgtable);
-		flush_tlb_kernel_range(start, start + PMD_SIZE);
+		if (!bulk)
+			flush_tlb_kernel_range(start, start + PMD_SIZE);
 	} else {
 		pte_free_kernel(&init_mm, pgtable);
 	}
@@ -119,18 +123,28 @@ static int vmemmap_pmd_range(pud_t *pud, unsigned long addr,
 			     unsigned long end,
 			     struct vmemmap_remap_walk *walk)
 {
+	bool bulk;
 	pmd_t *pmd;
 	unsigned long next;
 
+	bulk = walk->flags & VMEMMAP_REMAP_ONLY_SPLIT;
 	pmd = pmd_offset(pud, addr);
 	do {
 		int ret;
 
-		ret = split_vmemmap_huge_pmd(pmd, addr & PMD_MASK);
+		ret = split_vmemmap_huge_pmd(pmd, addr & PMD_MASK, bulk);
 		if (ret)
 			return ret;
 
 		next = pmd_addr_end(addr, end);
+
+		/*
+		 * We are only splitting, not remapping the hugetlb vmemmap
+		 * pages.
+		 */
+		if (bulk)
+			continue;
+
 		vmemmap_pte_range(pmd, addr, next, walk);
 	} while (pmd++, addr = next, addr != end);
 
@@ -197,7 +211,8 @@ static int vmemmap_remap_range(unsigned long start, unsigned long end,
 			return ret;
 	} while (pgd++, addr = next, addr != end);
 
-	flush_tlb_kernel_range(start, end);
+	if (!(walk->flags & VMEMMAP_REMAP_ONLY_SPLIT))
+		flush_tlb_kernel_range(start, end);
 
 	return 0;
 }
@@ -296,6 +311,48 @@ static void vmemmap_restore_pte(pte_t *pte, unsigned long addr,
 	set_pte_at(&init_mm, addr, pte, mk_pte(page, pgprot));
 }
 
+/**
+ * vmemmap_remap_split - split the vmemmap virtual address range [@start, @end)
+ *                      backing PMDs of the directmap into PTEs
+ * @start:     start address of the vmemmap virtual address range that we want
+ *             to remap.
+ * @end:       end address of the vmemmap virtual address range that we want to
+ *             remap.
+ * @reuse:     reuse address.
+ *
+ * Return: %0 on success, negative error code otherwise.
+ */
+static int vmemmap_remap_split(unsigned long start, unsigned long end,
+				unsigned long reuse)
+{
+	int ret;
+	LIST_HEAD(vmemmap_pages);
+	struct vmemmap_remap_walk walk = {
+		.flags = VMEMMAP_REMAP_ONLY_SPLIT,
+	};
+
+	/*
+	 * In order to make remapping routine most efficient for the huge pages,
+	 * the routine of vmemmap page table walking has the following rules
+	 * (see more details from the vmemmap_pte_range()):
+	 *
+	 * - The range [@start, @end) and the range [@reuse, @reuse + PAGE_SIZE)
+	 *   should be continuous.
+	 * - The @reuse address is part of the range [@reuse, @end) that we are
+	 *   walking which is passed to vmemmap_remap_range().
+	 * - The @reuse address is the first in the complete range.
+	 *
+	 * So we need to make sure that @start and @reuse meet the above rules.
+	 */
+	BUG_ON(start - reuse != PAGE_SIZE);
+
+	mmap_read_lock(&init_mm);
+	ret = vmemmap_remap_range(reuse, end, &walk);
+	mmap_read_unlock(&init_mm);
+
+	return ret;
+}
+
 /**
  * vmemmap_remap_free - remap the vmemmap virtual address range [@start, @end)
  *			to the page which @reuse is mapped to, then free vmemmap
@@ -320,6 +377,7 @@ static int vmemmap_remap_free(unsigned long start, unsigned long end,
 		.remap_pte	= vmemmap_remap_pte,
 		.reuse_addr	= reuse,
 		.vmemmap_pages	= &vmemmap_pages,
+		.flags		= 0,
 	};
 	int nid = page_to_nid((struct page *)start);
 	gfp_t gfp_mask = GFP_KERNEL | __GFP_THISNODE | __GFP_NORETRY |
@@ -606,11 +664,39 @@ void hugetlb_vmemmap_optimize_bulk(const struct hstate *h, struct page *head,
 	__hugetlb_vmemmap_optimize(h, head, bulk_pages);
 }
 
+void hugetlb_vmemmap_split(const struct hstate *h, struct page *head)
+{
+	unsigned long vmemmap_start = (unsigned long)head, vmemmap_end;
+	unsigned long vmemmap_reuse;
+
+	if (!vmemmap_should_optimize(h, head))
+		return;
+
+	static_branch_inc(&hugetlb_optimize_vmemmap_key);
+
+	vmemmap_end     = vmemmap_start + hugetlb_vmemmap_size(h);
+	vmemmap_reuse   = vmemmap_start;
+	vmemmap_start   += HUGETLB_VMEMMAP_RESERVE_SIZE;
+
+	/*
+	 * Remap the vmemmap virtual address range [@vmemmap_start, @vmemmap_end)
+	 * to the page which @vmemmap_reuse is mapped to, then free the pages
+	 * which the range [@vmemmap_start, @vmemmap_end] is mapped to.
+	 */
+	if (vmemmap_remap_split(vmemmap_start, vmemmap_end, vmemmap_reuse))
+		static_branch_dec(&hugetlb_optimize_vmemmap_key);
+}
+
 void hugetlb_vmemmap_optimize_folios(struct hstate *h, struct list_head *folio_list)
 {
 	struct folio *folio;
 	LIST_HEAD(vmemmap_pages);
 
+	list_for_each_entry(folio, folio_list, lru)
+		hugetlb_vmemmap_split(h, &folio->page);
+
+	flush_tlb_kernel_range(0, TLB_FLUSH_ALL);
+
 	list_for_each_entry(folio, folio_list, lru)
 		hugetlb_vmemmap_optimize_bulk(h, &folio->page, &vmemmap_pages);
 
-- 
2.41.0



^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 11/12] hugetlb: batch TLB flushes when freeing vmemmap
  2023-08-25 19:04 [PATCH 00/12] Batch hugetlb vmemmap modification operations Mike Kravetz
                   ` (9 preceding siblings ...)
  2023-08-25 19:04 ` [PATCH 10/12] hugetlb: batch PMD split for bulk vmemmap dedup Mike Kravetz
@ 2023-08-25 19:04 ` Mike Kravetz
  2023-08-30  8:23   ` Muchun Song
  2023-08-25 19:04 ` [PATCH 12/12] hugetlb: batch TLB flushes when restoring vmemmap Mike Kravetz
  11 siblings, 1 reply; 36+ messages in thread
From: Mike Kravetz @ 2023-08-25 19:04 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Muchun Song, Joao Martins, Oscar Salvador, David Hildenbrand,
	Miaohe Lin, David Rientjes, Anshuman Khandual, Naoya Horiguchi,
	Barry Song, Michal Hocko, Matthew Wilcox, Xiongchun Duan,
	Andrew Morton, Mike Kravetz

From: Joao Martins <joao.m.martins@oracle.com>

Now that a list of pages is deduplicated at once, the TLB
flush can be batched for all vmemmap pages that got remapped.

Add a flags field and pass whether it's a bulk allocation or
just a single page to decide to remap.

The TLB flush is global as we don't have guarantees from caller
that the set of folios is contiguous, or to add complexity in
composing a list of kVAs to flush.

Modified by Mike Kravetz to perform TLB flush on single folio if an
error is encountered.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 mm/hugetlb_vmemmap.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
index 904a64fe5669..a2fc7b03ac6b 100644
--- a/mm/hugetlb_vmemmap.c
+++ b/mm/hugetlb_vmemmap.c
@@ -36,6 +36,7 @@ struct vmemmap_remap_walk {
 	unsigned long		reuse_addr;
 	struct list_head	*vmemmap_pages;
 #define VMEMMAP_REMAP_ONLY_SPLIT	BIT(0)
+#define	VMEMMAP_REMAP_BULK_PAGES	BIT(1)
 	unsigned long		flags;
 };
 
@@ -211,7 +212,8 @@ static int vmemmap_remap_range(unsigned long start, unsigned long end,
 			return ret;
 	} while (pgd++, addr = next, addr != end);
 
-	if (!(walk->flags & VMEMMAP_REMAP_ONLY_SPLIT))
+	if (!(walk->flags &
+	      (VMEMMAP_REMAP_ONLY_SPLIT | VMEMMAP_REMAP_BULK_PAGES)))
 		flush_tlb_kernel_range(start, end);
 
 	return 0;
@@ -377,7 +379,7 @@ static int vmemmap_remap_free(unsigned long start, unsigned long end,
 		.remap_pte	= vmemmap_remap_pte,
 		.reuse_addr	= reuse,
 		.vmemmap_pages	= &vmemmap_pages,
-		.flags		= 0,
+		.flags		= !bulk_pages ? 0 : VMEMMAP_REMAP_BULK_PAGES,
 	};
 	int nid = page_to_nid((struct page *)start);
 	gfp_t gfp_mask = GFP_KERNEL | __GFP_THISNODE | __GFP_NORETRY |
@@ -427,6 +429,7 @@ static int vmemmap_remap_free(unsigned long start, unsigned long end,
 			.remap_pte	= vmemmap_restore_pte,
 			.reuse_addr	= reuse,
 			.vmemmap_pages	= &vmemmap_pages,
+			.flags		= 0,
 		};
 
 		vmemmap_remap_range(reuse, end, &walk);
@@ -700,6 +703,8 @@ void hugetlb_vmemmap_optimize_folios(struct hstate *h, struct list_head *folio_l
 	list_for_each_entry(folio, folio_list, lru)
 		hugetlb_vmemmap_optimize_bulk(h, &folio->page, &vmemmap_pages);
 
+	flush_tlb_kernel_range(0, TLB_FLUSH_ALL);
+
 	free_vmemmap_page_list(&vmemmap_pages);
 }
 
-- 
2.41.0



^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 12/12] hugetlb: batch TLB flushes when restoring vmemmap
  2023-08-25 19:04 [PATCH 00/12] Batch hugetlb vmemmap modification operations Mike Kravetz
                   ` (10 preceding siblings ...)
  2023-08-25 19:04 ` [PATCH 11/12] hugetlb: batch TLB flushes when freeing vmemmap Mike Kravetz
@ 2023-08-25 19:04 ` Mike Kravetz
  2023-08-26  8:01   ` kernel test robot
  2023-08-30  8:47   ` Muchun Song
  11 siblings, 2 replies; 36+ messages in thread
From: Mike Kravetz @ 2023-08-25 19:04 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Muchun Song, Joao Martins, Oscar Salvador, David Hildenbrand,
	Miaohe Lin, David Rientjes, Anshuman Khandual, Naoya Horiguchi,
	Barry Song, Michal Hocko, Matthew Wilcox, Xiongchun Duan,
	Andrew Morton, Mike Kravetz

Update the hugetlb_vmemmap_restore path to take a 'batch' parameter that
indicates restoration is happening on a batch of pages.  When set, use
the existing mechanism (VMEMMAP_REMAP_BULK_PAGES) to delay TLB flushing.
The routine hugetlb_vmemmap_restore_folios is the only user of this new
batch parameter and it will perform a global flush after all vmemmap is
restored.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 mm/hugetlb_vmemmap.c | 37 +++++++++++++++++++++++--------------
 1 file changed, 23 insertions(+), 14 deletions(-)

diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
index a2fc7b03ac6b..d6e7440b9507 100644
--- a/mm/hugetlb_vmemmap.c
+++ b/mm/hugetlb_vmemmap.c
@@ -479,17 +479,19 @@ static int alloc_vmemmap_page_list(unsigned long start, unsigned long end,
  * @end:	end address of the vmemmap virtual address range that we want to
  *		remap.
  * @reuse:	reuse address.
+ * @bulk:	bulk operation, batch TLB flushes
  *
  * Return: %0 on success, negative error code otherwise.
  */
 static int vmemmap_remap_alloc(unsigned long start, unsigned long end,
-			       unsigned long reuse)
+			       unsigned long reuse, bool bulk)
 {
 	LIST_HEAD(vmemmap_pages);
 	struct vmemmap_remap_walk walk = {
 		.remap_pte	= vmemmap_restore_pte,
 		.reuse_addr	= reuse,
 		.vmemmap_pages	= &vmemmap_pages,
+		.flags		= !bulk ? 0 : VMEMMAP_REMAP_BULK_PAGES,
 	};
 
 	/* See the comment in the vmemmap_remap_free(). */
@@ -511,17 +513,7 @@ EXPORT_SYMBOL(hugetlb_optimize_vmemmap_key);
 static bool vmemmap_optimize_enabled = IS_ENABLED(CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP_DEFAULT_ON);
 core_param(hugetlb_free_vmemmap, vmemmap_optimize_enabled, bool, 0);
 
-/**
- * hugetlb_vmemmap_restore - restore previously optimized (by
- *			     hugetlb_vmemmap_optimize()) vmemmap pages which
- *			     will be reallocated and remapped.
- * @h:		struct hstate.
- * @head:	the head page whose vmemmap pages will be restored.
- *
- * Return: %0 if @head's vmemmap pages have been reallocated and remapped,
- * negative error code otherwise.
- */
-int hugetlb_vmemmap_restore(const struct hstate *h, struct page *head)
+int __hugetlb_vmemmap_restore(const struct hstate *h, struct page *head, bool bulk)
 {
 	int ret;
 	unsigned long vmemmap_start = (unsigned long)head, vmemmap_end;
@@ -541,7 +533,7 @@ int hugetlb_vmemmap_restore(const struct hstate *h, struct page *head)
 	 * When a HugeTLB page is freed to the buddy allocator, previously
 	 * discarded vmemmap pages must be allocated and remapping.
 	 */
-	ret = vmemmap_remap_alloc(vmemmap_start, vmemmap_end, vmemmap_reuse);
+	ret = vmemmap_remap_alloc(vmemmap_start, vmemmap_end, vmemmap_reuse, bulk);
 	if (!ret) {
 		ClearHPageVmemmapOptimized(head);
 		static_branch_dec(&hugetlb_optimize_vmemmap_key);
@@ -550,12 +542,29 @@ int hugetlb_vmemmap_restore(const struct hstate *h, struct page *head)
 	return ret;
 }
 
+/**
+ * hugetlb_vmemmap_restore - restore previously optimized (by
+ *			     hugetlb_vmemmap_optimize()) vmemmap pages which
+ *			     will be reallocated and remapped.
+ * @h:		struct hstate.
+ * @head:	the head page whose vmemmap pages will be restored.
+ *
+ * Return: %0 if @head's vmemmap pages have been reallocated and remapped,
+ * negative error code otherwise.
+ */
+int hugetlb_vmemmap_restore(const struct hstate *h, struct page *head)
+{
+	return __hugetlb_vmemmap_restore(h, head, false);
+}
+
 void hugetlb_vmemmap_restore_folios(const struct hstate *h, struct list_head *folio_list)
 {
 	struct folio *folio;
 
 	list_for_each_entry(folio, folio_list, lru)
-		hugetlb_vmemmap_restore(h, &folio->page);
+		(void)__hugetlb_vmemmap_restore(h, &folio->page, true);
+
+	flush_tlb_kernel_range(0, TLB_FLUSH_ALL);
 }
 
 /* Return true iff a HugeTLB whose vmemmap should and can be optimized. */
-- 
2.41.0



^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH 08/12] hugetlb: batch freeing of vmemmap pages
  2023-08-25 19:04 ` [PATCH 08/12] hugetlb: batch freeing of vmemmap pages Mike Kravetz
@ 2023-08-26  4:00   ` kernel test robot
  2023-08-30  7:20   ` Muchun Song
  1 sibling, 0 replies; 36+ messages in thread
From: kernel test robot @ 2023-08-26  4:00 UTC (permalink / raw)
  To: Mike Kravetz, linux-mm, linux-kernel
  Cc: llvm, oe-kbuild-all, Muchun Song, Joao Martins, Oscar Salvador,
	David Hildenbrand, Miaohe Lin, David Rientjes, Anshuman Khandual,
	Naoya Horiguchi, Barry Song, Michal Hocko, Matthew Wilcox,
	Xiongchun Duan, Andrew Morton, Linux Memory Management List,
	Mike Kravetz

Hi Mike,

kernel test robot noticed the following build warnings:

[auto build test WARNING on next-20230825]
[cannot apply to akpm-mm/mm-everything v6.5-rc7 v6.5-rc6 v6.5-rc5 linus/master v6.5-rc7]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Mike-Kravetz/hugetlb-clear-flags-in-tail-pages-that-will-be-freed-individually/20230826-030805
base:   next-20230825
patch link:    https://lore.kernel.org/r/20230825190436.55045-9-mike.kravetz%40oracle.com
patch subject: [PATCH 08/12] hugetlb: batch freeing of vmemmap pages
config: s390-randconfig-001-20230826 (https://download.01.org/0day-ci/archive/20230826/202308261129.hdlNvI3X-lkp@intel.com/config)
compiler: clang version 17.0.0 (https://github.com/llvm/llvm-project.git 4a5ac14ee968ff0ad5d2cc1ffa0299048db4c88a)
reproduce: (https://download.01.org/0day-ci/archive/20230826/202308261129.hdlNvI3X-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202308261129.hdlNvI3X-lkp@intel.com/

All warnings (new ones prefixed by >>):

>> mm/hugetlb_vmemmap.c:601:6: warning: no previous prototype for function 'hugetlb_vmemmap_optimize_bulk' [-Wmissing-prototypes]
     601 | void hugetlb_vmemmap_optimize_bulk(const struct hstate *h, struct page *head,
         |      ^
   mm/hugetlb_vmemmap.c:601:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
     601 | void hugetlb_vmemmap_optimize_bulk(const struct hstate *h, struct page *head,
         | ^
         | static 
   1 warning generated.


vim +/hugetlb_vmemmap_optimize_bulk +601 mm/hugetlb_vmemmap.c

   600	
 > 601	void hugetlb_vmemmap_optimize_bulk(const struct hstate *h, struct page *head,
   602				struct list_head *bulk_pages)
   603	{
   604		__hugetlb_vmemmap_optimize(h, head, bulk_pages);
   605	}
   606	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 10/12] hugetlb: batch PMD split for bulk vmemmap dedup
  2023-08-25 19:04 ` [PATCH 10/12] hugetlb: batch PMD split for bulk vmemmap dedup Mike Kravetz
@ 2023-08-26  5:56   ` kernel test robot
  2023-08-28  9:42     ` Joao Martins
  2023-08-26 18:14   ` kernel test robot
  2023-08-30  8:09   ` Muchun Song
  2 siblings, 1 reply; 36+ messages in thread
From: kernel test robot @ 2023-08-26  5:56 UTC (permalink / raw)
  To: Mike Kravetz, linux-mm, linux-kernel
  Cc: llvm, oe-kbuild-all, Muchun Song, Joao Martins, Oscar Salvador,
	David Hildenbrand, Miaohe Lin, David Rientjes, Anshuman Khandual,
	Naoya Horiguchi, Barry Song, Michal Hocko, Matthew Wilcox,
	Xiongchun Duan, Andrew Morton, Linux Memory Management List,
	Mike Kravetz

Hi Mike,

kernel test robot noticed the following build errors:

[auto build test ERROR on next-20230825]
[cannot apply to akpm-mm/mm-everything v6.5-rc7 v6.5-rc6 v6.5-rc5 linus/master v6.5-rc7]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Mike-Kravetz/hugetlb-clear-flags-in-tail-pages-that-will-be-freed-individually/20230826-030805
base:   next-20230825
patch link:    https://lore.kernel.org/r/20230825190436.55045-11-mike.kravetz%40oracle.com
patch subject: [PATCH 10/12] hugetlb: batch PMD split for bulk vmemmap dedup
config: s390-randconfig-001-20230826 (https://download.01.org/0day-ci/archive/20230826/202308261325.ipTttZHZ-lkp@intel.com/config)
compiler: clang version 17.0.0 (https://github.com/llvm/llvm-project.git 4a5ac14ee968ff0ad5d2cc1ffa0299048db4c88a)
reproduce: (https://download.01.org/0day-ci/archive/20230826/202308261325.ipTttZHZ-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202308261325.ipTttZHZ-lkp@intel.com/

All error/warnings (new ones prefixed by >>):

   mm/hugetlb_vmemmap.c:661:6: warning: no previous prototype for function 'hugetlb_vmemmap_optimize_bulk' [-Wmissing-prototypes]
     661 | void hugetlb_vmemmap_optimize_bulk(const struct hstate *h, struct page *head,
         |      ^
   mm/hugetlb_vmemmap.c:661:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
     661 | void hugetlb_vmemmap_optimize_bulk(const struct hstate *h, struct page *head,
         | ^
         | static 
>> mm/hugetlb_vmemmap.c:667:6: warning: no previous prototype for function 'hugetlb_vmemmap_split' [-Wmissing-prototypes]
     667 | void hugetlb_vmemmap_split(const struct hstate *h, struct page *head)
         |      ^
   mm/hugetlb_vmemmap.c:667:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
     667 | void hugetlb_vmemmap_split(const struct hstate *h, struct page *head)
         | ^
         | static 
>> mm/hugetlb_vmemmap.c:698:28: error: use of undeclared identifier 'TLB_FLUSH_ALL'
     698 |         flush_tlb_kernel_range(0, TLB_FLUSH_ALL);
         |                                   ^
   2 warnings and 1 error generated.


vim +/TLB_FLUSH_ALL +698 mm/hugetlb_vmemmap.c

   666	
 > 667	void hugetlb_vmemmap_split(const struct hstate *h, struct page *head)
   668	{
   669		unsigned long vmemmap_start = (unsigned long)head, vmemmap_end;
   670		unsigned long vmemmap_reuse;
   671	
   672		if (!vmemmap_should_optimize(h, head))
   673			return;
   674	
   675		static_branch_inc(&hugetlb_optimize_vmemmap_key);
   676	
   677		vmemmap_end     = vmemmap_start + hugetlb_vmemmap_size(h);
   678		vmemmap_reuse   = vmemmap_start;
   679		vmemmap_start   += HUGETLB_VMEMMAP_RESERVE_SIZE;
   680	
   681		/*
   682		 * Remap the vmemmap virtual address range [@vmemmap_start, @vmemmap_end)
   683		 * to the page which @vmemmap_reuse is mapped to, then free the pages
   684		 * which the range [@vmemmap_start, @vmemmap_end] is mapped to.
   685		 */
   686		if (vmemmap_remap_split(vmemmap_start, vmemmap_end, vmemmap_reuse))
   687			static_branch_dec(&hugetlb_optimize_vmemmap_key);
   688	}
   689	
   690	void hugetlb_vmemmap_optimize_folios(struct hstate *h, struct list_head *folio_list)
   691	{
   692		struct folio *folio;
   693		LIST_HEAD(vmemmap_pages);
   694	
   695		list_for_each_entry(folio, folio_list, lru)
   696			hugetlb_vmemmap_split(h, &folio->page);
   697	
 > 698		flush_tlb_kernel_range(0, TLB_FLUSH_ALL);
   699	
   700		list_for_each_entry(folio, folio_list, lru)
   701			hugetlb_vmemmap_optimize_bulk(h, &folio->page, &vmemmap_pages);
   702	
   703		free_vmemmap_page_list(&vmemmap_pages);
   704	}
   705	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 07/12] hugetlb: perform vmemmap restoration on a list of pages
  2023-08-25 19:04 ` [PATCH 07/12] hugetlb: perform vmemmap restoration " Mike Kravetz
@ 2023-08-26  6:58   ` kernel test robot
  2023-08-30  8:33   ` Muchun Song
  1 sibling, 0 replies; 36+ messages in thread
From: kernel test robot @ 2023-08-26  6:58 UTC (permalink / raw)
  To: Mike Kravetz, linux-mm, linux-kernel
  Cc: llvm, oe-kbuild-all, Muchun Song, Joao Martins, Oscar Salvador,
	David Hildenbrand, Miaohe Lin, David Rientjes, Anshuman Khandual,
	Naoya Horiguchi, Barry Song, Michal Hocko, Matthew Wilcox,
	Xiongchun Duan, Andrew Morton, Linux Memory Management List,
	Mike Kravetz

Hi Mike,

kernel test robot noticed the following build errors:

[auto build test ERROR on next-20230825]
[cannot apply to akpm-mm/mm-everything v6.5-rc7 v6.5-rc6 v6.5-rc5 linus/master v6.5-rc7]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Mike-Kravetz/hugetlb-clear-flags-in-tail-pages-that-will-be-freed-individually/20230826-030805
base:   next-20230825
patch link:    https://lore.kernel.org/r/20230825190436.55045-8-mike.kravetz%40oracle.com
patch subject: [PATCH 07/12] hugetlb: perform vmemmap restoration on a list of pages
config: arm64-randconfig-r011-20230826 (https://download.01.org/0day-ci/archive/20230826/202308261407.9M957lhF-lkp@intel.com/config)
compiler: clang version 17.0.0 (https://github.com/llvm/llvm-project.git 4a5ac14ee968ff0ad5d2cc1ffa0299048db4c88a)
reproduce: (https://download.01.org/0day-ci/archive/20230826/202308261407.9M957lhF-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202308261407.9M957lhF-lkp@intel.com/

All errors (new ones prefixed by >>):

   In file included from mm/hugetlb.c:49:
>> mm/hugetlb_vmemmap.h:50:2: error: void function 'hugetlb_vmemmap_restore_folios' should not return a value [-Wreturn-type]
      50 |         return 0;
         |         ^      ~
   1 error generated.


vim +/hugetlb_vmemmap_restore_folios +50 mm/hugetlb_vmemmap.h

    47	
    48	static inline void hugetlb_vmemmap_restore_folios(const struct hstate *h, struct list_head *folio_list)
    49	{
  > 50		return 0;
    51	}
    52	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 12/12] hugetlb: batch TLB flushes when restoring vmemmap
  2023-08-25 19:04 ` [PATCH 12/12] hugetlb: batch TLB flushes when restoring vmemmap Mike Kravetz
@ 2023-08-26  8:01   ` kernel test robot
  2023-08-30  8:47   ` Muchun Song
  1 sibling, 0 replies; 36+ messages in thread
From: kernel test robot @ 2023-08-26  8:01 UTC (permalink / raw)
  To: Mike Kravetz, linux-mm, linux-kernel
  Cc: llvm, oe-kbuild-all, Muchun Song, Joao Martins, Oscar Salvador,
	David Hildenbrand, Miaohe Lin, David Rientjes, Anshuman Khandual,
	Naoya Horiguchi, Barry Song, Michal Hocko, Matthew Wilcox,
	Xiongchun Duan, Andrew Morton, Linux Memory Management List,
	Mike Kravetz

Hi Mike,

kernel test robot noticed the following build warnings:

[auto build test WARNING on next-20230825]
[cannot apply to akpm-mm/mm-everything v6.5-rc7 v6.5-rc6 v6.5-rc5 linus/master v6.5-rc7]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Mike-Kravetz/hugetlb-clear-flags-in-tail-pages-that-will-be-freed-individually/20230826-030805
base:   next-20230825
patch link:    https://lore.kernel.org/r/20230825190436.55045-13-mike.kravetz%40oracle.com
patch subject: [PATCH 12/12] hugetlb: batch TLB flushes when restoring vmemmap
config: s390-randconfig-001-20230826 (https://download.01.org/0day-ci/archive/20230826/202308261516.F6FBNktd-lkp@intel.com/config)
compiler: clang version 17.0.0 (https://github.com/llvm/llvm-project.git 4a5ac14ee968ff0ad5d2cc1ffa0299048db4c88a)
reproduce: (https://download.01.org/0day-ci/archive/20230826/202308261516.F6FBNktd-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202308261516.F6FBNktd-lkp@intel.com/

All warnings (new ones prefixed by >>):

>> mm/hugetlb_vmemmap.c:516:5: warning: no previous prototype for function '__hugetlb_vmemmap_restore' [-Wmissing-prototypes]
     516 | int __hugetlb_vmemmap_restore(const struct hstate *h, struct page *head, bool bulk)
         |     ^
   mm/hugetlb_vmemmap.c:516:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
     516 | int __hugetlb_vmemmap_restore(const struct hstate *h, struct page *head, bool bulk)
         | ^
         | static 
   mm/hugetlb_vmemmap.c:567:28: error: use of undeclared identifier 'TLB_FLUSH_ALL'
     567 |         flush_tlb_kernel_range(0, TLB_FLUSH_ALL);
         |                                   ^
   mm/hugetlb_vmemmap.c:673:6: warning: no previous prototype for function 'hugetlb_vmemmap_optimize_bulk' [-Wmissing-prototypes]
     673 | void hugetlb_vmemmap_optimize_bulk(const struct hstate *h, struct page *head,
         |      ^
   mm/hugetlb_vmemmap.c:673:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
     673 | void hugetlb_vmemmap_optimize_bulk(const struct hstate *h, struct page *head,
         | ^
         | static 
   mm/hugetlb_vmemmap.c:679:6: warning: no previous prototype for function 'hugetlb_vmemmap_split' [-Wmissing-prototypes]
     679 | void hugetlb_vmemmap_split(const struct hstate *h, struct page *head)
         |      ^
   mm/hugetlb_vmemmap.c:679:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
     679 | void hugetlb_vmemmap_split(const struct hstate *h, struct page *head)
         | ^
         | static 
   mm/hugetlb_vmemmap.c:710:28: error: use of undeclared identifier 'TLB_FLUSH_ALL'
     710 |         flush_tlb_kernel_range(0, TLB_FLUSH_ALL);
         |                                   ^
   mm/hugetlb_vmemmap.c:715:28: error: use of undeclared identifier 'TLB_FLUSH_ALL'
     715 |         flush_tlb_kernel_range(0, TLB_FLUSH_ALL);
         |                                   ^
   3 warnings and 3 errors generated.


vim +/__hugetlb_vmemmap_restore +516 mm/hugetlb_vmemmap.c

   515	
 > 516	int __hugetlb_vmemmap_restore(const struct hstate *h, struct page *head, bool bulk)
   517	{
   518		int ret;
   519		unsigned long vmemmap_start = (unsigned long)head, vmemmap_end;
   520		unsigned long vmemmap_reuse;
   521	
   522		if (!HPageVmemmapOptimized(head))
   523			return 0;
   524	
   525		vmemmap_end	= vmemmap_start + hugetlb_vmemmap_size(h);
   526		vmemmap_reuse	= vmemmap_start;
   527		vmemmap_start	+= HUGETLB_VMEMMAP_RESERVE_SIZE;
   528	
   529		/*
   530		 * The pages which the vmemmap virtual address range [@vmemmap_start,
   531		 * @vmemmap_end) are mapped to are freed to the buddy allocator, and
   532		 * the range is mapped to the page which @vmemmap_reuse is mapped to.
   533		 * When a HugeTLB page is freed to the buddy allocator, previously
   534		 * discarded vmemmap pages must be allocated and remapping.
   535		 */
   536		ret = vmemmap_remap_alloc(vmemmap_start, vmemmap_end, vmemmap_reuse, bulk);
   537		if (!ret) {
   538			ClearHPageVmemmapOptimized(head);
   539			static_branch_dec(&hugetlb_optimize_vmemmap_key);
   540		}
   541	
   542		return ret;
   543	}
   544	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 10/12] hugetlb: batch PMD split for bulk vmemmap dedup
  2023-08-25 19:04 ` [PATCH 10/12] hugetlb: batch PMD split for bulk vmemmap dedup Mike Kravetz
  2023-08-26  5:56   ` kernel test robot
@ 2023-08-26 18:14   ` kernel test robot
  2023-08-30  8:09   ` Muchun Song
  2 siblings, 0 replies; 36+ messages in thread
From: kernel test robot @ 2023-08-26 18:14 UTC (permalink / raw)
  To: Mike Kravetz, linux-mm, linux-kernel
  Cc: oe-kbuild-all, Muchun Song, Joao Martins, Oscar Salvador,
	David Hildenbrand, Miaohe Lin, David Rientjes, Anshuman Khandual,
	Naoya Horiguchi, Barry Song, Michal Hocko, Matthew Wilcox,
	Xiongchun Duan, Andrew Morton, Linux Memory Management List,
	Mike Kravetz

Hi Mike,

kernel test robot noticed the following build warnings:

[auto build test WARNING on next-20230825]
[cannot apply to akpm-mm/mm-everything v6.5-rc7 v6.5-rc6 v6.5-rc5 linus/master v6.5-rc7]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Mike-Kravetz/hugetlb-clear-flags-in-tail-pages-that-will-be-freed-individually/20230826-030805
base:   next-20230825
patch link:    https://lore.kernel.org/r/20230825190436.55045-11-mike.kravetz%40oracle.com
patch subject: [PATCH 10/12] hugetlb: batch PMD split for bulk vmemmap dedup
config: x86_64-randconfig-r031-20230826 (https://download.01.org/0day-ci/archive/20230827/202308270248.eJT8UDG3-lkp@intel.com/config)
compiler: gcc-11 (Debian 11.3.0-12) 11.3.0
reproduce: (https://download.01.org/0day-ci/archive/20230827/202308270248.eJT8UDG3-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202308270248.eJT8UDG3-lkp@intel.com/

All warnings (new ones prefixed by >>):

>> mm/hugetlb_vmemmap.c:40: warning: Function parameter or member 'flags' not described in 'vmemmap_remap_walk'


vim +40 mm/hugetlb_vmemmap.c

f41f2ed43ca525 Muchun Song  2021-06-30  19  
998a2997885f73 Muchun Song  2022-06-28  20  /**
998a2997885f73 Muchun Song  2022-06-28  21   * struct vmemmap_remap_walk - walk vmemmap page table
998a2997885f73 Muchun Song  2022-06-28  22   *
998a2997885f73 Muchun Song  2022-06-28  23   * @remap_pte:		called for each lowest-level entry (PTE).
998a2997885f73 Muchun Song  2022-06-28  24   * @nr_walked:		the number of walked pte.
998a2997885f73 Muchun Song  2022-06-28  25   * @reuse_page:		the page which is reused for the tail vmemmap pages.
998a2997885f73 Muchun Song  2022-06-28  26   * @reuse_addr:		the virtual address of the @reuse_page page.
998a2997885f73 Muchun Song  2022-06-28  27   * @vmemmap_pages:	the list head of the vmemmap pages that can be freed
998a2997885f73 Muchun Song  2022-06-28  28   *			or is mapped from.
506a27a4627ab7 Joao Martins 2023-08-25  29   * @flags		used to modify behavior in bulk operations
998a2997885f73 Muchun Song  2022-06-28  30   */
998a2997885f73 Muchun Song  2022-06-28  31  struct vmemmap_remap_walk {
998a2997885f73 Muchun Song  2022-06-28  32  	void			(*remap_pte)(pte_t *pte, unsigned long addr,
998a2997885f73 Muchun Song  2022-06-28  33  					     struct vmemmap_remap_walk *walk);
998a2997885f73 Muchun Song  2022-06-28  34  	unsigned long		nr_walked;
998a2997885f73 Muchun Song  2022-06-28  35  	struct page		*reuse_page;
998a2997885f73 Muchun Song  2022-06-28  36  	unsigned long		reuse_addr;
998a2997885f73 Muchun Song  2022-06-28  37  	struct list_head	*vmemmap_pages;
506a27a4627ab7 Joao Martins 2023-08-25  38  #define VMEMMAP_REMAP_ONLY_SPLIT	BIT(0)
506a27a4627ab7 Joao Martins 2023-08-25  39  	unsigned long		flags;
998a2997885f73 Muchun Song  2022-06-28 @40  };
998a2997885f73 Muchun Song  2022-06-28  41  

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 10/12] hugetlb: batch PMD split for bulk vmemmap dedup
  2023-08-26  5:56   ` kernel test robot
@ 2023-08-28  9:42     ` Joao Martins
  2023-08-28 16:44       ` Mike Kravetz
  0 siblings, 1 reply; 36+ messages in thread
From: Joao Martins @ 2023-08-28  9:42 UTC (permalink / raw)
  To: kernel test robot, Mike Kravetz, linux-mm
  Cc: llvm, oe-kbuild-all, Muchun Song, Oscar Salvador,
	David Hildenbrand, Miaohe Lin, David Rientjes, Anshuman Khandual,
	Naoya Horiguchi, Barry Song, Michal Hocko, Matthew Wilcox,
	Xiongchun Duan, Andrew Morton, linux-kernel

On 26/08/2023 06:56, kernel test robot wrote:
> Hi Mike,
> 
> kernel test robot noticed the following build errors:
> 
> [auto build test ERROR on next-20230825]
> [cannot apply to akpm-mm/mm-everything v6.5-rc7 v6.5-rc6 v6.5-rc5 linus/master v6.5-rc7]
> [If your patch is applied to the wrong git tree, kindly drop us a note.
> And when submitting patch, we suggest to use '--base' as documented in
> https://git-scm.com/docs/git-format-patch#_base_tree_information]
> 
> url:    https://github.com/intel-lab-lkp/linux/commits/Mike-Kravetz/hugetlb-clear-flags-in-tail-pages-that-will-be-freed-individually/20230826-030805
> base:   next-20230825
> patch link:    https://lore.kernel.org/r/20230825190436.55045-11-mike.kravetz%40oracle.com
> patch subject: [PATCH 10/12] hugetlb: batch PMD split for bulk vmemmap dedup
> config: s390-randconfig-001-20230826 (https://download.01.org/0day-ci/archive/20230826/202308261325.ipTttZHZ-lkp@intel.com/config)
> compiler: clang version 17.0.0 (https://github.com/llvm/llvm-project.git 4a5ac14ee968ff0ad5d2cc1ffa0299048db4c88a)
> reproduce: (https://download.01.org/0day-ci/archive/20230826/202308261325.ipTttZHZ-lkp@intel.com/reproduce)
> 
> If you fix the issue in a separate patch/commit (i.e. not just a new version of
> the same patch/commit), kindly add following tags
> | Reported-by: kernel test robot <lkp@intel.com>
> | Closes: https://lore.kernel.org/oe-kbuild-all/202308261325.ipTttZHZ-lkp@intel.com/
> 
> All error/warnings (new ones prefixed by >>):
> 

[...]

>>> mm/hugetlb_vmemmap.c:698:28: error: use of undeclared identifier 'TLB_FLUSH_ALL'
>      698 |         flush_tlb_kernel_range(0, TLB_FLUSH_ALL);
>          |                                   ^
>    2 warnings and 1 error generated.
> 
> 

TLB_FLUSH_ALL is x86 only so what I wrote above is wrong in what should be
architecture independent. The way I should have written the global TLB flush is
to use flush_tlb_all(), which is what is implemented by the arch.

The alternative is to compose a start/end tuple in the top-level optimize-folios
function as we iterate over folios to remap, and flush via
flush_tlb_kernel_range(). But this would likely only be relevant on x86 only,
that is to optimize the flushing of 3 contiguous 2M hugetlb pages (~24 vmemmap
pages) as that's where the TLB flush ceiling is put (31 pages) for per-page VA
flush, before falling back to a global TLB flush. Weren't sure of the added
complexity for dubious benefit thus kept it in global TLB flush.

> vim +/TLB_FLUSH_ALL +698 mm/hugetlb_vmemmap.c
> 
>    666	
>  > 667	void hugetlb_vmemmap_split(const struct hstate *h, struct page *head)
>    668	{
>    669		unsigned long vmemmap_start = (unsigned long)head, vmemmap_end;
>    670		unsigned long vmemmap_reuse;
>    671	
>    672		if (!vmemmap_should_optimize(h, head))
>    673			return;
>    674	
>    675		static_branch_inc(&hugetlb_optimize_vmemmap_key);
>    676	
>    677		vmemmap_end     = vmemmap_start + hugetlb_vmemmap_size(h);
>    678		vmemmap_reuse   = vmemmap_start;
>    679		vmemmap_start   += HUGETLB_VMEMMAP_RESERVE_SIZE;
>    680	
>    681		/*
>    682		 * Remap the vmemmap virtual address range [@vmemmap_start, @vmemmap_end)
>    683		 * to the page which @vmemmap_reuse is mapped to, then free the pages
>    684		 * which the range [@vmemmap_start, @vmemmap_end] is mapped to.
>    685		 */
>    686		if (vmemmap_remap_split(vmemmap_start, vmemmap_end, vmemmap_reuse))
>    687			static_branch_dec(&hugetlb_optimize_vmemmap_key);
>    688	}
>    689	
>    690	void hugetlb_vmemmap_optimize_folios(struct hstate *h, struct list_head *folio_list)
>    691	{
>    692		struct folio *folio;
>    693		LIST_HEAD(vmemmap_pages);
>    694	
>    695		list_for_each_entry(folio, folio_list, lru)
>    696			hugetlb_vmemmap_split(h, &folio->page);
>    697	
>  > 698		flush_tlb_kernel_range(0, TLB_FLUSH_ALL);
>    699	
>    700		list_for_each_entry(folio, folio_list, lru)
>    701			hugetlb_vmemmap_optimize_bulk(h, &folio->page, &vmemmap_pages);
>    702	
>    703		free_vmemmap_page_list(&vmemmap_pages);
>    704	}
>    705	
> 


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 10/12] hugetlb: batch PMD split for bulk vmemmap dedup
  2023-08-28  9:42     ` Joao Martins
@ 2023-08-28 16:44       ` Mike Kravetz
  2023-08-29  3:47         ` Muchun Song
  0 siblings, 1 reply; 36+ messages in thread
From: Mike Kravetz @ 2023-08-28 16:44 UTC (permalink / raw)
  To: Joao Martins
  Cc: kernel test robot, linux-mm, llvm, oe-kbuild-all, Muchun Song,
	Oscar Salvador, David Hildenbrand, Miaohe Lin, David Rientjes,
	Anshuman Khandual, Naoya Horiguchi, Barry Song, Michal Hocko,
	Matthew Wilcox, Xiongchun Duan, Andrew Morton, linux-kernel

On 08/28/23 10:42, Joao Martins wrote:
> On 26/08/2023 06:56, kernel test robot wrote:
> > Hi Mike,
> > 
> > kernel test robot noticed the following build errors:
> > 
> > [auto build test ERROR on next-20230825]
> > [cannot apply to akpm-mm/mm-everything v6.5-rc7 v6.5-rc6 v6.5-rc5 linus/master v6.5-rc7]
> > [If your patch is applied to the wrong git tree, kindly drop us a note.
> > And when submitting patch, we suggest to use '--base' as documented in
> > https://git-scm.com/docs/git-format-patch#_base_tree_information]
> > 
> > url:    https://github.com/intel-lab-lkp/linux/commits/Mike-Kravetz/hugetlb-clear-flags-in-tail-pages-that-will-be-freed-individually/20230826-030805
> > base:   next-20230825
> > patch link:    https://lore.kernel.org/r/20230825190436.55045-11-mike.kravetz%40oracle.com
> > patch subject: [PATCH 10/12] hugetlb: batch PMD split for bulk vmemmap dedup
> > config: s390-randconfig-001-20230826 (https://download.01.org/0day-ci/archive/20230826/202308261325.ipTttZHZ-lkp@intel.com/config)
> > compiler: clang version 17.0.0 (https://github.com/llvm/llvm-project.git 4a5ac14ee968ff0ad5d2cc1ffa0299048db4c88a)
> > reproduce: (https://download.01.org/0day-ci/archive/20230826/202308261325.ipTttZHZ-lkp@intel.com/reproduce)
> > 
> > If you fix the issue in a separate patch/commit (i.e. not just a new version of
> > the same patch/commit), kindly add following tags
> > | Reported-by: kernel test robot <lkp@intel.com>
> > | Closes: https://lore.kernel.org/oe-kbuild-all/202308261325.ipTttZHZ-lkp@intel.com/
> > 
> > All error/warnings (new ones prefixed by >>):
> > 
> 
> [...]
> 
> >>> mm/hugetlb_vmemmap.c:698:28: error: use of undeclared identifier 'TLB_FLUSH_ALL'
> >      698 |         flush_tlb_kernel_range(0, TLB_FLUSH_ALL);
> >          |                                   ^
> >    2 warnings and 1 error generated.
> > 
> > 
> 
> TLB_FLUSH_ALL is x86 only so what I wrote above is wrong in what should be
> architecture independent. The way I should have written the global TLB flush is
> to use flush_tlb_all(), which is what is implemented by the arch.
> 
> The alternative is to compose a start/end tuple in the top-level optimize-folios
> function as we iterate over folios to remap, and flush via
> flush_tlb_kernel_range(). But this would likely only be relevant on x86 only,
> that is to optimize the flushing of 3 contiguous 2M hugetlb pages (~24 vmemmap
> pages) as that's where the TLB flush ceiling is put (31 pages) for per-page VA
> flush, before falling back to a global TLB flush. Weren't sure of the added
> complexity for dubious benefit thus kept it in global TLB flush.

Thanks Joao.

I added my share of build issues to this RFC as can be seen in the bot
responses to other patches.

My assumption is that these build issues will not prevent people from
looking into and commenting on the bigger performance issue that was the
reason for this series.  The build issues would of course be resolved if
there is some concensus that this is the way to move forward to address
this issue.  If the build issues are a stumbling block for anyone to
look at this bigger issue, let me know and I will fix them all ASAP.
-- 
Mike Kravetz


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 10/12] hugetlb: batch PMD split for bulk vmemmap dedup
  2023-08-28 16:44       ` Mike Kravetz
@ 2023-08-29  3:47         ` Muchun Song
  0 siblings, 0 replies; 36+ messages in thread
From: Muchun Song @ 2023-08-29  3:47 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Joao Martins, kernel test robot, Linux-MM, llvm, oe-kbuild-all,
	Muchun Song, Oscar Salvador, David Hildenbrand, Miaohe Lin,
	David Rientjes, Anshuman Khandual, Naoya Horiguchi, Barry Song,
	Michal Hocko, Matthew Wilcox, Xiongchun Duan, Andrew Morton, LKML



> On Aug 29, 2023, at 00:44, Mike Kravetz <mike.kravetz@oracle.com> wrote:
> 
> On 08/28/23 10:42, Joao Martins wrote:
>> On 26/08/2023 06:56, kernel test robot wrote:
>>> Hi Mike,
>>> 
>>> kernel test robot noticed the following build errors:
>>> 
>>> [auto build test ERROR on next-20230825]
>>> [cannot apply to akpm-mm/mm-everything v6.5-rc7 v6.5-rc6 v6.5-rc5 linus/master v6.5-rc7]
>>> [If your patch is applied to the wrong git tree, kindly drop us a note.
>>> And when submitting patch, we suggest to use '--base' as documented in
>>> https://git-scm.com/docs/git-format-patch#_base_tree_information]
>>> 
>>> url:    https://github.com/intel-lab-lkp/linux/commits/Mike-Kravetz/hugetlb-clear-flags-in-tail-pages-that-will-be-freed-individually/20230826-030805
>>> base:   next-20230825
>>> patch link:    https://lore.kernel.org/r/20230825190436.55045-11-mike.kravetz%40oracle.com
>>> patch subject: [PATCH 10/12] hugetlb: batch PMD split for bulk vmemmap dedup
>>> config: s390-randconfig-001-20230826 (https://download.01.org/0day-ci/archive/20230826/202308261325.ipTttZHZ-lkp@intel.com/config)
>>> compiler: clang version 17.0.0 (https://github.com/llvm/llvm-project.git 4a5ac14ee968ff0ad5d2cc1ffa0299048db4c88a)
>>> reproduce: (https://download.01.org/0day-ci/archive/20230826/202308261325.ipTttZHZ-lkp@intel.com/reproduce)
>>> 
>>> If you fix the issue in a separate patch/commit (i.e. not just a new version of
>>> the same patch/commit), kindly add following tags
>>> | Reported-by: kernel test robot <lkp@intel.com>
>>> | Closes: https://lore.kernel.org/oe-kbuild-all/202308261325.ipTttZHZ-lkp@intel.com/
>>> 
>>> All error/warnings (new ones prefixed by >>):
>>> 
>> 
>> [...]
>> 
>>>>> mm/hugetlb_vmemmap.c:698:28: error: use of undeclared identifier 'TLB_FLUSH_ALL'
>>>     698 |         flush_tlb_kernel_range(0, TLB_FLUSH_ALL);
>>>         |                                   ^
>>>   2 warnings and 1 error generated.
>>> 
>>> 
>> 
>> TLB_FLUSH_ALL is x86 only so what I wrote above is wrong in what should be
>> architecture independent. The way I should have written the global TLB flush is
>> to use flush_tlb_all(), which is what is implemented by the arch.
>> 
>> The alternative is to compose a start/end tuple in the top-level optimize-folios
>> function as we iterate over folios to remap, and flush via
>> flush_tlb_kernel_range(). But this would likely only be relevant on x86 only,
>> that is to optimize the flushing of 3 contiguous 2M hugetlb pages (~24 vmemmap
>> pages) as that's where the TLB flush ceiling is put (31 pages) for per-page VA
>> flush, before falling back to a global TLB flush. Weren't sure of the added
>> complexity for dubious benefit thus kept it in global TLB flush.
> 
> Thanks Joao.
> 
> I added my share of build issues to this RFC as can be seen in the bot
> responses to other patches.
> 
> My assumption is that these build issues will not prevent people from
> looking into and commenting on the bigger performance issue that was the
> reason for this series.  The build issues would of course be resolved if
> there is some concensus that this is the way to move forward to address
> this issue.  If the build issues are a stumbling block for anyone to
> look at this bigger issue, let me know and I will fix them all ASAP.

No need to update. But I need some time to look.

Muchun,
Thanks.



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 08/12] hugetlb: batch freeing of vmemmap pages
  2023-08-25 19:04 ` [PATCH 08/12] hugetlb: batch freeing of vmemmap pages Mike Kravetz
  2023-08-26  4:00   ` kernel test robot
@ 2023-08-30  7:20   ` Muchun Song
  2023-08-30 18:36     ` Mike Kravetz
  1 sibling, 1 reply; 36+ messages in thread
From: Muchun Song @ 2023-08-30  7:20 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Muchun Song, Joao Martins, Oscar Salvador, David Hildenbrand,
	Miaohe Lin, David Rientjes, Anshuman Khandual, Naoya Horiguchi,
	Barry Song, Michal Hocko, Matthew Wilcox, Xiongchun Duan,
	Andrew Morton, linux-kernel, linux-mm



On 2023/8/26 03:04, Mike Kravetz wrote:
> Now that batching of hugetlb vmemmap optimization processing is possible,
> batch the freeing of vmemmap pages.  When freeing vmemmap pages for a
> hugetlb page, we add them to a list that is freed after the entire batch
> has been processed.
>
> This enhances the ability to return contiguous ranges of memory to the
> low level allocators.
>
> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> ---
>   mm/hugetlb_vmemmap.c | 56 ++++++++++++++++++++++++++++++++------------
>   1 file changed, 41 insertions(+), 15 deletions(-)
>
> diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
> index d5e6b6c76dce..e390170c0887 100644
> --- a/mm/hugetlb_vmemmap.c
> +++ b/mm/hugetlb_vmemmap.c
> @@ -305,11 +305,14 @@ static void vmemmap_restore_pte(pte_t *pte, unsigned long addr,
>    * @end:	end address of the vmemmap virtual address range that we want to
>    *		remap.
>    * @reuse:	reuse address.
> + * @bulk_pages: list to deposit vmemmap pages to be freed in bulk operations
> + *		or NULL in non-bulk case;

I'd like to rename bulk_pages to vmemmap_pages. Always add the vmemmap
pages to this list and let the caller (hugetlb_vmemmap_optimize and
hugetlb_vmemmap_optimize_folios) to help us to free them. It will be
clear to me.

>    *
>    * Return: %0 on success, negative error code otherwise.
>    */
>   static int vmemmap_remap_free(unsigned long start, unsigned long end,
> -			      unsigned long reuse)
> +			      unsigned long reuse,
> +			      struct list_head *bulk_pages)
>   {
>   	int ret;
>   	LIST_HEAD(vmemmap_pages);
> @@ -372,7 +375,14 @@ static int vmemmap_remap_free(unsigned long start, unsigned long end,
>   	}
>   	mmap_read_unlock(&init_mm);
>   
> -	free_vmemmap_page_list(&vmemmap_pages);
> +	/*
> +	 * if performing bulk operation, do not free pages here.
> +	 * rather add them to the bulk list
> +	 */
> +	if (!bulk_pages)
> +		free_vmemmap_page_list(&vmemmap_pages);
> +	else
> +		list_splice(&vmemmap_pages, bulk_pages);

Here, always add vmemmap_pages to the list.

>   
>   	return ret;
>   }
> @@ -546,17 +556,9 @@ static bool vmemmap_should_optimize(const struct hstate *h, const struct page *h
>   	return true;
>   }
>   
> -/**
> - * hugetlb_vmemmap_optimize - optimize @head page's vmemmap pages.
> - * @h:		struct hstate.
> - * @head:	the head page whose vmemmap pages will be optimized.
> - *
> - * This function only tries to optimize @head's vmemmap pages and does not
> - * guarantee that the optimization will succeed after it returns. The caller
> - * can use HPageVmemmapOptimized(@head) to detect if @head's vmemmap pages
> - * have been optimized.
> - */
> -void hugetlb_vmemmap_optimize(const struct hstate *h, struct page *head)
> +static void __hugetlb_vmemmap_optimize(const struct hstate *h,
> +					struct page *head,
> +					struct list_head *bulk_pages)

Also struct list_head *vmemmap_pages.

>   {
>   	unsigned long vmemmap_start = (unsigned long)head, vmemmap_end;
>   	unsigned long vmemmap_reuse;
> @@ -575,18 +577,42 @@ void hugetlb_vmemmap_optimize(const struct hstate *h, struct page *head)
>   	 * to the page which @vmemmap_reuse is mapped to, then free the pages
>   	 * which the range [@vmemmap_start, @vmemmap_end] is mapped to.
>   	 */
> -	if (vmemmap_remap_free(vmemmap_start, vmemmap_end, vmemmap_reuse))
> +	if (vmemmap_remap_free(vmemmap_start, vmemmap_end, vmemmap_reuse, bulk_pages))
>   		static_branch_dec(&hugetlb_optimize_vmemmap_key);
>   	else
>   		SetHPageVmemmapOptimized(head);
>   }
>   
> +/**
> + * hugetlb_vmemmap_optimize - optimize @head page's vmemmap pages.
> + * @h:		struct hstate.
> + * @head:	the head page whose vmemmap pages will be optimized.
> + *
> + * This function only tries to optimize @head's vmemmap pages and does not
> + * guarantee that the optimization will succeed after it returns. The caller
> + * can use HPageVmemmapOptimized(@head) to detect if @head's vmemmap pages
> + * have been optimized.
> + */
> +void hugetlb_vmemmap_optimize(const struct hstate *h, struct page *head)
> +{
> +	__hugetlb_vmemmap_optimize(h, head, NULL);

Use free_vmemmap_page_list to free vmemmap pages here.

> +}
> +
> +void hugetlb_vmemmap_optimize_bulk(const struct hstate *h, struct page *head,
> +			struct list_head *bulk_pages)
> +{
> +	__hugetlb_vmemmap_optimize(h, head, bulk_pages);
> +}
> +
>   void hugetlb_vmemmap_optimize_folios(struct hstate *h, struct list_head *folio_list)
>   {
>   	struct folio *folio;
> +	LIST_HEAD(vmemmap_pages);
>   
>   	list_for_each_entry(folio, folio_list, lru)
> -		hugetlb_vmemmap_optimize(h, &folio->page);
> +		hugetlb_vmemmap_optimize_bulk(h, &folio->page, &vmemmap_pages);

Directly use __hugetlb_vmemmap_optimize and delete 
hugetlb_vmemmap_optimize_bulk.
In the future, we could rename hugetlb_vmemmap_optimize to 
hugetlb_vmemmap_optimize_folio,
then, both function names are more consistent. E.g.

   1) hugetlb_vmemmap_optimize_folio(): used to free one folio's vmemmap 
pages.
   2) hugetlb_vmemmap_optimize_folios(): used to free multiple folio's 
vmemmap pages.

Thanks.

> +
> +	free_vmemmap_page_list(&vmemmap_pages);
>   }
>   
>   static struct ctl_table hugetlb_vmemmap_sysctls[] = {



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 09/12] hugetlb_vmemmap: Optimistically set Optimized flag
  2023-08-25 19:04 ` [PATCH 09/12] hugetlb_vmemmap: Optimistically set Optimized flag Mike Kravetz
@ 2023-08-30  7:26   ` Muchun Song
  2023-08-30 22:47     ` Mike Kravetz
  0 siblings, 1 reply; 36+ messages in thread
From: Muchun Song @ 2023-08-30  7:26 UTC (permalink / raw)
  To: Mike Kravetz, linux-mm, linux-kernel
  Cc: Muchun Song, Joao Martins, Oscar Salvador, David Hildenbrand,
	Miaohe Lin, David Rientjes, Anshuman Khandual, Naoya Horiguchi,
	Barry Song, Michal Hocko, Matthew Wilcox, Xiongchun Duan,
	Andrew Morton



On 2023/8/26 03:04, Mike Kravetz wrote:
> At the beginning of hugetlb_vmemmap_optimize, optimistically set
> the HPageVmemmapOptimized flag in the head page.  Clear the flag
> if the operation fails.
>
> No change in behavior.  However, this will become important in
> subsequent patches where we batch delay TLB flushing.  We need to
> make sure the content in the old and new vmemmap pages are the same.

Sorry, I didn't get the point here. Could you elaborate it?

>
> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> ---
>   mm/hugetlb_vmemmap.c | 8 +++++---
>   1 file changed, 5 insertions(+), 3 deletions(-)
>
> diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
> index e390170c0887..500a118915ff 100644
> --- a/mm/hugetlb_vmemmap.c
> +++ b/mm/hugetlb_vmemmap.c
> @@ -566,7 +566,9 @@ static void __hugetlb_vmemmap_optimize(const struct hstate *h,
>   	if (!vmemmap_should_optimize(h, head))
>   		return;
>   
> +	/* Optimistically assume success */
>   	static_branch_inc(&hugetlb_optimize_vmemmap_key);
> +	SetHPageVmemmapOptimized(head);
>   
>   	vmemmap_end	= vmemmap_start + hugetlb_vmemmap_size(h);
>   	vmemmap_reuse	= vmemmap_start;
> @@ -577,10 +579,10 @@ static void __hugetlb_vmemmap_optimize(const struct hstate *h,
>   	 * to the page which @vmemmap_reuse is mapped to, then free the pages
>   	 * which the range [@vmemmap_start, @vmemmap_end] is mapped to.
>   	 */
> -	if (vmemmap_remap_free(vmemmap_start, vmemmap_end, vmemmap_reuse, bulk_pages))
> +	if (vmemmap_remap_free(vmemmap_start, vmemmap_end, vmemmap_reuse, bulk_pages)) {
>   		static_branch_dec(&hugetlb_optimize_vmemmap_key);
> -	else
> -		SetHPageVmemmapOptimized(head);
> +		ClearHPageVmemmapOptimized(head);
> +	}
>   }
>   
>   /**



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 10/12] hugetlb: batch PMD split for bulk vmemmap dedup
  2023-08-25 19:04 ` [PATCH 10/12] hugetlb: batch PMD split for bulk vmemmap dedup Mike Kravetz
  2023-08-26  5:56   ` kernel test robot
  2023-08-26 18:14   ` kernel test robot
@ 2023-08-30  8:09   ` Muchun Song
  2023-08-30 11:13     ` Joao Martins
  2 siblings, 1 reply; 36+ messages in thread
From: Muchun Song @ 2023-08-30  8:09 UTC (permalink / raw)
  To: Mike Kravetz, Joao Martins
  Cc: Muchun Song, Oscar Salvador, David Hildenbrand, Miaohe Lin,
	David Rientjes, Anshuman Khandual, Naoya Horiguchi, Barry Song,
	Michal Hocko, Matthew Wilcox, Xiongchun Duan, Andrew Morton,
	linux-mm, linux-kernel



On 2023/8/26 03:04, Mike Kravetz wrote:
> From: Joao Martins <joao.m.martins@oracle.com>
>
> In an effort to minimize amount of TLB flushes, batch all PMD splits
> belonging to a range of pages in order to perform only 1 (global) TLB
> flush. This brings down from 14.2secs into 7.9secs a 1T hugetlb
> allocation.
>
> Rebased by Mike Kravetz
>
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> ---
>   mm/hugetlb_vmemmap.c | 94 ++++++++++++++++++++++++++++++++++++++++++--
>   1 file changed, 90 insertions(+), 4 deletions(-)
>
> diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
> index 500a118915ff..904a64fe5669 100644
> --- a/mm/hugetlb_vmemmap.c
> +++ b/mm/hugetlb_vmemmap.c
> @@ -26,6 +26,7 @@
>    * @reuse_addr:		the virtual address of the @reuse_page page.
>    * @vmemmap_pages:	the list head of the vmemmap pages that can be freed
>    *			or is mapped from.
> + * @flags		used to modify behavior in bulk operations
>    */
>   struct vmemmap_remap_walk {
>   	void			(*remap_pte)(pte_t *pte, unsigned long addr,
> @@ -34,9 +35,11 @@ struct vmemmap_remap_walk {
>   	struct page		*reuse_page;
>   	unsigned long		reuse_addr;
>   	struct list_head	*vmemmap_pages;
> +#define VMEMMAP_REMAP_ONLY_SPLIT	BIT(0)
> +	unsigned long		flags;
>   };
>   
> -static int split_vmemmap_huge_pmd(pmd_t *pmd, unsigned long start)
> +static int split_vmemmap_huge_pmd(pmd_t *pmd, unsigned long start, bool bulk)
>   {
>   	pmd_t __pmd;
>   	int i;
> @@ -79,7 +82,8 @@ static int split_vmemmap_huge_pmd(pmd_t *pmd, unsigned long start)
>   		/* Make pte visible before pmd. See comment in pmd_install(). */
>   		smp_wmb();
>   		pmd_populate_kernel(&init_mm, pmd, pgtable);
> -		flush_tlb_kernel_range(start, start + PMD_SIZE);
> +		if (!bulk)
> +			flush_tlb_kernel_range(start, start + PMD_SIZE);

A little weird to me. @bulk is used to indicate whether the TLB
should be flushed, however, the flag name is VMEMMAP_REMAP_ONLY_SPLIT,
it seems to tell me @bulk (calculated from walk->flags & 
VMEMMAP_REMAP_ONLY_SPLIT)
is a indicator to only split the huge pmd entry. For me, I think
it is better to introduce another flag like VMEMMAP_SPLIT_WITHOUT_FLUSH
to indicate whether TLB should be flushed.

>   	} else {
>   		pte_free_kernel(&init_mm, pgtable);
>   	}
> @@ -119,18 +123,28 @@ static int vmemmap_pmd_range(pud_t *pud, unsigned long addr,
>   			     unsigned long end,
>   			     struct vmemmap_remap_walk *walk)
>   {
> +	bool bulk;
>   	pmd_t *pmd;
>   	unsigned long next;
>   
> +	bulk = walk->flags & VMEMMAP_REMAP_ONLY_SPLIT;
>   	pmd = pmd_offset(pud, addr);
>   	do {
>   		int ret;
>   
> -		ret = split_vmemmap_huge_pmd(pmd, addr & PMD_MASK);
> +		ret = split_vmemmap_huge_pmd(pmd, addr & PMD_MASK, bulk);
>   		if (ret)
>   			return ret;
>   
>   		next = pmd_addr_end(addr, end);
> +
> +		/*
> +		 * We are only splitting, not remapping the hugetlb vmemmap
> +		 * pages.
> +		 */
> +		if (bulk)
> +			continue;

Actually, we don not need a flag to detect this situation, you could
use "!@walk->remap_pte" to determine whether we should go into the
next level traversal of the page table. ->remap_pte is used to traverse
the pte entry, so it make senses to continue to the next pmd entry if
it is NULL.

> +
>   		vmemmap_pte_range(pmd, addr, next, walk);
>   	} while (pmd++, addr = next, addr != end);
>   
> @@ -197,7 +211,8 @@ static int vmemmap_remap_range(unsigned long start, unsigned long end,
>   			return ret;
>   	} while (pgd++, addr = next, addr != end);
>   
> -	flush_tlb_kernel_range(start, end);
> +	if (!(walk->flags & VMEMMAP_REMAP_ONLY_SPLIT))
> +		flush_tlb_kernel_range(start, end);

This could be:

     if (walk->remap_pte)
         flush_tlb_kernel_range(start, end);

>   
>   	return 0;
>   }
> @@ -296,6 +311,48 @@ static void vmemmap_restore_pte(pte_t *pte, unsigned long addr,
>   	set_pte_at(&init_mm, addr, pte, mk_pte(page, pgprot));
>   }
>   
> +/**
> + * vmemmap_remap_split - split the vmemmap virtual address range [@start, @end)
> + *                      backing PMDs of the directmap into PTEs
> + * @start:     start address of the vmemmap virtual address range that we want
> + *             to remap.
> + * @end:       end address of the vmemmap virtual address range that we want to
> + *             remap.
> + * @reuse:     reuse address.
> + *
> + * Return: %0 on success, negative error code otherwise.
> + */
> +static int vmemmap_remap_split(unsigned long start, unsigned long end,
> +				unsigned long reuse)
> +{
> +	int ret;
> +	LIST_HEAD(vmemmap_pages);

Unused variable?

> +	struct vmemmap_remap_walk walk = {
> +		.flags = VMEMMAP_REMAP_ONLY_SPLIT,
> +	};
> +
> +	/*
> +	 * In order to make remapping routine most efficient for the huge pages,
> +	 * the routine of vmemmap page table walking has the following rules
> +	 * (see more details from the vmemmap_pte_range()):
> +	 *
> +	 * - The range [@start, @end) and the range [@reuse, @reuse + PAGE_SIZE)
> +	 *   should be continuous.
> +	 * - The @reuse address is part of the range [@reuse, @end) that we are
> +	 *   walking which is passed to vmemmap_remap_range().
> +	 * - The @reuse address is the first in the complete range.
> +	 *
> +	 * So we need to make sure that @start and @reuse meet the above rules.
> +	 */

The comments are duplicated, something like:

     /* See the comment in the vmemmap_remap_free(). */

is enough.

> +	BUG_ON(start - reuse != PAGE_SIZE);
> +
> +	mmap_read_lock(&init_mm);
> +	ret = vmemmap_remap_range(reuse, end, &walk);
> +	mmap_read_unlock(&init_mm);
> +
> +	return ret;
> +}
> +
>   /**
>    * vmemmap_remap_free - remap the vmemmap virtual address range [@start, @end)
>    *			to the page which @reuse is mapped to, then free vmemmap
> @@ -320,6 +377,7 @@ static int vmemmap_remap_free(unsigned long start, unsigned long end,
>   		.remap_pte	= vmemmap_remap_pte,
>   		.reuse_addr	= reuse,
>   		.vmemmap_pages	= &vmemmap_pages,
> +		.flags		= 0,
>   	};
>   	int nid = page_to_nid((struct page *)start);
>   	gfp_t gfp_mask = GFP_KERNEL | __GFP_THISNODE | __GFP_NORETRY |
> @@ -606,11 +664,39 @@ void hugetlb_vmemmap_optimize_bulk(const struct hstate *h, struct page *head,
>   	__hugetlb_vmemmap_optimize(h, head, bulk_pages);
>   }
>   
> +void hugetlb_vmemmap_split(const struct hstate *h, struct page *head)
> +{
> +	unsigned long vmemmap_start = (unsigned long)head, vmemmap_end;
> +	unsigned long vmemmap_reuse;
> +
> +	if (!vmemmap_should_optimize(h, head))
> +		return;
> +
> +	static_branch_inc(&hugetlb_optimize_vmemmap_key);

Why? hugetlb_optimize_vmemmap_key is used as a switch to let
page_fixed_fake_head works properly for the vmemmap-optimized
HugeTLB pages, however, this function only splits the huge pmd
entry without optimizing the vmemmap pages. So it is wrong to
increase the static_key.

Thanks.

> +
> +	vmemmap_end     = vmemmap_start + hugetlb_vmemmap_size(h);
> +	vmemmap_reuse   = vmemmap_start;
> +	vmemmap_start   += HUGETLB_VMEMMAP_RESERVE_SIZE;
> +
> +	/*
> +	 * Remap the vmemmap virtual address range [@vmemmap_start, @vmemmap_end)
> +	 * to the page which @vmemmap_reuse is mapped to, then free the pages
> +	 * which the range [@vmemmap_start, @vmemmap_end] is mapped to.
> +	 */
> +	if (vmemmap_remap_split(vmemmap_start, vmemmap_end, vmemmap_reuse))
> +		static_branch_dec(&hugetlb_optimize_vmemmap_key);
> +}
> +
>   void hugetlb_vmemmap_optimize_folios(struct hstate *h, struct list_head *folio_list)
>   {
>   	struct folio *folio;
>   	LIST_HEAD(vmemmap_pages);
>   
> +	list_for_each_entry(folio, folio_list, lru)
> +		hugetlb_vmemmap_split(h, &folio->page);
> +
> +	flush_tlb_kernel_range(0, TLB_FLUSH_ALL);
> +
>   	list_for_each_entry(folio, folio_list, lru)
>   		hugetlb_vmemmap_optimize_bulk(h, &folio->page, &vmemmap_pages);
>   



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 11/12] hugetlb: batch TLB flushes when freeing vmemmap
  2023-08-25 19:04 ` [PATCH 11/12] hugetlb: batch TLB flushes when freeing vmemmap Mike Kravetz
@ 2023-08-30  8:23   ` Muchun Song
  2023-08-30 11:17     ` Joao Martins
  0 siblings, 1 reply; 36+ messages in thread
From: Muchun Song @ 2023-08-30  8:23 UTC (permalink / raw)
  To: Mike Kravetz, Joao Martins
  Cc: Muchun Song, Oscar Salvador, David Hildenbrand, Miaohe Lin,
	David Rientjes, Anshuman Khandual, Naoya Horiguchi, Barry Song,
	Michal Hocko, Matthew Wilcox, Xiongchun Duan, Andrew Morton,
	linux-mm, linux-kernel



On 2023/8/26 03:04, Mike Kravetz wrote:
> From: Joao Martins <joao.m.martins@oracle.com>
>
> Now that a list of pages is deduplicated at once, the TLB
> flush can be batched for all vmemmap pages that got remapped.
>
> Add a flags field and pass whether it's a bulk allocation or
> just a single page to decide to remap.
>
> The TLB flush is global as we don't have guarantees from caller
> that the set of folios is contiguous, or to add complexity in
> composing a list of kVAs to flush.
>
> Modified by Mike Kravetz to perform TLB flush on single folio if an
> error is encountered.
>
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> ---
>   mm/hugetlb_vmemmap.c | 9 +++++++--
>   1 file changed, 7 insertions(+), 2 deletions(-)
>
> diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
> index 904a64fe5669..a2fc7b03ac6b 100644
> --- a/mm/hugetlb_vmemmap.c
> +++ b/mm/hugetlb_vmemmap.c
> @@ -36,6 +36,7 @@ struct vmemmap_remap_walk {
>   	unsigned long		reuse_addr;
>   	struct list_head	*vmemmap_pages;
>   #define VMEMMAP_REMAP_ONLY_SPLIT	BIT(0)
> +#define	VMEMMAP_REMAP_BULK_PAGES	BIT(1)

We could reuse the flag (as I suggest VMEMMAP_SPLIT_WITHOUT_FLUSH)
proposed in the patch 10. When I saw this patch, I think the name
is not suitable, maybe VMEMMAP_WITHOUT_TLB_FLUSH is better.

Thanks.

>   	unsigned long		flags;
>   };
>   
> @@ -211,7 +212,8 @@ static int vmemmap_remap_range(unsigned long start, unsigned long end,
>   			return ret;
>   	} while (pgd++, addr = next, addr != end);
>   
> -	if (!(walk->flags & VMEMMAP_REMAP_ONLY_SPLIT))
> +	if (!(walk->flags &
> +	      (VMEMMAP_REMAP_ONLY_SPLIT | VMEMMAP_REMAP_BULK_PAGES)))
>   		flush_tlb_kernel_range(start, end);
>   
>   	return 0;
> @@ -377,7 +379,7 @@ static int vmemmap_remap_free(unsigned long start, unsigned long end,
>   		.remap_pte	= vmemmap_remap_pte,
>   		.reuse_addr	= reuse,
>   		.vmemmap_pages	= &vmemmap_pages,
> -		.flags		= 0,
> +		.flags		= !bulk_pages ? 0 : VMEMMAP_REMAP_BULK_PAGES,
>   	};
>   	int nid = page_to_nid((struct page *)start);
>   	gfp_t gfp_mask = GFP_KERNEL | __GFP_THISNODE | __GFP_NORETRY |
> @@ -427,6 +429,7 @@ static int vmemmap_remap_free(unsigned long start, unsigned long end,
>   			.remap_pte	= vmemmap_restore_pte,
>   			.reuse_addr	= reuse,
>   			.vmemmap_pages	= &vmemmap_pages,
> +			.flags		= 0,
>   		};
>   
>   		vmemmap_remap_range(reuse, end, &walk);
> @@ -700,6 +703,8 @@ void hugetlb_vmemmap_optimize_folios(struct hstate *h, struct list_head *folio_l
>   	list_for_each_entry(folio, folio_list, lru)
>   		hugetlb_vmemmap_optimize_bulk(h, &folio->page, &vmemmap_pages);
>   
> +	flush_tlb_kernel_range(0, TLB_FLUSH_ALL);
> +
>   	free_vmemmap_page_list(&vmemmap_pages);
>   }
>   



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 07/12] hugetlb: perform vmemmap restoration on a list of pages
  2023-08-25 19:04 ` [PATCH 07/12] hugetlb: perform vmemmap restoration " Mike Kravetz
  2023-08-26  6:58   ` kernel test robot
@ 2023-08-30  8:33   ` Muchun Song
  2023-08-30 17:53     ` Mike Kravetz
  1 sibling, 1 reply; 36+ messages in thread
From: Muchun Song @ 2023-08-30  8:33 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Muchun Song, Joao Martins, Oscar Salvador, David Hildenbrand,
	Miaohe Lin, David Rientjes, Anshuman Khandual, Naoya Horiguchi,
	Barry Song, Michal Hocko, Matthew Wilcox, Xiongchun Duan,
	Andrew Morton, linux-mm, linux-kernel



On 2023/8/26 03:04, Mike Kravetz wrote:
> When removing hugetlb pages from the pool, we first create a list
> of removed pages and then free those pages back to low level allocators.
> Part of the 'freeing process' is to restore vmemmap for all base pages
> if necessary.  Pass this list of pages to a new routine
> hugetlb_vmemmap_restore_folios() so that vmemmap restoration can be
> performed in bulk.
>
> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> ---
>   mm/hugetlb.c         | 3 +++
>   mm/hugetlb_vmemmap.c | 8 ++++++++
>   mm/hugetlb_vmemmap.h | 6 ++++++
>   3 files changed, 17 insertions(+)
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 3133dbd89696..1bde5e234d5c 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1833,6 +1833,9 @@ static void update_and_free_pages_bulk(struct hstate *h, struct list_head *list)
>   {
>   	struct folio *folio, *t_folio;
>   
> +	/* First restore vmemmap for all pages on list. */
> +	hugetlb_vmemmap_restore_folios(h, list);
> +
>   	list_for_each_entry_safe(folio, t_folio, list, lru) {
>   		update_and_free_hugetlb_folio(h, folio, false);
>   		cond_resched();
> diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
> index 147018a504a6..d5e6b6c76dce 100644
> --- a/mm/hugetlb_vmemmap.c
> +++ b/mm/hugetlb_vmemmap.c
> @@ -479,6 +479,14 @@ int hugetlb_vmemmap_restore(const struct hstate *h, struct page *head)
>   	return ret;
>   }
>   

Because it is a void function, I'd like to add a comment here like:

     This function only tries to restore a list of folios' vmemmap pages and
     does not guarantee that the restoration will succeed after it returns.

Thanks.

> +void hugetlb_vmemmap_restore_folios(const struct hstate *h, struct list_head *folio_list)
> +{
> +	struct folio *folio;
> +
> +	list_for_each_entry(folio, folio_list, lru)
> +		hugetlb_vmemmap_restore(h, &folio->page);
> +}
> +
>   /* Return true iff a HugeTLB whose vmemmap should and can be optimized. */
>   static bool vmemmap_should_optimize(const struct hstate *h, const struct page *head)
>   {
> diff --git a/mm/hugetlb_vmemmap.h b/mm/hugetlb_vmemmap.h
> index 036494e040ca..b7074672ceb2 100644
> --- a/mm/hugetlb_vmemmap.h
> +++ b/mm/hugetlb_vmemmap.h
> @@ -12,6 +12,7 @@
>   
>   #ifdef CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP
>   int hugetlb_vmemmap_restore(const struct hstate *h, struct page *head);
> +void hugetlb_vmemmap_restore_folios(const struct hstate *h, struct list_head *folio_list);
>   void hugetlb_vmemmap_optimize(const struct hstate *h, struct page *head);
>   void hugetlb_vmemmap_optimize_folios(struct hstate *h, struct list_head *folio_list);
>   
> @@ -44,6 +45,11 @@ static inline int hugetlb_vmemmap_restore(const struct hstate *h, struct page *h
>   	return 0;
>   }
>   
> +static inline void hugetlb_vmemmap_restore_folios(const struct hstate *h, struct list_head *folio_list)
> +{
> +	return 0;
> +}
> +
>   static inline void hugetlb_vmemmap_optimize(const struct hstate *h, struct page *head)
>   {
>   }



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 12/12] hugetlb: batch TLB flushes when restoring vmemmap
  2023-08-25 19:04 ` [PATCH 12/12] hugetlb: batch TLB flushes when restoring vmemmap Mike Kravetz
  2023-08-26  8:01   ` kernel test robot
@ 2023-08-30  8:47   ` Muchun Song
  1 sibling, 0 replies; 36+ messages in thread
From: Muchun Song @ 2023-08-30  8:47 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Muchun Song, Joao Martins, Oscar Salvador, David Hildenbrand,
	Miaohe Lin, David Rientjes, Anshuman Khandual, Naoya Horiguchi,
	Barry Song, Michal Hocko, Matthew Wilcox, Xiongchun Duan,
	Andrew Morton, linux-mm, linux-kernel



On 2023/8/26 03:04, Mike Kravetz wrote:
> Update the hugetlb_vmemmap_restore path to take a 'batch' parameter that
> indicates restoration is happening on a batch of pages.  When set, use
> the existing mechanism (VMEMMAP_REMAP_BULK_PAGES) to delay TLB flushing.
> The routine hugetlb_vmemmap_restore_folios is the only user of this new
> batch parameter and it will perform a global flush after all vmemmap is
> restored.
>
> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> ---
>   mm/hugetlb_vmemmap.c | 37 +++++++++++++++++++++++--------------
>   1 file changed, 23 insertions(+), 14 deletions(-)
>
> diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
> index a2fc7b03ac6b..d6e7440b9507 100644
> --- a/mm/hugetlb_vmemmap.c
> +++ b/mm/hugetlb_vmemmap.c
> @@ -479,17 +479,19 @@ static int alloc_vmemmap_page_list(unsigned long start, unsigned long end,
>    * @end:	end address of the vmemmap virtual address range that we want to
>    *		remap.
>    * @reuse:	reuse address.
> + * @bulk:	bulk operation, batch TLB flushes
>    *
>    * Return: %0 on success, negative error code otherwise.
>    */
>   static int vmemmap_remap_alloc(unsigned long start, unsigned long end,
> -			       unsigned long reuse)
> +			       unsigned long reuse, bool bulk)

I'd like to let vmemmap_remap_alloc pass VMEMMAP_REMAP_BULK_PAGES directly,
in which case, we do not need to change this function if we want to 
introduce
another flag in the future. I mean that change "bool bulk" to "unsigned 
long flags".

>   {
>   	LIST_HEAD(vmemmap_pages);
>   	struct vmemmap_remap_walk walk = {
>   		.remap_pte	= vmemmap_restore_pte,
>   		.reuse_addr	= reuse,
>   		.vmemmap_pages	= &vmemmap_pages,
> +		.flags		= !bulk ? 0 : VMEMMAP_REMAP_BULK_PAGES,
>   	};
>   
>   	/* See the comment in the vmemmap_remap_free(). */
> @@ -511,17 +513,7 @@ EXPORT_SYMBOL(hugetlb_optimize_vmemmap_key);
>   static bool vmemmap_optimize_enabled = IS_ENABLED(CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP_DEFAULT_ON);
>   core_param(hugetlb_free_vmemmap, vmemmap_optimize_enabled, bool, 0);
>   
> -/**
> - * hugetlb_vmemmap_restore - restore previously optimized (by
> - *			     hugetlb_vmemmap_optimize()) vmemmap pages which
> - *			     will be reallocated and remapped.
> - * @h:		struct hstate.
> - * @head:	the head page whose vmemmap pages will be restored.
> - *
> - * Return: %0 if @head's vmemmap pages have been reallocated and remapped,
> - * negative error code otherwise.
> - */
> -int hugetlb_vmemmap_restore(const struct hstate *h, struct page *head)
> +int __hugetlb_vmemmap_restore(const struct hstate *h, struct page *head, bool bulk)

The same as here.

>   {
>   	int ret;
>   	unsigned long vmemmap_start = (unsigned long)head, vmemmap_end;
> @@ -541,7 +533,7 @@ int hugetlb_vmemmap_restore(const struct hstate *h, struct page *head)
>   	 * When a HugeTLB page is freed to the buddy allocator, previously
>   	 * discarded vmemmap pages must be allocated and remapping.
>   	 */
> -	ret = vmemmap_remap_alloc(vmemmap_start, vmemmap_end, vmemmap_reuse);
> +	ret = vmemmap_remap_alloc(vmemmap_start, vmemmap_end, vmemmap_reuse, bulk);
>   	if (!ret) {
>   		ClearHPageVmemmapOptimized(head);
>   		static_branch_dec(&hugetlb_optimize_vmemmap_key);
> @@ -550,12 +542,29 @@ int hugetlb_vmemmap_restore(const struct hstate *h, struct page *head)
>   	return ret;
>   }
>   
> +/**
> + * hugetlb_vmemmap_restore - restore previously optimized (by
> + *			     hugetlb_vmemmap_optimize()) vmemmap pages which
> + *			     will be reallocated and remapped.
> + * @h:		struct hstate.
> + * @head:	the head page whose vmemmap pages will be restored.
> + *
> + * Return: %0 if @head's vmemmap pages have been reallocated and remapped,
> + * negative error code otherwise.
> + */
> +int hugetlb_vmemmap_restore(const struct hstate *h, struct page *head)
> +{
> +	return __hugetlb_vmemmap_restore(h, head, false);
> +}
> +
>   void hugetlb_vmemmap_restore_folios(const struct hstate *h, struct list_head *folio_list)
>   {
>   	struct folio *folio;
>   
>   	list_for_each_entry(folio, folio_list, lru)
> -		hugetlb_vmemmap_restore(h, &folio->page);
> +		(void)__hugetlb_vmemmap_restore(h, &folio->page, true);

Pass VMEMMAP_REMAP_BULK_PAGES directly here.

Thanks.

> +
> +	flush_tlb_kernel_range(0, TLB_FLUSH_ALL);
>   }
>   
>   /* Return true iff a HugeTLB whose vmemmap should and can be optimized. */



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 10/12] hugetlb: batch PMD split for bulk vmemmap dedup
  2023-08-30  8:09   ` Muchun Song
@ 2023-08-30 11:13     ` Joao Martins
  2023-08-30 16:03       ` Joao Martins
  0 siblings, 1 reply; 36+ messages in thread
From: Joao Martins @ 2023-08-30 11:13 UTC (permalink / raw)
  To: Muchun Song, Mike Kravetz
  Cc: Muchun Song, Oscar Salvador, David Hildenbrand, Miaohe Lin,
	David Rientjes, Anshuman Khandual, Naoya Horiguchi, Barry Song,
	Michal Hocko, Matthew Wilcox, Xiongchun Duan, Andrew Morton,
	linux-mm, linux-kernel

On 30/08/2023 09:09, Muchun Song wrote:
> 
> 
> On 2023/8/26 03:04, Mike Kravetz wrote:
>> From: Joao Martins <joao.m.martins@oracle.com>
>>
>> In an effort to minimize amount of TLB flushes, batch all PMD splits
>> belonging to a range of pages in order to perform only 1 (global) TLB
>> flush. This brings down from 14.2secs into 7.9secs a 1T hugetlb
>> allocation.
>>
>> Rebased by Mike Kravetz
>>
>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
>> ---
>>   mm/hugetlb_vmemmap.c | 94 ++++++++++++++++++++++++++++++++++++++++++--
>>   1 file changed, 90 insertions(+), 4 deletions(-)
>>
>> diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
>> index 500a118915ff..904a64fe5669 100644
>> --- a/mm/hugetlb_vmemmap.c
>> +++ b/mm/hugetlb_vmemmap.c
>> @@ -26,6 +26,7 @@
>>    * @reuse_addr:        the virtual address of the @reuse_page page.
>>    * @vmemmap_pages:    the list head of the vmemmap pages that can be freed
>>    *            or is mapped from.
>> + * @flags        used to modify behavior in bulk operations
>>    */
>>   struct vmemmap_remap_walk {
>>       void            (*remap_pte)(pte_t *pte, unsigned long addr,
>> @@ -34,9 +35,11 @@ struct vmemmap_remap_walk {
>>       struct page        *reuse_page;
>>       unsigned long        reuse_addr;
>>       struct list_head    *vmemmap_pages;
>> +#define VMEMMAP_REMAP_ONLY_SPLIT    BIT(0)
>> +    unsigned long        flags;
>>   };
>>   -static int split_vmemmap_huge_pmd(pmd_t *pmd, unsigned long start)
>> +static int split_vmemmap_huge_pmd(pmd_t *pmd, unsigned long start, bool bulk)
>>   {
>>       pmd_t __pmd;
>>       int i;
>> @@ -79,7 +82,8 @@ static int split_vmemmap_huge_pmd(pmd_t *pmd, unsigned long
>> start)
>>           /* Make pte visible before pmd. See comment in pmd_install(). */
>>           smp_wmb();
>>           pmd_populate_kernel(&init_mm, pmd, pgtable);
>> -        flush_tlb_kernel_range(start, start + PMD_SIZE);
>> +        if (!bulk)
>> +            flush_tlb_kernel_range(start, start + PMD_SIZE);
> 
> A little weird to me. @bulk is used to indicate whether the TLB
> should be flushed, however, the flag name is VMEMMAP_REMAP_ONLY_SPLIT,
> it seems to tell me @bulk (calculated from walk->flags & VMEMMAP_REMAP_ONLY_SPLIT)

bulk here has a meaning of PMD being split in bulk, not the bulk of the vmemmap
pages. But yeah it's weird, I should clean this up.

> is a indicator to only split the huge pmd entry. For me, I think
> it is better to introduce another flag like VMEMMAP_SPLIT_WITHOUT_FLUSH
> to indicate whether TLB should be flushed.
> 
Based on the feedback of another patch I think it's abetter as you say, to
introduce a VMEMMAP_NO_TLB_FLUSH, and use the the walk::remap_pte to tell
whether it's a PMD split or a PTE remap.

>>       } else {
>>           pte_free_kernel(&init_mm, pgtable);
>>       }
>> @@ -119,18 +123,28 @@ static int vmemmap_pmd_range(pud_t *pud, unsigned long
>> addr,
>>                    unsigned long end,
>>                    struct vmemmap_remap_walk *walk)
>>   {
>> +    bool bulk;
>>       pmd_t *pmd;
>>       unsigned long next;
>>   +    bulk = walk->flags & VMEMMAP_REMAP_ONLY_SPLIT;
>>       pmd = pmd_offset(pud, addr);
>>       do {
>>           int ret;
>>   -        ret = split_vmemmap_huge_pmd(pmd, addr & PMD_MASK);
>> +        ret = split_vmemmap_huge_pmd(pmd, addr & PMD_MASK, bulk);
>>           if (ret)
>>               return ret;
>>             next = pmd_addr_end(addr, end);
>> +
>> +        /*
>> +         * We are only splitting, not remapping the hugetlb vmemmap
>> +         * pages.
>> +         */
>> +        if (bulk)
>> +            continue;
> 
> Actually, we don not need a flag to detect this situation, you could
> use "!@walk->remap_pte" to determine whether we should go into the
> next level traversal of the page table. ->remap_pte is used to traverse
> the pte entry, so it make senses to continue to the next pmd entry if
> it is NULL.
> 

Yeap, great suggestion.

>> +
>>           vmemmap_pte_range(pmd, addr, next, walk);
>>       } while (pmd++, addr = next, addr != end);
>>   @@ -197,7 +211,8 @@ static int vmemmap_remap_range(unsigned long start,
>> unsigned long end,
>>               return ret;
>>       } while (pgd++, addr = next, addr != end);
>>   -    flush_tlb_kernel_range(start, end);
>> +    if (!(walk->flags & VMEMMAP_REMAP_ONLY_SPLIT))
>> +        flush_tlb_kernel_range(start, end);
> 
> This could be:
> 
>     if (walk->remap_pte)
>         flush_tlb_kernel_range(start, end);
> 
Yeap.

>>         return 0;
>>   }
>> @@ -296,6 +311,48 @@ static void vmemmap_restore_pte(pte_t *pte, unsigned long
>> addr,
>>       set_pte_at(&init_mm, addr, pte, mk_pte(page, pgprot));
>>   }
>>   +/**
>> + * vmemmap_remap_split - split the vmemmap virtual address range [@start, @end)
>> + *                      backing PMDs of the directmap into PTEs
>> + * @start:     start address of the vmemmap virtual address range that we want
>> + *             to remap.
>> + * @end:       end address of the vmemmap virtual address range that we want to
>> + *             remap.
>> + * @reuse:     reuse address.
>> + *
>> + * Return: %0 on success, negative error code otherwise.
>> + */
>> +static int vmemmap_remap_split(unsigned long start, unsigned long end,
>> +                unsigned long reuse)
>> +{
>> +    int ret;
>> +    LIST_HEAD(vmemmap_pages);
> 
> Unused variable?
> 
Yeap, a leftover from something else.

>> +    struct vmemmap_remap_walk walk = {
>> +        .flags = VMEMMAP_REMAP_ONLY_SPLIT,
>> +    };
>> +
>> +    /*
>> +     * In order to make remapping routine most efficient for the huge pages,
>> +     * the routine of vmemmap page table walking has the following rules
>> +     * (see more details from the vmemmap_pte_range()):
>> +     *
>> +     * - The range [@start, @end) and the range [@reuse, @reuse + PAGE_SIZE)
>> +     *   should be continuous.
>> +     * - The @reuse address is part of the range [@reuse, @end) that we are
>> +     *   walking which is passed to vmemmap_remap_range().
>> +     * - The @reuse address is the first in the complete range.
>> +     *
>> +     * So we need to make sure that @start and @reuse meet the above rules.
>> +     */
> 
> The comments are duplicated, something like:
> 
>     /* See the comment in the vmemmap_remap_free(). */
> 
> is enough.
> 
OK.

>> +    BUG_ON(start - reuse != PAGE_SIZE);
>> +
>> +    mmap_read_lock(&init_mm);
>> +    ret = vmemmap_remap_range(reuse, end, &walk);
>> +    mmap_read_unlock(&init_mm);
>> +
>> +    return ret;
>> +}
>> +
>>   /**
>>    * vmemmap_remap_free - remap the vmemmap virtual address range [@start, @end)
>>    *            to the page which @reuse is mapped to, then free vmemmap
>> @@ -320,6 +377,7 @@ static int vmemmap_remap_free(unsigned long start,
>> unsigned long end,
>>           .remap_pte    = vmemmap_remap_pte,
>>           .reuse_addr    = reuse,
>>           .vmemmap_pages    = &vmemmap_pages,
>> +        .flags        = 0,
>>       };
>>       int nid = page_to_nid((struct page *)start);
>>       gfp_t gfp_mask = GFP_KERNEL | __GFP_THISNODE | __GFP_NORETRY |
>> @@ -606,11 +664,39 @@ void hugetlb_vmemmap_optimize_bulk(const struct hstate
>> *h, struct page *head,
>>       __hugetlb_vmemmap_optimize(h, head, bulk_pages);
>>   }
>>   +void hugetlb_vmemmap_split(const struct hstate *h, struct page *head)
>> +{
>> +    unsigned long vmemmap_start = (unsigned long)head, vmemmap_end;
>> +    unsigned long vmemmap_reuse;
>> +
>> +    if (!vmemmap_should_optimize(h, head))
>> +        return;
>> +
>> +    static_branch_inc(&hugetlb_optimize_vmemmap_key);
> 
> Why? hugetlb_optimize_vmemmap_key is used as a switch to let
> page_fixed_fake_head works properly for the vmemmap-optimized
> HugeTLB pages, however, this function only splits the huge pmd
> entry without optimizing the vmemmap pages. So it is wrong to
> increase the static_key.
> 
Yes, you're right. It's wrong, it was a copy-and-paste error from
remap_free and I failed to remove the non vmemmap_optimize specific
logic.

> Thanks.
> 
>> +
>> +    vmemmap_end     = vmemmap_start + hugetlb_vmemmap_size(h);
>> +    vmemmap_reuse   = vmemmap_start;
>> +    vmemmap_start   += HUGETLB_VMEMMAP_RESERVE_SIZE;
>> +
>> +    /*
>> +     * Remap the vmemmap virtual address range [@vmemmap_start, @vmemmap_end)
>> +     * to the page which @vmemmap_reuse is mapped to, then free the pages
>> +     * which the range [@vmemmap_start, @vmemmap_end] is mapped to.
>> +     */
>> +    if (vmemmap_remap_split(vmemmap_start, vmemmap_end, vmemmap_reuse))
>> +        static_branch_dec(&hugetlb_optimize_vmemmap_key);
>> +}
>> +
>>   void hugetlb_vmemmap_optimize_folios(struct hstate *h, struct list_head
>> *folio_list)
>>   {
>>       struct folio *folio;
>>       LIST_HEAD(vmemmap_pages);
>>   +    list_for_each_entry(folio, folio_list, lru)
>> +        hugetlb_vmemmap_split(h, &folio->page);
>> +
>> +    flush_tlb_kernel_range(0, TLB_FLUSH_ALL);
>> +
>>       list_for_each_entry(folio, folio_list, lru)
>>           hugetlb_vmemmap_optimize_bulk(h, &folio->page, &vmemmap_pages);
>>   
> 


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 11/12] hugetlb: batch TLB flushes when freeing vmemmap
  2023-08-30  8:23   ` Muchun Song
@ 2023-08-30 11:17     ` Joao Martins
  0 siblings, 0 replies; 36+ messages in thread
From: Joao Martins @ 2023-08-30 11:17 UTC (permalink / raw)
  To: Muchun Song, Mike Kravetz
  Cc: Muchun Song, Oscar Salvador, David Hildenbrand, Miaohe Lin,
	David Rientjes, Anshuman Khandual, Naoya Horiguchi, Barry Song,
	Michal Hocko, Matthew Wilcox, Xiongchun Duan, Andrew Morton,
	linux-mm, linux-kernel



On 30/08/2023 09:23, Muchun Song wrote:
> 
> 
> On 2023/8/26 03:04, Mike Kravetz wrote:
>> From: Joao Martins <joao.m.martins@oracle.com>
>>
>> Now that a list of pages is deduplicated at once, the TLB
>> flush can be batched for all vmemmap pages that got remapped.
>>
>> Add a flags field and pass whether it's a bulk allocation or
>> just a single page to decide to remap.
>>
>> The TLB flush is global as we don't have guarantees from caller
>> that the set of folios is contiguous, or to add complexity in
>> composing a list of kVAs to flush.
>>
>> Modified by Mike Kravetz to perform TLB flush on single folio if an
>> error is encountered.
>>
>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
>> ---
>>   mm/hugetlb_vmemmap.c | 9 +++++++--
>>   1 file changed, 7 insertions(+), 2 deletions(-)
>>
>> diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
>> index 904a64fe5669..a2fc7b03ac6b 100644
>> --- a/mm/hugetlb_vmemmap.c
>> +++ b/mm/hugetlb_vmemmap.c
>> @@ -36,6 +36,7 @@ struct vmemmap_remap_walk {
>>       unsigned long        reuse_addr;
>>       struct list_head    *vmemmap_pages;
>>   #define VMEMMAP_REMAP_ONLY_SPLIT    BIT(0)
>> +#define    VMEMMAP_REMAP_BULK_PAGES    BIT(1)
> 
> We could reuse the flag (as I suggest VMEMMAP_SPLIT_WITHOUT_FLUSH)
> proposed in the patch 10. When I saw this patch, I think the name
> is not suitable, maybe VMEMMAP_WITHOUT_TLB_FLUSH is better.
> 

As mentioned in the previous patch, yeah makes sense to have a bit just for
no TLB flush and perhaps we don't even BIT(1). We can use remap_pte to tell PTE
vs PMD flush "skipping"

> Thanks.
> 
>>       unsigned long        flags;
>>   };
>>   @@ -211,7 +212,8 @@ static int vmemmap_remap_range(unsigned long start,
>> unsigned long end,
>>               return ret;
>>       } while (pgd++, addr = next, addr != end);
>>   -    if (!(walk->flags & VMEMMAP_REMAP_ONLY_SPLIT))
>> +    if (!(walk->flags &
>> +          (VMEMMAP_REMAP_ONLY_SPLIT | VMEMMAP_REMAP_BULK_PAGES)))
>>           flush_tlb_kernel_range(start, end);
>>         return 0;
>> @@ -377,7 +379,7 @@ static int vmemmap_remap_free(unsigned long start,
>> unsigned long end,
>>           .remap_pte    = vmemmap_remap_pte,
>>           .reuse_addr    = reuse,
>>           .vmemmap_pages    = &vmemmap_pages,
>> -        .flags        = 0,
>> +        .flags        = !bulk_pages ? 0 : VMEMMAP_REMAP_BULK_PAGES,
>>       };
>>       int nid = page_to_nid((struct page *)start);
>>       gfp_t gfp_mask = GFP_KERNEL | __GFP_THISNODE | __GFP_NORETRY |
>> @@ -427,6 +429,7 @@ static int vmemmap_remap_free(unsigned long start,
>> unsigned long end,
>>               .remap_pte    = vmemmap_restore_pte,
>>               .reuse_addr    = reuse,
>>               .vmemmap_pages    = &vmemmap_pages,
>> +            .flags        = 0,
>>           };
>>             vmemmap_remap_range(reuse, end, &walk);
>> @@ -700,6 +703,8 @@ void hugetlb_vmemmap_optimize_folios(struct hstate *h,
>> struct list_head *folio_l
>>       list_for_each_entry(folio, folio_list, lru)
>>           hugetlb_vmemmap_optimize_bulk(h, &folio->page, &vmemmap_pages);
>>   +    flush_tlb_kernel_range(0, TLB_FLUSH_ALL);
>> +
>>       free_vmemmap_page_list(&vmemmap_pages);
>>   }
>>   
> 


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 10/12] hugetlb: batch PMD split for bulk vmemmap dedup
  2023-08-30 11:13     ` Joao Martins
@ 2023-08-30 16:03       ` Joao Martins
  2023-08-31  3:54         ` Muchun Song
  0 siblings, 1 reply; 36+ messages in thread
From: Joao Martins @ 2023-08-30 16:03 UTC (permalink / raw)
  To: Muchun Song, Mike Kravetz
  Cc: Muchun Song, Oscar Salvador, David Hildenbrand, Miaohe Lin,
	David Rientjes, Anshuman Khandual, Naoya Horiguchi, Barry Song,
	Michal Hocko, Matthew Wilcox, Xiongchun Duan, Andrew Morton,
	linux-mm, linux-kernel

On 30/08/2023 12:13, Joao Martins wrote:
> On 30/08/2023 09:09, Muchun Song wrote:
>> On 2023/8/26 03:04, Mike Kravetz wrote:
>>> +
>>> +        /*
>>> +         * We are only splitting, not remapping the hugetlb vmemmap
>>> +         * pages.
>>> +         */
>>> +        if (bulk)
>>> +            continue;
>>
>> Actually, we don not need a flag to detect this situation, you could
>> use "!@walk->remap_pte" to determine whether we should go into the
>> next level traversal of the page table. ->remap_pte is used to traverse
>> the pte entry, so it make senses to continue to the next pmd entry if
>> it is NULL.
>>
> 
> Yeap, great suggestion.
> 
>>> +
>>>           vmemmap_pte_range(pmd, addr, next, walk);
>>>       } while (pmd++, addr = next, addr != end);
>>>   @@ -197,7 +211,8 @@ static int vmemmap_remap_range(unsigned long start,
>>> unsigned long end,
>>>               return ret;
>>>       } while (pgd++, addr = next, addr != end);
>>>   -    flush_tlb_kernel_range(start, end);
>>> +    if (!(walk->flags & VMEMMAP_REMAP_ONLY_SPLIT))
>>> +        flush_tlb_kernel_range(start, end);
>>
>> This could be:
>>
>>     if (walk->remap_pte)
>>         flush_tlb_kernel_range(start, end);
>>
> Yeap.
> 

Quite correction: This stays as is, except with a flag rename. That is because
this is actual flush that we intend to batch in the next patch. And while the
PMD split could just use !walk->remap_pte, the next patch would just need to
test NO_TLB_FLUSH flag. Meaning we endup anyways just testing for this
to-be-consolidated flag


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 07/12] hugetlb: perform vmemmap restoration on a list of pages
  2023-08-30  8:33   ` Muchun Song
@ 2023-08-30 17:53     ` Mike Kravetz
  0 siblings, 0 replies; 36+ messages in thread
From: Mike Kravetz @ 2023-08-30 17:53 UTC (permalink / raw)
  To: Muchun Song
  Cc: Muchun Song, Joao Martins, Oscar Salvador, David Hildenbrand,
	Miaohe Lin, David Rientjes, Anshuman Khandual, Naoya Horiguchi,
	Barry Song, Michal Hocko, Matthew Wilcox, Xiongchun Duan,
	Andrew Morton, linux-mm, linux-kernel

On 08/30/23 16:33, Muchun Song wrote:
> 
> 
> On 2023/8/26 03:04, Mike Kravetz wrote:
> > When removing hugetlb pages from the pool, we first create a list
> > of removed pages and then free those pages back to low level allocators.
> > Part of the 'freeing process' is to restore vmemmap for all base pages
> > if necessary.  Pass this list of pages to a new routine
> > hugetlb_vmemmap_restore_folios() so that vmemmap restoration can be
> > performed in bulk.
> > 
> > Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> > ---
> >   mm/hugetlb.c         | 3 +++
> >   mm/hugetlb_vmemmap.c | 8 ++++++++
> >   mm/hugetlb_vmemmap.h | 6 ++++++
> >   3 files changed, 17 insertions(+)
> > 
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index 3133dbd89696..1bde5e234d5c 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -1833,6 +1833,9 @@ static void update_and_free_pages_bulk(struct hstate *h, struct list_head *list)
> >   {
> >   	struct folio *folio, *t_folio;
> > +	/* First restore vmemmap for all pages on list. */
> > +	hugetlb_vmemmap_restore_folios(h, list);
> > +
> >   	list_for_each_entry_safe(folio, t_folio, list, lru) {
> >   		update_and_free_hugetlb_folio(h, folio, false);
> >   		cond_resched();
> > diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
> > index 147018a504a6..d5e6b6c76dce 100644
> > --- a/mm/hugetlb_vmemmap.c
> > +++ b/mm/hugetlb_vmemmap.c
> > @@ -479,6 +479,14 @@ int hugetlb_vmemmap_restore(const struct hstate *h, struct page *head)
> >   	return ret;
> >   }
> 
> Because it is a void function, I'd like to add a comment here like:
> 
>     This function only tries to restore a list of folios' vmemmap pages and
>     does not guarantee that the restoration will succeed after it returns.

Will do.  Thanks!

-- 
Mike Kravetz


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 08/12] hugetlb: batch freeing of vmemmap pages
  2023-08-30  7:20   ` Muchun Song
@ 2023-08-30 18:36     ` Mike Kravetz
  0 siblings, 0 replies; 36+ messages in thread
From: Mike Kravetz @ 2023-08-30 18:36 UTC (permalink / raw)
  To: Muchun Song
  Cc: Muchun Song, Joao Martins, Oscar Salvador, David Hildenbrand,
	Miaohe Lin, David Rientjes, Anshuman Khandual, Naoya Horiguchi,
	Barry Song, Michal Hocko, Matthew Wilcox, Xiongchun Duan,
	Andrew Morton, linux-kernel, linux-mm

On 08/30/23 15:20, Muchun Song wrote:
> 
> 
> On 2023/8/26 03:04, Mike Kravetz wrote:
> > Now that batching of hugetlb vmemmap optimization processing is possible,
> > batch the freeing of vmemmap pages.  When freeing vmemmap pages for a
> > hugetlb page, we add them to a list that is freed after the entire batch
> > has been processed.
> > 
> > This enhances the ability to return contiguous ranges of memory to the
> > low level allocators.
> > 
> > Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> > ---
> >   mm/hugetlb_vmemmap.c | 56 ++++++++++++++++++++++++++++++++------------
> >   1 file changed, 41 insertions(+), 15 deletions(-)
> > 
> > diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
> > index d5e6b6c76dce..e390170c0887 100644
> > --- a/mm/hugetlb_vmemmap.c
> > +++ b/mm/hugetlb_vmemmap.c
> > @@ -305,11 +305,14 @@ static void vmemmap_restore_pte(pte_t *pte, unsigned long addr,
> >    * @end:	end address of the vmemmap virtual address range that we want to
> >    *		remap.
> >    * @reuse:	reuse address.
> > + * @bulk_pages: list to deposit vmemmap pages to be freed in bulk operations
> > + *		or NULL in non-bulk case;
> 
> I'd like to rename bulk_pages to vmemmap_pages. Always add the vmemmap
> pages to this list and let the caller (hugetlb_vmemmap_optimize and
> hugetlb_vmemmap_optimize_folios) to help us to free them. It will be
> clear to me.

Makes sense.

I will update all these in next version based on your suggestions.

Thank you,
-- 
Mike Kravetz

> 
> >    *
> >    * Return: %0 on success, negative error code otherwise.
> >    */
> >   static int vmemmap_remap_free(unsigned long start, unsigned long end,
> > -			      unsigned long reuse)
> > +			      unsigned long reuse,
> > +			      struct list_head *bulk_pages)
> >   {
> >   	int ret;
> >   	LIST_HEAD(vmemmap_pages);
> > @@ -372,7 +375,14 @@ static int vmemmap_remap_free(unsigned long start, unsigned long end,
> >   	}
> >   	mmap_read_unlock(&init_mm);
> > -	free_vmemmap_page_list(&vmemmap_pages);
> > +	/*
> > +	 * if performing bulk operation, do not free pages here.
> > +	 * rather add them to the bulk list
> > +	 */
> > +	if (!bulk_pages)
> > +		free_vmemmap_page_list(&vmemmap_pages);
> > +	else
> > +		list_splice(&vmemmap_pages, bulk_pages);
> 
> Here, always add vmemmap_pages to the list.
> 
> >   	return ret;
> >   }
> > @@ -546,17 +556,9 @@ static bool vmemmap_should_optimize(const struct hstate *h, const struct page *h
> >   	return true;
> >   }
> > -/**
> > - * hugetlb_vmemmap_optimize - optimize @head page's vmemmap pages.
> > - * @h:		struct hstate.
> > - * @head:	the head page whose vmemmap pages will be optimized.
> > - *
> > - * This function only tries to optimize @head's vmemmap pages and does not
> > - * guarantee that the optimization will succeed after it returns. The caller
> > - * can use HPageVmemmapOptimized(@head) to detect if @head's vmemmap pages
> > - * have been optimized.
> > - */
> > -void hugetlb_vmemmap_optimize(const struct hstate *h, struct page *head)
> > +static void __hugetlb_vmemmap_optimize(const struct hstate *h,
> > +					struct page *head,
> > +					struct list_head *bulk_pages)
> 
> Also struct list_head *vmemmap_pages.
> 
> >   {
> >   	unsigned long vmemmap_start = (unsigned long)head, vmemmap_end;
> >   	unsigned long vmemmap_reuse;
> > @@ -575,18 +577,42 @@ void hugetlb_vmemmap_optimize(const struct hstate *h, struct page *head)
> >   	 * to the page which @vmemmap_reuse is mapped to, then free the pages
> >   	 * which the range [@vmemmap_start, @vmemmap_end] is mapped to.
> >   	 */
> > -	if (vmemmap_remap_free(vmemmap_start, vmemmap_end, vmemmap_reuse))
> > +	if (vmemmap_remap_free(vmemmap_start, vmemmap_end, vmemmap_reuse, bulk_pages))
> >   		static_branch_dec(&hugetlb_optimize_vmemmap_key);
> >   	else
> >   		SetHPageVmemmapOptimized(head);
> >   }
> > +/**
> > + * hugetlb_vmemmap_optimize - optimize @head page's vmemmap pages.
> > + * @h:		struct hstate.
> > + * @head:	the head page whose vmemmap pages will be optimized.
> > + *
> > + * This function only tries to optimize @head's vmemmap pages and does not
> > + * guarantee that the optimization will succeed after it returns. The caller
> > + * can use HPageVmemmapOptimized(@head) to detect if @head's vmemmap pages
> > + * have been optimized.
> > + */
> > +void hugetlb_vmemmap_optimize(const struct hstate *h, struct page *head)
> > +{
> > +	__hugetlb_vmemmap_optimize(h, head, NULL);
> 
> Use free_vmemmap_page_list to free vmemmap pages here.
> 
> > +}
> > +
> > +void hugetlb_vmemmap_optimize_bulk(const struct hstate *h, struct page *head,
> > +			struct list_head *bulk_pages)
> > +{
> > +	__hugetlb_vmemmap_optimize(h, head, bulk_pages);
> > +}
> > +
> >   void hugetlb_vmemmap_optimize_folios(struct hstate *h, struct list_head *folio_list)
> >   {
> >   	struct folio *folio;
> > +	LIST_HEAD(vmemmap_pages);
> >   	list_for_each_entry(folio, folio_list, lru)
> > -		hugetlb_vmemmap_optimize(h, &folio->page);
> > +		hugetlb_vmemmap_optimize_bulk(h, &folio->page, &vmemmap_pages);
> 
> Directly use __hugetlb_vmemmap_optimize and delete
> hugetlb_vmemmap_optimize_bulk.
> In the future, we could rename hugetlb_vmemmap_optimize to
> hugetlb_vmemmap_optimize_folio,
> then, both function names are more consistent. E.g.
> 
>   1) hugetlb_vmemmap_optimize_folio(): used to free one folio's vmemmap
> pages.
>   2) hugetlb_vmemmap_optimize_folios(): used to free multiple folio's
> vmemmap pages.
> 
> Thanks.
> 
> > +
> > +	free_vmemmap_page_list(&vmemmap_pages);
> >   }
> >   static struct ctl_table hugetlb_vmemmap_sysctls[] = {
> 


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 09/12] hugetlb_vmemmap: Optimistically set Optimized flag
  2023-08-30  7:26   ` Muchun Song
@ 2023-08-30 22:47     ` Mike Kravetz
  2023-08-31  3:27       ` Muchun Song
  0 siblings, 1 reply; 36+ messages in thread
From: Mike Kravetz @ 2023-08-30 22:47 UTC (permalink / raw)
  To: Muchun Song
  Cc: linux-mm, linux-kernel, Muchun Song, Joao Martins, Oscar Salvador,
	David Hildenbrand, Miaohe Lin, David Rientjes, Anshuman Khandual,
	Naoya Horiguchi, Barry Song, Michal Hocko, Matthew Wilcox,
	Xiongchun Duan, Andrew Morton

On 08/30/23 15:26, Muchun Song wrote:
> 
> 
> On 2023/8/26 03:04, Mike Kravetz wrote:
> > At the beginning of hugetlb_vmemmap_optimize, optimistically set
> > the HPageVmemmapOptimized flag in the head page.  Clear the flag
> > if the operation fails.
> > 
> > No change in behavior.  However, this will become important in
> > subsequent patches where we batch delay TLB flushing.  We need to
> > make sure the content in the old and new vmemmap pages are the same.
> 
> Sorry, I didn't get the point here. Could you elaborate it?
> 

Sorry, this really could use a better explanation.

> > 
> > Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> > ---
> >   mm/hugetlb_vmemmap.c | 8 +++++---
> >   1 file changed, 5 insertions(+), 3 deletions(-)
> > 
> > diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
> > index e390170c0887..500a118915ff 100644
> > --- a/mm/hugetlb_vmemmap.c
> > +++ b/mm/hugetlb_vmemmap.c
> > @@ -566,7 +566,9 @@ static void __hugetlb_vmemmap_optimize(const struct hstate *h,
> >   	if (!vmemmap_should_optimize(h, head))
> >   		return;
> > +	/* Optimistically assume success */
> >   	static_branch_inc(&hugetlb_optimize_vmemmap_key);
> > +	SetHPageVmemmapOptimized(head);
> >   	vmemmap_end	= vmemmap_start + hugetlb_vmemmap_size(h);
> >   	vmemmap_reuse	= vmemmap_start;
> > @@ -577,10 +579,10 @@ static void __hugetlb_vmemmap_optimize(const struct hstate *h,
> >   	 * to the page which @vmemmap_reuse is mapped to, then free the pages
> >   	 * which the range [@vmemmap_start, @vmemmap_end] is mapped to.
> >   	 */
> > -	if (vmemmap_remap_free(vmemmap_start, vmemmap_end, vmemmap_reuse, bulk_pages))
> > +	if (vmemmap_remap_free(vmemmap_start, vmemmap_end, vmemmap_reuse, bulk_pages)) {
> >   		static_branch_dec(&hugetlb_optimize_vmemmap_key);
> > -	else
> > -		SetHPageVmemmapOptimized(head);
> > +		ClearHPageVmemmapOptimized(head);
> > +	}

Consider the case where we have successfully remapped vmemmap AND
- we have replaced the page table page (pte page) containing the struct
  page of the hugetlb head page.  Joao's commit 11aad2631bf7
  'mm/hugetlb_vmemmap: remap head page to newly allocated page'.
- we have NOT flushed the TLB after remapping due to batching the
  operations before flush.

In this case, it is possible that the old head page is still in the TLB
and caches and SetHPageVmemmapOptimized(head) will actually set the flag
in the old pte page.  We then have an optimized hugetlb page without the
HPageVmemmapOptimized flag set.  When developing this series, we
experienced various BUGs as a result of this situation.

In the case of an error during optimization, we do a TLB flush so if
we need to clear the flag we will write to the correct pte page.

Hope that makes sense.

I add an explanation like this to the commit message and perhaps put
this closer to/or squash with the patch that batches operations before
flushing TLB.
-- 
Mike Kravetz

> >   }
> >   /**
> 


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 09/12] hugetlb_vmemmap: Optimistically set Optimized flag
  2023-08-30 22:47     ` Mike Kravetz
@ 2023-08-31  3:27       ` Muchun Song
  0 siblings, 0 replies; 36+ messages in thread
From: Muchun Song @ 2023-08-31  3:27 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Linux-MM, LKML, Muchun Song, Joao Martins, Oscar Salvador,
	David Hildenbrand, Miaohe Lin, David Rientjes, Anshuman Khandual,
	Naoya Horiguchi, Michal Hocko, Matthew Wilcox, Xiongchun Duan,
	Andrew Morton



> On Aug 31, 2023, at 06:47, Mike Kravetz <mike.kravetz@oracle.com> wrote:
> 
> On 08/30/23 15:26, Muchun Song wrote:
>> 
>> 
>> On 2023/8/26 03:04, Mike Kravetz wrote:
>>> At the beginning of hugetlb_vmemmap_optimize, optimistically set
>>> the HPageVmemmapOptimized flag in the head page.  Clear the flag
>>> if the operation fails.
>>> 
>>> No change in behavior.  However, this will become important in
>>> subsequent patches where we batch delay TLB flushing.  We need to
>>> make sure the content in the old and new vmemmap pages are the same.
>> 
>> Sorry, I didn't get the point here. Could you elaborate it?
>> 
> 
> Sorry, this really could use a better explanation.
> 
>>> 
>>> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
>>> ---
>>>  mm/hugetlb_vmemmap.c | 8 +++++---
>>>  1 file changed, 5 insertions(+), 3 deletions(-)
>>> 
>>> diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
>>> index e390170c0887..500a118915ff 100644
>>> --- a/mm/hugetlb_vmemmap.c
>>> +++ b/mm/hugetlb_vmemmap.c
>>> @@ -566,7 +566,9 @@ static void __hugetlb_vmemmap_optimize(const struct hstate *h,
>>>   if (!vmemmap_should_optimize(h, head))
>>>   return;
>>> + /* Optimistically assume success */
>>>   static_branch_inc(&hugetlb_optimize_vmemmap_key);
>>> + SetHPageVmemmapOptimized(head);
>>>   vmemmap_end = vmemmap_start + hugetlb_vmemmap_size(h);
>>>   vmemmap_reuse = vmemmap_start;
>>> @@ -577,10 +579,10 @@ static void __hugetlb_vmemmap_optimize(const struct hstate *h,
>>>    * to the page which @vmemmap_reuse is mapped to, then free the pages
>>>    * which the range [@vmemmap_start, @vmemmap_end] is mapped to.
>>>    */
>>> - if (vmemmap_remap_free(vmemmap_start, vmemmap_end, vmemmap_reuse, bulk_pages))
>>> + if (vmemmap_remap_free(vmemmap_start, vmemmap_end, vmemmap_reuse, bulk_pages)) {
>>>   static_branch_dec(&hugetlb_optimize_vmemmap_key);
>>> - else
>>> - SetHPageVmemmapOptimized(head);
>>> + ClearHPageVmemmapOptimized(head);
>>> + }
> 
> Consider the case where we have successfully remapped vmemmap AND
> - we have replaced the page table page (pte page) containing the struct
>  page of the hugetlb head page.  Joao's commit 11aad2631bf7
>  'mm/hugetlb_vmemmap: remap head page to newly allocated page'.
> - we have NOT flushed the TLB after remapping due to batching the
>  operations before flush.
> 
> In this case, it is possible that the old head page is still in the TLB
> and caches and SetHPageVmemmapOptimized(head) will actually set the flag
> in the old pte page.  We then have an optimized hugetlb page without the
> HPageVmemmapOptimized flag set.  When developing this series, we
> experienced various BUGs as a result of this situation.

Now, I got it. Thanks for your elaboration.

> 
> In the case of an error during optimization, we do a TLB flush so if
> we need to clear the flag we will write to the correct pte page.

Right.

> 
> Hope that makes sense.
> 
> I add an explanation like this to the commit message and perhaps put
> this closer to/or squash with the patch that batches operations before
> flushing TLB.

Yes. But I'd also like to add a big comment to explain what's going on here
instead of a simple "Optimistically assume success". This one really makes
me think it is an optimization not a mandatory premise.

Thanks.

> -- 
> Mike Kravetz
> 
>>>  }
>>>  /**




^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 10/12] hugetlb: batch PMD split for bulk vmemmap dedup
  2023-08-30 16:03       ` Joao Martins
@ 2023-08-31  3:54         ` Muchun Song
  2023-08-31  9:26           ` Joao Martins
  0 siblings, 1 reply; 36+ messages in thread
From: Muchun Song @ 2023-08-31  3:54 UTC (permalink / raw)
  To: Joao Martins
  Cc: Mike Kravetz, Muchun Song, Oscar Salvador, David Hildenbrand,
	Miaohe Lin, David Rientjes, Anshuman Khandual, Naoya Horiguchi,
	Barry Song, Michal Hocko, Matthew Wilcox, Xiongchun Duan,
	Andrew Morton, Linux-MM, LKML



> On Aug 31, 2023, at 00:03, Joao Martins <joao.m.martins@oracle.com> wrote:
> 
> On 30/08/2023 12:13, Joao Martins wrote:
>> On 30/08/2023 09:09, Muchun Song wrote:
>>> On 2023/8/26 03:04, Mike Kravetz wrote:
>>>> +
>>>> +        /*
>>>> +         * We are only splitting, not remapping the hugetlb vmemmap
>>>> +         * pages.
>>>> +         */
>>>> +        if (bulk)
>>>> +            continue;
>>> 
>>> Actually, we don not need a flag to detect this situation, you could
>>> use "!@walk->remap_pte" to determine whether we should go into the
>>> next level traversal of the page table. ->remap_pte is used to traverse
>>> the pte entry, so it make senses to continue to the next pmd entry if
>>> it is NULL.
>>> 
>> 
>> Yeap, great suggestion.
>> 
>>>> +
>>>>           vmemmap_pte_range(pmd, addr, next, walk);
>>>>       } while (pmd++, addr = next, addr != end);
>>>>   @@ -197,7 +211,8 @@ static int vmemmap_remap_range(unsigned long start,
>>>> unsigned long end,
>>>>               return ret;
>>>>       } while (pgd++, addr = next, addr != end);
>>>>   -    flush_tlb_kernel_range(start, end);
>>>> +    if (!(walk->flags & VMEMMAP_REMAP_ONLY_SPLIT))
>>>> +        flush_tlb_kernel_range(start, end);
>>> 
>>> This could be:
>>> 
>>>     if (walk->remap_pte)
>>>         flush_tlb_kernel_range(start, end);
>>> 
>> Yeap.
>> 
> 
> Quite correction: This stays as is, except with a flag rename. That is because
> this is actual flush that we intend to batch in the next patch. And while the
> PMD split could just use !walk->remap_pte, the next patch would just need to
> test NO_TLB_FLUSH flag. Meaning we endup anyways just testing for this
> to-be-consolidated flag

I think this really should be "if (walk->remap_pte && !(flag & VMEMMAP_NO_TLB_FLUSH))"
in your next patch. This TLB flushing only make sense for the case of existing of
@walk->remap_pte. I know "if (!(flag & VMEMMAP_NO_TLB_FLUSH))" check is suitable for your
use case, but what if a user (even if it does not exist now, but it may in the future)
passing a NULL @walk->remap_pte and not specifying VMEMMAP_NO_TLB_FLUSH? Then we will
do a useless TLB flushing. This is why I suggest you change this to "if (walk->remap_pte)"
in this patch and change it to "if (walk->remap_pte && !(flag & VMEMMAP_NO_TLB_FLUSH))"
in the next patch.

Thanks.






^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 10/12] hugetlb: batch PMD split for bulk vmemmap dedup
  2023-08-31  3:54         ` Muchun Song
@ 2023-08-31  9:26           ` Joao Martins
  0 siblings, 0 replies; 36+ messages in thread
From: Joao Martins @ 2023-08-31  9:26 UTC (permalink / raw)
  To: Muchun Song
  Cc: Mike Kravetz, Muchun Song, Oscar Salvador, David Hildenbrand,
	Miaohe Lin, David Rientjes, Anshuman Khandual, Naoya Horiguchi,
	Barry Song, Michal Hocko, Matthew Wilcox, Xiongchun Duan,
	Andrew Morton, Linux-MM, LKML



On 31/08/2023 04:54, Muchun Song wrote:
> 
> 
>> On Aug 31, 2023, at 00:03, Joao Martins <joao.m.martins@oracle.com> wrote:
>>
>> On 30/08/2023 12:13, Joao Martins wrote:
>>> On 30/08/2023 09:09, Muchun Song wrote:
>>>> On 2023/8/26 03:04, Mike Kravetz wrote:
>>>>> +
>>>>> +        /*
>>>>> +         * We are only splitting, not remapping the hugetlb vmemmap
>>>>> +         * pages.
>>>>> +         */
>>>>> +        if (bulk)
>>>>> +            continue;
>>>>
>>>> Actually, we don not need a flag to detect this situation, you could
>>>> use "!@walk->remap_pte" to determine whether we should go into the
>>>> next level traversal of the page table. ->remap_pte is used to traverse
>>>> the pte entry, so it make senses to continue to the next pmd entry if
>>>> it is NULL.
>>>>
>>>
>>> Yeap, great suggestion.
>>>
>>>>> +
>>>>>           vmemmap_pte_range(pmd, addr, next, walk);
>>>>>       } while (pmd++, addr = next, addr != end);
>>>>>   @@ -197,7 +211,8 @@ static int vmemmap_remap_range(unsigned long start,
>>>>> unsigned long end,
>>>>>               return ret;
>>>>>       } while (pgd++, addr = next, addr != end);
>>>>>   -    flush_tlb_kernel_range(start, end);
>>>>> +    if (!(walk->flags & VMEMMAP_REMAP_ONLY_SPLIT))
>>>>> +        flush_tlb_kernel_range(start, end);
>>>>
>>>> This could be:
>>>>
>>>>     if (walk->remap_pte)
>>>>         flush_tlb_kernel_range(start, end);
>>>>
>>> Yeap.
>>>
>>
>> Quite correction: This stays as is, except with a flag rename. That is because
>> this is actual flush that we intend to batch in the next patch. And while the
>> PMD split could just use !walk->remap_pte, the next patch would just need to
>> test NO_TLB_FLUSH flag. Meaning we endup anyways just testing for this
>> to-be-consolidated flag
> 
> I think this really should be "if (walk->remap_pte && !(flag & VMEMMAP_NO_TLB_FLUSH))"
> in your next patch. This TLB flushing only make sense for the case of existing of
> @walk->remap_pte. I know "if (!(flag & VMEMMAP_NO_TLB_FLUSH))" check is suitable for your
> use case, but what if a user (even if it does not exist now, but it may in the future)
> passing a NULL @walk->remap_pte and not specifying VMEMMAP_NO_TLB_FLUSH? Then we will
> do a useless TLB flushing. This is why I suggest you change this to "if (walk->remap_pte)"
> in this patch and change it to "if (walk->remap_pte && !(flag & VMEMMAP_NO_TLB_FLUSH))"
> in the next patch.

OK, fair enough.


^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2023-08-31  9:27 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-08-25 19:04 [PATCH 00/12] Batch hugetlb vmemmap modification operations Mike Kravetz
2023-08-25 19:04 ` [PATCH 01/12] hugetlb: clear flags in tail pages that will be freed individually Mike Kravetz
2023-08-25 19:04 ` [PATCH 02/12] hugetlb: Use a folio in free_hpage_workfn() Mike Kravetz
2023-08-25 19:04 ` [PATCH 03/12] hugetlb: Remove a few calls to page_folio() Mike Kravetz
2023-08-25 19:04 ` [PATCH 04/12] hugetlb: Convert remove_pool_huge_page() to remove_pool_hugetlb_folio() Mike Kravetz
2023-08-25 19:04 ` [PATCH 05/12] hugetlb: restructure pool allocations Mike Kravetz
2023-08-25 19:04 ` [PATCH 06/12] hugetlb: perform vmemmap optimization on a list of pages Mike Kravetz
2023-08-25 19:04 ` [PATCH 07/12] hugetlb: perform vmemmap restoration " Mike Kravetz
2023-08-26  6:58   ` kernel test robot
2023-08-30  8:33   ` Muchun Song
2023-08-30 17:53     ` Mike Kravetz
2023-08-25 19:04 ` [PATCH 08/12] hugetlb: batch freeing of vmemmap pages Mike Kravetz
2023-08-26  4:00   ` kernel test robot
2023-08-30  7:20   ` Muchun Song
2023-08-30 18:36     ` Mike Kravetz
2023-08-25 19:04 ` [PATCH 09/12] hugetlb_vmemmap: Optimistically set Optimized flag Mike Kravetz
2023-08-30  7:26   ` Muchun Song
2023-08-30 22:47     ` Mike Kravetz
2023-08-31  3:27       ` Muchun Song
2023-08-25 19:04 ` [PATCH 10/12] hugetlb: batch PMD split for bulk vmemmap dedup Mike Kravetz
2023-08-26  5:56   ` kernel test robot
2023-08-28  9:42     ` Joao Martins
2023-08-28 16:44       ` Mike Kravetz
2023-08-29  3:47         ` Muchun Song
2023-08-26 18:14   ` kernel test robot
2023-08-30  8:09   ` Muchun Song
2023-08-30 11:13     ` Joao Martins
2023-08-30 16:03       ` Joao Martins
2023-08-31  3:54         ` Muchun Song
2023-08-31  9:26           ` Joao Martins
2023-08-25 19:04 ` [PATCH 11/12] hugetlb: batch TLB flushes when freeing vmemmap Mike Kravetz
2023-08-30  8:23   ` Muchun Song
2023-08-30 11:17     ` Joao Martins
2023-08-25 19:04 ` [PATCH 12/12] hugetlb: batch TLB flushes when restoring vmemmap Mike Kravetz
2023-08-26  8:01   ` kernel test robot
2023-08-30  8:47   ` Muchun Song

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).