* [PATCH 1/2] hugetlb: Do not clear hugetlb dtor until allocating vmemmap
2023-07-11 22:09 [PATCH 0/2] Fix hugetlb free path race with memory errors Mike Kravetz
@ 2023-07-11 22:09 ` Mike Kravetz
2023-07-12 8:03 ` Muchun Song
2023-07-11 22:09 ` [PATCH 2/2] hugetlb: optimize update_and_free_pages_bulk to avoid lock cycles Mike Kravetz
2023-07-13 17:34 ` [PATCH 0/2] Fix hugetlb free path race with memory errors Andrew Morton
2 siblings, 1 reply; 7+ messages in thread
From: Mike Kravetz @ 2023-07-11 22:09 UTC (permalink / raw)
To: linux-mm, linux-kernel
Cc: Jiaqi Yan, Naoya Horiguchi, Muchun Song, Miaohe Lin,
Axel Rasmussen, James Houghton, Michal Hocko, Andrew Morton,
Mike Kravetz, stable
Freeing a hugetlb page and releasing base pages back to the underlying
allocator such as buddy or cma is performed in two steps:
- remove_hugetlb_folio() is called to remove the folio from hugetlb
lists, get a ref on the page and remove hugetlb destructor. This
all must be done under the hugetlb lock. After this call, the page
can be treated as a normal compound page or a collection of base
size pages.
- update_and_free_hugetlb_folio() is called to allocate vmemmap if
needed and the free routine of the underlying allocator is called
on the resulting page. We can not hold the hugetlb lock here.
One issue with this scheme is that a memory error could occur between
these two steps. In this case, the memory error handling code treats
the old hugetlb page as a normal compound page or collection of base
pages. It will then try to SetPageHWPoison(page) on the page with an
error. If the page with error is a tail page without vmemmap, a write
error will occur when trying to set the flag.
Address this issue by modifying remove_hugetlb_folio() and
update_and_free_hugetlb_folio() such that the hugetlb destructor is not
cleared until after allocating vmemmap. Since clearing the destructor
requires holding the hugetlb lock, the clearing is done in
remove_hugetlb_folio() if the vmemmap is present. This saves a
lock/unlock cycle. Otherwise, destructor is cleared in
update_and_free_hugetlb_folio() after allocating vmemmap.
Note that this will leave hugetlb pages in a state where they are marked
free (by hugetlb specific page flag) and have a ref count. This is not
a normal state. The only code that would notice is the memory error
code, and it is set up to retry in such a case.
A subsequent patch will create a routine to do bulk processing of
vmemmap allocation. This will eliminate a lock/unlock cycle for each
hugetlb page in the case where we are freeing a large number of pages.
Fixes: ad2fa3717b74 ("mm: hugetlb: alloc the vmemmap pages associated with each HugeTLB page")
Cc: <stable@vger.kernel.org>
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
mm/hugetlb.c | 75 +++++++++++++++++++++++++++++++++++-----------------
1 file changed, 51 insertions(+), 24 deletions(-)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index e4a28ce0667f..1b67bf341c32 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1580,9 +1580,37 @@ static inline void destroy_compound_gigantic_folio(struct folio *folio,
unsigned int order) { }
#endif
+static inline void __clear_hugetlb_destructor(struct hstate *h,
+ struct folio *folio)
+{
+ lockdep_assert_held(&hugetlb_lock);
+
+ /*
+ * Very subtle
+ *
+ * For non-gigantic pages set the destructor to the normal compound
+ * page dtor. This is needed in case someone takes an additional
+ * temporary ref to the page, and freeing is delayed until they drop
+ * their reference.
+ *
+ * For gigantic pages set the destructor to the null dtor. This
+ * destructor will never be called. Before freeing the gigantic
+ * page destroy_compound_gigantic_folio will turn the folio into a
+ * simple group of pages. After this the destructor does not
+ * apply.
+ *
+ */
+ if (hstate_is_gigantic(h))
+ folio_set_compound_dtor(folio, NULL_COMPOUND_DTOR);
+ else
+ folio_set_compound_dtor(folio, COMPOUND_PAGE_DTOR);
+}
+
/*
- * Remove hugetlb folio from lists, and update dtor so that the folio appears
- * as just a compound page.
+ * Remove hugetlb folio from lists.
+ * If vmemmap exists for the folio, update dtor so that the folio appears
+ * as just a compound page. Otherwise, wait until after allocating vmemmap
+ * to update dtor.
*
* A reference is held on the folio, except in the case of demote.
*
@@ -1613,31 +1641,19 @@ static void __remove_hugetlb_folio(struct hstate *h, struct folio *folio,
}
/*
- * Very subtle
- *
- * For non-gigantic pages set the destructor to the normal compound
- * page dtor. This is needed in case someone takes an additional
- * temporary ref to the page, and freeing is delayed until they drop
- * their reference.
- *
- * For gigantic pages set the destructor to the null dtor. This
- * destructor will never be called. Before freeing the gigantic
- * page destroy_compound_gigantic_folio will turn the folio into a
- * simple group of pages. After this the destructor does not
- * apply.
- *
- * This handles the case where more than one ref is held when and
- * after update_and_free_hugetlb_folio is called.
- *
- * In the case of demote we do not ref count the page as it will soon
- * be turned into a page of smaller size.
+ * We can only clear the hugetlb destructor after allocating vmemmap
+ * pages. Otherwise, someone (memory error handling) may try to write
+ * to tail struct pages.
+ */
+ if (!folio_test_hugetlb_vmemmap_optimized(folio))
+ __clear_hugetlb_destructor(h, folio);
+
+ /*
+ * In the case of demote we do not ref count the page as it will soon
+ * be turned into a page of smaller size.
*/
if (!demote)
folio_ref_unfreeze(folio, 1);
- if (hstate_is_gigantic(h))
- folio_set_compound_dtor(folio, NULL_COMPOUND_DTOR);
- else
- folio_set_compound_dtor(folio, COMPOUND_PAGE_DTOR);
h->nr_huge_pages--;
h->nr_huge_pages_node[nid]--;
@@ -1706,6 +1722,7 @@ static void __update_and_free_hugetlb_folio(struct hstate *h,
{
int i;
struct page *subpage;
+ bool clear_dtor = folio_test_hugetlb_vmemmap_optimized(folio);
if (hstate_is_gigantic(h) && !gigantic_page_runtime_supported())
return;
@@ -1736,6 +1753,16 @@ static void __update_and_free_hugetlb_folio(struct hstate *h,
if (unlikely(folio_test_hwpoison(folio)))
folio_clear_hugetlb_hwpoison(folio);
+ /*
+ * If vmemmap pages were allocated above, then we need to clear the
+ * hugetlb destructor under the hugetlb lock.
+ */
+ if (clear_dtor) {
+ spin_lock_irq(&hugetlb_lock);
+ __clear_hugetlb_destructor(h, folio);
+ spin_unlock_irq(&hugetlb_lock);
+ }
+
for (i = 0; i < pages_per_huge_page(h); i++) {
subpage = folio_page(folio, i);
subpage->flags &= ~(1 << PG_locked | 1 << PG_error |
--
2.41.0
^ permalink raw reply related [flat|nested] 7+ messages in thread* Re: [PATCH 1/2] hugetlb: Do not clear hugetlb dtor until allocating vmemmap
2023-07-11 22:09 ` [PATCH 1/2] hugetlb: Do not clear hugetlb dtor until allocating vmemmap Mike Kravetz
@ 2023-07-12 8:03 ` Muchun Song
2023-07-12 18:14 ` Mike Kravetz
0 siblings, 1 reply; 7+ messages in thread
From: Muchun Song @ 2023-07-12 8:03 UTC (permalink / raw)
To: Mike Kravetz
Cc: Linux Memory Management List, LKML, Jiaqi Yan, Naoya Horiguchi,
Muchun Song, Miaohe Lin, Axel Rasmussen, James Houghton,
Michal Hocko, Andrew Morton, stable
> On Jul 12, 2023, at 06:09, Mike Kravetz <mike.kravetz@oracle.com> wrote:
>
> Freeing a hugetlb page and releasing base pages back to the underlying
> allocator such as buddy or cma is performed in two steps:
> - remove_hugetlb_folio() is called to remove the folio from hugetlb
> lists, get a ref on the page and remove hugetlb destructor. This
> all must be done under the hugetlb lock. After this call, the page
> can be treated as a normal compound page or a collection of base
> size pages.
> - update_and_free_hugetlb_folio() is called to allocate vmemmap if
> needed and the free routine of the underlying allocator is called
> on the resulting page. We can not hold the hugetlb lock here.
>
> One issue with this scheme is that a memory error could occur between
> these two steps. In this case, the memory error handling code treats
> the old hugetlb page as a normal compound page or collection of base
> pages. It will then try to SetPageHWPoison(page) on the page with an
> error. If the page with error is a tail page without vmemmap, a write
> error will occur when trying to set the flag.
>
> Address this issue by modifying remove_hugetlb_folio() and
> update_and_free_hugetlb_folio() such that the hugetlb destructor is not
> cleared until after allocating vmemmap. Since clearing the destructor
> requires holding the hugetlb lock, the clearing is done in
> remove_hugetlb_folio() if the vmemmap is present. This saves a
> lock/unlock cycle. Otherwise, destructor is cleared in
> update_and_free_hugetlb_folio() after allocating vmemmap.
>
> Note that this will leave hugetlb pages in a state where they are marked
> free (by hugetlb specific page flag) and have a ref count. This is not
> a normal state. The only code that would notice is the memory error
> code, and it is set up to retry in such a case.
>
> A subsequent patch will create a routine to do bulk processing of
> vmemmap allocation. This will eliminate a lock/unlock cycle for each
> hugetlb page in the case where we are freeing a large number of pages.
>
> Fixes: ad2fa3717b74 ("mm: hugetlb: alloc the vmemmap pages associated with each HugeTLB page")
> Cc: <stable@vger.kernel.org>
> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Hi Mike,
I have seen an issue proposed by Jiaqi Yan in [1]. I didn't see any
resolution for it. Am I missing something with this fix?
[1] https://lore.kernel.org/linux-mm/CACw3F53iPiLrJt4pyaX2aaZ5BVg9tj8x_k6-v7=9Xn1nrh=UCw@mail.gmail.com/
Thanks.
^ permalink raw reply [flat|nested] 7+ messages in thread* Re: [PATCH 1/2] hugetlb: Do not clear hugetlb dtor until allocating vmemmap
2023-07-12 8:03 ` Muchun Song
@ 2023-07-12 18:14 ` Mike Kravetz
2023-07-13 0:22 ` Mike Kravetz
0 siblings, 1 reply; 7+ messages in thread
From: Mike Kravetz @ 2023-07-12 18:14 UTC (permalink / raw)
To: Muchun Song
Cc: Linux Memory Management List, LKML, Jiaqi Yan, Naoya Horiguchi,
Muchun Song, Miaohe Lin, Axel Rasmussen, James Houghton,
Michal Hocko, Andrew Morton, stable
On 07/12/23 16:03, Muchun Song wrote:
>
>
> > On Jul 12, 2023, at 06:09, Mike Kravetz <mike.kravetz@oracle.com> wrote:
> >
> > Freeing a hugetlb page and releasing base pages back to the underlying
> > allocator such as buddy or cma is performed in two steps:
> > - remove_hugetlb_folio() is called to remove the folio from hugetlb
> > lists, get a ref on the page and remove hugetlb destructor. This
> > all must be done under the hugetlb lock. After this call, the page
> > can be treated as a normal compound page or a collection of base
> > size pages.
> > - update_and_free_hugetlb_folio() is called to allocate vmemmap if
> > needed and the free routine of the underlying allocator is called
> > on the resulting page. We can not hold the hugetlb lock here.
> >
> > One issue with this scheme is that a memory error could occur between
> > these two steps. In this case, the memory error handling code treats
> > the old hugetlb page as a normal compound page or collection of base
> > pages. It will then try to SetPageHWPoison(page) on the page with an
> > error. If the page with error is a tail page without vmemmap, a write
> > error will occur when trying to set the flag.
> >
> > Address this issue by modifying remove_hugetlb_folio() and
> > update_and_free_hugetlb_folio() such that the hugetlb destructor is not
> > cleared until after allocating vmemmap. Since clearing the destructor
> > requires holding the hugetlb lock, the clearing is done in
> > remove_hugetlb_folio() if the vmemmap is present. This saves a
> > lock/unlock cycle. Otherwise, destructor is cleared in
> > update_and_free_hugetlb_folio() after allocating vmemmap.
> >
> > Note that this will leave hugetlb pages in a state where they are marked
> > free (by hugetlb specific page flag) and have a ref count. This is not
> > a normal state. The only code that would notice is the memory error
> > code, and it is set up to retry in such a case.
> >
> > A subsequent patch will create a routine to do bulk processing of
> > vmemmap allocation. This will eliminate a lock/unlock cycle for each
> > hugetlb page in the case where we are freeing a large number of pages.
> >
> > Fixes: ad2fa3717b74 ("mm: hugetlb: alloc the vmemmap pages associated with each HugeTLB page")
> > Cc: <stable@vger.kernel.org>
> > Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
>
> Hi Mike,
>
> I have seen an issue proposed by Jiaqi Yan in [1]. I didn't see any
> resolution for it. Am I missing something with this fix?
>
> [1] https://lore.kernel.org/linux-mm/CACw3F53iPiLrJt4pyaX2aaZ5BVg9tj8x_k6-v7=9Xn1nrh=UCw@mail.gmail.com/
>
My mistake! I sent the old version of the patch.
The new version was modified to simply check the destructor via
folio_test_hugetlb() in order to decide if it should be cleared.
I will send V2. Sorry!
--
Mike Kravetz
^ permalink raw reply [flat|nested] 7+ messages in thread* Re: [PATCH 1/2] hugetlb: Do not clear hugetlb dtor until allocating vmemmap
2023-07-12 18:14 ` Mike Kravetz
@ 2023-07-13 0:22 ` Mike Kravetz
0 siblings, 0 replies; 7+ messages in thread
From: Mike Kravetz @ 2023-07-13 0:22 UTC (permalink / raw)
To: Muchun Song
Cc: Linux Memory Management List, LKML, Jiaqi Yan, Naoya Horiguchi,
Muchun Song, Miaohe Lin, Axel Rasmussen, James Houghton,
Michal Hocko, Andrew Morton, stable
On 07/12/23 11:14, Mike Kravetz wrote:
> On 07/12/23 16:03, Muchun Song wrote:
> > > On Jul 12, 2023, at 06:09, Mike Kravetz <mike.kravetz@oracle.com> wrote:
> > >
> > > Note that this will leave hugetlb pages in a state where they are marked
> > > free (by hugetlb specific page flag) and have a ref count. This is not
> > > a normal state. The only code that would notice is the memory error
> > > code, and it is set up to retry in such a case.
> > >
> > > A subsequent patch will create a routine to do bulk processing of
> > > vmemmap allocation. This will eliminate a lock/unlock cycle for each
> > > hugetlb page in the case where we are freeing a large number of pages.
> > >
> > > Fixes: ad2fa3717b74 ("mm: hugetlb: alloc the vmemmap pages associated with each HugeTLB page")
> > > Cc: <stable@vger.kernel.org>
> > > Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> >
> > Hi Mike,
> >
> > I have seen an issue proposed by Jiaqi Yan in [1]. I didn't see any
> > resolution for it. Am I missing something with this fix?
> >
> > [1] https://lore.kernel.org/linux-mm/CACw3F53iPiLrJt4pyaX2aaZ5BVg9tj8x_k6-v7=9Xn1nrh=UCw@mail.gmail.com/
> >
>
> My mistake! I sent the old version of the patch.
>
> The new version was modified to simply check the destructor via
> folio_test_hugetlb() in order to decide if it should be cleared.
>
> I will send V2. Sorry!
I was about to send v2 when I noticed that this approach opened another
race window. :( Closing the window should be just a matter of
reordering code. I will take a day or two to make sure I did not miss
something else.
--
Mike Kravetz
^ permalink raw reply [flat|nested] 7+ messages in thread
* [PATCH 2/2] hugetlb: optimize update_and_free_pages_bulk to avoid lock cycles
2023-07-11 22:09 [PATCH 0/2] Fix hugetlb free path race with memory errors Mike Kravetz
2023-07-11 22:09 ` [PATCH 1/2] hugetlb: Do not clear hugetlb dtor until allocating vmemmap Mike Kravetz
@ 2023-07-11 22:09 ` Mike Kravetz
2023-07-13 17:34 ` [PATCH 0/2] Fix hugetlb free path race with memory errors Andrew Morton
2 siblings, 0 replies; 7+ messages in thread
From: Mike Kravetz @ 2023-07-11 22:09 UTC (permalink / raw)
To: linux-mm, linux-kernel
Cc: Jiaqi Yan, Naoya Horiguchi, Muchun Song, Miaohe Lin,
Axel Rasmussen, James Houghton, Michal Hocko, Andrew Morton,
Mike Kravetz, stable
update_and_free_pages_bulk is designed to free a list of hugetlb pages
back to their associated lower level allocators. This may require
allocating vmemmmap pages associated with each hugetlb page. The
hugetlb page destructor must be changed before pages are freed to lower
level allocators. However, the destructor must be changed under the
hugetlb lock. This means there is potentially one lock cycle per page.
Minimize the number of lock cycles in update_and_free_pages_bulk by:
1) allocating necessary vmemmap for all hugetlb pages on the list
2) take hugetlb lock and clear destructor for all pages on the list
3) free all pages on list back to low level allocators
Fixes: ad2fa3717b74 ("mm: hugetlb: alloc the vmemmap pages associated with each HugeTLB page")
Cc: <stable@vger.kernel.org>
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
mm/hugetlb.c | 35 ++++++++++++++++++++++++++++++++++-
1 file changed, 34 insertions(+), 1 deletion(-)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 1b67bf341c32..e751fced870a 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1856,11 +1856,44 @@ static void update_and_free_pages_bulk(struct hstate *h, struct list_head *list)
{
struct page *page, *t_page;
struct folio *folio;
+ bool clear_dtor = false;
+ /*
+ * First allocate required vmemmmap for all pages on list. If vmemmap
+ * can not be allocated, we can not free page to lower level allocator,
+ * so add back as hugetlb surplus page.
+ */
+ list_for_each_entry_safe(page, t_page, list, lru) {
+ if (HPageVmemmapOptimized(page)) {
+ clear_dtor = true;
+ if (hugetlb_vmemmap_restore(h, page)) {
+ spin_lock_irq(&hugetlb_lock);
+ add_hugetlb_folio(h, folio, true);
+ spin_unlock_irq(&hugetlb_lock);
+ }
+ cond_resched();
+ }
+ }
+
+ /*
+ * If vmemmmap allocation performed above, then take lock * to clear
+ * destructor of all pages on list.
+ */
+ if (clear_dtor) {
+ spin_lock_irq(&hugetlb_lock);
+ list_for_each_entry(page, list, lru)
+ __clear_hugetlb_destructor(h, page_folio(page));
+ spin_unlock_irq(&hugetlb_lock);
+ }
+
+ /*
+ * Free pages back to low level allocators. vmemmap and destructors
+ * were taken care of above, so update_and_free_hugetlb_folio will
+ * not need to take hugetlb lock.
+ */
list_for_each_entry_safe(page, t_page, list, lru) {
folio = page_folio(page);
update_and_free_hugetlb_folio(h, folio, false);
- cond_resched();
}
}
--
2.41.0
^ permalink raw reply related [flat|nested] 7+ messages in thread* Re: [PATCH 0/2] Fix hugetlb free path race with memory errors
2023-07-11 22:09 [PATCH 0/2] Fix hugetlb free path race with memory errors Mike Kravetz
2023-07-11 22:09 ` [PATCH 1/2] hugetlb: Do not clear hugetlb dtor until allocating vmemmap Mike Kravetz
2023-07-11 22:09 ` [PATCH 2/2] hugetlb: optimize update_and_free_pages_bulk to avoid lock cycles Mike Kravetz
@ 2023-07-13 17:34 ` Andrew Morton
2 siblings, 0 replies; 7+ messages in thread
From: Andrew Morton @ 2023-07-13 17:34 UTC (permalink / raw)
To: Mike Kravetz
Cc: linux-mm, linux-kernel, Jiaqi Yan, Naoya Horiguchi, Muchun Song,
Miaohe Lin, Axel Rasmussen, James Houghton, Michal Hocko,
Greg Kroah-Hartman
On Tue, 11 Jul 2023 15:09:40 -0700 Mike Kravetz <mike.kravetz@oracle.com> wrote:
> In the discussion of Jiaqi Yan's series "Improve hugetlbfs read on
> HWPOISON hugepages" the race window was discovered.
> https://lore.kernel.org/linux-mm/20230616233447.GB7371@monkey/
>
> Freeing a hugetlb page back to low level memory allocators is performed
> in two steps.
> 1) Under hugetlb lock, remove page from hugetlb lists and clear destructor
> 2) Outside lock, allocate vmemmap if necessary and call low level free
> Between these two steps, the hugetlb page will appear as a normal
> compound page. However, vmemmap for tail pages could be missing.
> If a memory error occurs at this time, we could try to update page
> flags non-existant page structs.
>
> A much more detailed description is in the first patch.
>
> The first patch addresses the race window. However, it adds a
> hugetlb_lock lock/unlock cycle to every vmemmap optimized hugetlb
> page free operation. This could lead to slowdowns if one is freeing
> a large number of hugetlb pages.
>
> The second path optimizes the update_and_free_pages_bulk routine
> to only take the lock once in bulk operations.
>
> The second patch is technically not a bug fix, but includes a Fixes
> tag and Cc stable to avoid a performance regression. It can be
> combined with the first, but was done separately make reviewing easier.
>
I feel that backporting performance improvements into -stable is not a
usual thing to do. Perhaps the fact that it's a regression fix changes
this, but why?
Much hinges on the magnitude of the performance change. Are you able
to quantify this at all?
^ permalink raw reply [flat|nested] 7+ messages in thread