Re: [RFC PATCH] mm/hugetlb_vmemmap: fix race with speculative PFN walkers

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Muchun Song <muchun.song@linux.dev>
To: Yu Zhao <yuzhao@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	linux-mm@kvack.org, David Hildenbrand <david@redhat.com>,
	Frank van der Linden <fvdl@google.com>,
	"Matthew Wilcox (Oracle)" <willy@infradead.org>
Subject: Re: [RFC PATCH] mm/hugetlb_vmemmap: fix race with speculative PFN walkers
Date: Wed, 26 Jun 2024 10:37:59 +0800	[thread overview]
Message-ID: <7380dad0-829a-48b6-a69e-e3a02ded30de@linux.dev> (raw)
In-Reply-To: <20240621213717.1099079-1-yuzhao@google.com>



On 2024/6/22 05:37, Yu Zhao wrote:
> While investigating HVO for THPs [1], it turns out that speculative
> PFN walkers like compaction can race with vmemmap modificatioins,
> e.g.,
>
>    CPU 1 (vmemmap modifier)         CPU 2 (speculative PFN walker)
>    -----------------------------    ------------------------------
>    Allocates an LRU folio page1
>                                     Sees page1
>    Frees page1
>
>    Allocates a hugeTLB folio page2
>    (page1 being a tail of page2)
>
>    Updates vmemmap mapping page1
>                                     get_page_unless_zero(page1)
>
> Even though page1 has a zero refcnt after HVO, get_page_unless_zero()
> can still try to modify its read-only struct page resulting in a
> crash.
>
> An independent report [2] confirmed this race.

Right. Thanks for your continuous focus on this race.

>
> There are two discussed approaches to fix this race:
> 1. Make RO vmemmap RW so that get_page_unless_zero() can fail without
>     triggering a PF.
> 2. Use RCU to make sure get_page_unless_zero() either sees zero
>     refcnts through the old vmemmap or non-zero refcnts through the new
>     one.
>
> The second approach is preferred here because:
> 1. It can prevent illegal modifications to struct page[] that is HVO;
> 2. It can be generalized, in a way similar to ZERO_PAGE(), to fix
>     similar races in other places, e.g., arch_remove_memory() on x86
>     [3], which frees vmemmap mapping offlined struct page[].
>
> While adding synchronize_rcu(), the goal is to be surgical, rather
> than optimized. Specifically, calls to synchronize_rcu() on the error
> handling paths can be coalesced, but it is not done for the sake of
> Simplicity: noticeably, this fix removes ~50% more lines than it adds.
>
> [1] https://lore.kernel.org/20240229183436.4110845-4-yuzhao@google.com/
> [2] https://lore.kernel.org/917FFC7F-0615-44DD-90EE-9F85F8EA9974@linux.dev/
> [3] https://lore.kernel.org/be130a96-a27e-4240-ad78-776802f57cad@redhat.com/
>
> Signed-off-by: Yu Zhao <yuzhao@google.com>
> ---
>   include/linux/page_ref.h |  8 ++++++-
>   mm/hugetlb.c             | 50 +++++-----------------------------------
>   mm/hugetlb_vmemmap.c     | 16 +++++++++++++
>   3 files changed, 29 insertions(+), 45 deletions(-)
>
> diff --git a/include/linux/page_ref.h b/include/linux/page_ref.h
> index 1acf5bac7f50..add92e8f31b2 100644
> --- a/include/linux/page_ref.h
> +++ b/include/linux/page_ref.h
> @@ -230,7 +230,13 @@ static inline int folio_ref_dec_return(struct folio *folio)
>   
>   static inline bool page_ref_add_unless(struct page *page, int nr, int u)
>   {
> -	bool ret = atomic_add_unless(&page->_refcount, nr, u);
> +	bool ret = false;
> +
> +	rcu_read_lock();
> +	/* avoid writing to the vmemmap area being remapped */
> +	if (!page_is_fake_head(page) && page_ref_count(page) != u)
> +		ret = atomic_add_unless(&page->_refcount, nr, u);
> +	rcu_read_unlock();
>   
>   	if (page_ref_tracepoint_active(page_ref_mod_unless))
>   		__page_ref_mod_unless(page, nr, ret);
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index f35abff8be60..271d83a7cde0 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1629,9 +1629,8 @@ static inline void destroy_compound_gigantic_folio(struct folio *folio,
>    *
>    * Must be called with hugetlb lock held.
>    */
> -static void __remove_hugetlb_folio(struct hstate *h, struct folio *folio,
> -							bool adjust_surplus,
> -							bool demote)
> +static void remove_hugetlb_folio(struct hstate *h, struct folio *folio,
> +							bool adjust_surplus)
>   {
>   	int nid = folio_nid(folio);
>   
> @@ -1661,33 +1660,13 @@ static void __remove_hugetlb_folio(struct hstate *h, struct folio *folio,
>   	if (!folio_test_hugetlb_vmemmap_optimized(folio))
>   		__folio_clear_hugetlb(folio);
>   
> -	 /*
> -	  * In the case of demote we do not ref count the page as it will soon
> -	  * be turned into a page of smaller size.
> -	 */
> -	if (!demote)
> -		folio_ref_unfreeze(folio, 1);
> -
>   	h->nr_huge_pages--;
>   	h->nr_huge_pages_node[nid]--;
>   }
>   
> -static void remove_hugetlb_folio(struct hstate *h, struct folio *folio,
> -							bool adjust_surplus)
> -{
> -	__remove_hugetlb_folio(h, folio, adjust_surplus, false);
> -}
> -
> -static void remove_hugetlb_folio_for_demote(struct hstate *h, struct folio *folio,
> -							bool adjust_surplus)
> -{
> -	__remove_hugetlb_folio(h, folio, adjust_surplus, true);
> -}
> -
>   static void add_hugetlb_folio(struct hstate *h, struct folio *folio,
>   			     bool adjust_surplus)
>   {
> -	int zeroed;
>   	int nid = folio_nid(folio);
>   
>   	VM_BUG_ON_FOLIO(!folio_test_hugetlb_vmemmap_optimized(folio), folio);
> @@ -1711,21 +1690,6 @@ static void add_hugetlb_folio(struct hstate *h, struct folio *folio,
>   	 */
>   	folio_set_hugetlb_vmemmap_optimized(folio);
>   
> -	/*
> -	 * This folio is about to be managed by the hugetlb allocator and
> -	 * should have no users.  Drop our reference, and check for others
> -	 * just in case.
> -	 */
> -	zeroed = folio_put_testzero(folio);
> -	if (unlikely(!zeroed))
> -		/*
> -		 * It is VERY unlikely soneone else has taken a ref
> -		 * on the folio.  In this case, we simply return as
> -		 * free_huge_folio() will be called when this other ref
> -		 * is dropped.
> -		 */
> -		return;
> -
>   	arch_clear_hugetlb_flags(folio);
>   	enqueue_hugetlb_folio(h, folio);
>   }
> @@ -1779,6 +1743,8 @@ static void __update_and_free_hugetlb_folio(struct hstate *h,
>   		spin_unlock_irq(&hugetlb_lock);
>   	}
>   
> +	folio_ref_unfreeze(folio, 1);
> +
>   	/*
>   	 * Non-gigantic pages demoted from CMA allocated gigantic pages
>   	 * need to be given back to CMA in free_gigantic_folio.
> @@ -3079,11 +3045,8 @@ static int alloc_and_dissolve_hugetlb_folio(struct hstate *h,
>   
>   free_new:
>   	spin_unlock_irq(&hugetlb_lock);
> -	if (new_folio) {
> -		/* Folio has a zero ref count, but needs a ref to be freed */
> -		folio_ref_unfreeze(new_folio, 1);
> +	if (new_folio)
>   		update_and_free_hugetlb_folio(h, new_folio, false);
> -	}

Look into this function, we have:

dissolve_free_huge_page
retry:
     if (!folio_test_hugetlb(folio))
         return;
     if (!folio_ref_count(folio))
         if (unlikely(!folio_test_hugetlb_freed(folio)))
             goto retry;
     remove_hugetlb_folio(h, folio, false);

Since you have not raised the refcount in remove_hugetlb_folio(), we will
disslove this page again if there is a concurrent dissolve_free_huge_page()
processing routine. Then, the statistics will be wrong (like 
->nr_huge_pages).
A solution seems easy, we should clear folio_clear_hugetlb_freed in
remove_hugetlb_folio.

Muchun,
Thanks.
>   
>   	return ret;
>   }
> @@ -3938,7 +3901,7 @@ static int demote_free_hugetlb_folio(struct hstate *h, struct folio *folio)
>   
>   	target_hstate = size_to_hstate(PAGE_SIZE << h->demote_order);
>   
> -	remove_hugetlb_folio_for_demote(h, folio, false);
> +	remove_hugetlb_folio(h, folio, false);
>   	spin_unlock_irq(&hugetlb_lock);
>   
>   	/*
> @@ -3952,7 +3915,6 @@ static int demote_free_hugetlb_folio(struct hstate *h, struct folio *folio)
>   		if (rc) {
>   			/* Allocation of vmemmmap failed, we can not demote folio */
>   			spin_lock_irq(&hugetlb_lock);
> -			folio_ref_unfreeze(folio, 1);
>   			add_hugetlb_folio(h, folio, false);
>   			return rc;
>   		}
> diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
> index b9a55322e52c..8193906515c6 100644
> --- a/mm/hugetlb_vmemmap.c
> +++ b/mm/hugetlb_vmemmap.c
> @@ -446,6 +446,8 @@ static int __hugetlb_vmemmap_restore_folio(const struct hstate *h,
>   	unsigned long vmemmap_reuse;
>   
>   	VM_WARN_ON_ONCE_FOLIO(!folio_test_hugetlb(folio), folio);
> +	VM_WARN_ON_ONCE_FOLIO(folio_ref_count(folio), folio);
> +
>   	if (!folio_test_hugetlb_vmemmap_optimized(folio))
>   		return 0;
>   
> @@ -481,6 +483,9 @@ static int __hugetlb_vmemmap_restore_folio(const struct hstate *h,
>    */
>   int hugetlb_vmemmap_restore_folio(const struct hstate *h, struct folio *folio)
>   {
> +	/* avoid writes from page_ref_add_unless() while unfolding vmemmap */
> +	synchronize_rcu();
> +
>   	return __hugetlb_vmemmap_restore_folio(h, folio, 0);
>   }
>   
> @@ -505,6 +510,9 @@ long hugetlb_vmemmap_restore_folios(const struct hstate *h,
>   	long restored = 0;
>   	long ret = 0;
>   
> +	/* avoid writes from page_ref_add_unless() while unfolding vmemmap */
> +	synchronize_rcu();
> +
>   	list_for_each_entry_safe(folio, t_folio, folio_list, lru) {
>   		if (folio_test_hugetlb_vmemmap_optimized(folio)) {
>   			ret = __hugetlb_vmemmap_restore_folio(h, folio,
> @@ -550,6 +558,8 @@ static int __hugetlb_vmemmap_optimize_folio(const struct hstate *h,
>   	unsigned long vmemmap_reuse;
>   
>   	VM_WARN_ON_ONCE_FOLIO(!folio_test_hugetlb(folio), folio);
> +	VM_WARN_ON_ONCE_FOLIO(folio_ref_count(folio), folio);
> +
>   	if (!vmemmap_should_optimize_folio(h, folio))
>   		return ret;
>   
> @@ -601,6 +611,9 @@ void hugetlb_vmemmap_optimize_folio(const struct hstate *h, struct folio *folio)
>   {
>   	LIST_HEAD(vmemmap_pages);
>   
> +	/* avoid writes from page_ref_add_unless() while folding vmemmap */
> +	synchronize_rcu();
> +
>   	__hugetlb_vmemmap_optimize_folio(h, folio, &vmemmap_pages, 0);
>   	free_vmemmap_page_list(&vmemmap_pages);
>   }
> @@ -644,6 +657,9 @@ void hugetlb_vmemmap_optimize_folios(struct hstate *h, struct list_head *folio_l
>   
>   	flush_tlb_all();
>   
> +	/* avoid writes from page_ref_add_unless() while folding vmemmap */
> +	synchronize_rcu();
> +
>   	list_for_each_entry(folio, folio_list, lru) {
>   		int ret;
>   
>
> base-commit: 264efe488fd82cf3145a3dc625f394c61db99934
> prerequisite-patch-id: 5029fb66d9bf40b84903a5b4f066e85101169e84
> prerequisite-patch-id: 7889e5ee16b8e91cccde12468f1d2c3f65500336
> prerequisite-patch-id: 0d4c19afc7b92f16bee9e9cf2b6832406389742a
> prerequisite-patch-id: c56f06d4bb3e738aea489ec30313ed0c1dbac325

next prev parent reply	other threads:[~2024-06-26  2:38 UTC|newest]

Thread overview: 3+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-06-21 21:37 [RFC PATCH] mm/hugetlb_vmemmap: fix race with speculative PFN walkers Yu Zhao
2024-06-26  2:37 ` Muchun Song [this message]
2024-06-27  4:47   ` Yu Zhao

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=7380dad0-829a-48b6-a69e-e3a02ded30de@linux.dev \
    --to=muchun.song@linux.dev \
    --cc=akpm@linux-foundation.org \
    --cc=david@redhat.com \
    --cc=fvdl@google.com \
    --cc=linux-mm@kvack.org \
    --cc=willy@infradead.org \
    --cc=yuzhao@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.