All of lore.kernel.org
 help / color / mirror / Atom feed
From: Naoya Horiguchi <nao.horiguchi@gmail.com>
To: Mike Kravetz <mike.kravetz@oracle.com>,
	Michal Hocko <mhocko@suse.com>,
	Muchun Song <songmuchun@bytedance.com>,
	"akpm@linux-foundation.org" <akpm@linux-foundation.org>,
	"osalvador@suse.de" <osalvador@suse.de>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>
Cc: "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	Naoya Horiguchi <naoya.horiguchi@nec.com>
Subject: [PATCH] mm,hwpoison: fix race with compound page allocation
Date: Wed, 28 Apr 2021 16:46:54 +0900	[thread overview]
Message-ID: <20210428074654.GA2093897@u2004> (raw)
In-Reply-To: <20210423080153.GA78658@hori.linux.bs1.fc.nec.co.jp>

On Fri, Apr 23, 2021 at 08:01:54AM +0000, HORIGUCHI NAOYA(堀口 直也) wrote:
> On Thu, Apr 22, 2021 at 08:27:46AM +0000, HORIGUCHI NAOYA(堀口 直也) wrote:
> > On Wed, Apr 21, 2021 at 11:03:24AM -0700, Mike Kravetz wrote:
> > > On 4/21/21 1:33 AM, HORIGUCHI NAOYA(堀口 直也) wrote:
> > > > On Wed, Apr 21, 2021 at 10:03:34AM +0200, Michal Hocko wrote:
> > > >> [Cc Naoya]
> > > >>
> > > >> On Wed 21-04-21 14:02:59, Muchun Song wrote:
> > > >>> The possible bad scenario:
> > > >>>
> > > >>> CPU0:                           CPU1:
> > > >>>
> > > >>>                                 gather_surplus_pages()
> > > >>>                                   page = alloc_surplus_huge_page()
> > > >>> memory_failure_hugetlb()
> > > >>>   get_hwpoison_page(page)
> > > >>>     __get_hwpoison_page(page)
> > > >>>       get_page_unless_zero(page)
> > > >>>                                   zero = put_page_testzero(page)
> > > >>>                                   VM_BUG_ON_PAGE(!zero, page)
> > > >>>                                   enqueue_huge_page(h, page)
> > > >>>   put_page(page)
> > > >>>
> > > >>> The refcount can possibly be increased by memory-failure or soft_offline
> > > >>> handlers, we can trigger VM_BUG_ON_PAGE and wrongly add the page to the
> > > >>> hugetlb pool list.
> > > >>
> > > >> The hwpoison side of this looks really suspicious to me. It shouldn't
> > > >> really touch the reference count of hugetlb pages without being very
> > > >> careful (and having hugetlb_lock held).
> > > > 
> > > > I have the same feeling, there is a window where a hugepage is refcounted
> > > > during converting from buddy free pages into free hugepage, so refcount
> > > > alone is not enough to prevent the race.  hugetlb_lock is retaken after
> > > > alloc_surplus_huge_page returns, so simply holding hugetlb_lock in
> > > > get_hwpoison_page() seems not work.  Is there any status bit to show that a
> > > > hugepage is just being initialized (not in free hugepage pool or in use)?
> > > > 
> > > 
> > > It seems we can also race with the code that makes a compound page a
> > > hugetlb page.  The memory failure code could be called after allocating
> > > pages from buddy and before setting compound page DTOR.  So, the memory
> > > handling code will process it as a compound page.
> > 
> > Yes, so get_hwpoison_page() has to call get_page_unless_zero()
> > only when memory_failure() can surely handle the error.
> > 
> > > 
> > > Just thinking that this may not be limited to the hugetlb specific memory
> > > failure handling?
> > 
> > Currently hugetlb page is the only type of compound page supported by memory
> > failure.  But I agree with you that other types of compound pages have the
> > same race window, and judging only with get_page_unless_zero() is dangerous.
> > So I think that __get_hwpoison_page() should have the following structure:
> > 
> >   if (PageCompound) {
> >       if (PageHuge) {
> >           if (PageHugeFreed || PageHugeActive) {
> >               if (get_page_unless_zero)
> >                   return 0;   // path for in-use hugetlb page
> >               else
> >                   return 1;   // path for free hugetlb page
> >           } else {
> >               return -EBUSY;  // any transient hugetlb page
> >           }
> >       } else {
> >           ... // any other compound page (like thp, slab, ...)
> >       }
> >   } else {
> >       ...   // any non-compound page
> >   }
> 
> The above pseudo code was wrong, so let me update my thought.
> I'm now trying to solve the reported issue by changing __get_hwpoison_page()
> like below:
> 
>   static int __get_hwpoison_page(struct page *page)
>   {
>           struct page *head = compound_head(page);
>   
>           if (PageCompound(page)) {
>                   if (PageSlab(page)) {
>                           return get_page_unless_zero(page);
>                   } else if (PageHuge(head)) {
>                           if (HPageFreed(head) || HPageMigratable(head))
>                                   return get_page_unless_zero(head);
>                   } else if (PageTransHuge(head)) {
>                           /*
>                            * Non anonymous thp exists only in allocation/free time. We
>                            * can't handle such a case correctly, so let's give it up.
>                            * This should be better than triggering BUG_ON when kernel
>                            * tries to touch the "partially handled" page.
>                            */
>                           if (!PageAnon(head)) {
>                                   pr_err("Memory failure: %#lx: non anonymous thp\n",
>                                          page_to_pfn(page));
>                                   return 0;
>                           }
>                           if (get_page_unless_zero(head)) {
>                                   if (head == compound_head(page))
>                                           return 1;
>                                   pr_info("Memory failure: %#lx cannot catch tail\n",
>                                           page_to_pfn(page));
>                                   put_page(head);
>                           }
>                   }
>                   return 0;
>           }
>   
>           return get_page_unless_zero(page);
>   }
> 
> Some notes: 
> 
>   - in hugetlb path, new HPage* checks should avoid the reported race,
>     but I still need more testing to confirm it,
>   - PageSlab check is added because otherwise I found that "non anonymous thp"
>     path is chosen, that's obviously wrong,
>   - thp's branch has a known issue unrelated to the current issue, which
>     will/should be improved later.
> 
> I'll send a patch next week.

I confirmed that the patch fixes the reported problem (in the testcase
triggering VM_BUG_ON_PAGE() without this patch).
So let me suggest this as a fix on hwpoison side.

Thanks,
Naoya Horiguchi

---
From: Naoya Horiguchi <naoya.horiguchi@nec.com>
Date: Wed, 28 Apr 2021 15:55:47 +0900
Subject: [PATCH] mm,hwpoison: fix race with compound page allocation

When hugetlb page fault (under overcommiting situation) and memory_failure()
race, VM_BUG_ON_PAGE() is triggered by the following race:

    CPU0:                           CPU1:

                                    gather_surplus_pages()
                                      page = alloc_surplus_huge_page()
    memory_failure_hugetlb()
      get_hwpoison_page(page)
        __get_hwpoison_page(page)
          get_page_unless_zero(page)
                                      zero = put_page_testzero(page)
                                      VM_BUG_ON_PAGE(!zero, page)
                                      enqueue_huge_page(h, page)
      put_page(page)

__get_hwpoison_page() only checks page refcount before taking additional
one for memory error handling, which is wrong because there's time
windows where compound pages have non-zero refcount during initialization.

So makes __get_hwpoison_page() check more page status for a few types
of compound pages. PageSlab() check is added because otherwise
"non anonymous thp" path is wrongly chosen for slab pages.

Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Reported-by: Muchun Song <songmuchun@bytedance.com>
---
 mm/memory-failure.c | 48 +++++++++++++++++++++++++--------------------
 1 file changed, 27 insertions(+), 21 deletions(-)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index a3659619d293..61988e332712 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1095,30 +1095,36 @@ static int __get_hwpoison_page(struct page *page)
 {
 	struct page *head = compound_head(page);
 
-	if (!PageHuge(head) && PageTransHuge(head)) {
-		/*
-		 * Non anonymous thp exists only in allocation/free time. We
-		 * can't handle such a case correctly, so let's give it up.
-		 * This should be better than triggering BUG_ON when kernel
-		 * tries to touch the "partially handled" page.
-		 */
-		if (!PageAnon(head)) {
-			pr_err("Memory failure: %#lx: non anonymous thp\n",
-				page_to_pfn(page));
-			return 0;
+	if (PageCompound(page)) {
+		if (PageSlab(page)) {
+			return get_page_unless_zero(page);
+		} else if (PageHuge(head)) {
+			if (HPageFreed(head) || HPageMigratable(head))
+				return get_page_unless_zero(head);
+		} else if (PageTransHuge(head)) {
+			/*
+			 * Non anonymous thp exists only in allocation/free time. We
+			 * can't handle such a case correctly, so let's give it up.
+			 * This should be better than triggering BUG_ON when kernel
+			 * tries to touch the "partially handled" page.
+			 */
+			if (!PageAnon(head)) {
+				pr_err("Memory failure: %#lx: non anonymous thp\n",
+				       page_to_pfn(page));
+				return 0;
+			}
+			if (get_page_unless_zero(head)) {
+				if (head == compound_head(page))
+					return 1;
+				pr_info("Memory failure: %#lx cannot catch tail\n",
+					page_to_pfn(page));
+				put_page(head);
+			}
 		}
+		return 0;
 	}
 
-	if (get_page_unless_zero(head)) {
-		if (head == compound_head(page))
-			return 1;
-
-		pr_info("Memory failure: %#lx cannot catch tail\n",
-			page_to_pfn(page));
-		put_page(head);
-	}
-
-	return 0;
+	return get_page_unless_zero(page);
 }
 
 /*
-- 
2.25.1



  reply	other threads:[~2021-04-28  7:47 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-04-21  6:02 [PATCH] mm: hugetlb: fix a race between memory-failure/soft_offline and gather_surplus_pages Muchun Song
2021-04-21  8:03 ` Michal Hocko
2021-04-21  8:15   ` [External] " Muchun Song
2021-04-21  8:21     ` Oscar Salvador
2021-04-21  8:41       ` Muchun Song
2021-04-21  8:49         ` Oscar Salvador
2021-04-21  8:58           ` Muchun Song
2021-04-21  8:43       ` Michal Hocko
2021-04-21  8:25     ` Michal Hocko
2021-04-21  8:33   ` HORIGUCHI NAOYA(堀口 直也)
2021-04-21  9:02     ` [External] " Muchun Song
2021-04-21 18:03     ` Mike Kravetz
2021-04-22  8:27       ` HORIGUCHI NAOYA(堀口 直也)
2021-04-23  8:01         ` HORIGUCHI NAOYA(堀口 直也)
2021-04-28  7:46           ` Naoya Horiguchi [this message]
2021-04-28  8:23             ` [PATCH] mm,hwpoison: fix race with compound page allocation Oscar Salvador
2021-04-28  9:18               ` HORIGUCHI NAOYA(堀口 直也)
2021-05-06  1:31                 ` [PATCH v2] " Naoya Horiguchi
2021-05-06  8:51                   ` Oscar Salvador
2021-05-07  4:17                     ` HORIGUCHI NAOYA(堀口 直也)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20210428074654.GA2093897@u2004 \
    --to=nao.horiguchi@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@suse.com \
    --cc=mike.kravetz@oracle.com \
    --cc=naoya.horiguchi@nec.com \
    --cc=osalvador@suse.de \
    --cc=songmuchun@bytedance.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.