[PATCH] thp: close race between split and zap huge pages

public inbox for stable@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH] thp: close race between split and zap huge pages
@ 2014-04-15 21:48 Kirill A. Shutemov
  2014-04-16 14:46 ` Sasha Levin
  2014-04-16 20:19 ` Andrew Morton
  0 siblings, 2 replies; 4+ messages in thread
From: Kirill A. Shutemov @ 2014-04-15 21:48 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Rik van Riel, Mel Gorman, Michel Lespinasse, Sasha Levin,
	Dave Jones, Vlastimil Babka, Bob Liu, linux-mm, linux-kernel,
	Kirill A. Shutemov, stable

Sasha Levin has reported two THP BUGs[1][2]. I believe both of them have
the same root cause. Let's look to them one by one.

The first bug[1] is "kernel BUG at mm/huge_memory.c:1829!".
It's BUG_ON(mapcount != page_mapcount(page)) in __split_huge_page().
>From my testing I see that page_mapcount() is higher than mapcount here.

I think it happens due to race between zap_huge_pmd() and
page_check_address_pmd(). page_check_address_pmd() misses PMD
which is under zap:

	CPU0						CPU1
						zap_huge_pmd()
						  pmdp_get_and_clear()
__split_huge_page()
  anon_vma_interval_tree_foreach()
    __split_huge_page_splitting()
      page_check_address_pmd()
        mm_find_pmd()
	  /*
	   * We check if PMD present without taking ptl: no
	   * serialization against zap_huge_pmd(). We miss this PMD,
	   * it's not accounted to 'mapcount' in __split_huge_page().
	   */
	  pmd_present(pmd) == 0

  BUG_ON(mapcount != page_mapcount(page)) // CRASH!!!

						  page_remove_rmap(page)
						    atomic_add_negative(-1, &page->_mapcount)

The second bug[2] is "kernel BUG at mm/huge_memory.c:1371!".
It's VM_BUG_ON_PAGE(!PageHead(page), page) in zap_huge_pmd().

This happens in similar way:

	CPU0						CPU1
						zap_huge_pmd()
						  pmdp_get_and_clear()
						  page_remove_rmap(page)
						    atomic_add_negative(-1, &page->_mapcount)
__split_huge_page()
  anon_vma_interval_tree_foreach()
    __split_huge_page_splitting()
      page_check_address_pmd()
        mm_find_pmd()
	  pmd_present(pmd) == 0	/* The same comment as above */
  /*
   * No crash this time since we already decremented page->_mapcount in
   * zap_huge_pmd().
   */
  BUG_ON(mapcount != page_mapcount(page))

  /*
   * We split the compound page here into small pages without
   * serialization against zap_huge_pmd()
   */
  __split_huge_page_refcount()
						VM_BUG_ON_PAGE(!PageHead(page), page); // CRASH!!!

So my understanding the problem is pmd_present() check in mm_find_pmd()
without taking page table lock.

The bug was introduced by me commit with commit 117b0791ac42. Sorry for
that. :(

Let's open code mm_find_pmd() in page_check_address_pmd() and do the
check under page table lock.

Note that __page_check_address() does the same for PTE entires
if sync != 0.

I've stress tested split and zap code paths for 36+ hours by now and
don't see crashes with the patch applied. Before it took <20 min to
trigger the first bug and few hours for second one (if we ignore
first).

[1] https://lkml.kernel.org/g/<53440991.9090001@oracle.com>
[2] https://lkml.kernel.org/g/<5310C56C.60709@oracle.com>

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reported-by: Sasha Levin <sasha.levin@oracle.com>
Cc: <stable@vger.kernel.org> #3.13+
---
 mm/huge_memory.c | 13 ++++++++++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 5025709bb3b5..d02a83852ee9 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1536,16 +1536,23 @@ pmd_t *page_check_address_pmd(struct page *page,
 			      enum page_check_address_pmd_flag flag,
 			      spinlock_t **ptl)
 {
+	pgd_t *pgd;
+	pud_t *pud;
 	pmd_t *pmd;
 
 	if (address & ~HPAGE_PMD_MASK)
 		return NULL;
 
-	pmd = mm_find_pmd(mm, address);
-	if (!pmd)
+	pgd = pgd_offset(mm, address);
+	if (!pgd_present(*pgd))
 		return NULL;
+	pud = pud_offset(pgd, address);
+	if (!pud_present(*pud))
+		return NULL;
+	pmd = pmd_offset(pud, address);
+
 	*ptl = pmd_lock(mm, pmd);
-	if (pmd_none(*pmd))
+	if (!pmd_present(*pmd))
 		goto unlock;
 	if (pmd_page(*pmd) != page)
 		goto unlock;
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH] thp: close race between split and zap huge pages
  2014-04-15 21:48 [PATCH] thp: close race between split and zap huge pages Kirill A. Shutemov
@ 2014-04-16 14:46 ` Sasha Levin
  2014-04-16 20:19 ` Andrew Morton
  1 sibling, 0 replies; 4+ messages in thread
From: Sasha Levin @ 2014-04-16 14:46 UTC (permalink / raw)
  To: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton
  Cc: Rik van Riel, Mel Gorman, Michel Lespinasse, Dave Jones,
	Vlastimil Babka, Bob Liu, linux-mm, linux-kernel, stable

On 04/15/2014 05:48 PM, Kirill A. Shutemov wrote:
> Sasha Levin has reported two THP BUGs[1][2]. I believe both of them have
> the same root cause. Let's look to them one by one.
> 
> The first bug[1] is "kernel BUG at mm/huge_memory.c:1829!".
> It's BUG_ON(mapcount != page_mapcount(page)) in __split_huge_page().
> From my testing I see that page_mapcount() is higher than mapcount here.
> 
> I think it happens due to race between zap_huge_pmd() and
> page_check_address_pmd(). page_check_address_pmd() misses PMD
> which is under zap:
> 
> 	CPU0						CPU1
> 						zap_huge_pmd()
> 						  pmdp_get_and_clear()
> __split_huge_page()
>   anon_vma_interval_tree_foreach()
>     __split_huge_page_splitting()
>       page_check_address_pmd()
>         mm_find_pmd()
> 	  /*
> 	   * We check if PMD present without taking ptl: no
> 	   * serialization against zap_huge_pmd(). We miss this PMD,
> 	   * it's not accounted to 'mapcount' in __split_huge_page().
> 	   */
> 	  pmd_present(pmd) == 0
> 
>   BUG_ON(mapcount != page_mapcount(page)) // CRASH!!!
> 
> 						  page_remove_rmap(page)
> 						    atomic_add_negative(-1, &page->_mapcount)
> 
> The second bug[2] is "kernel BUG at mm/huge_memory.c:1371!".
> It's VM_BUG_ON_PAGE(!PageHead(page), page) in zap_huge_pmd().
> 
> This happens in similar way:
> 
> 	CPU0						CPU1
> 						zap_huge_pmd()
> 						  pmdp_get_and_clear()
> 						  page_remove_rmap(page)
> 						    atomic_add_negative(-1, &page->_mapcount)
> __split_huge_page()
>   anon_vma_interval_tree_foreach()
>     __split_huge_page_splitting()
>       page_check_address_pmd()
>         mm_find_pmd()
> 	  pmd_present(pmd) == 0	/* The same comment as above */
>   /*
>    * No crash this time since we already decremented page->_mapcount in
>    * zap_huge_pmd().
>    */
>   BUG_ON(mapcount != page_mapcount(page))
> 
>   /*
>    * We split the compound page here into small pages without
>    * serialization against zap_huge_pmd()
>    */
>   __split_huge_page_refcount()
> 						VM_BUG_ON_PAGE(!PageHead(page), page); // CRASH!!!
> 
> So my understanding the problem is pmd_present() check in mm_find_pmd()
> without taking page table lock.
> 
> The bug was introduced by me commit with commit 117b0791ac42. Sorry for
> that. :(
> 
> Let's open code mm_find_pmd() in page_check_address_pmd() and do the
> check under page table lock.
> 
> Note that __page_check_address() does the same for PTE entires
> if sync != 0.
> 
> I've stress tested split and zap code paths for 36+ hours by now and
> don't see crashes with the patch applied. Before it took <20 min to
> trigger the first bug and few hours for second one (if we ignore
> first).
> 
> [1] https://lkml.kernel.org/g/<53440991.9090001@oracle.com>
> [2] https://lkml.kernel.org/g/<5310C56C.60709@oracle.com>
> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Reported-by: Sasha Levin <sasha.levin@oracle.com>
> Cc: <stable@vger.kernel.org> #3.13+

Seems to work for me, thanks!


Thanks,
Sasha


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] thp: close race between split and zap huge pages
  2014-04-15 21:48 [PATCH] thp: close race between split and zap huge pages Kirill A. Shutemov
  2014-04-16 14:46 ` Sasha Levin
@ 2014-04-16 20:19 ` Andrew Morton
  2014-04-18 20:56   ` Kirill A. Shutemov
  1 sibling, 1 reply; 4+ messages in thread
From: Andrew Morton @ 2014-04-16 20:19 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Rik van Riel, Mel Gorman, Michel Lespinasse,
	Sasha Levin, Dave Jones, Vlastimil Babka, Bob Liu, linux-mm,
	linux-kernel, stable

On Wed, 16 Apr 2014 00:48:35 +0300 "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> wrote:

> Sasha Levin has reported two THP BUGs[1][2]. I believe both of them have
> the same root cause. Let's look to them one by one.
> 
> The first bug[1] is "kernel BUG at mm/huge_memory.c:1829!".
> It's BUG_ON(mapcount != page_mapcount(page)) in __split_huge_page().
> >From my testing I see that page_mapcount() is higher than mapcount here.
> 
> I think it happens due to race between zap_huge_pmd() and
> page_check_address_pmd(). page_check_address_pmd() misses PMD
> which is under zap:

Why did this bug happen?

In other words, what earlier mistakes had we made which led to you
getting this locking wrong?  

Based on that knowledge, what can we do to reduce the likelihood of
such mistakes being made in the future?  (Hint: the answer to this
will involve making changes to this patch).

> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1536,16 +1536,23 @@ pmd_t *page_check_address_pmd(struct page *page,
>  			      enum page_check_address_pmd_flag flag,
>  			      spinlock_t **ptl)
>  {
> +	pgd_t *pgd;
> +	pud_t *pud;
>  	pmd_t *pmd;
>  
>  	if (address & ~HPAGE_PMD_MASK)
>  		return NULL;
>  
> -	pmd = mm_find_pmd(mm, address);
> -	if (!pmd)
> +	pgd = pgd_offset(mm, address);
> +	if (!pgd_present(*pgd))
>  		return NULL;
> +	pud = pud_offset(pgd, address);
> +	if (!pud_present(*pud))
> +		return NULL;
> +	pmd = pmd_offset(pud, address);
> +
>  	*ptl = pmd_lock(mm, pmd);
> -	if (pmd_none(*pmd))
> +	if (!pmd_present(*pmd))
>  		goto unlock;
>  	if (pmd_page(*pmd) != page)
>  		goto unlock;

So how do other callers of mm_find_pmd() manage to avoid this race, or
are they all buggy?

Is mm_find_pmd() really so simple and obvious that we can afford to
leave it undocumented?


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] thp: close race between split and zap huge pages
  2014-04-16 20:19 ` Andrew Morton
@ 2014-04-18 20:56   ` Kirill A. Shutemov
  0 siblings, 0 replies; 4+ messages in thread
From: Kirill A. Shutemov @ 2014-04-18 20:56 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli, Rik van Riel
  Cc: Kirill A. Shutemov, Mel Gorman, Michel Lespinasse, Sasha Levin,
	Dave Jones, Vlastimil Babka, Bob Liu, linux-mm, linux-kernel,
	stable

Andrew Morton wrote:
> On Wed, 16 Apr 2014 00:48:35 +0300 "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> wrote:
> 
> > Sasha Levin has reported two THP BUGs[1][2]. I believe both of them have
> > the same root cause. Let's look to them one by one.
> > 
> > The first bug[1] is "kernel BUG at mm/huge_memory.c:1829!".
> > It's BUG_ON(mapcount != page_mapcount(page)) in __split_huge_page().
> > >From my testing I see that page_mapcount() is higher than mapcount here.
> > 
> > I think it happens due to race between zap_huge_pmd() and
> > page_check_address_pmd(). page_check_address_pmd() misses PMD
> > which is under zap:
> 
> Why did this bug happen?
> 
> In other words, what earlier mistakes had we made which led to you
> getting this locking wrong?

Locking model for perfect (without missing any page table entry) rmap walk is
not straight-forward and seems not documented properly anywhere.

Actually, the same bug was made for page_check_address() on introduction split
PTE lock (see c0718806cf95 mm: rmap with inner ptlock) and fixed later (see
479db0bf408e mm: dirty page tracking race fix).

> Based on that knowledge, what can we do to reduce the likelihood of
> such mistakes being made in the future?  (Hint: the answer to this
> will involve making changes to this patch).

Patch to add local comment is below.

But we really need proper documentation on rmap expectations from somebody who
understands it better than me.
Rik? Andrea?

> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -1536,16 +1536,23 @@ pmd_t *page_check_address_pmd(struct page *page,
> >  			      enum page_check_address_pmd_flag flag,
> >  			      spinlock_t **ptl)
> >  {
> > +	pgd_t *pgd;
> > +	pud_t *pud;
> >  	pmd_t *pmd;
> >  
> >  	if (address & ~HPAGE_PMD_MASK)
> >  		return NULL;
> >  
> > -	pmd = mm_find_pmd(mm, address);
> > -	if (!pmd)
> > +	pgd = pgd_offset(mm, address);
> > +	if (!pgd_present(*pgd))
> >  		return NULL;
> > +	pud = pud_offset(pgd, address);
> > +	if (!pud_present(*pud))
> > +		return NULL;
> > +	pmd = pmd_offset(pud, address);
> > +
> >  	*ptl = pmd_lock(mm, pmd);
> > -	if (pmd_none(*pmd))
> > +	if (!pmd_present(*pmd))
> >  		goto unlock;
> >  	if (pmd_page(*pmd) != page)
> >  		goto unlock;
> 
> So how do other callers of mm_find_pmd() manage to avoid this race, or
> are they all buggy?

None of them involved in rmap walk over pmds. In this case speculative pmd
checks without ptl are fine.

> Is mm_find_pmd() really so simple and obvious that we can afford to
> leave it undocumented?

I think it is. rmap is not.

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index d02a83852ee9..98165f222cef 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1552,6 +1552,11 @@ pmd_t *page_check_address_pmd(struct page *page,
        pmd = pmd_offset(pud, address);
 
        *ptl = pmd_lock(mm, pmd);
+       /*
+        * Check if pmd present *only* with ptl taken. For perfect rmap walk we
+        * want to be serialized against zap_huge_pmd() and can't just
+        * speculatively skip non-present pmd without getting ptl.
+        */
        if (!pmd_present(*pmd))
                goto unlock;
        if (pmd_page(*pmd) != page)
-- 
 Kirill A. Shutemov

^ permalink raw reply related	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2014-04-18 20:56 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-04-15 21:48 [PATCH] thp: close race between split and zap huge pages Kirill A. Shutemov
2014-04-16 14:46 ` Sasha Levin
2014-04-16 20:19 ` Andrew Morton
2014-04-18 20:56   ` Kirill A. Shutemov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox