exit_mmap() BUG_ON triggering since 3.1

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* exit_mmap() BUG_ON triggering since 3.1
@ 2012-02-15 18:33 Dave Jones
  2012-02-16  2:14 ` Hugh Dickins
  0 siblings, 1 reply; 12+ messages in thread
From: Dave Jones @ 2012-02-15 18:33 UTC (permalink / raw)
  To: Linux Kernel; +Cc: linux-mm, Fedora Kernel Team

We've had three reports against the Fedora kernel recently where
a process exits, and we're tripping up the 

        BUG_ON(mm->nr_ptes > (FIRST_USER_ADDRESS+PMD_SIZE-1)>>PMD_SHIFT);

in exit_mmap()

It started happening with 3.1, but still occurs on 3.2
(no 3.3rc reports yet, but it's not getting much testing).

https://bugzilla.redhat.com/show_bug.cgi?id=786632
https://bugzilla.redhat.com/show_bug.cgi?id=787527
https://bugzilla.redhat.com/show_bug.cgi?id=790546

I don't see anything special in common between the loaded modules.

anyone?

	Dave

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: exit_mmap() BUG_ON triggering since 3.1
  2012-02-15 18:33 exit_mmap() BUG_ON triggering since 3.1 Dave Jones
@ 2012-02-16  2:14 ` Hugh Dickins
  2012-02-16  2:22   ` Roland Dreier
  2012-02-16  7:07   ` Andrea Arcangeli
  0 siblings, 2 replies; 12+ messages in thread
From: Hugh Dickins @ 2012-02-16  2:14 UTC (permalink / raw)
  To: Dave Jones
  Cc: Andrew Morton, Andrea Arcangeli, David Rientjes, linux-kernel,
	linux-mm, kernel-team

On Wed, 15 Feb 2012, Dave Jones wrote:

> We've had three reports against the Fedora kernel recently where
> a process exits, and we're tripping up the 
> 
>         BUG_ON(mm->nr_ptes > (FIRST_USER_ADDRESS+PMD_SIZE-1)>>PMD_SHIFT);
> 
> in exit_mmap()
> 
> It started happening with 3.1, but still occurs on 3.2
> (no 3.3rc reports yet, but it's not getting much testing).
> 
> https://bugzilla.redhat.com/show_bug.cgi?id=786632
> https://bugzilla.redhat.com/show_bug.cgi?id=787527
> https://bugzilla.redhat.com/show_bug.cgi?id=790546
> 
> I don't see anything special in common between the loaded modules.
> 
> anyone?

My suspicion was that it would be related to Transparent HugePages:
they do complicate the pagetable story.  And I think I have found a
potential culprit.  I don't know if nr_ptes is the only loser from
these two split_huge_pages calls, but assuming it is...


[PATCH] mm: fix BUG on mm->nr_ptes

mm->nr_ptes had unusual locking: down_read mmap_sem plus page_table_lock
when incrementing, down_write mmap_sem (or mm_users 0) when decrementing;
whereas THP is careful to increment and decrement it under page_table_lock.

Now most of those paths in THP also hold mmap_sem for read or write (with
appropriate checks on mm_users), but two do not: when split_huge_page()
is called by hwpoison_user_mappings(), and when called by add_to_swap().

It's conceivable that the latter case is responsible for the exit_mmap()
BUG_ON mm->nr_ptes that has been reported on Fedora.

THP's understanding of the locking seems reasonable, so take that lock
to update it in free_pgd_range(): try to avoid retaking it repeatedly
by passing the count up from levels below - free_pgtables() already
does its best to combine calls across neighbouring vmas.

Or should we try harder to avoid the extra locking: test mm_users?
#ifdef on THP?  Or consider the accuracy of this count not worth
extra locking, and just scrap the BUG_ON now?

Reported-by: Dave Jones <davej@redhat.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
---

 mm/memory.c |   40 +++++++++++++++++++++++++++-------------
 1 file changed, 27 insertions(+), 13 deletions(-)

--- 3.3-rc3/mm/memory.c	2012-01-31 14:51:15.100021868 -0800
+++ linux/mm/memory.c	2012-02-15 17:01:46.588649490 -0800
@@ -419,22 +419,23 @@ void pmd_clear_bad(pmd_t *pmd)
  * Note: this doesn't free the actual pages themselves. That
  * has been handled earlier when unmapping all the memory regions.
  */
-static void free_pte_range(struct mmu_gather *tlb, pmd_t *pmd,
+static long free_pte_range(struct mmu_gather *tlb, pmd_t *pmd,
 			   unsigned long addr)
 {
 	pgtable_t token = pmd_pgtable(*pmd);
 	pmd_clear(pmd);
 	pte_free_tlb(tlb, token, addr);
-	tlb->mm->nr_ptes--;
+	return 1;
 }
 
-static inline void free_pmd_range(struct mmu_gather *tlb, pud_t *pud,
+static inline long free_pmd_range(struct mmu_gather *tlb, pud_t *pud,
 				unsigned long addr, unsigned long end,
 				unsigned long floor, unsigned long ceiling)
 {
 	pmd_t *pmd;
 	unsigned long next;
 	unsigned long start;
+	long nr_ptes = 0;
 
 	start = addr;
 	pmd = pmd_offset(pud, addr);
@@ -442,32 +443,35 @@ static inline void free_pmd_range(struct
 		next = pmd_addr_end(addr, end);
 		if (pmd_none_or_clear_bad(pmd))
 			continue;
-		free_pte_range(tlb, pmd, addr);
+		nr_ptes += free_pte_range(tlb, pmd, addr);
 	} while (pmd++, addr = next, addr != end);
 
 	start &= PUD_MASK;
 	if (start < floor)
-		return;
+		goto out;
 	if (ceiling) {
 		ceiling &= PUD_MASK;
 		if (!ceiling)
-			return;
+			goto out;
 	}
 	if (end - 1 > ceiling - 1)
-		return;
+		goto out;
 
 	pmd = pmd_offset(pud, start);
 	pud_clear(pud);
 	pmd_free_tlb(tlb, pmd, start);
+out:
+	return nr_ptes;
 }
 
-static inline void free_pud_range(struct mmu_gather *tlb, pgd_t *pgd,
+static inline long free_pud_range(struct mmu_gather *tlb, pgd_t *pgd,
 				unsigned long addr, unsigned long end,
 				unsigned long floor, unsigned long ceiling)
 {
 	pud_t *pud;
 	unsigned long next;
 	unsigned long start;
+	long nr_ptes = 0;
 
 	start = addr;
 	pud = pud_offset(pgd, addr);
@@ -475,23 +479,25 @@ static inline void free_pud_range(struct
 		next = pud_addr_end(addr, end);
 		if (pud_none_or_clear_bad(pud))
 			continue;
-		free_pmd_range(tlb, pud, addr, next, floor, ceiling);
+		nr_ptes += free_pmd_range(tlb, pud, addr, next, floor, ceiling);
 	} while (pud++, addr = next, addr != end);
 
 	start &= PGDIR_MASK;
 	if (start < floor)
-		return;
+		goto out;
 	if (ceiling) {
 		ceiling &= PGDIR_MASK;
 		if (!ceiling)
-			return;
+			goto out;
 	}
 	if (end - 1 > ceiling - 1)
-		return;
+		goto out;
 
 	pud = pud_offset(pgd, start);
 	pgd_clear(pgd);
 	pud_free_tlb(tlb, pud, start);
+out:
+	return nr_ptes;
 }
 
 /*
@@ -505,6 +511,7 @@ void free_pgd_range(struct mmu_gather *t
 {
 	pgd_t *pgd;
 	unsigned long next;
+	long nr_ptes = 0;
 
 	/*
 	 * The next few lines have given us lots of grief...
@@ -553,8 +560,15 @@ void free_pgd_range(struct mmu_gather *t
 		next = pgd_addr_end(addr, end);
 		if (pgd_none_or_clear_bad(pgd))
 			continue;
-		free_pud_range(tlb, pgd, addr, next, floor, ceiling);
+		nr_ptes += free_pud_range(tlb, pgd, addr, next, floor, ceiling);
 	} while (pgd++, addr = next, addr != end);
+
+	if (nr_ptes) {
+		struct mm_struct *mm = tlb->mm;
+		spin_lock(&mm->page_table_lock);
+		mm->nr_ptes -= nr_ptes;
+		spin_unlock(&mm->page_table_lock);
+	}
 }
 
 void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *vma,

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: exit_mmap() BUG_ON triggering since 3.1
  2012-02-16  2:14 ` Hugh Dickins
@ 2012-02-16  2:22   ` Roland Dreier
  2012-02-16  2:48     ` Hugh Dickins
  2012-02-16  7:07   ` Andrea Arcangeli
  1 sibling, 1 reply; 12+ messages in thread
From: Roland Dreier @ 2012-02-16  2:22 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Dave Jones, Andrew Morton, Andrea Arcangeli, David Rientjes,
	linux-kernel, linux-mm, kernel-team

On Wed, Feb 15, 2012 at 6:14 PM, Hugh Dickins <hughd@google.com> wrote:
> My suspicion was that it would be related to Transparent HugePages:
> they do complicate the pagetable story.  And I think I have found a
> potential culprit.  I don't know if nr_ptes is the only loser from
> these two split_huge_pages calls, but assuming it is...

Do you have an idea when this bug might have been introduced?
Presumably it's been there since THP came in?

The reason I ask is that I have one of these exit_mm BUG_ONs
in my pile of one-off unreproducible crashes, but in my case it
happened with 2.6.39 (with THP enabled).  So I'm wondering if
I can cross it off my list and blame this bug, or if it remains one
of those inexplicable mysteries...

Thanks,
  Roland

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: exit_mmap() BUG_ON triggering since 3.1
  2012-02-16  2:22   ` Roland Dreier
@ 2012-02-16  2:48     ` Hugh Dickins
  0 siblings, 0 replies; 12+ messages in thread
From: Hugh Dickins @ 2012-02-16  2:48 UTC (permalink / raw)
  To: Roland Dreier
  Cc: Dave Jones, Andrew Morton, Andrea Arcangeli, David Rientjes,
	linux-kernel, linux-mm, kernel-team

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1220 bytes --]

On Wed, 15 Feb 2012, Roland Dreier wrote:
> On Wed, Feb 15, 2012 at 6:14 PM, Hugh Dickins <hughd@google.com> wrote:
> > My suspicion was that it would be related to Transparent HugePages:
> > they do complicate the pagetable story.  And I think I have found a
> > potential culprit.  I don't know if nr_ptes is the only loser from
> > these two split_huge_pages calls, but assuming it is...
> 
> Do you have an idea when this bug might have been introduced?
> Presumably it's been there since THP came in?

That's right, since THP came in (2.6.38 on mainline,
but IIRC Red Hat had THP applied to an earlier kernel).

> 
> The reason I ask is that I have one of these exit_mm BUG_ONs
> in my pile of one-off unreproducible crashes, but in my case it
> happened with 2.6.39 (with THP enabled).  So I'm wondering if
> I can cross it off my list and blame this bug, or if it remains one
> of those inexplicable mysteries...

If you think that system could have been using swap, yes, cross it
off (unless someone points out that I'm totally wrong, because....).

But if you know that system used no swap (and didn't get involved
in any memory-failure hwpoison business), then keep on worrying!

Hugh

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: exit_mmap() BUG_ON triggering since 3.1
  2012-02-16  2:14 ` Hugh Dickins
  2012-02-16  2:22   ` Roland Dreier
@ 2012-02-16  7:07   ` Andrea Arcangeli
  2012-02-16  9:53     ` Hugh Dickins
  1 sibling, 1 reply; 12+ messages in thread
From: Andrea Arcangeli @ 2012-02-16  7:07 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Dave Jones, Andrew Morton, David Rientjes, linux-kernel, linux-mm,
	kernel-team

On Wed, Feb 15, 2012 at 06:14:12PM -0800, Hugh Dickins wrote:
> Now most of those paths in THP also hold mmap_sem for read or write (with
> appropriate checks on mm_users), but two do not: when split_huge_page()
> is called by hwpoison_user_mappings(), and when called by add_to_swap().

So the race is split_huge_page_map() called by add_to_swap() running
concurrently with free_pgtables. Great catch!!

> Or should we try harder to avoid the extra locking: test mm_users?
> #ifdef on THP?  Or consider the accuracy of this count not worth
> extra locking, and just scrap the BUG_ON now?

It's probably also happening with a large munmap, while add_to_swap
runs on another vma. Process didn't exit yet, but the actual BUG_ON
check runs at exit. So I doubt aborting split_huge_page on zero
mm_users could solve it.

Good part is, this being a false positive makes these oopses a
nuisance, so it means they can't corrupt any memory or disk etc...

The simplest is probably to change nr_ptes to count THPs too. I tried
that and no oopses so far but it's not very well tested yet.

====
From: Andrea Arcangeli <aarcange@redhat.com>
Subject: [PATCH] mm: thp: fix BUG on mm->nr_ptes

Quoting Hugh's discovery and explanation of the SMP race condition:

===
mm->nr_ptes had unusual locking: down_read mmap_sem plus
page_table_lock when incrementing, down_write mmap_sem (or mm_users 0)
when decrementing; whereas THP is careful to increment and decrement
it under page_table_lock.

Now most of those paths in THP also hold mmap_sem for read or write
(with appropriate checks on mm_users), but two do not: when
split_huge_page() is called by hwpoison_user_mappings(), and when
called by add_to_swap().

It's conceivable that the latter case is responsible for the
exit_mmap() BUG_ON mm->nr_ptes that has been reported on Fedora.
===

The simplest way to fix it without having to alter the locking is to
make split_huge_page() a noop in nr_ptes terms, so by counting the
preallocated pagetables that exists for every mapped hugepage. It was
an arbitrary choice not to count them and either way is not wrong or
right, because they are not used but they're still allocated.

Reported-by: Dave Jones <davej@redhat.com>
Reported-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 mm/huge_memory.c |    6 +++---
 1 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 91d3efb..8f7fc39 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -671,6 +671,7 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
 		set_pmd_at(mm, haddr, pmd, entry);
 		prepare_pmd_huge_pte(pgtable, mm);
 		add_mm_counter(mm, MM_ANONPAGES, HPAGE_PMD_NR);
+		mm->nr_ptes++;
 		spin_unlock(&mm->page_table_lock);
 	}

@@ -789,6 +790,7 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	pmd = pmd_mkold(pmd_wrprotect(pmd));
 	set_pmd_at(dst_mm, addr, dst_pmd, pmd);
 	prepare_pmd_huge_pte(pgtable, dst_mm);
+	dst_mm->nr_ptes++;

 	ret = 0;
 out_unlock:
@@ -887,7 +889,6 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
 	}
 	kfree(pages);

-	mm->nr_ptes++;
 	smp_wmb(); /* make pte visible before pmd */
 	pmd_populate(mm, pmd, pgtable);
 	page_remove_rmap(page);
@@ -1047,6 +1048,7 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 			VM_BUG_ON(page_mapcount(page) < 0);
 			add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
 			VM_BUG_ON(!PageHead(page));
+			tlb->mm->nr_ptes--;
 			spin_unlock(&tlb->mm->page_table_lock);
 			tlb_remove_page(tlb, page);
 			pte_free(tlb->mm, pgtable);
@@ -1375,7 +1377,6 @@ static int __split_huge_page_map(struct page *page,
 			pte_unmap(pte);
 		}

-		mm->nr_ptes++;
 		smp_wmb(); /* make pte visible before pmd */
 		/*
 		 * Up to this point the pmd is present and huge and
@@ -1988,7 +1989,6 @@ static void collapse_huge_page(struct mm_struct *mm,
 	set_pmd_at(mm, address, pmd, _pmd);
 	update_mmu_cache(vma, address, _pmd);
 	prepare_pmd_huge_pte(pgtable, mm);
-	mm->nr_ptes--;
 	spin_unlock(&mm->page_table_lock);

 #ifndef CONFIG_NUMA

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: exit_mmap() BUG_ON triggering since 3.1
  2012-02-16  7:07   ` Andrea Arcangeli
@ 2012-02-16  9:53     ` Hugh Dickins
  2012-02-16 21:42       ` Andrea Arcangeli
  0 siblings, 1 reply; 12+ messages in thread
From: Hugh Dickins @ 2012-02-16  9:53 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Dave Jones, Andrew Morton, David Rientjes, linux-kernel, linux-mm,
	kernel-team

On Thu, 16 Feb 2012, Andrea Arcangeli wrote:
> On Wed, Feb 15, 2012 at 06:14:12PM -0800, Hugh Dickins wrote:
> > Now most of those paths in THP also hold mmap_sem for read or write (with
> > appropriate checks on mm_users), but two do not: when split_huge_page()
> > is called by hwpoison_user_mappings(), and when called by add_to_swap().
> 
> So the race is split_huge_page_map() called by add_to_swap() running
> concurrently with free_pgtables. Great catch!!
> 
> > Or should we try harder to avoid the extra locking: test mm_users?
> > #ifdef on THP?  Or consider the accuracy of this count not worth
> > extra locking, and just scrap the BUG_ON now?
> 
> It's probably also happening with a large munmap, while add_to_swap
> runs on another vma. Process didn't exit yet, but the actual BUG_ON
> check runs at exit. So I doubt aborting split_huge_page on zero
> mm_users could solve it.

Indeed, what I meant was, I was wondering whether to make the
spin_lock and spin_unlock in my patch conditional on mm_users,
not to make split_huge_page conditional on it.

> 
> Good part is, this being a false positive makes these oopses a
> nuisance, so it means they can't corrupt any memory or disk etc...

Yes (and I think less troublesome than most BUGs, coming at exit
while not holding locks; though we could well make it a WARN_ON,
I don't think that existed back in the day).

> 
> The simplest is probably to change nr_ptes to count THPs too. I tried
> that and no oopses so far but it's not very well tested yet.

Oh, I like that, that's a much nicer fix than mine.  If you're happy
to change the THP end (which I could hardly blame for getting those odd
rules slightly wrong), and it passes your testing, then certainly add my

Acked-by: Hugh Dickins <hughd@google.com>

In looking into the bug, it had actually bothered me a little that you
were setting aside those pages, yet not counting them into nr_ptes;
though the only thing that cares is oom_kill.c, and the count of pages
in each hugepage can only dwarf the count in nr_ptes (whereas, without
hugepages, it's possible to populate very sparsely and nr_ptes become
significant).

> 
> ====
> From: Andrea Arcangeli <aarcange@redhat.com>
> Subject: [PATCH] mm: thp: fix BUG on mm->nr_ptes
> 
> Quoting Hugh's discovery and explanation of the SMP race condition:
> 
> ===
> mm->nr_ptes had unusual locking: down_read mmap_sem plus
> page_table_lock when incrementing, down_write mmap_sem (or mm_users 0)
> when decrementing; whereas THP is careful to increment and decrement
> it under page_table_lock.
> 
> Now most of those paths in THP also hold mmap_sem for read or write
> (with appropriate checks on mm_users), but two do not: when
> split_huge_page() is called by hwpoison_user_mappings(), and when
> called by add_to_swap().
> 
> It's conceivable that the latter case is responsible for the
> exit_mmap() BUG_ON mm->nr_ptes that has been reported on Fedora.
> ===
> 
> The simplest way to fix it without having to alter the locking is to
> make split_huge_page() a noop in nr_ptes terms, so by counting the
> preallocated pagetables that exists for every mapped hugepage. It was
> an arbitrary choice not to count them and either way is not wrong or
> right, because they are not used but they're still allocated.
> 
> Reported-by: Dave Jones <davej@redhat.com>
> Reported-by: Hugh Dickins <hughd@google.com>
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> ---
>  mm/huge_memory.c |    6 +++---
>  1 files changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 91d3efb..8f7fc39 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -671,6 +671,7 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
>  		set_pmd_at(mm, haddr, pmd, entry);
>  		prepare_pmd_huge_pte(pgtable, mm);
>  		add_mm_counter(mm, MM_ANONPAGES, HPAGE_PMD_NR);
> +		mm->nr_ptes++;
>  		spin_unlock(&mm->page_table_lock);
>  	}
>  
> @@ -789,6 +790,7 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
>  	pmd = pmd_mkold(pmd_wrprotect(pmd));
>  	set_pmd_at(dst_mm, addr, dst_pmd, pmd);
>  	prepare_pmd_huge_pte(pgtable, dst_mm);
> +	dst_mm->nr_ptes++;
>  
>  	ret = 0;
>  out_unlock:
> @@ -887,7 +889,6 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
>  	}
>  	kfree(pages);
>  
> -	mm->nr_ptes++;
>  	smp_wmb(); /* make pte visible before pmd */
>  	pmd_populate(mm, pmd, pgtable);
>  	page_remove_rmap(page);
> @@ -1047,6 +1048,7 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>  			VM_BUG_ON(page_mapcount(page) < 0);
>  			add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
>  			VM_BUG_ON(!PageHead(page));
> +			tlb->mm->nr_ptes--;
>  			spin_unlock(&tlb->mm->page_table_lock);
>  			tlb_remove_page(tlb, page);
>  			pte_free(tlb->mm, pgtable);
> @@ -1375,7 +1377,6 @@ static int __split_huge_page_map(struct page *page,
>  			pte_unmap(pte);
>  		}
>  
> -		mm->nr_ptes++;
>  		smp_wmb(); /* make pte visible before pmd */
>  		/*
>  		 * Up to this point the pmd is present and huge and
> @@ -1988,7 +1989,6 @@ static void collapse_huge_page(struct mm_struct *mm,
>  	set_pmd_at(mm, address, pmd, _pmd);
>  	update_mmu_cache(vma, address, _pmd);
>  	prepare_pmd_huge_pte(pgtable, mm);
> -	mm->nr_ptes--;
>  	spin_unlock(&mm->page_table_lock);
>  
>  #ifndef CONFIG_NUMA

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: exit_mmap() BUG_ON triggering since 3.1
  2012-02-16  9:53     ` Hugh Dickins
@ 2012-02-16 21:42       ` Andrea Arcangeli
  2012-03-02 22:53         ` Hugh Dickins
  0 siblings, 1 reply; 12+ messages in thread
From: Andrea Arcangeli @ 2012-02-16 21:42 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Dave Jones, Andrew Morton, David Rientjes, linux-kernel, linux-mm,
	kernel-team

On Thu, Feb 16, 2012 at 01:53:04AM -0800, Hugh Dickins wrote:
> Yes (and I think less troublesome than most BUGs, coming at exit
> while not holding locks; though we could well make it a WARN_ON,
> I don't think that existed back in the day).

A WARN_ON would be fine with me, go ahead if you prefer it... only
risk would be to go unnoticed or be underestimated. I am ok with the
BUG_ON too (even if this time it triggered false positives... sigh).

> Acked-by: Hugh Dickins <hughd@google.com>

Thanks for the quick review!

> In looking into the bug, it had actually bothered me a little that you
> were setting aside those pages, yet not counting them into nr_ptes;
> though the only thing that cares is oom_kill.c, and the count of pages
> in each hugepage can only dwarf the count in nr_ptes (whereas, without
> hugepages, it's possible to populate very sparsely and nr_ptes become
> significant).

Agreed, it's not significant either ways.

Running my two primary systems with this applied for half a day and no
problem so far so it should be good for -mm at least.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: exit_mmap() BUG_ON triggering since 3.1
  2012-02-16 21:42       ` Andrea Arcangeli
@ 2012-03-02 22:53         ` Hugh Dickins
  2012-03-02 22:58           ` Andrew Morton
  0 siblings, 1 reply; 12+ messages in thread
From: Hugh Dickins @ 2012-03-02 22:53 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrea Arcangeli, Dave Jones, David Rientjes, linux-kernel,
	linux-mm, kernel-team

On Thu, 16 Feb 2012, Andrea Arcangeli wrote:
> On Thu, Feb 16, 2012 at 01:53:04AM -0800, Hugh Dickins wrote:
> > Yes (and I think less troublesome than most BUGs, coming at exit
> > while not holding locks; though we could well make it a WARN_ON,
> > I don't think that existed back in the day).
> 
> A WARN_ON would be fine with me, go ahead if you prefer it... only
> risk would be to go unnoticed or be underestimated. I am ok with the
> BUG_ON too (even if this time it triggered false positives... sigh).
> 
> > Acked-by: Hugh Dickins <hughd@google.com>
> 
> Thanks for the quick review!
> 
> > In looking into the bug, it had actually bothered me a little that you
> > were setting aside those pages, yet not counting them into nr_ptes;
> > though the only thing that cares is oom_kill.c, and the count of pages
> > in each hugepage can only dwarf the count in nr_ptes (whereas, without
> > hugepages, it's possible to populate very sparsely and nr_ptes become
> > significant).
> 
> Agreed, it's not significant either ways.
> 
> Running my two primary systems with this applied for half a day and no
> problem so far so it should be good foro -mm at least.

And I've had no trouble running your patch since then (but I never hit
the bug it fixes either).  But we've all forgottent about it, so let me
bring your patch back inline (I've added one introductory sentence) and
address to akpm...


From: Andrea Arcangeli <aarcange@redhat.com>
Subject: [PATCH] mm: thp: fix BUG on mm->nr_ptes

Dave Jones reports a few Fedora users hitting the BUG_ON(mm->nr_ptes...)
in exit_mmap() recently.

Quoting Hugh's discovery and explanation of the SMP race condition:

===
mm->nr_ptes had unusual locking: down_read mmap_sem plus
page_table_lock when incrementing, down_write mmap_sem (or mm_users 0)
when decrementing; whereas THP is careful to increment and decrement
it under page_table_lock.

Now most of those paths in THP also hold mmap_sem for read or write
(with appropriate checks on mm_users), but two do not: when
split_huge_page() is called by hwpoison_user_mappings(), and when
called by add_to_swap().

It's conceivable that the latter case is responsible for the
exit_mmap() BUG_ON mm->nr_ptes that has been reported on Fedora.
===

The simplest way to fix it without having to alter the locking is to
make split_huge_page() a noop in nr_ptes terms, so by counting the
preallocated pagetables that exists for every mapped hugepage. It was
an arbitrary choice not to count them and either way is not wrong or
right, because they are not used but they're still allocated.

Reported-by: Dave Jones <davej@redhat.com>
Reported-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Hugh Dickins <hughd@google.com>
---
 mm/huge_memory.c |    6 +++---
 1 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 91d3efb..8f7fc39 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -671,6 +671,7 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
 		set_pmd_at(mm, haddr, pmd, entry);
 		prepare_pmd_huge_pte(pgtable, mm);
 		add_mm_counter(mm, MM_ANONPAGES, HPAGE_PMD_NR);
+		mm->nr_ptes++;
 		spin_unlock(&mm->page_table_lock);
 	}
 
@@ -789,6 +790,7 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	pmd = pmd_mkold(pmd_wrprotect(pmd));
 	set_pmd_at(dst_mm, addr, dst_pmd, pmd);
 	prepare_pmd_huge_pte(pgtable, dst_mm);
+	dst_mm->nr_ptes++;
 
 	ret = 0;
 out_unlock:
@@ -887,7 +889,6 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
 	}
 	kfree(pages);
 
-	mm->nr_ptes++;
 	smp_wmb(); /* make pte visible before pmd */
 	pmd_populate(mm, pmd, pgtable);
 	page_remove_rmap(page);
@@ -1047,6 +1048,7 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 			VM_BUG_ON(page_mapcount(page) < 0);
 			add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
 			VM_BUG_ON(!PageHead(page));
+			tlb->mm->nr_ptes--;
 			spin_unlock(&tlb->mm->page_table_lock);
 			tlb_remove_page(tlb, page);
 			pte_free(tlb->mm, pgtable);
@@ -1375,7 +1377,6 @@ static int __split_huge_page_map(struct page *page,
 			pte_unmap(pte);
 		}
 
-		mm->nr_ptes++;
 		smp_wmb(); /* make pte visible before pmd */
 		/*
 		 * Up to this point the pmd is present and huge and
@@ -1988,7 +1989,6 @@ static void collapse_huge_page(struct mm_struct *mm,
 	set_pmd_at(mm, address, pmd, _pmd);
 	update_mmu_cache(vma, address, _pmd);
 	prepare_pmd_huge_pte(pgtable, mm);
-	mm->nr_ptes--;
 	spin_unlock(&mm->page_table_lock);
 
 #ifndef CONFIG_NUMA

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: exit_mmap() BUG_ON triggering since 3.1
  2012-03-02 22:53         ` Hugh Dickins
@ 2012-03-02 22:58           ` Andrew Morton
  2012-03-02 23:09             ` Hugh Dickins
  0 siblings, 1 reply; 12+ messages in thread
From: Andrew Morton @ 2012-03-02 22:58 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrea Arcangeli, Dave Jones, David Rientjes, linux-kernel,
	linux-mm, kernel-team

On Fri, 2 Mar 2012 14:53:32 -0800 (PST)
Hugh Dickins <hughd@google.com> wrote:

> Subject: Re: exit_mmap() BUG_ON triggering since 3.1
> ...
> Subject: [PATCH] mm: thp: fix BUG on mm->nr_ptes

So it's needed in 3.1.x and 3.2.x?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: exit_mmap() BUG_ON triggering since 3.1
  2012-03-02 22:58           ` Andrew Morton
@ 2012-03-02 23:09             ` Hugh Dickins
  2012-03-05 19:59               ` Josh Boyer
  0 siblings, 1 reply; 12+ messages in thread
From: Hugh Dickins @ 2012-03-02 23:09 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrea Arcangeli, Dave Jones, David Rientjes, linux-kernel,
	linux-mm, kernel-team

On Fri, 2 Mar 2012, Andrew Morton wrote:
> On Fri, 2 Mar 2012 14:53:32 -0800 (PST)
> Hugh Dickins <hughd@google.com> wrote:
> 
> > Subject: Re: exit_mmap() BUG_ON triggering since 3.1
> > ...
> > Subject: [PATCH] mm: thp: fix BUG on mm->nr_ptes
> 
> So it's needed in 3.1.x and 3.2.x?

Indeed it would be needed in -stable, thanks, I forgot to add that.

And although Fedora only got reports from 3.1 onwards, I believe it
would equally be needed in 3.0.x.  3.1.x is closed down now, but
3.0.x and 3.2.x are still open.

I've not yet tried applying it to the latest of either of those: maybe
it applies cleanly and correctly, but I could imagine movements too.
But the first step, yes, is to Cc: stable@vger.kernel.org

Thanks,
Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: exit_mmap() BUG_ON triggering since 3.1
  2012-03-02 23:09             ` Hugh Dickins
@ 2012-03-05 19:59               ` Josh Boyer
  2012-03-05 20:07                 ` Andrew Morton
  0 siblings, 1 reply; 12+ messages in thread
From: Josh Boyer @ 2012-03-05 19:59 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Andrea Arcangeli, Dave Jones, David Rientjes,
	linux-kernel, linux-mm, kernel-team

On Fri, Mar 02, 2012 at 03:09:29PM -0800, Hugh Dickins wrote:
> On Fri, 2 Mar 2012, Andrew Morton wrote:
> > On Fri, 2 Mar 2012 14:53:32 -0800 (PST)
> > Hugh Dickins <hughd@google.com> wrote:
> > 
> > > Subject: Re: exit_mmap() BUG_ON triggering since 3.1
> > > ...
> > > Subject: [PATCH] mm: thp: fix BUG on mm->nr_ptes
> > 
> > So it's needed in 3.1.x and 3.2.x?
> 
> Indeed it would be needed in -stable, thanks, I forgot to add that.
> 
> And although Fedora only got reports from 3.1 onwards, I believe it
> would equally be needed in 3.0.x.  3.1.x is closed down now, but
> 3.0.x and 3.2.x are still open.
> 
> I've not yet tried applying it to the latest of either of those: maybe
> it applies cleanly and correctly, but I could imagine movements too.
> But the first step, yes, is to Cc: stable@vger.kernel.org

I don't see this in linux-next, 3.3-rcX, the stable-queue, or really
anywhere at all at the moment.  Did the patch get swallowed up by some
kind of evil code Eagle of Doom before making it into the safety of a
tree somewhere?

josh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: exit_mmap() BUG_ON triggering since 3.1
  2012-03-05 19:59               ` Josh Boyer
@ 2012-03-05 20:07                 ` Andrew Morton
  0 siblings, 0 replies; 12+ messages in thread
From: Andrew Morton @ 2012-03-05 20:07 UTC (permalink / raw)
  To: Josh Boyer
  Cc: Hugh Dickins, Andrea Arcangeli, Dave Jones, David Rientjes,
	linux-kernel, linux-mm, kernel-team

On Mon, 5 Mar 2012 14:59:53 -0500
Josh Boyer <jwboyer@redhat.com> wrote:

> On Fri, Mar 02, 2012 at 03:09:29PM -0800, Hugh Dickins wrote:
> > On Fri, 2 Mar 2012, Andrew Morton wrote:
> > > On Fri, 2 Mar 2012 14:53:32 -0800 (PST)
> > > Hugh Dickins <hughd@google.com> wrote:
> > > 
> > > > Subject: Re: exit_mmap() BUG_ON triggering since 3.1
> > > > ...
> > > > Subject: [PATCH] mm: thp: fix BUG on mm->nr_ptes
> > > 
> > > So it's needed in 3.1.x and 3.2.x?
> > 
> > Indeed it would be needed in -stable, thanks, I forgot to add that.
> > 
> > And although Fedora only got reports from 3.1 onwards, I believe it
> > would equally be needed in 3.0.x.  3.1.x is closed down now, but
> > 3.0.x and 3.2.x are still open.
> > 
> > I've not yet tried applying it to the latest of either of those: maybe
> > it applies cleanly and correctly, but I could imagine movements too.
> > But the first step, yes, is to Cc: stable@vger.kernel.org
> 
> I don't see this in linux-next, 3.3-rcX, the stable-queue, or really
> anywhere at all at the moment.  Did the patch get swallowed up by some
> kind of evil code Eagle of Doom before making it into the safety of a
> tree somewhere?

I have it queued for 3.3.  It's taking a while to get things into -next
at present because they're breaking things faster than I can fix them :(

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2012-03-05 20:07 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-02-15 18:33 exit_mmap() BUG_ON triggering since 3.1 Dave Jones
2012-02-16  2:14 ` Hugh Dickins
2012-02-16  2:22   ` Roland Dreier
2012-02-16  2:48     ` Hugh Dickins
2012-02-16  7:07   ` Andrea Arcangeli
2012-02-16  9:53     ` Hugh Dickins
2012-02-16 21:42       ` Andrea Arcangeli
2012-03-02 22:53         ` Hugh Dickins
2012-03-02 22:58           ` Andrew Morton
2012-03-02 23:09             ` Hugh Dickins
2012-03-05 19:59               ` Josh Boyer
2012-03-05 20:07                 ` Andrew Morton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).