linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Larry Woodman <lwoodman@redhat.com>
To: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>, Hugh Dickins <hughd@google.com>,
	Michal Hocko <mhocko@suse.cz>, Linux-MM <linux-mm@kvack.org>,
	David Gibson <david@gibson.dropbear.id.au>,
	Ken Chen <kenchen@google.com>,
	Cong Wang <xiyou.wangcong@gmail.com>,
	LKML <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)
Date: Tue, 31 Jul 2012 13:49:21 -0400	[thread overview]
Message-ID: <50181AA1.0@redhat.com> (raw)
In-Reply-To: <20120731124650.GO612@suse.de>

On 07/31/2012 08:46 AM, Mel Gorman wrote:
>
> Fundamentally I think the problem is that we are not correctly detecting
> that page table sharing took place during huge_pte_alloc(). This patch is
> longer and makes an API change but if I'm right, it addresses the underlying
> problem. The first VM_MAYSHARE patch is still necessary but would you mind
> testing this on top please?
Hi Mel, yes this does work just fine.  It ran for hours without a panic so
I'll Ack this one if you send it to the list.

Larry



> ---8<---
> mm: hugetlbfs: Correctly detect if page tables have just been shared
>
> Each page mapped in a processes address space must be correctly
> accounted for in _mapcount. Normally the rules for this are
> straight-forward but hugetlbfs page table sharing is different.
> The page table pages at the PMD level are reference counted while
> the mapcount remains the same. If this accounting is wrong, it causes
> bugs like this one reported by Larry Woodman
>
> [ 1106.156569] ------------[ cut here ]------------
> [ 1106.161731] kernel BUG at mm/filemap.c:135!
> [ 1106.166395] invalid opcode: 0000 [#1] SMP
> [ 1106.170975] CPU 22
> [ 1106.173115] Modules linked in: bridge stp llc sunrpc binfmt_misc dcdbas microcode pcspkr acpi_pad acpi]
> [ 1106.201770]
> [ 1106.203426] Pid: 18001, comm: mpitest Tainted: G        W    3.3.0+ #4 Dell Inc. PowerEdge R620/07NDJ2
> [ 1106.213822] RIP: 0010:[<ffffffff8112cfed>]  [<ffffffff8112cfed>] __delete_from_page_cache+0x15d/0x170
> [ 1106.224117] RSP: 0018:ffff880428973b88  EFLAGS: 00010002
> [ 1106.230032] RAX: 0000000000000001 RBX: ffffea0006b80000 RCX: 00000000ffffffb0
> [ 1106.237979] RDX: 0000000000016df1 RSI: 0000000000000009 RDI: ffff88043ffd9e00
> [ 1106.245927] RBP: ffff880428973b98 R08: 0000000000000050 R09: 0000000000000003
> [ 1106.253876] R10: 000000000000000d R11: 0000000000000000 R12: ffff880428708150
> [ 1106.261826] R13: ffff880428708150 R14: 0000000000000000 R15: ffffea0006b80000
> [ 1106.269780] FS:  0000000000000000(0000) GS:ffff88042fd60000(0000) knlGS:0000000000000000
> [ 1106.278794] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 1106.285193] CR2: 0000003a1d38c4a8 CR3: 000000000187d000 CR4: 00000000000406e0
> [ 1106.293149] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 1106.301097] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> [ 1106.309046] Process mpitest (pid: 18001, threadinfo ffff880428972000, task ffff880428b5cc20)
> [ 1106.318447] Stack:
> [ 1106.320690]  ffffea0006b80000 0000000000000000 ffff880428973bc8 ffffffff8112d040
> [ 1106.328958]  ffff880428973bc8 00000000000002ab 00000000000002a0 ffff880428973c18
> [ 1106.337234]  ffff880428973cc8 ffffffff8125b405 ffff880400000001 0000000000000000
> [ 1106.345513] Call Trace:
> [ 1106.348235]  [<ffffffff8112d040>] delete_from_page_cache+0x40/0x80
> [ 1106.355128]  [<ffffffff8125b405>] truncate_hugepages+0x115/0x1f0
> [ 1106.361826]  [<ffffffff8125b4f8>] hugetlbfs_evict_inode+0x18/0x30
> [ 1106.368615]  [<ffffffff811ab1af>] evict+0x9f/0x1b0
> [ 1106.373951]  [<ffffffff811ab3a3>] iput_final+0xe3/0x1e0
> [ 1106.379773]  [<ffffffff811ab4de>] iput+0x3e/0x50
> [ 1106.384922]  [<ffffffff811a8e18>] d_kill+0xf8/0x110
> [ 1106.390356]  [<ffffffff811a8f12>] dput+0xe2/0x1b0
> [ 1106.395595]  [<ffffffff81193612>] __fput+0x162/0x240
>
> During fork(), copy_hugetlb_page_range() detects if huge_pte_alloc()
> shared page tables with the check dst_pte == src_pte. The logic is if
> the PMD page is the same, they must be shared. This assumes that the
> sharing is between the parent and child. However, if the sharing is with
> a different process entirely then this check fails as in this diagram.
>
> parent
>    |
>    ------------>pmd
>                 src_pte---------->  data page
>                                        ^
> other--------->pmd--------------------|
>                  ^
> child-----------|
>                 dst_pte
>
> For this situation to occur, it must be possible for Parent and Other
> to have faulted and failed to share page tables with each other. This is
> possible due to the following style of race.
>
> PROC A						PROC B
> copy_hugetlb_page_range				copy_hugetlb_page_range
>    src_pte == huge_pte_offset			  src_pte == huge_pte_offset
>    !src_pte so no sharing			  !src_pte so no sharing
>
> (time passes)
>
> hugetlb_fault					hugetlb_fault
>    huge_pte_alloc				  huge_pte_alloc
>      huge_pmd_share				   huge_pmd_share
>        LOCK(i_mmap_mutex)
>        find nothing, no sharing
>        UNLOCK(i_mmap_mutex)
>        						    LOCK(i_mmap_mutex)
>        						    find nothing, no sharing
>        						    UNLOCK(i_mmap_mutex)
>      pmd_alloc					    pmd_alloc
>      LOCK(instantiation_mutex)
>      fault
>      UNLOCK(instantiation_mutex)
>      						LOCK(instantiation_mutex)
>      						fault
>      						UNLOCK(instantiation_mutex)
>
> These two processes are not poing to the same data page but are not sharing
> page tables because the opportunity was missed. When either process later
> forks, the src_pte == dst pte is potentially insufficient.  As the check
> falls through, the wrong PTE information is copied in (harmless but wrong)
> and the mapcount is bumped for a page mapped by a shared page table leading
> to the BUG_ON.
>
> Signed-off-by: Mel Gorman<mgorman@suse.de>
> ---
>   arch/ia64/mm/hugetlbpage.c    |    3 ++-
>   arch/mips/mm/hugetlbpage.c    |    2 +-
>   arch/powerpc/mm/hugetlbpage.c |    3 ++-
>   arch/s390/mm/hugetlbpage.c    |    3 ++-
>   arch/sh/mm/hugetlbpage.c      |    3 ++-
>   arch/sparc/mm/hugetlbpage.c   |    3 ++-
>   arch/tile/mm/hugetlbpage.c    |    3 ++-
>   arch/x86/mm/hugetlbpage.c     |   13 ++++++++-----
>   include/linux/hugetlb.h       |    3 ++-
>   mm/hugetlb.c                  |   12 +++++++++---
>   10 files changed, 32 insertions(+), 16 deletions(-)
>
> diff --git a/arch/ia64/mm/hugetlbpage.c b/arch/ia64/mm/hugetlbpage.c
> index 5ca674b..a0bb307 100644
> --- a/arch/ia64/mm/hugetlbpage.c
> +++ b/arch/ia64/mm/hugetlbpage.c
> @@ -25,7 +25,8 @@ unsigned int hpage_shift = HPAGE_SHIFT_DEFAULT;
>   EXPORT_SYMBOL(hpage_shift);
>
>   pte_t *
> -huge_pte_alloc(struct mm_struct *mm, unsigned long addr, unsigned long sz)
> +huge_pte_alloc(struct mm_struct *mm, unsigned long addr, unsigned long sz,
> +	       bool *shared)
>   {
>   	unsigned long taddr = htlbpage_to_page(addr);
>   	pgd_t *pgd;
> diff --git a/arch/mips/mm/hugetlbpage.c b/arch/mips/mm/hugetlbpage.c
> index a7fee0d..06ca4a3 100644
> --- a/arch/mips/mm/hugetlbpage.c
> +++ b/arch/mips/mm/hugetlbpage.c
> @@ -23,7 +23,7 @@
>   #include<asm/tlbflush.h>
>
>   pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr,
> -		      unsigned long sz)
> +		      unsigned long sz, bool *shared)
>   {
>   	pgd_t *pgd;
>   	pud_t *pud;
> diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
> index 1a6de0a..5fc6672 100644
> --- a/arch/powerpc/mm/hugetlbpage.c
> +++ b/arch/powerpc/mm/hugetlbpage.c
> @@ -175,7 +175,8 @@ static int __hugepte_alloc(struct mm_struct *mm, hugepd_t *hpdp,
>   #define HUGEPD_PUD_SHIFT PMD_SHIFT
>   #endif
>
> -pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr, unsigned long sz)
> +pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr, unsigned long sz,
> +		      bool *shared)
>   {
>   	pgd_t *pg;
>   	pud_t *pu;
> diff --git a/arch/s390/mm/hugetlbpage.c b/arch/s390/mm/hugetlbpage.c
> index 532525e..2c3b501 100644
> --- a/arch/s390/mm/hugetlbpage.c
> +++ b/arch/s390/mm/hugetlbpage.c
> @@ -65,7 +65,8 @@ void arch_release_hugepage(struct page *page)
>   }
>
>   pte_t *huge_pte_alloc(struct mm_struct *mm,
> -			unsigned long addr, unsigned long sz)
> +			unsigned long addr, unsigned long sz,
> +			bool *shared)
>   {
>   	pgd_t *pgdp;
>   	pud_t *pudp;
> diff --git a/arch/sh/mm/hugetlbpage.c b/arch/sh/mm/hugetlbpage.c
> index d776234..bbe154d 100644
> --- a/arch/sh/mm/hugetlbpage.c
> +++ b/arch/sh/mm/hugetlbpage.c
> @@ -22,7 +22,8 @@
>   #include<asm/cacheflush.h>
>
>   pte_t *huge_pte_alloc(struct mm_struct *mm,
> -			unsigned long addr, unsigned long sz)
> +			unsigned long addr, unsigned long sz,
> +			bool *shared)
>   {
>   	pgd_t *pgd;
>   	pud_t *pud;
> diff --git a/arch/sparc/mm/hugetlbpage.c b/arch/sparc/mm/hugetlbpage.c
> index 07e1453..d3ef01b 100644
> --- a/arch/sparc/mm/hugetlbpage.c
> +++ b/arch/sparc/mm/hugetlbpage.c
> @@ -194,7 +194,8 @@ hugetlb_get_unmapped_area(struct file *file, unsigned long addr,
>   }
>
>   pte_t *huge_pte_alloc(struct mm_struct *mm,
> -			unsigned long addr, unsigned long sz)
> +			unsigned long addr, unsigned long sz,
> +			bool *shared)
>   {
>   	pgd_t *pgd;
>   	pud_t *pud;
> diff --git a/arch/tile/mm/hugetlbpage.c b/arch/tile/mm/hugetlbpage.c
> index 812e2d0..db01091 100644
> --- a/arch/tile/mm/hugetlbpage.c
> +++ b/arch/tile/mm/hugetlbpage.c
> @@ -84,7 +84,8 @@ static pte_t *pte_alloc_hugetlb(struct mm_struct *mm, pmd_t *pmd,
>   #endif
>
>   pte_t *huge_pte_alloc(struct mm_struct *mm,
> -		      unsigned long addr, unsigned long sz)
> +		      unsigned long addr, unsigned long sz,
> +		      bool *shared)
>   {
>   	pgd_t *pgd;
>   	pud_t *pud;
> diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c
> index f6679a7..8c53064 100644
> --- a/arch/x86/mm/hugetlbpage.c
> +++ b/arch/x86/mm/hugetlbpage.c
> @@ -58,7 +58,8 @@ static int vma_shareable(struct vm_area_struct *vma, unsigned long addr)
>   /*
>    * search for a shareable pmd page for hugetlb.
>    */
> -static void huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
> +static void huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud,
> +			   bool *shared)
>   {
>   	struct vm_area_struct *vma = find_vma(mm, addr);
>   	struct address_space *mapping = vma->vm_file->f_mapping;
> @@ -91,9 +92,10 @@ static void huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
>   		goto out;
>
>   	spin_lock(&mm->page_table_lock);
> -	if (pud_none(*pud))
> +	if (pud_none(*pud)) {
>   		pud_populate(mm, pud, (pmd_t *)((unsigned long)spte&  PAGE_MASK));
> -	else
> +		*shared = true;
> +	} else
>   		put_page(virt_to_page(spte));
>   	spin_unlock(&mm->page_table_lock);
>   out:
> @@ -128,7 +130,8 @@ int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr, pte_t *ptep)
>   }
>
>   pte_t *huge_pte_alloc(struct mm_struct *mm,
> -			unsigned long addr, unsigned long sz)
> +			unsigned long addr, unsigned long sz,
> +			bool *shared)
>   {
>   	pgd_t *pgd;
>   	pud_t *pud;
> @@ -142,7 +145,7 @@ pte_t *huge_pte_alloc(struct mm_struct *mm,
>   		} else {
>   			BUG_ON(sz != PMD_SIZE);
>   			if (pud_none(*pud))
> -				huge_pmd_share(mm, addr, pud);
> +				huge_pmd_share(mm, addr, pud, shared);
>   			pte = (pte_t *) pmd_alloc(mm, pud, addr);
>   		}
>   	}
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index 73c7782..68d2597 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -75,7 +75,8 @@ extern struct list_head huge_boot_pages;
>   /* arch callbacks */
>
>   pte_t *huge_pte_alloc(struct mm_struct *mm,
> -			unsigned long addr, unsigned long sz);
> +			unsigned long addr, unsigned long sz,
> +			bool *shared);
>   pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr);
>   int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr, pte_t *ptep);
>   struct page *follow_huge_addr(struct mm_struct *mm, unsigned long address,
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 71c93d7..45c2196 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -2282,6 +2282,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
>   	int cow;
>   	struct hstate *h = hstate_vma(vma);
>   	unsigned long sz = huge_page_size(h);
> +	bool shared = false;
>
>   	cow = (vma->vm_flags&  (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE;
>
> @@ -2289,12 +2290,12 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
>   		src_pte = huge_pte_offset(src, addr);
>   		if (!src_pte)
>   			continue;
> -		dst_pte = huge_pte_alloc(dst, addr, sz);
> +		dst_pte = huge_pte_alloc(dst, addr, sz,&shared);
>   		if (!dst_pte)
>   			goto nomem;
>
>   		/* If the pagetables are shared don't copy or take references */
> -		if (dst_pte == src_pte)
> +		if (shared)
>   			continue;
>
>   		spin_lock(&dst->page_table_lock);
> @@ -2817,6 +2818,7 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>   	struct page *pagecache_page = NULL;
>   	static DEFINE_MUTEX(hugetlb_instantiation_mutex);
>   	struct hstate *h = hstate_vma(vma);
> +	bool shared = false;
>
>   	address&= huge_page_mask(h);
>
> @@ -2831,10 +2833,14 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>   				VM_FAULT_SET_HINDEX(hstate_index(h));
>   	}
>
> -	ptep = huge_pte_alloc(mm, address, huge_page_size(h));
> +	ptep = huge_pte_alloc(mm, address, huge_page_size(h),&shared);
>   	if (!ptep)
>   		return VM_FAULT_OOM;
>
> +	/* If the pagetable is shared, no other work is necessary */
> +	if (shared)
> +		return 0;
> +
>   	/*
>   	 * Serialize hugepage allocation and instantiation, so that we don't
>   	 * get spurious allocation failures if two CPUs race to instantiate
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email:<a href=mailto:"dont@kvack.org">  email@kvack.org</a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  parent reply	other threads:[~2012-07-31 18:24 UTC|newest]

Thread overview: 50+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-07-20 13:49 [PATCH] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables v2 Mel Gorman
2012-07-20 14:11 ` [PATCH] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend) Mel Gorman
2012-07-20 14:29   ` Michal Hocko
2012-07-20 14:37     ` Mel Gorman
2012-07-20 14:40       ` Michal Hocko
2012-07-20 14:36   ` [PATCH -alternative] " Michal Hocko
2012-07-20 14:51     ` Mel Gorman
2012-07-23  4:04       ` Hugh Dickins
2012-07-23 11:40         ` Mel Gorman
2012-07-24  1:08           ` Hugh Dickins
2012-07-24  8:32             ` Michal Hocko
2012-07-24  9:34             ` Mel Gorman
2012-07-24 10:04               ` Michal Hocko
2012-07-24 19:23               ` Hugh Dickins
2012-07-25  8:36                 ` Mel Gorman
2012-07-26 17:42         ` Rik van Riel
2012-07-26 18:04           ` Larry Woodman
2012-07-27  8:42           ` Mel Gorman
2012-07-26 18:37         ` Rik van Riel
2012-07-26 21:03           ` Larry Woodman
2012-07-27  3:48           ` Larry Woodman
2012-07-27 10:10             ` Larry Woodman
2012-07-27 10:23             ` Mel Gorman
2012-07-27 10:36               ` Larry Woodman
2012-07-30 19:11               ` Larry Woodman
2012-07-31 12:16                 ` Hillf Danton
2012-07-31 12:46                 ` Mel Gorman
2012-07-31 13:07                   ` Larry Woodman
2012-07-31 13:29                     ` Mel Gorman
2012-07-31 13:21                   ` Michal Hocko
2012-07-31 17:49                   ` Larry Woodman [this message]
2012-07-31 20:06                     ` Michal Hocko
2012-07-31 20:57                       ` Larry Woodman
2012-08-01  2:45                       ` Larry Woodman
2012-08-01  8:20                         ` Michal Hocko
2012-08-01 12:32                           ` Michal Hocko
2012-08-01 15:06                             ` Larry Woodman
2012-08-02  7:19                               ` Michal Hocko
2012-08-02  7:37                                 ` Mel Gorman
2012-08-02 12:36                                   ` Michal Hocko
2012-08-02 13:33                                     ` Mel Gorman
2012-08-02 13:53                                       ` Michal Hocko
2012-07-31 18:03                   ` Rik van Riel
2012-07-26 18:31     ` Rik van Riel
2012-07-27  9:02       ` Michal Hocko
2012-07-26 16:01 ` [PATCH] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables v2 Larry Woodman
2012-07-27  8:47   ` Mel Gorman
2012-07-26 21:00 ` Rik van Riel
2012-07-26 21:54   ` Hugh Dickins
2012-07-27  8:52   ` Mel Gorman

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=50181AA1.0@redhat.com \
    --to=lwoodman@redhat.com \
    --cc=david@gibson.dropbear.id.au \
    --cc=hughd@google.com \
    --cc=kenchen@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@suse.de \
    --cc=mhocko@suse.cz \
    --cc=riel@redhat.com \
    --cc=xiyou.wangcong@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).