From: Larry Woodman <lwoodman@redhat.com>
To: Andrea Arcangeli <aarcange@redhat.com>
Cc: linux-mm@kvack.org, Andrew Morton <akpm@linux-foundation.org>,
Johannes Weiner <hannes@cmpxchg.org>,
Mel Gorman <mgorman@suse.de>, Hugh Dickins <hughd@google.com>,
Rik van Riel <riel@redhat.com>,
Ulrich Obergfell <uobergfe@redhat.com>
Subject: Re: [PATCH] mm: thp: fix pmd_bad() triggering in code paths holding mmap_sem read mode
Date: Thu, 15 Mar 2012 12:13:44 -0400 [thread overview]
Message-ID: <4F621538.7030707@redhat.com> (raw)
In-Reply-To: <1331822671-21508-1-git-send-email-aarcange@redhat.com>
On 03/15/2012 10:44 AM, Andrea Arcangeli wrote:
> In some cases it may happen that pmd_none_or_clear_bad() is called
> with the mmap_sem hold in read mode. In those cases the huge page
> faults can allocate hugepmds under pmd_none_or_clear_bad() and that
> can trigger a false positive from pmd_bad() that will not like to see
> a pmd materializing as trans huge.
>
> It's not khugepaged the problem, khugepaged holds the mmap_sem in
> write mode (and all those sites must hold the mmap_sem in read mode to
> prevent pagetables to go away from under them, during code review it
> seems vm86 mode on 32bit kernels requires that too unless it's
> restricted to 1 thread per process or UP builds). The race is only
> with the huge pagefaults that can convert a pmd_none() into a
> pmd_trans_huge().
>
> Effectively all these pmd_none_or_clear_bad() sites running with
> mmap_sem in read mode are somewhat speculative with the page faults,
> and the result is always undefined when they run simultaneously. This
> is probably why it wasn't common to run into this. For example if the
> madvise(MADV_DONTNEED) runs zap_page_range() shortly before the page
> fault, the hugepage will not be zapped, if the page fault runs first
> it will be zapped.
>
> Altering pmd_bad() not to error out if it finds hugepmds won't be
> enough to fix this, because zap_pmd_range would then proceed to call
> zap_pte_range (which would be incorrect if the pmd become a
> pmd_trans_huge()).
>
> The simplest way to fix this is to read the pmd in the local stack
> (regardless of what we read, no need of actual CPU barriers, only
> compiler barrier needed), and be sure it is not changing under the
> code that computes its value. Even if the real pmd is changing under
> the value we hold on the stack, we don't care. If we actually end up
> in zap_pte_range it means the pmd was not none already and it was not
> huge, and it can't become huge from under us (khugepaged locking
> explained above).
>
> All we need is to enforce that there is no way anymore that in a code
> path like below, pmd_trans_huge can be false, but
> pmd_none_or_clear_bad can run into a hugepmd. The overhead of a
> barrier() is just a compiler tweak and should not be measurable (I
> only added it for THP builds). I don't exclude different compiler
> versions may have prevented the race too by caching the value of *pmd
> on the stack (that hasn't been verified, but it wouldn't be impossible
> considering pmd_none_or_clear_bad, pmd_bad, pmd_trans_huge, pmd_none
> are all inlines and there's no external function called in between
> pmd_trans_huge and pmd_none_or_clear_bad).
>
> if (pmd_trans_huge(*pmd)) {
> if (next-addr != HPAGE_PMD_SIZE) {
> VM_BUG_ON(!rwsem_is_locked(&tlb->mm->mmap_sem));
> split_huge_page_pmd(vma->vm_mm, pmd);
> } else if (zap_huge_pmd(tlb, vma, pmd, addr))
> continue;
> /* fall through */
> }
> if (pmd_none_or_clear_bad(pmd))
>
> Because this race condition could be exercised without special
> privileges this was reported in CVE-2012-1179.
>
> The race was identified and fully explained by Ulrich who debugged it.
> I'm quoting his accurate explanation below, for reference.
>
> ====== start quote =======
> mapcount 0 page_mapcount 1
> kernel BUG at mm/huge_memory.c:1384!
>
> At some point prior to the panic, a "bad pmd ..." message similar to the
> following is logged on the console:
>
> mm/memory.c:145: bad pmd ffff8800376e1f98(80000000314000e7).
>
> The "bad pmd ..." message is logged by pmd_clear_bad() before it clears
> the page's PMD table entry.
>
> 143 void pmd_clear_bad(pmd_t *pmd)
> 144 {
> -> 145 pmd_ERROR(*pmd);
> 146 pmd_clear(pmd);
> 147 }
>
> After the PMD table entry has been cleared, there is an inconsistency
> between the actual number of PMD table entries that are mapping the page
> and the page's map count (_mapcount field in struct page). When the page
> is subsequently reclaimed, __split_huge_page() detects this inconsistency.
>
> 1381 if (mapcount != page_mapcount(page))
> 1382 printk(KERN_ERR "mapcount %d page_mapcount %d\n",
> 1383 mapcount, page_mapcount(page));
> -> 1384 BUG_ON(mapcount != page_mapcount(page));
>
> The root cause of the problem is a race of two threads in a multithreaded
> process. Thread B incurs a page fault on a virtual address that has never
> been accessed (PMD entry is zero) while Thread A is executing an madvise()
> system call on a virtual address within the same 2 MB (huge page) range.
>
> virtual address space
> .---------------------.
> | |
> | |
> .-|---------------------|
> | | |
> | | |<-- B(fault)
> | | |
> 2 MB | |/////////////////////|-.
> huge< |/////////////////////|> A(range)
> page | |/////////////////////|-'
> | | |
> | | |
> '-|---------------------|
> | |
> | |
> '---------------------'
>
> - Thread A is executing an madvise(..., MADV_DONTNEED) system call
> on the virtual address range "A(range)" shown in the picture.
>
> sys_madvise
> // Acquire the semaphore in shared mode.
> down_read(¤t->mm->mmap_sem)
> ...
> madvise_vma
> switch (behavior)
> case MADV_DONTNEED:
> madvise_dontneed
> zap_page_range
> unmap_vmas
> unmap_page_range
> zap_pud_range
> zap_pmd_range
> //
> // Assume that this huge page has never been accessed.
> // I.e. content of the PMD entry is zero (not mapped).
> //
> if (pmd_trans_huge(*pmd)) {
> // We don't get here due to the above assumption.
> }
> //
> // Assume that Thread B incurred a page fault and
> .---------> // sneaks in here as shown below.
> | //
> | if (pmd_none_or_clear_bad(pmd))
> | {
> | if (unlikely(pmd_bad(*pmd)))
> | pmd_clear_bad
> | {
> | pmd_ERROR
> | // Log "bad pmd ..." message here.
> | pmd_clear
> | // Clear the page's PMD entry.
> | // Thread B incremented the map count
> | // in page_add_new_anon_rmap(), but
> | // now the page is no longer mapped
> | // by a PMD entry (-> inconsistency).
> | }
> | }
> |
> v
> - Thread B is handling a page fault on virtual address "B(fault)" shown
> in the picture.
>
> ...
> do_page_fault
> __do_page_fault
> // Acquire the semaphore in shared mode.
> down_read_trylock(&mm->mmap_sem)
> ...
> handle_mm_fault
> if (pmd_none(*pmd)&& transparent_hugepage_enabled(vma))
> // We get here due to the above assumption (PMD entry is zero).
> do_huge_pmd_anonymous_page
> alloc_hugepage_vma
> // Allocate a new transparent huge page here.
> ...
> __do_huge_pmd_anonymous_page
> ...
> spin_lock(&mm->page_table_lock)
> ...
> page_add_new_anon_rmap
> // Here we increment the page's map count (starts at -1).
> atomic_set(&page->_mapcount, 0)
> set_pmd_at
> // Here we set the page's PMD entry which will be cleared
> // when Thread A calls pmd_clear_bad().
> ...
> spin_unlock(&mm->page_table_lock)
>
> The mmap_sem does not prevent the race because both threads are acquiring
> it in shared mode (down_read). Thread B holds the page_table_lock while
> the page's map count and PMD table entry are updated. However, Thread A
> does not synchronize on that lock.
> ====== end quote =======
>
> Reported-by: Ulrich Obergfell<uobergfe@redhat.com>
> Signed-off-by: Andrea Arcangeli<aarcange@redhat.com>
> ---
> arch/x86/kernel/vm86_32.c | 2 +
> fs/proc/task_mmu.c | 9 ++++++
> include/asm-generic/pgtable.h | 57 +++++++++++++++++++++++++++++++++++++++++
> mm/memcontrol.c | 4 +++
> mm/memory.c | 14 ++++++++--
> mm/mempolicy.c | 2 +-
> mm/mincore.c | 2 +-
> mm/pagewalk.c | 2 +-
> mm/swapfile.c | 4 +--
> 9 files changed, 87 insertions(+), 9 deletions(-)
>
> diff --git a/arch/x86/kernel/vm86_32.c b/arch/x86/kernel/vm86_32.c
> index b466cab..328cb37 100644
> --- a/arch/x86/kernel/vm86_32.c
> +++ b/arch/x86/kernel/vm86_32.c
> @@ -172,6 +172,7 @@ static void mark_screen_rdonly(struct mm_struct *mm)
> spinlock_t *ptl;
> int i;
>
> + down_write(&mm->mmap_sem);
> pgd = pgd_offset(mm, 0xA0000);
> if (pgd_none_or_clear_bad(pgd))
> goto out;
> @@ -190,6 +191,7 @@ static void mark_screen_rdonly(struct mm_struct *mm)
> }
> pte_unmap_unlock(pte, ptl);
> out:
> + up_write(&mm->mmap_sem);
> flush_tlb();
> }
>
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index 7dcd2a2..3efa725 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -409,6 +409,9 @@ static int smaps_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
> } else {
> spin_unlock(&walk->mm->page_table_lock);
> }
> +
> + if (pmd_trans_unstable(pmd))
> + return 0;
> /*
> * The mmap_sem held all the way back in m_start() is what
> * keeps khugepaged out of here and from collapsing things
> @@ -507,6 +510,8 @@ static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr,
> struct page *page;
>
> split_huge_page_pmd(walk->mm, pmd);
> + if (pmd_trans_unstable(pmd))
> + return 0;
>
> pte = pte_offset_map_lock(vma->vm_mm, pmd, addr,&ptl);
> for (; addr != end; pte++, addr += PAGE_SIZE) {
> @@ -670,6 +675,8 @@ static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
> int err = 0;
>
> split_huge_page_pmd(walk->mm, pmd);
> + if (pmd_trans_unstable(pmd))
> + return 0;
>
> /* find the first VMA at or above 'addr' */
> vma = find_vma(walk->mm, addr);
> @@ -961,6 +968,8 @@ static int gather_pte_stats(pmd_t *pmd, unsigned long addr,
> spin_unlock(&walk->mm->page_table_lock);
> }
>
> + if (pmd_trans_unstable(pmd))
> + return 0;
> orig_pte = pte = pte_offset_map_lock(walk->mm, pmd, addr,&ptl);
> do {
> struct page *page = can_gather_numa_stats(*pte, md->vma, addr);
> diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
> index 76bff2b..10f8291 100644
> --- a/include/asm-generic/pgtable.h
> +++ b/include/asm-generic/pgtable.h
> @@ -443,6 +443,63 @@ static inline int pmd_write(pmd_t pmd)
> #endif /* __HAVE_ARCH_PMD_WRITE */
> #endif
>
> +/*
> + * This function is meant to be used by sites walking pagetables with
> + * the mmap_sem hold in read mode to protect against MADV_DONTNEED and
> + * transhuge page faults. MADV_DONTNEED can convert a transhuge pmd
> + * into a null pmd and the transhuge page fault can convert a null pmd
> + * into an hugepmd or into a regular pmd (if the hugepage allocation
> + * fails). While holding the mmap_sem in read mode the pmd becomes
> + * stable and stops changing under us only if it's not null and not a
> + * transhuge pmd. When those races occurs and this function makes a
> + * difference vs the standard pmd_none_or_clear_bad, the result is
> + * undefined so behaving like if the pmd was none is safe (because it
> + * can return none anyway). The compiler level barrier() is critically
> + * important to compute the two checks atomically on the same pmdval.
> + */
> +static inline int pmd_none_or_trans_huge_or_clear_bad(pmd_t *pmd)
> +{
> + /* depend on compiler for an atomic pmd read */
> + pmd_t pmdval = *pmd;
> + /*
> + * The barrier will stabilize the pmdval in a register or on
> + * the stack so that it will stop changing under the code.
> + */
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> + barrier();
> +#endif
> + if (pmd_none(pmdval))
> + return 1;
> + if (unlikely(pmd_bad(pmdval))) {
> + if (!pmd_trans_huge(pmdval))
> + pmd_clear_bad(pmd);
> + return 1;
> + }
> + return 0;
> +}
> +
> +/*
> + * This is a noop if Transparent Hugepage Support is not built into
> + * the kernel. Otherwise it is equivalent to
> + * pmd_none_or_trans_huge_or_clear_bad(), and shall only be called in
> + * places that already verified the pmd is not none and they want to
> + * walk ptes while holding the mmap sem in read mode (write mode don't
> + * need this). If THP is not enabled, the pmd can't go away under the
> + * code even if MADV_DONTNEED runs, but if THP is enabled we need to
> + * run a pmd_trans_unstable before walking the ptes after
> + * split_huge_page_pmd returns (because it may have run when the pmd
> + * become null, but then a page fault can map in a THP and not a
> + * regular page).
> + */
> +static inline int pmd_trans_unstable(pmd_t *pmd)
> +{
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> + return pmd_none_or_trans_huge_or_clear_bad(pmd);
> +#else
> + return 0;
> +#endif
> +}
> +
> #endif /* !__ASSEMBLY__ */
>
> #endif /* _ASM_GENERIC_PGTABLE_H */
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index d0e57a3..67b0578 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -5193,6 +5193,8 @@ static int mem_cgroup_count_precharge_pte_range(pmd_t *pmd,
> spinlock_t *ptl;
>
> split_huge_page_pmd(walk->mm, pmd);
> + if (pmd_trans_unstable(pmd))
> + return 0;
>
> pte = pte_offset_map_lock(vma->vm_mm, pmd, addr,&ptl);
> for (; addr != end; pte++, addr += PAGE_SIZE)
> @@ -5355,6 +5357,8 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd,
> spinlock_t *ptl;
>
> split_huge_page_pmd(walk->mm, pmd);
> + if (pmd_trans_unstable(pmd))
> + return 0;
> retry:
> pte = pte_offset_map_lock(vma->vm_mm, pmd, addr,&ptl);
> for (; addr != end; addr += PAGE_SIZE) {
> diff --git a/mm/memory.c b/mm/memory.c
> index fa2f04e..e3090fc 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1251,12 +1251,20 @@ static inline unsigned long zap_pmd_range(struct mmu_gather *tlb,
> VM_BUG_ON(!rwsem_is_locked(&tlb->mm->mmap_sem));
> split_huge_page_pmd(vma->vm_mm, pmd);
> } else if (zap_huge_pmd(tlb, vma, pmd, addr))
> - continue;
> + goto next;
> /* fall through */
> }
> - if (pmd_none_or_clear_bad(pmd))
> - continue;
> + /*
> + * Here there can be other concurrent MADV_DONTNEED or
> + * trans huge page faults running, and if the pmd is
> + * none or trans huge it can change under us. This is
> + * because MADV_DONTNEED holds the mmap_sem in read
> + * mode.
> + */
> + if (pmd_none_or_trans_huge_or_clear_bad(pmd))
> + goto next;
> next = zap_pte_range(tlb, vma, pmd, addr, next, details);
> + next:
> cond_resched();
> } while (pmd++, addr = next, addr != end);
>
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 47296fe..0a37570 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -512,7 +512,7 @@ static inline int check_pmd_range(struct vm_area_struct *vma, pud_t *pud,
> do {
> next = pmd_addr_end(addr, end);
> split_huge_page_pmd(vma->vm_mm, pmd);
> - if (pmd_none_or_clear_bad(pmd))
> + if (pmd_none_or_trans_huge_or_clear_bad(pmd))
> continue;
> if (check_pte_range(vma, pmd, addr, next, nodes,
> flags, private))
> diff --git a/mm/mincore.c b/mm/mincore.c
> index 636a868..936b4ce 100644
> --- a/mm/mincore.c
> +++ b/mm/mincore.c
> @@ -164,7 +164,7 @@ static void mincore_pmd_range(struct vm_area_struct *vma, pud_t *pud,
> }
> /* fall through */
> }
> - if (pmd_none_or_clear_bad(pmd))
> + if (pmd_none_or_trans_huge_or_clear_bad(pmd))
> mincore_unmapped_range(vma, addr, next, vec);
> else
> mincore_pte_range(vma, pmd, addr, next, vec);
> diff --git a/mm/pagewalk.c b/mm/pagewalk.c
> index 2f5cf10..aa9701e 100644
> --- a/mm/pagewalk.c
> +++ b/mm/pagewalk.c
> @@ -59,7 +59,7 @@ again:
> continue;
>
> split_huge_page_pmd(walk->mm, pmd);
> - if (pmd_none_or_clear_bad(pmd))
> + if (pmd_none_or_trans_huge_or_clear_bad(pmd))
> goto again;
> err = walk_pte_range(pmd, addr, next, walk);
> if (err)
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index d999f09..f31b29d 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -932,9 +932,7 @@ static inline int unuse_pmd_range(struct vm_area_struct *vma, pud_t *pud,
> pmd = pmd_offset(pud, addr);
> do {
> next = pmd_addr_end(addr, end);
> - if (unlikely(pmd_trans_huge(*pmd)))
> - continue;
> - if (pmd_none_or_clear_bad(pmd))
> + if (pmd_none_or_trans_huge_or_clear_bad(pmd))
> continue;
> ret = unuse_pte_range(vma, pmd, addr, next, entry, page);
> if (ret)
Acked-by: Larry Woodman <lwoodman@redhat.com>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2012-03-15 16:13 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
2012-03-15 14:44 [PATCH] mm: thp: fix pmd_bad() triggering in code paths holding mmap_sem read mode Andrea Arcangeli
2012-03-15 16:01 ` Rik van Riel
2012-03-15 16:13 ` Larry Woodman [this message]
2012-03-15 17:16 ` Dave Jones
2012-03-15 17:41 ` Andrea Arcangeli
2012-03-15 22:30 ` Andrew Morton
2012-03-15 18:08 ` Johannes Weiner
2012-03-15 22:45 ` Andrew Morton
2012-03-15 23:15 ` Andrea Arcangeli
2012-03-15 23:27 ` Andrew Morton
2012-03-16 8:05 ` Naoya Horiguchi
2012-03-16 11:54 ` Andrea Arcangeli
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4F621538.7030707@redhat.com \
--to=lwoodman@redhat.com \
--cc=aarcange@redhat.com \
--cc=akpm@linux-foundation.org \
--cc=hannes@cmpxchg.org \
--cc=hughd@google.com \
--cc=linux-mm@kvack.org \
--cc=mgorman@suse.de \
--cc=riel@redhat.com \
--cc=uobergfe@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).