stable.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Andrea Arcangeli <aarcange@redhat.com>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Josh Boyer <jwboyer@gmail.com>,
	Greg KH <gregkh@linuxfoundation.org>,
	linux-kernel@vger.kernel.org, stable@vger.kernel.org,
	akpm@linux-foundation.org, alan@lxorguk.ukuu.org.uk,
	Ulrich Obergfell <uobergfe@redhat.com>,
	Mel Gorman <mgorman@suse.de>, Hugh Dickins <hughd@google.com>,
	Larry Woodman <lwoodman@redhat.com>,
	Petr Matousek <pmatouse@redhat.com>,
	Rik van Riel <riel@redhat.com>
Subject: Re: [ 08/82] mm: pmd_read_atomic: fix 32bit PAE pmd walk vs pmd_populate SMP race condition
Date: Thu, 7 Jun 2012 21:04:15 +0200	[thread overview]
Message-ID: <20120607190414.GF21339@redhat.com> (raw)
In-Reply-To: <CA+55aFzv-z0njtAnpgOsVES70+igwV7HCteQbQ6M6uTsYvn5WQ@mail.gmail.com>

On Thu, Jun 07, 2012 at 10:46:44AM -0700, Linus Torvalds wrote:
> So I assume that Xen just turns the page tables read-only in order to
> track them, and then assumes that nobody modifies them in the
> particular section. And the cmpxchg64 looks like a modification, even
> if we only use it to read things.

Agreed, the implicit write could be the trigger.

> Andrea, do we have any guarantees like "once it has turned into a
> regular page table, we won't see it turn back if we hold the mmap
> sem"? Or anything like that? Because it is possible that we could do

Yes if it turns in a regular page table it will stop changing.

The problem is this is the THP case. Without THP it can only change
from nono to regular page table. With THP it can change from none to
trans_huge to none to trans_huge and it only stops if it eventually
becomes a regular page table.

> this entirely with some ordering guarantee - something like the
> attached patch?

It's possible to do it with a loop like in the patch, gup_fast does it
that way (and gup_fast does it on the pte so the pte is susciptible to
the exact same instability that the pmd has even when THP=n, as
madvise/pagefault can run under gup_fast), but I'm not sure if it's
safe especially with irqs enabled.  Maybe gup_fast is safe because it
disables irqs to stop MADV_DONTNEED?

The race would be:

            l = pmd.low
            smp_rmb()
            h = pmd.high
	    smp_rmb()
                                       pmd points to page 4G
                                       MADV_DONTNEED
                                       page fault allocates page at 8G
            l = pmd.low

Disabling irqs may be enough to hang MADV_DONTNEED on the tlb flush
IPI. But my feeling is the page fault can happen even while
MADV_DONTNEED waits on the tlb flush IPI. So the above could still
happen on gup_fast too? But of course gup_fast troubles are irrelevant
here, but I was thinking about this one before... so I mentioned it
too as it's the same problem.

The "simple" idea in pmd_none_or_trans_huge_or_clear_bad is that we
need an atomic snapshot of the pmdval, stabilize it in register or
local stack, and do the computations on it to know if the pmd is
stable or unstable.

But the more "complex" idea would be to relay on the below barrier and
deal with "half corrupted" pmds.

#ifdef CONFIG_TRANSPARENT_HUGEPAGE
	barrier();
#endif

the barrier prevents the *pmdp read to be cached across the return of
pmd_none_or_trans_huge_or_clear_bad when THP=y (our problem case). And
all we need is to compute those checks atomically on the "low" part.

	if (pmd_none(pmdval))
		return 1;
	if (unlikely(pmd_bad(pmdval))) {
		if (!pmd_trans_huge(pmdval))
			pmd_clear_bad(pmd);
		return 1;
	}

If we remove the #ifdef CONFIG_TRANSPARENT_HUGEPAGE around the
barrier(), we can get rid of pmd_read_atomic entirely and just do *pmd
as before the fix (however note that if we triggered the crash in
madvise with 32bit pae THP=n it means the value was cached by gcc and
the corrupted pmdval was used for running pte_offset). So if we make
the barrier() unconditional we'll force a second access to
memory. This is the whole point of the barrier() conditional (to avoid
screwing with gcc good work when not absolutely necessary).

Anyway I made a patch below to take advantage of the barrier() and
deal with corrupted pmds on pae 32bit x86 THP=y which I hope could fix
this more optimally:

diff --git a/arch/x86/include/asm/pgtable-3level.h b/arch/x86/include/asm/pgtable-3level.h
index 43876f1..149d968 100644
--- a/arch/x86/include/asm/pgtable-3level.h
+++ b/arch/x86/include/asm/pgtable-3level.h
@@ -31,7 +31,6 @@ static inline void native_set_pte(pte_t *ptep, pte_t pte)
 	ptep->pte_low = pte.pte_low;
 }
 
-#define pmd_read_atomic pmd_read_atomic
 /*
  * pte_offset_map_lock on 32bit PAE kernels was reading the pmd_t with
  * a "*pmdp" dereference done by gcc. Problem is, in certain places
@@ -53,10 +52,18 @@ static inline void native_set_pte(pte_t *ptep, pte_t pte)
  *
  * With THP if the mmap_sem is hold for reading, the pmd can become
  * THP or null or point to a pte (and in turn become "stable") at any
- * time under pmd_read_atomic, so it's mandatory to read it atomically
- * with cmpxchg8b.
+ * time under pmd_read_atomic. We could read it atomically here with a
+ * pmd_read_atomic using atomic64_read for the THP case, but instead
+ * we let the generic version of pmd_read_atomic run, and we instead
+ * relay on the barrier() in pmd_none_or_trans_huge_or_clear_bad() to
+ * prevent gcc to cache the potentially corrupted pmdval in pte_offset
+ * later. The barrier() will force the re-reading of the pmd and the
+ * checks in pmd_none_or_trans_huge_or_clear_bad() will only care
+ * about the low part of the pmd, regardless if the high part is
+ * consistent.
  */
 #ifndef CONFIG_TRANSPARENT_HUGEPAGE
+#define pmd_read_atomic pmd_read_atomic
 static inline pmd_t pmd_read_atomic(pmd_t *pmdp)
 {
 	pmdval_t ret;
@@ -74,11 +81,6 @@ static inline pmd_t pmd_read_atomic(pmd_t *pmdp)
 
 	return (pmd_t) { ret };
 }
-#else /* CONFIG_TRANSPARENT_HUGEPAGE */
-static inline pmd_t pmd_read_atomic(pmd_t *pmdp)
-{
-	return (pmd_t) { atomic64_read((atomic64_t *)pmdp) };
-}
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
 static inline void native_set_pte_atomic(pte_t *ptep, pte_t pte)
diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index ae39c4b..29e648a 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -484,6 +484,13 @@ static inline int pmd_none_or_trans_huge_or_clear_bad(pmd_t *pmd)
 	/*
 	 * The barrier will stabilize the pmdval in a register or on
 	 * the stack so that it will stop changing under the code.
+	 *
+	 * The barrier for the "x86 32bit PAE
+	 * CONFIG_TRANSPARENT_HUGEPAGE=y" case will also prevent an
+	 * inconsistent pmd low/high values (obtained by the generic
+	 * version of pmd_read_atomic) to be cached by gcc. The below
+	 * checks will only care about the low part of the pmd with
+	 * 32bit PAE.
 	 */
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 	barrier();



  reply	other threads:[~2012-06-07 19:04 UTC|newest]

Thread overview: 106+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-06-07  4:14 [ 00/82] 3.4.2-stable review Greg KH
2012-06-07  4:03 ` [ 01/82] exofs: Fix CRASH on very early IO errors Greg KH
2012-06-07  4:03 ` [ 02/82] microblaze: Do not select GENERIC_GPIO by default Greg KH
2012-06-07  4:03 ` [ 03/82] SCSI: fix scsi_wait_scan Greg KH
2012-06-07  4:03 ` [ 04/82] SCSI: Fix dm-multipath starvation when scsi host is busy Greg KH
2012-06-07  4:03 ` [ 05/82] mm/fork: fix overflow in vma length when copying mmap on clone Greg KH
2012-06-07  4:03 ` [ 06/82] mm: fix NULL ptr deref when walking hugepages Greg KH
2012-06-07  4:03 ` [ 07/82] mm: consider all swapped back pages in used-once logic Greg KH
2012-06-07  4:03 ` [ 08/82] mm: pmd_read_atomic: fix 32bit PAE pmd walk vs pmd_populate SMP race condition Greg KH
2012-06-07 13:42   ` Josh Boyer
2012-06-07 14:42     ` Andrea Arcangeli
2012-06-07 17:46       ` Linus Torvalds
2012-06-07 19:04         ` Andrea Arcangeli [this message]
2012-06-07 21:00           ` Andrea Arcangeli
2012-06-07 21:00           ` [PATCH] thp: avoid atomic64_read in pmd_read_atomic for 32bit PAE Andrea Arcangeli
2012-06-10  2:03             ` [PATCH] thp: avoid atomic64_read in pmd_read_atomic for 32bit PAE\ Konrad Rzeszutek Wilk
2012-06-11 10:34               ` [Xen-devel] " Andrew Jones
2012-06-11 19:27                 ` Konrad Rzeszutek Wilk
2012-06-11 19:41                   ` Andrea Arcangeli
2012-06-07 21:02           ` [ 08/82] mm: pmd_read_atomic: fix 32bit PAE pmd walk vs pmd_populate SMP race condition Andrea Arcangeli
2012-06-08  8:04     ` Greg KH
2012-06-07 17:52   ` Konrad Rzeszutek Wilk
2012-06-07  4:03 ` [ 09/82] mm: fix faulty initialization in vmalloc_init() Greg KH
2012-06-07  4:03 ` [ 10/82] iwlwifi: update BT traffic load states correctly Greg KH
2012-06-07  4:03 ` [ 11/82] iwlwifi: do not use shadow registers by default Greg KH
2012-06-07  4:03 ` [ 12/82] cifs: Include backup intent search flags during searches {try #2) Greg KH
2012-06-07  4:03 ` [ 13/82] cifs: fix oops while traversing open file list (try #4) Greg KH
2012-06-07  4:03 ` [ 14/82] PARISC: fix boot failure on 32-bit systems caused by branch stubs placed before .text Greg KH
2012-06-07  4:03 ` [ 15/82] PARISC: fix TLB fault path on PA2.0 narrow systems Greg KH
2012-06-07  4:03 ` [ 16/82] solos-pci: Fix DMA support Greg KH
2012-06-07  4:03 ` [ 17/82] MIPS: BCM63XX: Add missing include for bcm63xx_gpio.h Greg KH
2012-06-07  4:03 ` [ 18/82] mac80211: fix ADDBA declined after suspend with wowlan Greg KH
2012-06-07  4:03 ` [ 19/82] ixp4xx: fix compilation by adding gpiolib support Greg KH
2012-06-07  4:03 ` [ 20/82] ath9k: fix a use-after-free-bug when ath_tx_setup_buffer() fails Greg KH
2012-06-07  4:03 ` [ 21/82] x86, amd, xen: Avoid NULL pointer paravirt references Greg KH
2012-06-07  4:03 ` [ 22/82] NFS: kmalloc() doesnt return an ERR_PTR() Greg KH
2012-06-07  4:03 ` [ 23/82] NFSv4: Map NFS4ERR_SHARE_DENIED into an EACCES error instead of EIO Greg KH
2012-06-07  4:04 ` [ 24/82] hugetlb: fix resv_map leak in error path Greg KH
2012-06-07  4:04 ` [ 25/82] sunrpc: fix loss of task->tk_status after rpc_delay call in xprt_alloc_slot Greg KH
2012-06-07  4:04 ` [ 26/82] iommu/amd: Check for the right TLP prefix bit Greg KH
2012-06-07  4:04 ` [ 27/82] iommu/amd: Add workaround for event log erratum Greg KH
2012-06-07  4:04 ` [ 28/82] drm/radeon: fix XFX quirk Greg KH
2012-06-07  4:04 ` [ 29/82] drm/radeon: fix typo in trinity tiling setup Greg KH
2012-06-07  4:04 ` [ 30/82] drm/i915: properly handle interlaced bit for sdvo dtd conversion Greg KH
2012-06-07  4:04 ` [ 31/82] drm/i915: Adding TV Out Missing modes Greg KH
2012-06-07  4:04 ` [ 32/82] drm/i915: wait for a vblank to pass after tv detect Greg KH
2012-06-07  4:04 ` [ 33/82] drm/i915: no lvds quirk for HP t5740e Thin Client Greg KH
2012-06-07  4:04 ` [ 34/82] kbuild: install kernel-page-flags.h Greg KH
2012-06-07  4:04 ` [ 35/82] mm: fix vma_resv_map() NULL pointer Greg KH
2012-06-07  4:04 ` [ 36/82] ALSA: usb-audio: fix rate_list memory leak Greg KH
2012-06-07  4:04 ` [ 37/82] slub: fix a memory leak in get_partial_node() Greg KH
2012-06-07  4:04 ` [ 38/82] vfs: umount_tree() might be called on subtree that had never made it Greg KH
2012-06-07  4:04 ` [ 39/82] vfs: increment iversion when a file is truncated Greg KH
2012-06-07  4:04 ` [ 40/82] fec_mpc52xx: fix timestamp filtering Greg KH
2012-06-07  4:04 ` [ 41/82] x86, x32, ptrace: Remove PTRACE_ARCH_PRCTL for x32 Greg KH
2012-06-07  4:04 ` [ 42/82] x86: Reset the debug_stack update counter Greg KH
2012-06-07  4:04 ` [ 43/82] mtd: nand: fix scan_read_raw_oob Greg KH
2012-06-07  4:04 ` [ 44/82] mtd: of_parts: fix breakage in Kconfig Greg KH
2012-06-07  4:04 ` [ 45/82] mtd: block2mtd: fix recursive call of mtd_writev Greg KH
2012-06-07  4:04 ` [ 46/82] mtd: mxc_nand: move ecc strengh setup before nand_scan_tail Greg KH
2012-06-07  4:04 ` [ 47/82] drm/radeon: fix regression in UMS CS ioctl Greg KH
2012-06-07  4:04 ` [ 48/82] drm/radeon: fix bank information in tiling config Greg KH
2012-06-07  4:04 ` [ 49/82] drm/radeon: properly program gart on rv740, juniper, cypress, barts, hemlock Greg KH
2012-06-07  4:04 ` [ 50/82] drm/radeon: fix HD6790, HD6570 backend programming Greg KH
2012-06-07  4:04 ` [ 51/82] drm/ttm: Fix spinlock imbalance Greg KH
2012-06-07  4:04 ` [ 52/82] drm/vmwgfx: Fix nasty write past alloced memory area Greg KH
2012-06-07  4:04 ` [ 53/82] asix: allow full size 8021Q frames to be received Greg KH
2012-06-08  2:27   ` Ben Hutchings
2012-06-08  3:54     ` David Miller
2012-06-07  4:04 ` [ 54/82] ipv4: fix the rcu race between free_fib_info and ip_route_output_slow Greg KH
2012-06-07  4:04 ` [ 55/82] ipv6: fix incorrect ipsec fragment Greg KH
2012-06-07  4:04 ` [ 56/82] l2tp: fix oops in L2TP IP sockets for connect() AF_UNSPEC case Greg KH
2012-06-07  4:04 ` [ 57/82] skb: avoid unnecessary reallocations in __skb_cow Greg KH
2012-06-07  4:04 ` [ 58/82] xfrm: take net hdr len into account for esp payload size calculation Greg KH
2012-06-07  4:04 ` [ 59/82] ext4: fix potential NULL dereference in ext4_free_inodes_counts() Greg KH
2012-06-07  4:04 ` [ 60/82] ext4: force ro mount if ext4_setup_super() fails Greg KH
2012-06-07  4:04 ` [ 61/82] ext4: fix potential integer overflow in alloc_flex_gd() Greg KH
2012-06-07  4:04 ` [ 62/82] ext4: disallow hard-linked directory in ext4_lookup Greg KH
2012-06-07  4:04 ` [ 63/82] ext4: add missing save_error_info() to ext4_error() Greg KH
2012-06-07  4:04 ` [ 64/82] ext4: dont trash state flags in EXT4_IOC_SETFLAGS Greg KH
2012-06-08  3:03   ` Ben Hutchings
2012-06-08  3:11     ` Ted Ts'o
2012-06-08  3:21       ` Ben Hutchings
2012-06-08 20:05         ` Ted Ts'o
2012-06-08 23:01           ` Ben Hutchings
2012-06-09  2:30             ` Ted Ts'o
2012-06-09 12:56               ` Ben Hutchings
2012-06-09 15:23           ` Greg KH
2012-06-07  4:04 ` [ 65/82] ext4: add ext4_mb_unload_buddy in the error path Greg KH
2012-06-07  4:04 ` [ 66/82] ext4: remove mb_groups before tearing down the buddy_cache Greg KH
2012-06-07  4:04 ` [ 67/82] radix-tree: fix contiguous iterator Greg KH
2012-06-07  4:04 ` [ 68/82] drm/radeon/audio: dont hardcode CRTC id Greg KH
2012-06-07  4:04 ` [ 69/82] drm/radeon: fix vm deadlocks on cayman Greg KH
2012-06-07  4:04 ` [ 70/82] drm/radeon/kms: add new Trinity PCI ids Greg KH
2012-06-07  4:04 ` [ 71/82] drm/radeon/kms: add new Palm, Sumo " Greg KH
2012-06-07  4:04 ` [ 72/82] drm/radeon/kms: add new BTC " Greg KH
2012-06-07  4:04 ` [ 73/82] drm/radeon/kms: add new SI " Greg KH
2012-06-07  4:04 ` [ 74/82] iommu/amd: Cache pdev pointer to root-bridge Greg KH
2012-06-07  4:04 ` [ 75/82] iommu/amd: Fix deadlock in ppr-handling error path Greg KH
2012-06-07  4:04 ` [ 76/82] ACPI battery: only refresh the sysfs files when pertinent information changes Greg KH
2012-06-07  4:04 ` [ 77/82] vfs: Fix /proc/<tid>/fdinfo/<fd> file handling Greg KH
2012-06-07  4:04 ` [ 78/82] md: raid1/raid10: fix problem with merge_bvec_fn Greg KH
2012-06-07  4:04 ` [ 79/82] wl1251: fix oops on early interrupt Greg KH
2012-06-07  4:04 ` [ 80/82] drm/i915: always use RPNSWREQ for turbo change requests Greg KH
2012-06-07  4:04 ` [ 81/82] drm/i915/dp: Flush any outstanding work to turn the VDD off Greg KH
2012-06-07  4:04 ` [ 82/82] drm/i915: enable vdd when switching off the eDP panel Greg KH

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20120607190414.GF21339@redhat.com \
    --to=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=alan@lxorguk.ukuu.org.uk \
    --cc=gregkh@linuxfoundation.org \
    --cc=hughd@google.com \
    --cc=jwboyer@gmail.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=lwoodman@redhat.com \
    --cc=mgorman@suse.de \
    --cc=pmatouse@redhat.com \
    --cc=riel@redhat.com \
    --cc=stable@vger.kernel.org \
    --cc=torvalds@linux-foundation.org \
    --cc=uobergfe@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).