From: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
To: Greg KH <gregkh@linuxfoundation.org>
Cc: linux-kernel@vger.kernel.org, stable@vger.kernel.org,
torvalds@linux-foundation.org, akpm@linux-foundation.org,
alan@lxorguk.ukuu.org.uk, Ulrich Obergfell <uobergfe@redhat.com>,
Andrea Arcangeli <aarcange@redhat.com>,
Mel Gorman <mgorman@suse.de>, Hugh Dickins <hughd@google.com>,
Larry Woodman <lwoodman@redhat.com>,
Petr Matousek <pmatouse@redhat.com>,
Rik van Riel <riel@redhat.com>
Subject: Re: [ 08/82] mm: pmd_read_atomic: fix 32bit PAE pmd walk vs pmd_populate SMP race condition
Date: Thu, 7 Jun 2012 13:52:49 -0400 [thread overview]
Message-ID: <20120607175249.GV9472@phenom.dumpdata.com> (raw)
In-Reply-To: <20120607040337.622672845@linuxfoundation.org>
On Thu, Jun 07, 2012 at 01:03:44PM +0900, Greg KH wrote:
> 3.4-stable review patch. If anyone has any objections, please let me know.
It breaks Linux running under Amazon EC2 under 32-bit. Please
don't apply it to any 3.x kernels until we figure out a
fix to this.
>
> ------------------
>
> From: Andrea Arcangeli <aarcange@redhat.com>
>
> commit 26c191788f18129af0eb32a358cdaea0c7479626 upstream.
>
> When holding the mmap_sem for reading, pmd_offset_map_lock should only
> run on a pmd_t that has been read atomically from the pmdp pointer,
> otherwise we may read only half of it leading to this crash.
>
> PID: 11679 TASK: f06e8000 CPU: 3 COMMAND: "do_race_2_panic"
> #0 [f06a9dd8] crash_kexec at c049b5ec
> #1 [f06a9e2c] oops_end at c083d1c2
> #2 [f06a9e40] no_context at c0433ded
> #3 [f06a9e64] bad_area_nosemaphore at c043401a
> #4 [f06a9e6c] __do_page_fault at c0434493
> #5 [f06a9eec] do_page_fault at c083eb45
> #6 [f06a9f04] error_code (via page_fault) at c083c5d5
> EAX: 01fb470c EBX: fff35000 ECX: 00000003 EDX: 00000100 EBP:
> 00000000
> DS: 007b ESI: 9e201000 ES: 007b EDI: 01fb4700 GS: 00e0
> CS: 0060 EIP: c083bc14 ERR: ffffffff EFLAGS: 00010246
> #7 [f06a9f38] _spin_lock at c083bc14
> #8 [f06a9f44] sys_mincore at c0507b7d
> #9 [f06a9fb0] system_call at c083becd
> start len
> EAX: ffffffda EBX: 9e200000 ECX: 00001000 EDX: 6228537f
> DS: 007b ESI: 00000000 ES: 007b EDI: 003d0f00
> SS: 007b ESP: 62285354 EBP: 62285388 GS: 0033
> CS: 0073 EIP: 00291416 ERR: 000000da EFLAGS: 00000286
>
> This should be a longstanding bug affecting x86 32bit PAE without THP.
> Only archs with 64bit large pmd_t and 32bit unsigned long should be
> affected.
>
> With THP enabled the barrier() in pmd_none_or_trans_huge_or_clear_bad()
> would partly hide the bug when the pmd transition from none to stable,
> by forcing a re-read of the *pmd in pmd_offset_map_lock, but when THP is
> enabled a new set of problem arises by the fact could then transition
> freely in any of the none, pmd_trans_huge or pmd_trans_stable states.
> So making the barrier in pmd_none_or_trans_huge_or_clear_bad()
> unconditional isn't good idea and it would be a flakey solution.
>
> This should be fully fixed by introducing a pmd_read_atomic that reads
> the pmd in order with THP disabled, or by reading the pmd atomically
> with cmpxchg8b with THP enabled.
>
> Luckily this new race condition only triggers in the places that must
> already be covered by pmd_none_or_trans_huge_or_clear_bad() so the fix
> is localized there but this bug is not related to THP.
>
> NOTE: this can trigger on x86 32bit systems with PAE enabled with more
> than 4G of ram, otherwise the high part of the pmd will never risk to be
> truncated because it would be zero at all times, in turn so hiding the
> SMP race.
>
> This bug was discovered and fully debugged by Ulrich, quote:
>
> ----
> [..]
> pmd_none_or_trans_huge_or_clear_bad() loads the content of edx and
> eax.
>
> 496 static inline int pmd_none_or_trans_huge_or_clear_bad(pmd_t
> *pmd)
> 497 {
> 498 /* depend on compiler for an atomic pmd read */
> 499 pmd_t pmdval = *pmd;
>
> // edi = pmd pointer
> 0xc0507a74 <sys_mincore+548>: mov 0x8(%esp),%edi
> ...
> // edx = PTE page table high address
> 0xc0507a84 <sys_mincore+564>: mov 0x4(%edi),%edx
> ...
> // eax = PTE page table low address
> 0xc0507a8e <sys_mincore+574>: mov (%edi),%eax
>
> [..]
>
> Please note that the PMD is not read atomically. These are two "mov"
> instructions where the high order bits of the PMD entry are fetched
> first. Hence, the above machine code is prone to the following race.
>
> - The PMD entry {high|low} is 0x0000000000000000.
> The "mov" at 0xc0507a84 loads 0x00000000 into edx.
>
> - A page fault (on another CPU) sneaks in between the two "mov"
> instructions and instantiates the PMD.
>
> - The PMD entry {high|low} is now 0x00000003fda38067.
> The "mov" at 0xc0507a8e loads 0xfda38067 into eax.
> ----
>
> Reported-by: Ulrich Obergfell <uobergfe@redhat.com>
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Larry Woodman <lwoodman@redhat.com>
> Cc: Petr Matousek <pmatouse@redhat.com>
> Cc: Rik van Riel <riel@redhat.com>
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
>
> ---
> arch/x86/include/asm/pgtable-3level.h | 50 ++++++++++++++++++++++++++++++++++
> include/asm-generic/pgtable.h | 22 +++++++++++++-
> 2 files changed, 70 insertions(+), 2 deletions(-)
>
> --- a/arch/x86/include/asm/pgtable-3level.h
> +++ b/arch/x86/include/asm/pgtable-3level.h
> @@ -31,6 +31,56 @@ static inline void native_set_pte(pte_t
> ptep->pte_low = pte.pte_low;
> }
>
> +#define pmd_read_atomic pmd_read_atomic
> +/*
> + * pte_offset_map_lock on 32bit PAE kernels was reading the pmd_t with
> + * a "*pmdp" dereference done by gcc. Problem is, in certain places
> + * where pte_offset_map_lock is called, concurrent page faults are
> + * allowed, if the mmap_sem is hold for reading. An example is mincore
> + * vs page faults vs MADV_DONTNEED. On the page fault side
> + * pmd_populate rightfully does a set_64bit, but if we're reading the
> + * pmd_t with a "*pmdp" on the mincore side, a SMP race can happen
> + * because gcc will not read the 64bit of the pmd atomically. To fix
> + * this all places running pmd_offset_map_lock() while holding the
> + * mmap_sem in read mode, shall read the pmdp pointer using this
> + * function to know if the pmd is null nor not, and in turn to know if
> + * they can run pmd_offset_map_lock or pmd_trans_huge or other pmd
> + * operations.
> + *
> + * Without THP if the mmap_sem is hold for reading, the
> + * pmd can only transition from null to not null while pmd_read_atomic runs.
> + * So there's no need of literally reading it atomically.
> + *
> + * With THP if the mmap_sem is hold for reading, the pmd can become
> + * THP or null or point to a pte (and in turn become "stable") at any
> + * time under pmd_read_atomic, so it's mandatory to read it atomically
> + * with cmpxchg8b.
> + */
> +#ifndef CONFIG_TRANSPARENT_HUGEPAGE
> +static inline pmd_t pmd_read_atomic(pmd_t *pmdp)
> +{
> + pmdval_t ret;
> + u32 *tmp = (u32 *)pmdp;
> +
> + ret = (pmdval_t) (*tmp);
> + if (ret) {
> + /*
> + * If the low part is null, we must not read the high part
> + * or we can end up with a partial pmd.
> + */
> + smp_rmb();
> + ret |= ((pmdval_t)*(tmp + 1)) << 32;
> + }
> +
> + return (pmd_t) { ret };
> +}
> +#else /* CONFIG_TRANSPARENT_HUGEPAGE */
> +static inline pmd_t pmd_read_atomic(pmd_t *pmdp)
> +{
> + return (pmd_t) { atomic64_read((atomic64_t *)pmdp) };
> +}
> +#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
> +
> static inline void native_set_pte_atomic(pte_t *ptep, pte_t pte)
> {
> set_64bit((unsigned long long *)(ptep), native_pte_val(pte));
> --- a/include/asm-generic/pgtable.h
> +++ b/include/asm-generic/pgtable.h
> @@ -446,6 +446,18 @@ static inline int pmd_write(pmd_t pmd)
> #endif /* __HAVE_ARCH_PMD_WRITE */
> #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>
> +#ifndef pmd_read_atomic
> +static inline pmd_t pmd_read_atomic(pmd_t *pmdp)
> +{
> + /*
> + * Depend on compiler for an atomic pmd read. NOTE: this is
> + * only going to work, if the pmdval_t isn't larger than
> + * an unsigned long.
> + */
> + return *pmdp;
> +}
> +#endif
> +
> /*
> * This function is meant to be used by sites walking pagetables with
> * the mmap_sem hold in read mode to protect against MADV_DONTNEED and
> @@ -459,11 +471,17 @@ static inline int pmd_write(pmd_t pmd)
> * undefined so behaving like if the pmd was none is safe (because it
> * can return none anyway). The compiler level barrier() is critically
> * important to compute the two checks atomically on the same pmdval.
> + *
> + * For 32bit kernels with a 64bit large pmd_t this automatically takes
> + * care of reading the pmd atomically to avoid SMP race conditions
> + * against pmd_populate() when the mmap_sem is hold for reading by the
> + * caller (a special atomic read not done by "gcc" as in the generic
> + * version above, is also needed when THP is disabled because the page
> + * fault can populate the pmd from under us).
> */
> static inline int pmd_none_or_trans_huge_or_clear_bad(pmd_t *pmd)
> {
> - /* depend on compiler for an atomic pmd read */
> - pmd_t pmdval = *pmd;
> + pmd_t pmdval = pmd_read_atomic(pmd);
> /*
> * The barrier will stabilize the pmdval in a register or on
> * the stack so that it will stop changing under the code.
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
next prev parent reply other threads:[~2012-06-07 17:52 UTC|newest]
Thread overview: 106+ messages / expand[flat|nested] mbox.gz Atom feed top
2012-06-07 4:14 [ 00/82] 3.4.2-stable review Greg KH
2012-06-07 4:03 ` [ 01/82] exofs: Fix CRASH on very early IO errors Greg KH
2012-06-07 4:03 ` [ 02/82] microblaze: Do not select GENERIC_GPIO by default Greg KH
2012-06-07 4:03 ` [ 03/82] SCSI: fix scsi_wait_scan Greg KH
2012-06-07 4:03 ` [ 04/82] SCSI: Fix dm-multipath starvation when scsi host is busy Greg KH
2012-06-07 4:03 ` [ 05/82] mm/fork: fix overflow in vma length when copying mmap on clone Greg KH
2012-06-07 4:03 ` [ 06/82] mm: fix NULL ptr deref when walking hugepages Greg KH
2012-06-07 4:03 ` [ 07/82] mm: consider all swapped back pages in used-once logic Greg KH
2012-06-07 4:03 ` [ 08/82] mm: pmd_read_atomic: fix 32bit PAE pmd walk vs pmd_populate SMP race condition Greg KH
2012-06-07 13:42 ` Josh Boyer
2012-06-07 14:42 ` Andrea Arcangeli
2012-06-07 17:46 ` Linus Torvalds
2012-06-07 19:04 ` Andrea Arcangeli
2012-06-07 21:00 ` Andrea Arcangeli
2012-06-07 21:00 ` [PATCH] thp: avoid atomic64_read in pmd_read_atomic for 32bit PAE Andrea Arcangeli
2012-06-10 2:03 ` [PATCH] thp: avoid atomic64_read in pmd_read_atomic for 32bit PAE\ Konrad Rzeszutek Wilk
2012-06-11 10:34 ` [Xen-devel] " Andrew Jones
2012-06-11 19:27 ` Konrad Rzeszutek Wilk
2012-06-11 19:41 ` Andrea Arcangeli
2012-06-07 21:02 ` [ 08/82] mm: pmd_read_atomic: fix 32bit PAE pmd walk vs pmd_populate SMP race condition Andrea Arcangeli
2012-06-08 8:04 ` Greg KH
2012-06-07 17:52 ` Konrad Rzeszutek Wilk [this message]
2012-06-07 4:03 ` [ 09/82] mm: fix faulty initialization in vmalloc_init() Greg KH
2012-06-07 4:03 ` [ 10/82] iwlwifi: update BT traffic load states correctly Greg KH
2012-06-07 4:03 ` [ 11/82] iwlwifi: do not use shadow registers by default Greg KH
2012-06-07 4:03 ` [ 12/82] cifs: Include backup intent search flags during searches {try #2) Greg KH
2012-06-07 4:03 ` [ 13/82] cifs: fix oops while traversing open file list (try #4) Greg KH
2012-06-07 4:03 ` [ 14/82] PARISC: fix boot failure on 32-bit systems caused by branch stubs placed before .text Greg KH
2012-06-07 4:03 ` [ 15/82] PARISC: fix TLB fault path on PA2.0 narrow systems Greg KH
2012-06-07 4:03 ` [ 16/82] solos-pci: Fix DMA support Greg KH
2012-06-07 4:03 ` [ 17/82] MIPS: BCM63XX: Add missing include for bcm63xx_gpio.h Greg KH
2012-06-07 4:03 ` [ 18/82] mac80211: fix ADDBA declined after suspend with wowlan Greg KH
2012-06-07 4:03 ` [ 19/82] ixp4xx: fix compilation by adding gpiolib support Greg KH
2012-06-07 4:03 ` [ 20/82] ath9k: fix a use-after-free-bug when ath_tx_setup_buffer() fails Greg KH
2012-06-07 4:03 ` [ 21/82] x86, amd, xen: Avoid NULL pointer paravirt references Greg KH
2012-06-07 4:03 ` [ 22/82] NFS: kmalloc() doesnt return an ERR_PTR() Greg KH
2012-06-07 4:03 ` [ 23/82] NFSv4: Map NFS4ERR_SHARE_DENIED into an EACCES error instead of EIO Greg KH
2012-06-07 4:04 ` [ 24/82] hugetlb: fix resv_map leak in error path Greg KH
2012-06-07 4:04 ` [ 25/82] sunrpc: fix loss of task->tk_status after rpc_delay call in xprt_alloc_slot Greg KH
2012-06-07 4:04 ` [ 26/82] iommu/amd: Check for the right TLP prefix bit Greg KH
2012-06-07 4:04 ` [ 27/82] iommu/amd: Add workaround for event log erratum Greg KH
2012-06-07 4:04 ` [ 28/82] drm/radeon: fix XFX quirk Greg KH
2012-06-07 4:04 ` [ 29/82] drm/radeon: fix typo in trinity tiling setup Greg KH
2012-06-07 4:04 ` [ 30/82] drm/i915: properly handle interlaced bit for sdvo dtd conversion Greg KH
2012-06-07 4:04 ` [ 31/82] drm/i915: Adding TV Out Missing modes Greg KH
2012-06-07 4:04 ` [ 32/82] drm/i915: wait for a vblank to pass after tv detect Greg KH
2012-06-07 4:04 ` [ 33/82] drm/i915: no lvds quirk for HP t5740e Thin Client Greg KH
2012-06-07 4:04 ` [ 34/82] kbuild: install kernel-page-flags.h Greg KH
2012-06-07 4:04 ` [ 35/82] mm: fix vma_resv_map() NULL pointer Greg KH
2012-06-07 4:04 ` [ 36/82] ALSA: usb-audio: fix rate_list memory leak Greg KH
2012-06-07 4:04 ` [ 37/82] slub: fix a memory leak in get_partial_node() Greg KH
2012-06-07 4:04 ` [ 38/82] vfs: umount_tree() might be called on subtree that had never made it Greg KH
2012-06-07 4:04 ` [ 39/82] vfs: increment iversion when a file is truncated Greg KH
2012-06-07 4:04 ` [ 40/82] fec_mpc52xx: fix timestamp filtering Greg KH
2012-06-07 4:04 ` [ 41/82] x86, x32, ptrace: Remove PTRACE_ARCH_PRCTL for x32 Greg KH
2012-06-07 4:04 ` [ 42/82] x86: Reset the debug_stack update counter Greg KH
2012-06-07 4:04 ` [ 43/82] mtd: nand: fix scan_read_raw_oob Greg KH
2012-06-07 4:04 ` [ 44/82] mtd: of_parts: fix breakage in Kconfig Greg KH
2012-06-07 4:04 ` [ 45/82] mtd: block2mtd: fix recursive call of mtd_writev Greg KH
2012-06-07 4:04 ` [ 46/82] mtd: mxc_nand: move ecc strengh setup before nand_scan_tail Greg KH
2012-06-07 4:04 ` [ 47/82] drm/radeon: fix regression in UMS CS ioctl Greg KH
2012-06-07 4:04 ` [ 48/82] drm/radeon: fix bank information in tiling config Greg KH
2012-06-07 4:04 ` [ 49/82] drm/radeon: properly program gart on rv740, juniper, cypress, barts, hemlock Greg KH
2012-06-07 4:04 ` [ 50/82] drm/radeon: fix HD6790, HD6570 backend programming Greg KH
2012-06-07 4:04 ` [ 51/82] drm/ttm: Fix spinlock imbalance Greg KH
2012-06-07 4:04 ` [ 52/82] drm/vmwgfx: Fix nasty write past alloced memory area Greg KH
2012-06-07 4:04 ` [ 53/82] asix: allow full size 8021Q frames to be received Greg KH
2012-06-08 2:27 ` Ben Hutchings
2012-06-08 3:54 ` David Miller
2012-06-07 4:04 ` [ 54/82] ipv4: fix the rcu race between free_fib_info and ip_route_output_slow Greg KH
2012-06-07 4:04 ` [ 55/82] ipv6: fix incorrect ipsec fragment Greg KH
2012-06-07 4:04 ` [ 56/82] l2tp: fix oops in L2TP IP sockets for connect() AF_UNSPEC case Greg KH
2012-06-07 4:04 ` [ 57/82] skb: avoid unnecessary reallocations in __skb_cow Greg KH
2012-06-07 4:04 ` [ 58/82] xfrm: take net hdr len into account for esp payload size calculation Greg KH
2012-06-07 4:04 ` [ 59/82] ext4: fix potential NULL dereference in ext4_free_inodes_counts() Greg KH
2012-06-07 4:04 ` [ 60/82] ext4: force ro mount if ext4_setup_super() fails Greg KH
2012-06-07 4:04 ` [ 61/82] ext4: fix potential integer overflow in alloc_flex_gd() Greg KH
2012-06-07 4:04 ` [ 62/82] ext4: disallow hard-linked directory in ext4_lookup Greg KH
2012-06-07 4:04 ` [ 63/82] ext4: add missing save_error_info() to ext4_error() Greg KH
2012-06-07 4:04 ` [ 64/82] ext4: dont trash state flags in EXT4_IOC_SETFLAGS Greg KH
2012-06-08 3:03 ` Ben Hutchings
2012-06-08 3:11 ` Ted Ts'o
2012-06-08 3:21 ` Ben Hutchings
2012-06-08 20:05 ` Ted Ts'o
2012-06-08 23:01 ` Ben Hutchings
2012-06-09 2:30 ` Ted Ts'o
2012-06-09 12:56 ` Ben Hutchings
2012-06-09 15:23 ` Greg KH
2012-06-07 4:04 ` [ 65/82] ext4: add ext4_mb_unload_buddy in the error path Greg KH
2012-06-07 4:04 ` [ 66/82] ext4: remove mb_groups before tearing down the buddy_cache Greg KH
2012-06-07 4:04 ` [ 67/82] radix-tree: fix contiguous iterator Greg KH
2012-06-07 4:04 ` [ 68/82] drm/radeon/audio: dont hardcode CRTC id Greg KH
2012-06-07 4:04 ` [ 69/82] drm/radeon: fix vm deadlocks on cayman Greg KH
2012-06-07 4:04 ` [ 70/82] drm/radeon/kms: add new Trinity PCI ids Greg KH
2012-06-07 4:04 ` [ 71/82] drm/radeon/kms: add new Palm, Sumo " Greg KH
2012-06-07 4:04 ` [ 72/82] drm/radeon/kms: add new BTC " Greg KH
2012-06-07 4:04 ` [ 73/82] drm/radeon/kms: add new SI " Greg KH
2012-06-07 4:04 ` [ 74/82] iommu/amd: Cache pdev pointer to root-bridge Greg KH
2012-06-07 4:04 ` [ 75/82] iommu/amd: Fix deadlock in ppr-handling error path Greg KH
2012-06-07 4:04 ` [ 76/82] ACPI battery: only refresh the sysfs files when pertinent information changes Greg KH
2012-06-07 4:04 ` [ 77/82] vfs: Fix /proc/<tid>/fdinfo/<fd> file handling Greg KH
2012-06-07 4:04 ` [ 78/82] md: raid1/raid10: fix problem with merge_bvec_fn Greg KH
2012-06-07 4:04 ` [ 79/82] wl1251: fix oops on early interrupt Greg KH
2012-06-07 4:04 ` [ 80/82] drm/i915: always use RPNSWREQ for turbo change requests Greg KH
2012-06-07 4:04 ` [ 81/82] drm/i915/dp: Flush any outstanding work to turn the VDD off Greg KH
2012-06-07 4:04 ` [ 82/82] drm/i915: enable vdd when switching off the eDP panel Greg KH
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20120607175249.GV9472@phenom.dumpdata.com \
--to=konrad.wilk@oracle.com \
--cc=aarcange@redhat.com \
--cc=akpm@linux-foundation.org \
--cc=alan@lxorguk.ukuu.org.uk \
--cc=gregkh@linuxfoundation.org \
--cc=hughd@google.com \
--cc=linux-kernel@vger.kernel.org \
--cc=lwoodman@redhat.com \
--cc=mgorman@suse.de \
--cc=pmatouse@redhat.com \
--cc=riel@redhat.com \
--cc=stable@vger.kernel.org \
--cc=torvalds@linux-foundation.org \
--cc=uobergfe@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).