From: Greg KH <gregkh@linuxfoundation.org>
To: linux-kernel@vger.kernel.org, stable@vger.kernel.org
Cc: torvalds@linux-foundation.org, akpm@linux-foundation.org,
alan@lxorguk.ukuu.org.uk, Ulrich Obergfell <uobergfe@redhat.com>,
Andrea Arcangeli <aarcange@redhat.com>,
Mel Gorman <mgorman@suse.de>, Hugh Dickins <hughd@google.com>,
Larry Woodman <lwoodman@redhat.com>,
Petr Matousek <pmatouse@redhat.com>,
Rik van Riel <riel@redhat.com>
Subject: [ 08/82] mm: pmd_read_atomic: fix 32bit PAE pmd walk vs pmd_populate SMP race condition
Date: Thu, 07 Jun 2012 13:03:44 +0900 [thread overview]
Message-ID: <20120607040337.622672845@linuxfoundation.org> (raw)
In-Reply-To: <20120607041406.GA13233@kroah.com>
3.4-stable review patch. If anyone has any objections, please let me know.
------------------
From: Andrea Arcangeli <aarcange@redhat.com>
commit 26c191788f18129af0eb32a358cdaea0c7479626 upstream.
When holding the mmap_sem for reading, pmd_offset_map_lock should only
run on a pmd_t that has been read atomically from the pmdp pointer,
otherwise we may read only half of it leading to this crash.
PID: 11679 TASK: f06e8000 CPU: 3 COMMAND: "do_race_2_panic"
#0 [f06a9dd8] crash_kexec at c049b5ec
#1 [f06a9e2c] oops_end at c083d1c2
#2 [f06a9e40] no_context at c0433ded
#3 [f06a9e64] bad_area_nosemaphore at c043401a
#4 [f06a9e6c] __do_page_fault at c0434493
#5 [f06a9eec] do_page_fault at c083eb45
#6 [f06a9f04] error_code (via page_fault) at c083c5d5
EAX: 01fb470c EBX: fff35000 ECX: 00000003 EDX: 00000100 EBP:
00000000
DS: 007b ESI: 9e201000 ES: 007b EDI: 01fb4700 GS: 00e0
CS: 0060 EIP: c083bc14 ERR: ffffffff EFLAGS: 00010246
#7 [f06a9f38] _spin_lock at c083bc14
#8 [f06a9f44] sys_mincore at c0507b7d
#9 [f06a9fb0] system_call at c083becd
start len
EAX: ffffffda EBX: 9e200000 ECX: 00001000 EDX: 6228537f
DS: 007b ESI: 00000000 ES: 007b EDI: 003d0f00
SS: 007b ESP: 62285354 EBP: 62285388 GS: 0033
CS: 0073 EIP: 00291416 ERR: 000000da EFLAGS: 00000286
This should be a longstanding bug affecting x86 32bit PAE without THP.
Only archs with 64bit large pmd_t and 32bit unsigned long should be
affected.
With THP enabled the barrier() in pmd_none_or_trans_huge_or_clear_bad()
would partly hide the bug when the pmd transition from none to stable,
by forcing a re-read of the *pmd in pmd_offset_map_lock, but when THP is
enabled a new set of problem arises by the fact could then transition
freely in any of the none, pmd_trans_huge or pmd_trans_stable states.
So making the barrier in pmd_none_or_trans_huge_or_clear_bad()
unconditional isn't good idea and it would be a flakey solution.
This should be fully fixed by introducing a pmd_read_atomic that reads
the pmd in order with THP disabled, or by reading the pmd atomically
with cmpxchg8b with THP enabled.
Luckily this new race condition only triggers in the places that must
already be covered by pmd_none_or_trans_huge_or_clear_bad() so the fix
is localized there but this bug is not related to THP.
NOTE: this can trigger on x86 32bit systems with PAE enabled with more
than 4G of ram, otherwise the high part of the pmd will never risk to be
truncated because it would be zero at all times, in turn so hiding the
SMP race.
This bug was discovered and fully debugged by Ulrich, quote:
----
[..]
pmd_none_or_trans_huge_or_clear_bad() loads the content of edx and
eax.
496 static inline int pmd_none_or_trans_huge_or_clear_bad(pmd_t
*pmd)
497 {
498 /* depend on compiler for an atomic pmd read */
499 pmd_t pmdval = *pmd;
// edi = pmd pointer
0xc0507a74 <sys_mincore+548>: mov 0x8(%esp),%edi
...
// edx = PTE page table high address
0xc0507a84 <sys_mincore+564>: mov 0x4(%edi),%edx
...
// eax = PTE page table low address
0xc0507a8e <sys_mincore+574>: mov (%edi),%eax
[..]
Please note that the PMD is not read atomically. These are two "mov"
instructions where the high order bits of the PMD entry are fetched
first. Hence, the above machine code is prone to the following race.
- The PMD entry {high|low} is 0x0000000000000000.
The "mov" at 0xc0507a84 loads 0x00000000 into edx.
- A page fault (on another CPU) sneaks in between the two "mov"
instructions and instantiates the PMD.
- The PMD entry {high|low} is now 0x00000003fda38067.
The "mov" at 0xc0507a8e loads 0xfda38067 into eax.
----
Reported-by: Ulrich Obergfell <uobergfe@redhat.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: Larry Woodman <lwoodman@redhat.com>
Cc: Petr Matousek <pmatouse@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
arch/x86/include/asm/pgtable-3level.h | 50 ++++++++++++++++++++++++++++++++++
include/asm-generic/pgtable.h | 22 +++++++++++++-
2 files changed, 70 insertions(+), 2 deletions(-)
--- a/arch/x86/include/asm/pgtable-3level.h
+++ b/arch/x86/include/asm/pgtable-3level.h
@@ -31,6 +31,56 @@ static inline void native_set_pte(pte_t
ptep->pte_low = pte.pte_low;
}
+#define pmd_read_atomic pmd_read_atomic
+/*
+ * pte_offset_map_lock on 32bit PAE kernels was reading the pmd_t with
+ * a "*pmdp" dereference done by gcc. Problem is, in certain places
+ * where pte_offset_map_lock is called, concurrent page faults are
+ * allowed, if the mmap_sem is hold for reading. An example is mincore
+ * vs page faults vs MADV_DONTNEED. On the page fault side
+ * pmd_populate rightfully does a set_64bit, but if we're reading the
+ * pmd_t with a "*pmdp" on the mincore side, a SMP race can happen
+ * because gcc will not read the 64bit of the pmd atomically. To fix
+ * this all places running pmd_offset_map_lock() while holding the
+ * mmap_sem in read mode, shall read the pmdp pointer using this
+ * function to know if the pmd is null nor not, and in turn to know if
+ * they can run pmd_offset_map_lock or pmd_trans_huge or other pmd
+ * operations.
+ *
+ * Without THP if the mmap_sem is hold for reading, the
+ * pmd can only transition from null to not null while pmd_read_atomic runs.
+ * So there's no need of literally reading it atomically.
+ *
+ * With THP if the mmap_sem is hold for reading, the pmd can become
+ * THP or null or point to a pte (and in turn become "stable") at any
+ * time under pmd_read_atomic, so it's mandatory to read it atomically
+ * with cmpxchg8b.
+ */
+#ifndef CONFIG_TRANSPARENT_HUGEPAGE
+static inline pmd_t pmd_read_atomic(pmd_t *pmdp)
+{
+ pmdval_t ret;
+ u32 *tmp = (u32 *)pmdp;
+
+ ret = (pmdval_t) (*tmp);
+ if (ret) {
+ /*
+ * If the low part is null, we must not read the high part
+ * or we can end up with a partial pmd.
+ */
+ smp_rmb();
+ ret |= ((pmdval_t)*(tmp + 1)) << 32;
+ }
+
+ return (pmd_t) { ret };
+}
+#else /* CONFIG_TRANSPARENT_HUGEPAGE */
+static inline pmd_t pmd_read_atomic(pmd_t *pmdp)
+{
+ return (pmd_t) { atomic64_read((atomic64_t *)pmdp) };
+}
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+
static inline void native_set_pte_atomic(pte_t *ptep, pte_t pte)
{
set_64bit((unsigned long long *)(ptep), native_pte_val(pte));
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -446,6 +446,18 @@ static inline int pmd_write(pmd_t pmd)
#endif /* __HAVE_ARCH_PMD_WRITE */
#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+#ifndef pmd_read_atomic
+static inline pmd_t pmd_read_atomic(pmd_t *pmdp)
+{
+ /*
+ * Depend on compiler for an atomic pmd read. NOTE: this is
+ * only going to work, if the pmdval_t isn't larger than
+ * an unsigned long.
+ */
+ return *pmdp;
+}
+#endif
+
/*
* This function is meant to be used by sites walking pagetables with
* the mmap_sem hold in read mode to protect against MADV_DONTNEED and
@@ -459,11 +471,17 @@ static inline int pmd_write(pmd_t pmd)
* undefined so behaving like if the pmd was none is safe (because it
* can return none anyway). The compiler level barrier() is critically
* important to compute the two checks atomically on the same pmdval.
+ *
+ * For 32bit kernels with a 64bit large pmd_t this automatically takes
+ * care of reading the pmd atomically to avoid SMP race conditions
+ * against pmd_populate() when the mmap_sem is hold for reading by the
+ * caller (a special atomic read not done by "gcc" as in the generic
+ * version above, is also needed when THP is disabled because the page
+ * fault can populate the pmd from under us).
*/
static inline int pmd_none_or_trans_huge_or_clear_bad(pmd_t *pmd)
{
- /* depend on compiler for an atomic pmd read */
- pmd_t pmdval = *pmd;
+ pmd_t pmdval = pmd_read_atomic(pmd);
/*
* The barrier will stabilize the pmdval in a register or on
* the stack so that it will stop changing under the code.
next prev parent reply other threads:[~2012-06-07 4:03 UTC|newest]
Thread overview: 106+ messages / expand[flat|nested] mbox.gz Atom feed top
2012-06-07 4:14 [ 00/82] 3.4.2-stable review Greg KH
2012-06-07 4:03 ` [ 01/82] exofs: Fix CRASH on very early IO errors Greg KH
2012-06-07 4:03 ` [ 02/82] microblaze: Do not select GENERIC_GPIO by default Greg KH
2012-06-07 4:03 ` [ 03/82] SCSI: fix scsi_wait_scan Greg KH
2012-06-07 4:03 ` [ 04/82] SCSI: Fix dm-multipath starvation when scsi host is busy Greg KH
2012-06-07 4:03 ` [ 05/82] mm/fork: fix overflow in vma length when copying mmap on clone Greg KH
2012-06-07 4:03 ` [ 06/82] mm: fix NULL ptr deref when walking hugepages Greg KH
2012-06-07 4:03 ` [ 07/82] mm: consider all swapped back pages in used-once logic Greg KH
2012-06-07 4:03 ` Greg KH [this message]
2012-06-07 13:42 ` [ 08/82] mm: pmd_read_atomic: fix 32bit PAE pmd walk vs pmd_populate SMP race condition Josh Boyer
2012-06-07 14:42 ` Andrea Arcangeli
2012-06-07 17:46 ` Linus Torvalds
2012-06-07 19:04 ` Andrea Arcangeli
2012-06-07 21:00 ` Andrea Arcangeli
2012-06-07 21:00 ` [PATCH] thp: avoid atomic64_read in pmd_read_atomic for 32bit PAE Andrea Arcangeli
2012-06-10 2:03 ` [PATCH] thp: avoid atomic64_read in pmd_read_atomic for 32bit PAE\ Konrad Rzeszutek Wilk
2012-06-11 10:34 ` [Xen-devel] " Andrew Jones
2012-06-11 19:27 ` Konrad Rzeszutek Wilk
2012-06-11 19:41 ` Andrea Arcangeli
2012-06-07 21:02 ` [ 08/82] mm: pmd_read_atomic: fix 32bit PAE pmd walk vs pmd_populate SMP race condition Andrea Arcangeli
2012-06-08 8:04 ` Greg KH
2012-06-07 17:52 ` Konrad Rzeszutek Wilk
2012-06-07 4:03 ` [ 09/82] mm: fix faulty initialization in vmalloc_init() Greg KH
2012-06-07 4:03 ` [ 10/82] iwlwifi: update BT traffic load states correctly Greg KH
2012-06-07 4:03 ` [ 11/82] iwlwifi: do not use shadow registers by default Greg KH
2012-06-07 4:03 ` [ 12/82] cifs: Include backup intent search flags during searches {try #2) Greg KH
2012-06-07 4:03 ` [ 13/82] cifs: fix oops while traversing open file list (try #4) Greg KH
2012-06-07 4:03 ` [ 14/82] PARISC: fix boot failure on 32-bit systems caused by branch stubs placed before .text Greg KH
2012-06-07 4:03 ` [ 15/82] PARISC: fix TLB fault path on PA2.0 narrow systems Greg KH
2012-06-07 4:03 ` [ 16/82] solos-pci: Fix DMA support Greg KH
2012-06-07 4:03 ` [ 17/82] MIPS: BCM63XX: Add missing include for bcm63xx_gpio.h Greg KH
2012-06-07 4:03 ` [ 18/82] mac80211: fix ADDBA declined after suspend with wowlan Greg KH
2012-06-07 4:03 ` [ 19/82] ixp4xx: fix compilation by adding gpiolib support Greg KH
2012-06-07 4:03 ` [ 20/82] ath9k: fix a use-after-free-bug when ath_tx_setup_buffer() fails Greg KH
2012-06-07 4:03 ` [ 21/82] x86, amd, xen: Avoid NULL pointer paravirt references Greg KH
2012-06-07 4:03 ` [ 22/82] NFS: kmalloc() doesnt return an ERR_PTR() Greg KH
2012-06-07 4:03 ` [ 23/82] NFSv4: Map NFS4ERR_SHARE_DENIED into an EACCES error instead of EIO Greg KH
2012-06-07 4:04 ` [ 24/82] hugetlb: fix resv_map leak in error path Greg KH
2012-06-07 4:04 ` [ 25/82] sunrpc: fix loss of task->tk_status after rpc_delay call in xprt_alloc_slot Greg KH
2012-06-07 4:04 ` [ 26/82] iommu/amd: Check for the right TLP prefix bit Greg KH
2012-06-07 4:04 ` [ 27/82] iommu/amd: Add workaround for event log erratum Greg KH
2012-06-07 4:04 ` [ 28/82] drm/radeon: fix XFX quirk Greg KH
2012-06-07 4:04 ` [ 29/82] drm/radeon: fix typo in trinity tiling setup Greg KH
2012-06-07 4:04 ` [ 30/82] drm/i915: properly handle interlaced bit for sdvo dtd conversion Greg KH
2012-06-07 4:04 ` [ 31/82] drm/i915: Adding TV Out Missing modes Greg KH
2012-06-07 4:04 ` [ 32/82] drm/i915: wait for a vblank to pass after tv detect Greg KH
2012-06-07 4:04 ` [ 33/82] drm/i915: no lvds quirk for HP t5740e Thin Client Greg KH
2012-06-07 4:04 ` [ 34/82] kbuild: install kernel-page-flags.h Greg KH
2012-06-07 4:04 ` [ 35/82] mm: fix vma_resv_map() NULL pointer Greg KH
2012-06-07 4:04 ` [ 36/82] ALSA: usb-audio: fix rate_list memory leak Greg KH
2012-06-07 4:04 ` [ 37/82] slub: fix a memory leak in get_partial_node() Greg KH
2012-06-07 4:04 ` [ 38/82] vfs: umount_tree() might be called on subtree that had never made it Greg KH
2012-06-07 4:04 ` [ 39/82] vfs: increment iversion when a file is truncated Greg KH
2012-06-07 4:04 ` [ 40/82] fec_mpc52xx: fix timestamp filtering Greg KH
2012-06-07 4:04 ` [ 41/82] x86, x32, ptrace: Remove PTRACE_ARCH_PRCTL for x32 Greg KH
2012-06-07 4:04 ` [ 42/82] x86: Reset the debug_stack update counter Greg KH
2012-06-07 4:04 ` [ 43/82] mtd: nand: fix scan_read_raw_oob Greg KH
2012-06-07 4:04 ` [ 44/82] mtd: of_parts: fix breakage in Kconfig Greg KH
2012-06-07 4:04 ` [ 45/82] mtd: block2mtd: fix recursive call of mtd_writev Greg KH
2012-06-07 4:04 ` [ 46/82] mtd: mxc_nand: move ecc strengh setup before nand_scan_tail Greg KH
2012-06-07 4:04 ` [ 47/82] drm/radeon: fix regression in UMS CS ioctl Greg KH
2012-06-07 4:04 ` [ 48/82] drm/radeon: fix bank information in tiling config Greg KH
2012-06-07 4:04 ` [ 49/82] drm/radeon: properly program gart on rv740, juniper, cypress, barts, hemlock Greg KH
2012-06-07 4:04 ` [ 50/82] drm/radeon: fix HD6790, HD6570 backend programming Greg KH
2012-06-07 4:04 ` [ 51/82] drm/ttm: Fix spinlock imbalance Greg KH
2012-06-07 4:04 ` [ 52/82] drm/vmwgfx: Fix nasty write past alloced memory area Greg KH
2012-06-07 4:04 ` [ 53/82] asix: allow full size 8021Q frames to be received Greg KH
2012-06-08 2:27 ` Ben Hutchings
2012-06-08 3:54 ` David Miller
2012-06-07 4:04 ` [ 54/82] ipv4: fix the rcu race between free_fib_info and ip_route_output_slow Greg KH
2012-06-07 4:04 ` [ 55/82] ipv6: fix incorrect ipsec fragment Greg KH
2012-06-07 4:04 ` [ 56/82] l2tp: fix oops in L2TP IP sockets for connect() AF_UNSPEC case Greg KH
2012-06-07 4:04 ` [ 57/82] skb: avoid unnecessary reallocations in __skb_cow Greg KH
2012-06-07 4:04 ` [ 58/82] xfrm: take net hdr len into account for esp payload size calculation Greg KH
2012-06-07 4:04 ` [ 59/82] ext4: fix potential NULL dereference in ext4_free_inodes_counts() Greg KH
2012-06-07 4:04 ` [ 60/82] ext4: force ro mount if ext4_setup_super() fails Greg KH
2012-06-07 4:04 ` [ 61/82] ext4: fix potential integer overflow in alloc_flex_gd() Greg KH
2012-06-07 4:04 ` [ 62/82] ext4: disallow hard-linked directory in ext4_lookup Greg KH
2012-06-07 4:04 ` [ 63/82] ext4: add missing save_error_info() to ext4_error() Greg KH
2012-06-07 4:04 ` [ 64/82] ext4: dont trash state flags in EXT4_IOC_SETFLAGS Greg KH
2012-06-08 3:03 ` Ben Hutchings
2012-06-08 3:11 ` Ted Ts'o
2012-06-08 3:21 ` Ben Hutchings
2012-06-08 20:05 ` Ted Ts'o
2012-06-08 23:01 ` Ben Hutchings
2012-06-09 2:30 ` Ted Ts'o
2012-06-09 12:56 ` Ben Hutchings
2012-06-09 15:23 ` Greg KH
2012-06-07 4:04 ` [ 65/82] ext4: add ext4_mb_unload_buddy in the error path Greg KH
2012-06-07 4:04 ` [ 66/82] ext4: remove mb_groups before tearing down the buddy_cache Greg KH
2012-06-07 4:04 ` [ 67/82] radix-tree: fix contiguous iterator Greg KH
2012-06-07 4:04 ` [ 68/82] drm/radeon/audio: dont hardcode CRTC id Greg KH
2012-06-07 4:04 ` [ 69/82] drm/radeon: fix vm deadlocks on cayman Greg KH
2012-06-07 4:04 ` [ 70/82] drm/radeon/kms: add new Trinity PCI ids Greg KH
2012-06-07 4:04 ` [ 71/82] drm/radeon/kms: add new Palm, Sumo " Greg KH
2012-06-07 4:04 ` [ 72/82] drm/radeon/kms: add new BTC " Greg KH
2012-06-07 4:04 ` [ 73/82] drm/radeon/kms: add new SI " Greg KH
2012-06-07 4:04 ` [ 74/82] iommu/amd: Cache pdev pointer to root-bridge Greg KH
2012-06-07 4:04 ` [ 75/82] iommu/amd: Fix deadlock in ppr-handling error path Greg KH
2012-06-07 4:04 ` [ 76/82] ACPI battery: only refresh the sysfs files when pertinent information changes Greg KH
2012-06-07 4:04 ` [ 77/82] vfs: Fix /proc/<tid>/fdinfo/<fd> file handling Greg KH
2012-06-07 4:04 ` [ 78/82] md: raid1/raid10: fix problem with merge_bvec_fn Greg KH
2012-06-07 4:04 ` [ 79/82] wl1251: fix oops on early interrupt Greg KH
2012-06-07 4:04 ` [ 80/82] drm/i915: always use RPNSWREQ for turbo change requests Greg KH
2012-06-07 4:04 ` [ 81/82] drm/i915/dp: Flush any outstanding work to turn the VDD off Greg KH
2012-06-07 4:04 ` [ 82/82] drm/i915: enable vdd when switching off the eDP panel Greg KH
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20120607040337.622672845@linuxfoundation.org \
--to=gregkh@linuxfoundation.org \
--cc=aarcange@redhat.com \
--cc=akpm@linux-foundation.org \
--cc=alan@lxorguk.ukuu.org.uk \
--cc=hughd@google.com \
--cc=linux-kernel@vger.kernel.org \
--cc=lwoodman@redhat.com \
--cc=mgorman@suse.de \
--cc=pmatouse@redhat.com \
--cc=riel@redhat.com \
--cc=stable@vger.kernel.org \
--cc=torvalds@linux-foundation.org \
--cc=uobergfe@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).