stable.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Konrad Rzeszutek Wilk <konrad@darnok.org>
To: Andrea Arcangeli <aarcange@redhat.com>, drjones@redhat.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	Greg KH <gregkh@linuxfoundation.org>,
	676360@bugs.debian.org, xen-devel@lists.xensource.com,
	Jonathan Nieder <jrnieder@gmail.com>,
	linux-kernel@vger.kernel.org,
	"linux-mm@kvack.org Konrad Rzeszutek Wilk"
	<konrad.wilk@oracle.com>,
	stable@vger.kernel.org, alan@lxorguk.ukuu.org.uk,
	Ulrich Obergfell <uobergfe@redhat.com>,
	Mel Gorman <mgorman@suse.de>, Hugh Dickins <hughd@google.com>,
	Larry Woodman <lwoodman@redhat.com>,
	Petr Matousek <pmatouse@redhat.com>,
	Rik van Riel <riel@redhat.com>, Jan Beulich <jbeulich@suse.com>,
	KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Subject: Re: [PATCH] thp: avoid atomic64_read in pmd_read_atomic for 32bit PAE\
Date: Sat, 9 Jun 2012 22:03:31 -0400	[thread overview]
Message-ID: <20120610020331.GA26100@localhost.localdomain> (raw)
In-Reply-To: <1339102833-12358-2-git-send-email-aarcange@redhat.com>

On Thu, Jun 07, 2012 at 11:00:33PM +0200, Andrea Arcangeli wrote:
> In the x86 32bit PAE CONFIG_TRANSPARENT_HUGEPAGE=y case while holding
> the mmap_sem for reading, cmpxchg8b cannot be used to read pmd
> contents under Xen.
> 
> So instead of dealing only with "consistent" pmdvals in
> pmd_none_or_trans_huge_or_clear_bad() (which would be conceptually
> simpler) we let pmd_none_or_trans_huge_or_clear_bad() deal with pmdvals
> where the low 32bit and high 32bit could be inconsistent (to avoid
> having to use cmpxchg8b).

<nods>
> 
> The only guarantee we get from pmd_read_atomic is that if the low part
> of the pmd was found null, the high part will be null too (so the pmd
> will be considered unstable). And if the low part of the pmd is found
> "stable" later, then it means the whole pmd was read atomically
> (because after a pmd is stable, neither MADV_DONTNEED nor page faults
> can alter it anymore, and we read the high part after the low part).
> 
> In the 32bit PAE x86 case, it is enough to read the low part of the
> pmdval atomically to declare the pmd as "stable" and that's true for
> THP and no THP, furthermore in the THP case we also have a barrier()
> that will prevent any inconsistent pmdvals to be cached by a later
> re-read of the *pmd.

Nice. Andrew, any chane you could test this patch on the affected
Xen hypervisors? Was it as easy to reproduce this on a RHEL5 (U1?)
hypervisor or is it really only on Linode and Amazon EC2?

> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> ---
>  arch/x86/include/asm/pgtable-3level.h |   30 +++++++++++++++++-------------
>  include/asm-generic/pgtable.h         |   10 ++++++++++
>  2 files changed, 27 insertions(+), 13 deletions(-)
> 
> diff --git a/arch/x86/include/asm/pgtable-3level.h b/arch/x86/include/asm/pgtable-3level.h
> index 43876f1..cb00ccc 100644
> --- a/arch/x86/include/asm/pgtable-3level.h
> +++ b/arch/x86/include/asm/pgtable-3level.h
> @@ -47,16 +47,26 @@ static inline void native_set_pte(pte_t *ptep, pte_t pte)
>   * they can run pmd_offset_map_lock or pmd_trans_huge or other pmd
>   * operations.
>   *
> - * Without THP if the mmap_sem is hold for reading, the
> - * pmd can only transition from null to not null while pmd_read_atomic runs.
> - * So there's no need of literally reading it atomically.
> + * Without THP if the mmap_sem is hold for reading, the pmd can only
> + * transition from null to not null while pmd_read_atomic runs. So
> + * we can always return atomic pmd values with this function.
>   *
>   * With THP if the mmap_sem is hold for reading, the pmd can become
> - * THP or null or point to a pte (and in turn become "stable") at any
> - * time under pmd_read_atomic, so it's mandatory to read it atomically
> - * with cmpxchg8b.
> + * trans_huge or none or point to a pte (and in turn become "stable")
> + * at any time under pmd_read_atomic. We could read it really
> + * atomically here with a atomic64_read for the THP enabled case (and
> + * it would be a whole lot simpler), but to avoid using cmpxchg8b we
> + * only return an atomic pmdval if the low part of the pmdval is later
> + * found stable (i.e. pointing to a pte). And we're returning a none
> + * pmdval if the low part of the pmd is none. In some cases the high
> + * and low part of the pmdval returned may not be consistent if THP is
> + * enabled (the low part may point to previously mapped hugepage,
> + * while the high part may point to a more recently mapped hugepage),
> + * but pmd_none_or_trans_huge_or_clear_bad() only needs the low part
> + * of the pmd to be read atomically to decide if the pmd is unstable
> + * or not, with the only exception of when the low part of the pmd is
> + * zero in which case we return a none pmd.
>   */
> -#ifndef CONFIG_TRANSPARENT_HUGEPAGE
>  static inline pmd_t pmd_read_atomic(pmd_t *pmdp)
>  {
>  	pmdval_t ret;
> @@ -74,12 +84,6 @@ static inline pmd_t pmd_read_atomic(pmd_t *pmdp)
>  
>  	return (pmd_t) { ret };
>  }
> -#else /* CONFIG_TRANSPARENT_HUGEPAGE */
> -static inline pmd_t pmd_read_atomic(pmd_t *pmdp)
> -{
> -	return (pmd_t) { atomic64_read((atomic64_t *)pmdp) };
> -}
> -#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>  
>  static inline void native_set_pte_atomic(pte_t *ptep, pte_t pte)
>  {
> diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
> index ae39c4b..0ff87ec 100644
> --- a/include/asm-generic/pgtable.h
> +++ b/include/asm-generic/pgtable.h
> @@ -484,6 +484,16 @@ static inline int pmd_none_or_trans_huge_or_clear_bad(pmd_t *pmd)
>  	/*
>  	 * The barrier will stabilize the pmdval in a register or on
>  	 * the stack so that it will stop changing under the code.
> +	 *
> +	 * When CONFIG_TRANSPARENT_HUGEPAGE=y on x86 32bit PAE,
> +	 * pmd_read_atomic is allowed to return a not atomic pmdval
> +	 * (for example pointing to an hugepage that has never been
> +	 * mapped in the pmd). The below checks will only care about
> +	 * the low part of the pmd with 32bit PAE x86 anyway, with the
> +	 * exception of pmd_none(). So the important thing is that if
> +	 * the low part of the pmd is found null, the high part will
> +	 * be also null or the pmd_none() check below would be
> +	 * confused.
>  	 */
>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>  	barrier();
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 

  reply	other threads:[~2012-06-10  2:03 UTC|newest]

Thread overview: 106+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-06-07  4:14 [ 00/82] 3.4.2-stable review Greg KH
2012-06-07  4:03 ` [ 01/82] exofs: Fix CRASH on very early IO errors Greg KH
2012-06-07  4:03 ` [ 02/82] microblaze: Do not select GENERIC_GPIO by default Greg KH
2012-06-07  4:03 ` [ 03/82] SCSI: fix scsi_wait_scan Greg KH
2012-06-07  4:03 ` [ 04/82] SCSI: Fix dm-multipath starvation when scsi host is busy Greg KH
2012-06-07  4:03 ` [ 05/82] mm/fork: fix overflow in vma length when copying mmap on clone Greg KH
2012-06-07  4:03 ` [ 06/82] mm: fix NULL ptr deref when walking hugepages Greg KH
2012-06-07  4:03 ` [ 07/82] mm: consider all swapped back pages in used-once logic Greg KH
2012-06-07  4:03 ` [ 08/82] mm: pmd_read_atomic: fix 32bit PAE pmd walk vs pmd_populate SMP race condition Greg KH
2012-06-07 13:42   ` Josh Boyer
2012-06-07 14:42     ` Andrea Arcangeli
2012-06-07 17:46       ` Linus Torvalds
2012-06-07 19:04         ` Andrea Arcangeli
2012-06-07 21:00           ` Andrea Arcangeli
2012-06-07 21:00           ` [PATCH] thp: avoid atomic64_read in pmd_read_atomic for 32bit PAE Andrea Arcangeli
2012-06-10  2:03             ` Konrad Rzeszutek Wilk [this message]
2012-06-11 10:34               ` [Xen-devel] [PATCH] thp: avoid atomic64_read in pmd_read_atomic for 32bit PAE\ Andrew Jones
2012-06-11 19:27                 ` Konrad Rzeszutek Wilk
2012-06-11 19:41                   ` Andrea Arcangeli
2012-06-07 21:02           ` [ 08/82] mm: pmd_read_atomic: fix 32bit PAE pmd walk vs pmd_populate SMP race condition Andrea Arcangeli
2012-06-08  8:04     ` Greg KH
2012-06-07 17:52   ` Konrad Rzeszutek Wilk
2012-06-07  4:03 ` [ 09/82] mm: fix faulty initialization in vmalloc_init() Greg KH
2012-06-07  4:03 ` [ 10/82] iwlwifi: update BT traffic load states correctly Greg KH
2012-06-07  4:03 ` [ 11/82] iwlwifi: do not use shadow registers by default Greg KH
2012-06-07  4:03 ` [ 12/82] cifs: Include backup intent search flags during searches {try #2) Greg KH
2012-06-07  4:03 ` [ 13/82] cifs: fix oops while traversing open file list (try #4) Greg KH
2012-06-07  4:03 ` [ 14/82] PARISC: fix boot failure on 32-bit systems caused by branch stubs placed before .text Greg KH
2012-06-07  4:03 ` [ 15/82] PARISC: fix TLB fault path on PA2.0 narrow systems Greg KH
2012-06-07  4:03 ` [ 16/82] solos-pci: Fix DMA support Greg KH
2012-06-07  4:03 ` [ 17/82] MIPS: BCM63XX: Add missing include for bcm63xx_gpio.h Greg KH
2012-06-07  4:03 ` [ 18/82] mac80211: fix ADDBA declined after suspend with wowlan Greg KH
2012-06-07  4:03 ` [ 19/82] ixp4xx: fix compilation by adding gpiolib support Greg KH
2012-06-07  4:03 ` [ 20/82] ath9k: fix a use-after-free-bug when ath_tx_setup_buffer() fails Greg KH
2012-06-07  4:03 ` [ 21/82] x86, amd, xen: Avoid NULL pointer paravirt references Greg KH
2012-06-07  4:03 ` [ 22/82] NFS: kmalloc() doesnt return an ERR_PTR() Greg KH
2012-06-07  4:03 ` [ 23/82] NFSv4: Map NFS4ERR_SHARE_DENIED into an EACCES error instead of EIO Greg KH
2012-06-07  4:04 ` [ 24/82] hugetlb: fix resv_map leak in error path Greg KH
2012-06-07  4:04 ` [ 25/82] sunrpc: fix loss of task->tk_status after rpc_delay call in xprt_alloc_slot Greg KH
2012-06-07  4:04 ` [ 26/82] iommu/amd: Check for the right TLP prefix bit Greg KH
2012-06-07  4:04 ` [ 27/82] iommu/amd: Add workaround for event log erratum Greg KH
2012-06-07  4:04 ` [ 28/82] drm/radeon: fix XFX quirk Greg KH
2012-06-07  4:04 ` [ 29/82] drm/radeon: fix typo in trinity tiling setup Greg KH
2012-06-07  4:04 ` [ 30/82] drm/i915: properly handle interlaced bit for sdvo dtd conversion Greg KH
2012-06-07  4:04 ` [ 31/82] drm/i915: Adding TV Out Missing modes Greg KH
2012-06-07  4:04 ` [ 32/82] drm/i915: wait for a vblank to pass after tv detect Greg KH
2012-06-07  4:04 ` [ 33/82] drm/i915: no lvds quirk for HP t5740e Thin Client Greg KH
2012-06-07  4:04 ` [ 34/82] kbuild: install kernel-page-flags.h Greg KH
2012-06-07  4:04 ` [ 35/82] mm: fix vma_resv_map() NULL pointer Greg KH
2012-06-07  4:04 ` [ 36/82] ALSA: usb-audio: fix rate_list memory leak Greg KH
2012-06-07  4:04 ` [ 37/82] slub: fix a memory leak in get_partial_node() Greg KH
2012-06-07  4:04 ` [ 38/82] vfs: umount_tree() might be called on subtree that had never made it Greg KH
2012-06-07  4:04 ` [ 39/82] vfs: increment iversion when a file is truncated Greg KH
2012-06-07  4:04 ` [ 40/82] fec_mpc52xx: fix timestamp filtering Greg KH
2012-06-07  4:04 ` [ 41/82] x86, x32, ptrace: Remove PTRACE_ARCH_PRCTL for x32 Greg KH
2012-06-07  4:04 ` [ 42/82] x86: Reset the debug_stack update counter Greg KH
2012-06-07  4:04 ` [ 43/82] mtd: nand: fix scan_read_raw_oob Greg KH
2012-06-07  4:04 ` [ 44/82] mtd: of_parts: fix breakage in Kconfig Greg KH
2012-06-07  4:04 ` [ 45/82] mtd: block2mtd: fix recursive call of mtd_writev Greg KH
2012-06-07  4:04 ` [ 46/82] mtd: mxc_nand: move ecc strengh setup before nand_scan_tail Greg KH
2012-06-07  4:04 ` [ 47/82] drm/radeon: fix regression in UMS CS ioctl Greg KH
2012-06-07  4:04 ` [ 48/82] drm/radeon: fix bank information in tiling config Greg KH
2012-06-07  4:04 ` [ 49/82] drm/radeon: properly program gart on rv740, juniper, cypress, barts, hemlock Greg KH
2012-06-07  4:04 ` [ 50/82] drm/radeon: fix HD6790, HD6570 backend programming Greg KH
2012-06-07  4:04 ` [ 51/82] drm/ttm: Fix spinlock imbalance Greg KH
2012-06-07  4:04 ` [ 52/82] drm/vmwgfx: Fix nasty write past alloced memory area Greg KH
2012-06-07  4:04 ` [ 53/82] asix: allow full size 8021Q frames to be received Greg KH
2012-06-08  2:27   ` Ben Hutchings
2012-06-08  3:54     ` David Miller
2012-06-07  4:04 ` [ 54/82] ipv4: fix the rcu race between free_fib_info and ip_route_output_slow Greg KH
2012-06-07  4:04 ` [ 55/82] ipv6: fix incorrect ipsec fragment Greg KH
2012-06-07  4:04 ` [ 56/82] l2tp: fix oops in L2TP IP sockets for connect() AF_UNSPEC case Greg KH
2012-06-07  4:04 ` [ 57/82] skb: avoid unnecessary reallocations in __skb_cow Greg KH
2012-06-07  4:04 ` [ 58/82] xfrm: take net hdr len into account for esp payload size calculation Greg KH
2012-06-07  4:04 ` [ 59/82] ext4: fix potential NULL dereference in ext4_free_inodes_counts() Greg KH
2012-06-07  4:04 ` [ 60/82] ext4: force ro mount if ext4_setup_super() fails Greg KH
2012-06-07  4:04 ` [ 61/82] ext4: fix potential integer overflow in alloc_flex_gd() Greg KH
2012-06-07  4:04 ` [ 62/82] ext4: disallow hard-linked directory in ext4_lookup Greg KH
2012-06-07  4:04 ` [ 63/82] ext4: add missing save_error_info() to ext4_error() Greg KH
2012-06-07  4:04 ` [ 64/82] ext4: dont trash state flags in EXT4_IOC_SETFLAGS Greg KH
2012-06-08  3:03   ` Ben Hutchings
2012-06-08  3:11     ` Ted Ts'o
2012-06-08  3:21       ` Ben Hutchings
2012-06-08 20:05         ` Ted Ts'o
2012-06-08 23:01           ` Ben Hutchings
2012-06-09  2:30             ` Ted Ts'o
2012-06-09 12:56               ` Ben Hutchings
2012-06-09 15:23           ` Greg KH
2012-06-07  4:04 ` [ 65/82] ext4: add ext4_mb_unload_buddy in the error path Greg KH
2012-06-07  4:04 ` [ 66/82] ext4: remove mb_groups before tearing down the buddy_cache Greg KH
2012-06-07  4:04 ` [ 67/82] radix-tree: fix contiguous iterator Greg KH
2012-06-07  4:04 ` [ 68/82] drm/radeon/audio: dont hardcode CRTC id Greg KH
2012-06-07  4:04 ` [ 69/82] drm/radeon: fix vm deadlocks on cayman Greg KH
2012-06-07  4:04 ` [ 70/82] drm/radeon/kms: add new Trinity PCI ids Greg KH
2012-06-07  4:04 ` [ 71/82] drm/radeon/kms: add new Palm, Sumo " Greg KH
2012-06-07  4:04 ` [ 72/82] drm/radeon/kms: add new BTC " Greg KH
2012-06-07  4:04 ` [ 73/82] drm/radeon/kms: add new SI " Greg KH
2012-06-07  4:04 ` [ 74/82] iommu/amd: Cache pdev pointer to root-bridge Greg KH
2012-06-07  4:04 ` [ 75/82] iommu/amd: Fix deadlock in ppr-handling error path Greg KH
2012-06-07  4:04 ` [ 76/82] ACPI battery: only refresh the sysfs files when pertinent information changes Greg KH
2012-06-07  4:04 ` [ 77/82] vfs: Fix /proc/<tid>/fdinfo/<fd> file handling Greg KH
2012-06-07  4:04 ` [ 78/82] md: raid1/raid10: fix problem with merge_bvec_fn Greg KH
2012-06-07  4:04 ` [ 79/82] wl1251: fix oops on early interrupt Greg KH
2012-06-07  4:04 ` [ 80/82] drm/i915: always use RPNSWREQ for turbo change requests Greg KH
2012-06-07  4:04 ` [ 81/82] drm/i915/dp: Flush any outstanding work to turn the VDD off Greg KH
2012-06-07  4:04 ` [ 82/82] drm/i915: enable vdd when switching off the eDP panel Greg KH

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20120610020331.GA26100@localhost.localdomain \
    --to=konrad@darnok.org \
    --cc=676360@bugs.debian.org \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=alan@lxorguk.ukuu.org.uk \
    --cc=drjones@redhat.com \
    --cc=gregkh@linuxfoundation.org \
    --cc=hughd@google.com \
    --cc=jbeulich@suse.com \
    --cc=jrnieder@gmail.com \
    --cc=konrad.wilk@oracle.com \
    --cc=kosaki.motohiro@gmail.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=lwoodman@redhat.com \
    --cc=mgorman@suse.de \
    --cc=pmatouse@redhat.com \
    --cc=riel@redhat.com \
    --cc=stable@vger.kernel.org \
    --cc=torvalds@linux-foundation.org \
    --cc=uobergfe@redhat.com \
    --cc=xen-devel@lists.xensource.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).