xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed
* Re: [ 08/82] mm: pmd_read_atomic: fix 32bit PAE pmd walk vs pmd_populate SMP race condition
       [not found] <20120607190414.GF21339@redhat.com>
@ 2012-06-07 21:00 ` Andrea Arcangeli
  2012-06-07 21:00 ` [PATCH] thp: avoid atomic64_read in pmd_read_atomic for 32bit PAE Andrea Arcangeli
  1 sibling, 0 replies; 6+ messages in thread
From: Andrea Arcangeli @ 2012-06-07 21:00 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, Greg KH
  Cc: 676360, xen-devel, Jonathan Nieder, linux-kernel,
	linux-mm@kvack.org Konrad Rzeszutek Wilk, stable, alan,
	Ulrich Obergfell, Mel Gorman, Hugh Dickins, Larry Woodman,
	Petr Matousek, Rik van Riel, Jan Beulich, KOSAKI Motohiro

Hi,

this should avoid the cmpxchg8b (to make Xen happy) but without
reintroducing the race condition. It's actually going to be faster
too, but it's conceptually more complicated as the pmd high/low may be
inconsistent at times, but at those times we're going to declare the
pmd unstable and ignore it anyway so it's ok.

NOTE: in theory I could also drop the high part when THP=y thanks to
the barrier() in the caller (and the barrier is needed for the generic
version anyway):

static inline pmd_t pmd_read_atomic(pmd_t *pmdp)
{
	pmdval_t ret;
	u32 *tmp = (u32 *)pmdp;

	ret = (pmdval_t) (*tmp);
+#ifndef CONFIG_TRANSPARENT_HUGEPAGE
	if (ret) {
		/*
		 * If the low part is null, we must not read the high part
		 * or we can end up with a partial pmd.
		 */
		smp_rmb();
		ret |= ((pmdval_t)*(tmp + 1)) << 32;
	}
+#endif

	return (pmd_t) { ret };
}

But it's not worth the extra complexity. It looks cleaner if we deal
with "good" pmds if they're later found pointing to a pte (even if we
discard them and force pte_offset to re-read the *pmd).

Andrea Arcangeli (1):
  thp: avoid atomic64_read in pmd_read_atomic for 32bit PAE

 arch/x86/include/asm/pgtable-3level.h |   30 +++++++++++++++++-------------
 include/asm-generic/pgtable.h         |   10 ++++++++++
 2 files changed, 27 insertions(+), 13 deletions(-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH] thp: avoid atomic64_read in pmd_read_atomic for 32bit PAE
       [not found] <20120607190414.GF21339@redhat.com>
  2012-06-07 21:00 ` [ 08/82] mm: pmd_read_atomic: fix 32bit PAE pmd walk vs pmd_populate SMP race condition Andrea Arcangeli
@ 2012-06-07 21:00 ` Andrea Arcangeli
  2012-06-10  2:03   ` [PATCH] thp: avoid atomic64_read in pmd_read_atomic for 32bit PAE\ Konrad Rzeszutek Wilk
  1 sibling, 1 reply; 6+ messages in thread
From: Andrea Arcangeli @ 2012-06-07 21:00 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, Greg KH
  Cc: 676360, xen-devel, Jonathan Nieder, linux-kernel,
	linux-mm@kvack.org Konrad Rzeszutek Wilk, stable, alan,
	Ulrich Obergfell, Mel Gorman, Hugh Dickins, Larry Woodman,
	Petr Matousek, Rik van Riel, Jan Beulich, KOSAKI Motohiro

In the x86 32bit PAE CONFIG_TRANSPARENT_HUGEPAGE=y case while holding
the mmap_sem for reading, cmpxchg8b cannot be used to read pmd
contents under Xen.

So instead of dealing only with "consistent" pmdvals in
pmd_none_or_trans_huge_or_clear_bad() (which would be conceptually
simpler) we let pmd_none_or_trans_huge_or_clear_bad() deal with pmdvals
where the low 32bit and high 32bit could be inconsistent (to avoid
having to use cmpxchg8b).

The only guarantee we get from pmd_read_atomic is that if the low part
of the pmd was found null, the high part will be null too (so the pmd
will be considered unstable). And if the low part of the pmd is found
"stable" later, then it means the whole pmd was read atomically
(because after a pmd is stable, neither MADV_DONTNEED nor page faults
can alter it anymore, and we read the high part after the low part).

In the 32bit PAE x86 case, it is enough to read the low part of the
pmdval atomically to declare the pmd as "stable" and that's true for
THP and no THP, furthermore in the THP case we also have a barrier()
that will prevent any inconsistent pmdvals to be cached by a later
re-read of the *pmd.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 arch/x86/include/asm/pgtable-3level.h |   30 +++++++++++++++++-------------
 include/asm-generic/pgtable.h         |   10 ++++++++++
 2 files changed, 27 insertions(+), 13 deletions(-)

diff --git a/arch/x86/include/asm/pgtable-3level.h b/arch/x86/include/asm/pgtable-3level.h
index 43876f1..cb00ccc 100644
--- a/arch/x86/include/asm/pgtable-3level.h
+++ b/arch/x86/include/asm/pgtable-3level.h
@@ -47,16 +47,26 @@ static inline void native_set_pte(pte_t *ptep, pte_t pte)
  * they can run pmd_offset_map_lock or pmd_trans_huge or other pmd
  * operations.
  *
- * Without THP if the mmap_sem is hold for reading, the
- * pmd can only transition from null to not null while pmd_read_atomic runs.
- * So there's no need of literally reading it atomically.
+ * Without THP if the mmap_sem is hold for reading, the pmd can only
+ * transition from null to not null while pmd_read_atomic runs. So
+ * we can always return atomic pmd values with this function.
  *
  * With THP if the mmap_sem is hold for reading, the pmd can become
- * THP or null or point to a pte (and in turn become "stable") at any
- * time under pmd_read_atomic, so it's mandatory to read it atomically
- * with cmpxchg8b.
+ * trans_huge or none or point to a pte (and in turn become "stable")
+ * at any time under pmd_read_atomic. We could read it really
+ * atomically here with a atomic64_read for the THP enabled case (and
+ * it would be a whole lot simpler), but to avoid using cmpxchg8b we
+ * only return an atomic pmdval if the low part of the pmdval is later
+ * found stable (i.e. pointing to a pte). And we're returning a none
+ * pmdval if the low part of the pmd is none. In some cases the high
+ * and low part of the pmdval returned may not be consistent if THP is
+ * enabled (the low part may point to previously mapped hugepage,
+ * while the high part may point to a more recently mapped hugepage),
+ * but pmd_none_or_trans_huge_or_clear_bad() only needs the low part
+ * of the pmd to be read atomically to decide if the pmd is unstable
+ * or not, with the only exception of when the low part of the pmd is
+ * zero in which case we return a none pmd.
  */
-#ifndef CONFIG_TRANSPARENT_HUGEPAGE
 static inline pmd_t pmd_read_atomic(pmd_t *pmdp)
 {
 	pmdval_t ret;
@@ -74,12 +84,6 @@ static inline pmd_t pmd_read_atomic(pmd_t *pmdp)
 
 	return (pmd_t) { ret };
 }
-#else /* CONFIG_TRANSPARENT_HUGEPAGE */
-static inline pmd_t pmd_read_atomic(pmd_t *pmdp)
-{
-	return (pmd_t) { atomic64_read((atomic64_t *)pmdp) };
-}
-#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
 static inline void native_set_pte_atomic(pte_t *ptep, pte_t pte)
 {
diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index ae39c4b..0ff87ec 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -484,6 +484,16 @@ static inline int pmd_none_or_trans_huge_or_clear_bad(pmd_t *pmd)
 	/*
 	 * The barrier will stabilize the pmdval in a register or on
 	 * the stack so that it will stop changing under the code.
+	 *
+	 * When CONFIG_TRANSPARENT_HUGEPAGE=y on x86 32bit PAE,
+	 * pmd_read_atomic is allowed to return a not atomic pmdval
+	 * (for example pointing to an hugepage that has never been
+	 * mapped in the pmd). The below checks will only care about
+	 * the low part of the pmd with 32bit PAE x86 anyway, with the
+	 * exception of pmd_none(). So the important thing is that if
+	 * the low part of the pmd is found null, the high part will
+	 * be also null or the pmd_none() check below would be
+	 * confused.
 	 */
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 	barrier();

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH] thp: avoid atomic64_read in pmd_read_atomic for 32bit PAE\
  2012-06-07 21:00 ` [PATCH] thp: avoid atomic64_read in pmd_read_atomic for 32bit PAE Andrea Arcangeli
@ 2012-06-10  2:03   ` Konrad Rzeszutek Wilk
  2012-06-11 10:34     ` [Xen-devel] " Andrew Jones
  0 siblings, 1 reply; 6+ messages in thread
From: Konrad Rzeszutek Wilk @ 2012-06-10  2:03 UTC (permalink / raw)
  To: Andrea Arcangeli, drjones
  Cc: Linus Torvalds, Andrew Morton, Greg KH, 676360, xen-devel,
	Jonathan Nieder, linux-kernel,
	linux-mm@kvack.org Konrad Rzeszutek Wilk, stable, alan,
	Ulrich Obergfell, Mel Gorman, Hugh Dickins, Larry Woodman,
	Petr Matousek, Rik van Riel, Jan Beulich, KOSAKI Motohiro

On Thu, Jun 07, 2012 at 11:00:33PM +0200, Andrea Arcangeli wrote:
> In the x86 32bit PAE CONFIG_TRANSPARENT_HUGEPAGE=y case while holding
> the mmap_sem for reading, cmpxchg8b cannot be used to read pmd
> contents under Xen.
> 
> So instead of dealing only with "consistent" pmdvals in
> pmd_none_or_trans_huge_or_clear_bad() (which would be conceptually
> simpler) we let pmd_none_or_trans_huge_or_clear_bad() deal with pmdvals
> where the low 32bit and high 32bit could be inconsistent (to avoid
> having to use cmpxchg8b).

<nods>
> 
> The only guarantee we get from pmd_read_atomic is that if the low part
> of the pmd was found null, the high part will be null too (so the pmd
> will be considered unstable). And if the low part of the pmd is found
> "stable" later, then it means the whole pmd was read atomically
> (because after a pmd is stable, neither MADV_DONTNEED nor page faults
> can alter it anymore, and we read the high part after the low part).
> 
> In the 32bit PAE x86 case, it is enough to read the low part of the
> pmdval atomically to declare the pmd as "stable" and that's true for
> THP and no THP, furthermore in the THP case we also have a barrier()
> that will prevent any inconsistent pmdvals to be cached by a later
> re-read of the *pmd.

Nice. Andrew, any chane you could test this patch on the affected
Xen hypervisors? Was it as easy to reproduce this on a RHEL5 (U1?)
hypervisor or is it really only on Linode and Amazon EC2?

> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> ---
>  arch/x86/include/asm/pgtable-3level.h |   30 +++++++++++++++++-------------
>  include/asm-generic/pgtable.h         |   10 ++++++++++
>  2 files changed, 27 insertions(+), 13 deletions(-)
> 
> diff --git a/arch/x86/include/asm/pgtable-3level.h b/arch/x86/include/asm/pgtable-3level.h
> index 43876f1..cb00ccc 100644
> --- a/arch/x86/include/asm/pgtable-3level.h
> +++ b/arch/x86/include/asm/pgtable-3level.h
> @@ -47,16 +47,26 @@ static inline void native_set_pte(pte_t *ptep, pte_t pte)
>   * they can run pmd_offset_map_lock or pmd_trans_huge or other pmd
>   * operations.
>   *
> - * Without THP if the mmap_sem is hold for reading, the
> - * pmd can only transition from null to not null while pmd_read_atomic runs.
> - * So there's no need of literally reading it atomically.
> + * Without THP if the mmap_sem is hold for reading, the pmd can only
> + * transition from null to not null while pmd_read_atomic runs. So
> + * we can always return atomic pmd values with this function.
>   *
>   * With THP if the mmap_sem is hold for reading, the pmd can become
> - * THP or null or point to a pte (and in turn become "stable") at any
> - * time under pmd_read_atomic, so it's mandatory to read it atomically
> - * with cmpxchg8b.
> + * trans_huge or none or point to a pte (and in turn become "stable")
> + * at any time under pmd_read_atomic. We could read it really
> + * atomically here with a atomic64_read for the THP enabled case (and
> + * it would be a whole lot simpler), but to avoid using cmpxchg8b we
> + * only return an atomic pmdval if the low part of the pmdval is later
> + * found stable (i.e. pointing to a pte). And we're returning a none
> + * pmdval if the low part of the pmd is none. In some cases the high
> + * and low part of the pmdval returned may not be consistent if THP is
> + * enabled (the low part may point to previously mapped hugepage,
> + * while the high part may point to a more recently mapped hugepage),
> + * but pmd_none_or_trans_huge_or_clear_bad() only needs the low part
> + * of the pmd to be read atomically to decide if the pmd is unstable
> + * or not, with the only exception of when the low part of the pmd is
> + * zero in which case we return a none pmd.
>   */
> -#ifndef CONFIG_TRANSPARENT_HUGEPAGE
>  static inline pmd_t pmd_read_atomic(pmd_t *pmdp)
>  {
>  	pmdval_t ret;
> @@ -74,12 +84,6 @@ static inline pmd_t pmd_read_atomic(pmd_t *pmdp)
>  
>  	return (pmd_t) { ret };
>  }
> -#else /* CONFIG_TRANSPARENT_HUGEPAGE */
> -static inline pmd_t pmd_read_atomic(pmd_t *pmdp)
> -{
> -	return (pmd_t) { atomic64_read((atomic64_t *)pmdp) };
> -}
> -#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>  
>  static inline void native_set_pte_atomic(pte_t *ptep, pte_t pte)
>  {
> diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
> index ae39c4b..0ff87ec 100644
> --- a/include/asm-generic/pgtable.h
> +++ b/include/asm-generic/pgtable.h
> @@ -484,6 +484,16 @@ static inline int pmd_none_or_trans_huge_or_clear_bad(pmd_t *pmd)
>  	/*
>  	 * The barrier will stabilize the pmdval in a register or on
>  	 * the stack so that it will stop changing under the code.
> +	 *
> +	 * When CONFIG_TRANSPARENT_HUGEPAGE=y on x86 32bit PAE,
> +	 * pmd_read_atomic is allowed to return a not atomic pmdval
> +	 * (for example pointing to an hugepage that has never been
> +	 * mapped in the pmd). The below checks will only care about
> +	 * the low part of the pmd with 32bit PAE x86 anyway, with the
> +	 * exception of pmd_none(). So the important thing is that if
> +	 * the low part of the pmd is found null, the high part will
> +	 * be also null or the pmd_none() check below would be
> +	 * confused.
>  	 */
>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>  	barrier();
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [Xen-devel] [PATCH] thp: avoid atomic64_read in pmd_read_atomic for 32bit PAE\
  2012-06-10  2:03   ` [PATCH] thp: avoid atomic64_read in pmd_read_atomic for 32bit PAE\ Konrad Rzeszutek Wilk
@ 2012-06-11 10:34     ` Andrew Jones
  2012-06-11 19:27       ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 6+ messages in thread
From: Andrew Jones @ 2012-06-11 10:34 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: xen-devel, Petr Matousek, Jan Beulich,
	linux-mm@kvack.org Konrad Rzeszutek Wilk, Greg KH, Hugh Dickins,
	linux-kernel, stable, Ulrich Obergfell, Jonathan Nieder,
	Mel Gorman, 676360, KOSAKI Motohiro, Andrew Morton,
	Linus Torvalds, Larry Woodman, alan, Andrea Arcangeli



----- Original Message -----
> On Thu, Jun 07, 2012 at 11:00:33PM +0200, Andrea Arcangeli wrote:
> > In the x86 32bit PAE CONFIG_TRANSPARENT_HUGEPAGE=y case while
> > holding
> > the mmap_sem for reading, cmpxchg8b cannot be used to read pmd
> > contents under Xen.
> > 
> > So instead of dealing only with "consistent" pmdvals in
> > pmd_none_or_trans_huge_or_clear_bad() (which would be conceptually
> > simpler) we let pmd_none_or_trans_huge_or_clear_bad() deal with
> > pmdvals
> > where the low 32bit and high 32bit could be inconsistent (to avoid
> > having to use cmpxchg8b).
> 
> <nods>
> > 
> > The only guarantee we get from pmd_read_atomic is that if the low
> > part
> > of the pmd was found null, the high part will be null too (so the
> > pmd
> > will be considered unstable). And if the low part of the pmd is
> > found
> > "stable" later, then it means the whole pmd was read atomically
> > (because after a pmd is stable, neither MADV_DONTNEED nor page
> > faults
> > can alter it anymore, and we read the high part after the low
> > part).
> > 
> > In the 32bit PAE x86 case, it is enough to read the low part of the
> > pmdval atomically to declare the pmd as "stable" and that's true
> > for
> > THP and no THP, furthermore in the THP case we also have a
> > barrier()
> > that will prevent any inconsistent pmdvals to be cached by a later
> > re-read of the *pmd.
> 
> Nice. Andrew, any chane you could test this patch on the affected
> Xen hypervisors? Was it as easy to reproduce this on a RHEL5 (U1?)
> hypervisor or is it really only on Linode and Amazon EC2?
> 

Originally, I was able to reproduce the issue easily with a RHEL5
host. Now, with this patch it's fixed.

Drew

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [Xen-devel] [PATCH] thp: avoid atomic64_read in pmd_read_atomic for 32bit PAE\
  2012-06-11 10:34     ` [Xen-devel] " Andrew Jones
@ 2012-06-11 19:27       ` Konrad Rzeszutek Wilk
  2012-06-11 19:41         ` Andrea Arcangeli
  0 siblings, 1 reply; 6+ messages in thread
From: Konrad Rzeszutek Wilk @ 2012-06-11 19:27 UTC (permalink / raw)
  To: Andrew Jones
  Cc: Konrad Rzeszutek Wilk, Andrea Arcangeli, xen-devel, Petr Matousek,
	676360, Greg KH, Hugh Dickins, linux-kernel, stable,
	Ulrich Obergfell, Jonathan Nieder, Mel Gorman, Jan Beulich,
	KOSAKI Motohiro, Andrew Morton, Linus Torvalds, Larry Woodman,
	alan

> > Nice. Andrew, any chane you could test this patch on the affected
> > Xen hypervisors? Was it as easy to reproduce this on a RHEL5 (U1?)
> > hypervisor or is it really only on Linode and Amazon EC2?
> > 
> 
> Originally, I was able to reproduce the issue easily with a RHEL5
> host. Now, with this patch it's fixed.

OK, so Tested-by: Andrew Jones..
and from my perspective it looks good - so Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

Andrea, any chance you can respin this patch and send it to Linus for 3.5 please?

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [Xen-devel] [PATCH] thp: avoid atomic64_read in pmd_read_atomic for 32bit PAE\
  2012-06-11 19:27       ` Konrad Rzeszutek Wilk
@ 2012-06-11 19:41         ` Andrea Arcangeli
  0 siblings, 0 replies; 6+ messages in thread
From: Andrea Arcangeli @ 2012-06-11 19:41 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Andrew Jones, Konrad Rzeszutek Wilk, xen-devel, Petr Matousek,
	676360, Greg KH, Hugh Dickins, linux-kernel, stable,
	Ulrich Obergfell, Jonathan Nieder, Mel Gorman, Jan Beulich,
	KOSAKI Motohiro, Andrew Morton, Linus Torvalds, Larry Woodman,
	alan

Hi,

On Mon, Jun 11, 2012 at 03:27:38PM -0400, Konrad Rzeszutek Wilk wrote:
> > > Nice. Andrew, any chane you could test this patch on the affected
> > > Xen hypervisors? Was it as easy to reproduce this on a RHEL5 (U1?)
> > > hypervisor or is it really only on Linode and Amazon EC2?
> > > 
> > 
> > Originally, I was able to reproduce the issue easily with a RHEL5
> > host. Now, with this patch it's fixed.
> 
> OK, so Tested-by: Andrew Jones..
> and from my perspective it looks good - so Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

Thanks for testing and reviews.

> Andrea, any chance you can respin this patch and send it to Linus for 3.5 please?

Andrew merged it in -mm last Friday, so I would expect it to go
upstream soon through the -mm flow (I assume everyone has been
rightfully waiting a bit of time for testing and reviews to be sure).

Andrea

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2012-06-11 19:41 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20120607190414.GF21339@redhat.com>
2012-06-07 21:00 ` [ 08/82] mm: pmd_read_atomic: fix 32bit PAE pmd walk vs pmd_populate SMP race condition Andrea Arcangeli
2012-06-07 21:00 ` [PATCH] thp: avoid atomic64_read in pmd_read_atomic for 32bit PAE Andrea Arcangeli
2012-06-10  2:03   ` [PATCH] thp: avoid atomic64_read in pmd_read_atomic for 32bit PAE\ Konrad Rzeszutek Wilk
2012-06-11 10:34     ` [Xen-devel] " Andrew Jones
2012-06-11 19:27       ` Konrad Rzeszutek Wilk
2012-06-11 19:41         ` Andrea Arcangeli

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).