[PATCH] x86/mm: Clean up the pmd_read_atomic() comments

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Ingo Molnar <mingo@kernel.org>
To: Wei Yang <richardw.yang@linux.intel.com>
Cc: tglx@linutronix.de, mingo@redhat.com, bp@alien8.de,
	x86@kernel.org, linux-kernel@vger.kernel.org,
	Peter Zijlstra <a.p.zijlstra@chello.nl>,
	Andy Lutomirski <luto@kernel.org>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Rik van Riel <riel@surriel.com>,
	Dave Hansen <dave.hansen@linux.intel.com>
Subject: [PATCH] x86/mm: Clean up the pmd_read_atomic() comments
Date: Wed, 25 Sep 2019 08:55:14 +0200	[thread overview]
Message-ID: <20190925065514.GA21740@gmail.com> (raw)
In-Reply-To: <20190925015956.GA20295@richard>


* Wei Yang <richardw.yang@linux.intel.com> wrote:

> To be honest, I have a question on how this works.
> 
> As the comment says, we need to call pmd_read_atomic before using
> pte_offset_map_lock to avoid data corruption.

There's only a risk of data corruption if mmap_sem is held for reading. 
If it's held for writing then the pagetable contents should be stable.

> For example, in function swapin_walk_pmd_entry:
> 
>     pmd_none_or_trans_huge_or_clear_bad(pmd)
>         pmd_read_atomic(pmd)                      ---   1
>     pte_offset_map_lock(mm, pmd, ...)             ---   2
> 
> At point 1, we are assured the content is intact. While in point 2, we would
> read pmd again to calculate the pte address. How we ensure this time the
> content is intact? Because pmd_none_or_trans_huge_or_clear_bad() ensures the
> pte is stable, so that the content won't be changed?

Indeed pte_offset_map_lock() will take a non-atomic *pmd value in 
pte_lockptr() before taking the pte spinlock.

I believe the rule here is that if pmd_none_or_trans_huge_or_clear_bad() 
finds the pmd 'stable' while holding the mmap_sem read-locked, then the 
pmd cannot change while we are continuously holding the mmap_sem.

Hence the followup pte_offset_map_lock() and other iterators can look at 
the value of the pmd without locking. (Individual pte entries still need 
the pte-lock, because they might be faulted-in in parallel.)

So the pmd use pattern in swapin_walk_pmd_entry() should be safe.

I'm not 100% sure though - so I've added a few more Cc:s ...

I've also cleaned up the pmd_read_atomic() some more to make it more 
readable - see the patch below.

Thanks,

	Ingo

==================>
From: Ingo Molnar <mingo@kernel.org>
Date: Wed, 25 Sep 2019 08:38:57 +0200
Subject: [PATCH] x86/mm: Clean up the pmd_read_atomic() comments

Fix spelling, consistent parenthesis and grammar - and also clarify
the language where needed.

Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wei Yang <richardw.yang@linux.intel.com>
Link: https://lkml.kernel.org/r/20190925014453.20236-1-richardw.yang@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/include/asm/pgtable-3level.h | 44 ++++++++++++++++++-----------------
 1 file changed, 23 insertions(+), 21 deletions(-)

diff --git a/arch/x86/include/asm/pgtable-3level.h b/arch/x86/include/asm/pgtable-3level.h
index 1796462ff143..5afb5e0fe903 100644
--- a/arch/x86/include/asm/pgtable-3level.h
+++ b/arch/x86/include/asm/pgtable-3level.h
@@ -36,39 +36,41 @@ static inline void native_set_pte(pte_t *ptep, pte_t pte)
 
 #define pmd_read_atomic pmd_read_atomic
 /*
- * pte_offset_map_lock on 32bit PAE kernels was reading the pmd_t with
- * a "*pmdp" dereference done by gcc. Problem is, in certain places
- * where pte_offset_map_lock is called, concurrent page faults are
+ * pte_offset_map_lock() on 32-bit PAE kernels was reading the pmd_t with
+ * a "*pmdp" dereference done by GCC. Problem is, in certain places
+ * where pte_offset_map_lock() is called, concurrent page faults are
  * allowed, if the mmap_sem is hold for reading. An example is mincore
  * vs page faults vs MADV_DONTNEED. On the page fault side
- * pmd_populate rightfully does a set_64bit, but if we're reading the
+ * pmd_populate() rightfully does a set_64bit(), but if we're reading the
  * pmd_t with a "*pmdp" on the mincore side, a SMP race can happen
- * because gcc will not read the 64bit of the pmd atomically. To fix
- * this all places running pte_offset_map_lock() while holding the
+ * because GCC will not read the 64-bit value of the pmd atomically.
+ *
+ * To fix this all places running pte_offset_map_lock() while holding the
  * mmap_sem in read mode, shall read the pmdp pointer using this
- * function to know if the pmd is null nor not, and in turn to know if
+ * function to know if the pmd is null or not, and in turn to know if
  * they can run pte_offset_map_lock() or pmd_trans_huge() or other pmd
  * operations.
  *
- * Without THP if the mmap_sem is hold for reading, the pmd can only
- * transition from null to not null while pmd_read_atomic runs. So
+ * Without THP if the mmap_sem is held for reading, the pmd can only
+ * transition from null to not null while pmd_read_atomic() runs. So
  * we can always return atomic pmd values with this function.
  *
- * With THP if the mmap_sem is hold for reading, the pmd can become
+ * With THP if the mmap_sem is held for reading, the pmd can become
  * trans_huge or none or point to a pte (and in turn become "stable")
- * at any time under pmd_read_atomic. We could read it really
- * atomically here with a atomic64_read for the THP enabled case (and
+ * at any time under pmd_read_atomic(). We could read it truly
+ * atomically here with an atomic64_read() for the THP enabled case (and
  * it would be a whole lot simpler), but to avoid using cmpxchg8b we
  * only return an atomic pmdval if the low part of the pmdval is later
- * found stable (i.e. pointing to a pte). And we're returning a none
- * pmdval if the low part of the pmd is none. In some cases the high
- * and low part of the pmdval returned may not be consistent if THP is
- * enabled (the low part may point to previously mapped hugepage,
- * while the high part may point to a more recently mapped hugepage),
- * but pmd_none_or_trans_huge_or_clear_bad() only needs the low part
- * of the pmd to be read atomically to decide if the pmd is unstable
- * or not, with the only exception of when the low part of the pmd is
- * zero in which case we return a none pmd.
+ * found to be stable (i.e. pointing to a pte). We are also returning a
+ * 'none' (zero) pmdval if the low part of the pmd is zero.
+ *
+ * In some cases the high and low part of the pmdval returned may not be
+ * consistent if THP is enabled (the low part may point to previously
+ * mapped hugepage, while the high part may point to a more recently
+ * mapped hugepage), but pmd_none_or_trans_huge_or_clear_bad() only
+ * needs the low part of the pmd to be read atomically to decide if the
+ * pmd is unstable or not, with the only exception when the low part
+ * of the pmd is zero, in which case we return a 'none' pmd.
  */
 static inline pmd_t pmd_read_atomic(pmd_t *pmdp)
 {

next prev parent reply	other threads:[~2019-09-25  6:55 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-09-25  1:44 [PATCH] x86/mm: fix function name typo in pmd_read_atomic() comment Wei Yang
2019-09-25  1:59 ` Wei Yang
2019-09-25  6:55   ` Ingo Molnar [this message]
2019-09-25  7:28     ` [PATCH] x86/mm: Clean up the pmd_read_atomic() comments Wei Yang
2019-09-27  8:10 ` [tip: x86/mm] x86/mm: Fix function name typo in pmd_read_atomic() comment tip-bot2 for Wei Yang

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:1796462ff14 dfblob:5afb5e0fe90 )
 OR (
bs:"x86/mm: Clean up the pmd_read_atomic() comments" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190925065514.GA21740@gmail.com \
    --to=mingo@kernel.org \
    --cc=a.p.zijlstra@chello.nl \
    --cc=bp@alien8.de \
    --cc=dave.hansen@linux.intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=luto@kernel.org \
    --cc=mingo@redhat.com \
    --cc=richardw.yang@linux.intel.com \
    --cc=riel@surriel.com \
    --cc=tglx@linutronix.de \
    --cc=torvalds@linux-foundation.org \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.