* [PATCH v2 01/14] mm: pgtable: introduce pte_offset_map_{ro|rw}_nolock()
2024-08-22 7:13 [PATCH v2 00/14] introduce pte_offset_map_{ro|rw}_nolock() Qi Zheng
@ 2024-08-22 7:13 ` Qi Zheng
2024-08-26 3:45 ` [PATCH v2 01/14 update] " Qi Zheng
` (2 more replies)
2024-08-22 7:13 ` [PATCH v2 02/14] arm: adjust_pte() use pte_offset_map_rw_nolock() Qi Zheng
` (12 subsequent siblings)
13 siblings, 3 replies; 43+ messages in thread
From: Qi Zheng @ 2024-08-22 7:13 UTC (permalink / raw)
To: david, hughd, willy, muchun.song, vbabka, akpm, rppt,
vishal.moola, peterx, ryan.roberts, christophe.leroy2
Cc: linux-kernel, linux-mm, linux-arm-kernel, linuxppc-dev, Qi Zheng
Currently, the usage of pte_offset_map_nolock() can be divided into the
following two cases:
1) After acquiring PTL, only read-only operations are performed on the PTE
page. In this case, the RCU lock in pte_offset_map_nolock() will ensure
that the PTE page will not be freed, and there is no need to worry
about whether the pmd entry is modified.
2) After acquiring PTL, the pte or pmd entries may be modified. At this
time, we need to ensure that the pmd entry has not been modified
concurrently.
To more clearing distinguish between these two cases, this commit
introduces two new helper functions to replace pte_offset_map_nolock().
For 1), just rename it to pte_offset_map_ro_nolock(). For 2), in addition
to changing the name to pte_offset_map_rw_nolock(), it also outputs the
pmdval when successful. This can help the caller recheck *pmd once the PTL
is taken. In some cases, that is, either the mmap_lock for write, or
pte_same() check on contents, is also enough to ensure that the pmd entry
is stable. But in order to prevent the interface from being abused, we
choose to pass in a dummy local variable instead of NULL.
Subsequent commits will convert pte_offset_map_nolock() into the above
two functions one by one, and finally completely delete it.
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
Documentation/mm/split_page_table_lock.rst | 7 ++++
include/linux/mm.h | 5 +++
mm/pgtable-generic.c | 43 ++++++++++++++++++++++
3 files changed, 55 insertions(+)
diff --git a/Documentation/mm/split_page_table_lock.rst b/Documentation/mm/split_page_table_lock.rst
index e4f6972eb6c04..08d0e706a32db 100644
--- a/Documentation/mm/split_page_table_lock.rst
+++ b/Documentation/mm/split_page_table_lock.rst
@@ -19,6 +19,13 @@ There are helpers to lock/unlock a table and other accessor functions:
- pte_offset_map_nolock()
maps PTE, returns pointer to PTE with pointer to its PTE table
lock (not taken), or returns NULL if no PTE table;
+ - pte_offset_map_ro_nolock()
+ maps PTE, returns pointer to PTE with pointer to its PTE table
+ lock (not taken), or returns NULL if no PTE table;
+ - pte_offset_map_rw_nolock()
+ maps PTE, returns pointer to PTE with pointer to its PTE table
+ lock (not taken) and the value of its pmd entry, or returns NULL
+ if no PTE table;
- pte_offset_map()
maps PTE, returns pointer to PTE, or returns NULL if no PTE table;
- pte_unmap()
diff --git a/include/linux/mm.h b/include/linux/mm.h
index da29b066495d6..a00cb35ce065f 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2954,6 +2954,11 @@ static inline pte_t *pte_offset_map_lock(struct mm_struct *mm, pmd_t *pmd,
pte_t *pte_offset_map_nolock(struct mm_struct *mm, pmd_t *pmd,
unsigned long addr, spinlock_t **ptlp);
+pte_t *pte_offset_map_ro_nolock(struct mm_struct *mm, pmd_t *pmd,
+ unsigned long addr, spinlock_t **ptlp);
+pte_t *pte_offset_map_rw_nolock(struct mm_struct *mm, pmd_t *pmd,
+ unsigned long addr, pmd_t *pmdvalp,
+ spinlock_t **ptlp);
#define pte_unmap_unlock(pte, ptl) do { \
spin_unlock(ptl); \
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index a78a4adf711ac..9a1666574c959 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -317,6 +317,33 @@ pte_t *pte_offset_map_nolock(struct mm_struct *mm, pmd_t *pmd,
return pte;
}
+pte_t *pte_offset_map_ro_nolock(struct mm_struct *mm, pmd_t *pmd,
+ unsigned long addr, spinlock_t **ptlp)
+{
+ pmd_t pmdval;
+ pte_t *pte;
+
+ pte = __pte_offset_map(pmd, addr, &pmdval);
+ if (likely(pte))
+ *ptlp = pte_lockptr(mm, &pmdval);
+ return pte;
+}
+
+pte_t *pte_offset_map_rw_nolock(struct mm_struct *mm, pmd_t *pmd,
+ unsigned long addr, pmd_t *pmdvalp,
+ spinlock_t **ptlp)
+{
+ pmd_t pmdval;
+ pte_t *pte;
+
+ BUG_ON(!pmdvalp);
+ pte = __pte_offset_map(pmd, addr, &pmdval);
+ if (likely(pte))
+ *ptlp = pte_lockptr(mm, &pmdval);
+ *pmdvalp = pmdval;
+ return pte;
+}
+
/*
* pte_offset_map_lock(mm, pmd, addr, ptlp), and its internal implementation
* __pte_offset_map_lock() below, is usually called with the pmd pointer for
@@ -356,6 +383,22 @@ pte_t *pte_offset_map_nolock(struct mm_struct *mm, pmd_t *pmd,
* recheck *pmd once the lock is taken; in practice, no callsite needs that -
* either the mmap_lock for write, or pte_same() check on contents, is enough.
*
+ * pte_offset_map_ro_nolock(mm, pmd, addr, ptlp), above, is like
+ * pte_offset_map(); but when successful, it also outputs a pointer to the
+ * spinlock in ptlp - as pte_offset_map_lock() does, but in this case without
+ * locking it. This helps the caller to avoid a later pte_lockptr(mm, *pmd),
+ * which might by that time act on a changed *pmd: pte_offset_map_ro_nolock()
+ * provides the correct spinlock pointer for the page table that it returns.
+ * For readonly case, the caller does not need to recheck *pmd after the lock is
+ * taken, because the RCU lock will ensure that the PTE page will not be freed.
+ *
+ * pte_offset_map_rw_nolock(mm, pmd, addr, pmdvalp, ptlp), above, is like
+ * pte_offset_map_ro_nolock(); but when successful, it also outputs the
+ * pdmval. For cases where pte or pmd entries may be modified, that is, maywrite
+ * case, this can help the caller recheck *pmd once the lock is taken. In some
+ * cases, that is, either the mmap_lock for write, or pte_same() check on
+ * contents, is also enough to ensure that the pmd entry is stable.
+ *
* Note that free_pgtables(), used after unmapping detached vmas, or when
* exiting the whole mm, does not take page table lock before freeing a page
* table, and may not use RCU at all: "outsiders" like khugepaged should avoid
--
2.20.1
^ permalink raw reply related [flat|nested] 43+ messages in thread* [PATCH v2 01/14 update] mm: pgtable: introduce pte_offset_map_{ro|rw}_nolock()
2024-08-22 7:13 ` [PATCH v2 01/14] mm: pgtable: " Qi Zheng
@ 2024-08-26 3:45 ` Qi Zheng
2024-08-26 15:21 ` [PATCH v2 01/14] " David Hildenbrand
2024-08-28 9:48 ` Muchun Song
2 siblings, 0 replies; 43+ messages in thread
From: Qi Zheng @ 2024-08-26 3:45 UTC (permalink / raw)
To: david, hughd, willy, muchun.song, vbabka, akpm, rppt,
vishal.moola, peterx, ryan.roberts, christophe.leroy2
Cc: linux-kernel, linux-mm, Qi Zheng
Currently, the usage of pte_offset_map_nolock() can be divided into the
following two cases:
1) After acquiring PTL, only read-only operations are performed on the PTE
page. In this case, the RCU lock in pte_offset_map_nolock() will ensure
that the PTE page will not be freed, and there is no need to worry
about whether the pmd entry is modified.
2) After acquiring PTL, the pte or pmd entries may be modified. At this
time, we need to ensure that the pmd entry has not been modified
concurrently.
To more clearing distinguish between these two cases, this commit
introduces two new helper functions to replace pte_offset_map_nolock().
For 1), just rename it to pte_offset_map_ro_nolock(). For 2), in addition
to changing the name to pte_offset_map_rw_nolock(), it also outputs the
pmdval when successful. This can help the caller recheck *pmd once the PTL
is taken. In some cases, that is, either the mmap_lock for write, or
pte_same() check on contents, is also enough to ensure that the pmd entry
is stable. But in order to prevent the interface from being abused, we
choose to pass in a dummy local variable instead of NULL.
Subsequent commits will convert pte_offset_map_nolock() into the above
two functions one by one, and finally completely delete it.
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
Change to use VM_WARN_ON_ONCE() instead of BUG_ON(). (David Hildenbrand)
Documentation/mm/split_page_table_lock.rst | 7 ++++
include/linux/mm.h | 5 +++
mm/pgtable-generic.c | 43 ++++++++++++++++++++++
3 files changed, 55 insertions(+)
diff --git a/Documentation/mm/split_page_table_lock.rst b/Documentation/mm/split_page_table_lock.rst
index e4f6972eb6c04..08d0e706a32db 100644
--- a/Documentation/mm/split_page_table_lock.rst
+++ b/Documentation/mm/split_page_table_lock.rst
@@ -19,6 +19,13 @@ There are helpers to lock/unlock a table and other accessor functions:
- pte_offset_map_nolock()
maps PTE, returns pointer to PTE with pointer to its PTE table
lock (not taken), or returns NULL if no PTE table;
+ - pte_offset_map_ro_nolock()
+ maps PTE, returns pointer to PTE with pointer to its PTE table
+ lock (not taken), or returns NULL if no PTE table;
+ - pte_offset_map_rw_nolock()
+ maps PTE, returns pointer to PTE with pointer to its PTE table
+ lock (not taken) and the value of its pmd entry, or returns NULL
+ if no PTE table;
- pte_offset_map()
maps PTE, returns pointer to PTE, or returns NULL if no PTE table;
- pte_unmap()
diff --git a/include/linux/mm.h b/include/linux/mm.h
index da29b066495d6..a00cb35ce065f 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2954,6 +2954,11 @@ static inline pte_t *pte_offset_map_lock(struct mm_struct *mm, pmd_t *pmd,
pte_t *pte_offset_map_nolock(struct mm_struct *mm, pmd_t *pmd,
unsigned long addr, spinlock_t **ptlp);
+pte_t *pte_offset_map_ro_nolock(struct mm_struct *mm, pmd_t *pmd,
+ unsigned long addr, spinlock_t **ptlp);
+pte_t *pte_offset_map_rw_nolock(struct mm_struct *mm, pmd_t *pmd,
+ unsigned long addr, pmd_t *pmdvalp,
+ spinlock_t **ptlp);
#define pte_unmap_unlock(pte, ptl) do { \
spin_unlock(ptl); \
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index a78a4adf711ac..7abcea4451a71 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -317,6 +317,33 @@ pte_t *pte_offset_map_nolock(struct mm_struct *mm, pmd_t *pmd,
return pte;
}
+pte_t *pte_offset_map_ro_nolock(struct mm_struct *mm, pmd_t *pmd,
+ unsigned long addr, spinlock_t **ptlp)
+{
+ pmd_t pmdval;
+ pte_t *pte;
+
+ pte = __pte_offset_map(pmd, addr, &pmdval);
+ if (likely(pte))
+ *ptlp = pte_lockptr(mm, &pmdval);
+ return pte;
+}
+
+pte_t *pte_offset_map_rw_nolock(struct mm_struct *mm, pmd_t *pmd,
+ unsigned long addr, pmd_t *pmdvalp,
+ spinlock_t **ptlp)
+{
+ pmd_t pmdval;
+ pte_t *pte;
+
+ VM_WARN_ON_ONCE(!pmdvalp);
+ pte = __pte_offset_map(pmd, addr, &pmdval);
+ if (likely(pte))
+ *ptlp = pte_lockptr(mm, &pmdval);
+ *pmdvalp = pmdval;
+ return pte;
+}
+
/*
* pte_offset_map_lock(mm, pmd, addr, ptlp), and its internal implementation
* __pte_offset_map_lock() below, is usually called with the pmd pointer for
@@ -356,6 +383,22 @@ pte_t *pte_offset_map_nolock(struct mm_struct *mm, pmd_t *pmd,
* recheck *pmd once the lock is taken; in practice, no callsite needs that -
* either the mmap_lock for write, or pte_same() check on contents, is enough.
*
+ * pte_offset_map_ro_nolock(mm, pmd, addr, ptlp), above, is like
+ * pte_offset_map(); but when successful, it also outputs a pointer to the
+ * spinlock in ptlp - as pte_offset_map_lock() does, but in this case without
+ * locking it. This helps the caller to avoid a later pte_lockptr(mm, *pmd),
+ * which might by that time act on a changed *pmd: pte_offset_map_ro_nolock()
+ * provides the correct spinlock pointer for the page table that it returns.
+ * For readonly case, the caller does not need to recheck *pmd after the lock is
+ * taken, because the RCU lock will ensure that the PTE page will not be freed.
+ *
+ * pte_offset_map_rw_nolock(mm, pmd, addr, pmdvalp, ptlp), above, is like
+ * pte_offset_map_ro_nolock(); but when successful, it also outputs the
+ * pdmval. For cases where pte or pmd entries may be modified, that is, maywrite
+ * case, this can help the caller recheck *pmd once the lock is taken. In some
+ * cases, that is, either the mmap_lock for write, or pte_same() check on
+ * contents, is also enough to ensure that the pmd entry is stable.
+ *
* Note that free_pgtables(), used after unmapping detached vmas, or when
* exiting the whole mm, does not take page table lock before freeing a page
* table, and may not use RCU at all: "outsiders" like khugepaged should avoid
--
2.20.1
^ permalink raw reply related [flat|nested] 43+ messages in thread* Re: [PATCH v2 01/14] mm: pgtable: introduce pte_offset_map_{ro|rw}_nolock()
2024-08-22 7:13 ` [PATCH v2 01/14] mm: pgtable: " Qi Zheng
2024-08-26 3:45 ` [PATCH v2 01/14 update] " Qi Zheng
@ 2024-08-26 15:21 ` David Hildenbrand
2024-08-27 4:33 ` Qi Zheng
2024-08-28 9:48 ` Muchun Song
2 siblings, 1 reply; 43+ messages in thread
From: David Hildenbrand @ 2024-08-26 15:21 UTC (permalink / raw)
To: Qi Zheng, hughd, willy, muchun.song, vbabka, akpm, rppt,
vishal.moola, peterx, ryan.roberts, christophe.leroy2
Cc: linux-kernel, linux-mm, linux-arm-kernel, linuxppc-dev
On 22.08.24 09:13, Qi Zheng wrote:
> Currently, the usage of pte_offset_map_nolock() can be divided into the
> following two cases:
>
> 1) After acquiring PTL, only read-only operations are performed on the PTE
> page. In this case, the RCU lock in pte_offset_map_nolock() will ensure
> that the PTE page will not be freed, and there is no need to worry
> about whether the pmd entry is modified.
There is also the usage where we don't grab the PTL at all, and only do
a racy (read-only) lookup.
>
> 2) After acquiring PTL, the pte or pmd entries may be modified. At this
> time, we need to ensure that the pmd entry has not been modified
> concurrently.
>
> To more clearing distinguish between these two cases, this commit
> introduces two new helper functions to replace pte_offset_map_nolock().
> For 1), just rename it to pte_offset_map_ro_nolock(). For 2), in addition
> to changing the name to pte_offset_map_rw_nolock(), it also outputs the
> pmdval when successful. This can help the caller recheck *pmd once the PTL
> is taken. In some cases, that is, either the mmap_lock for write, or
> pte_same() check on contents, is also enough to ensure that the pmd entry
> is stable. But in order to prevent the interface from being abused, we
> choose to pass in a dummy local variable instead of NULL.
>
> Subsequent commits will convert pte_offset_map_nolock() into the above
> two functions one by one, and finally completely delete it.
>
> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
> ---
> Documentation/mm/split_page_table_lock.rst | 7 ++++
> include/linux/mm.h | 5 +++
> mm/pgtable-generic.c | 43 ++++++++++++++++++++++
> 3 files changed, 55 insertions(+)
>
> diff --git a/Documentation/mm/split_page_table_lock.rst b/Documentation/mm/split_page_table_lock.rst
> index e4f6972eb6c04..08d0e706a32db 100644
> --- a/Documentation/mm/split_page_table_lock.rst
> +++ b/Documentation/mm/split_page_table_lock.rst
> @@ -19,6 +19,13 @@ There are helpers to lock/unlock a table and other accessor functions:
> - pte_offset_map_nolock()
> maps PTE, returns pointer to PTE with pointer to its PTE table
> lock (not taken), or returns NULL if no PTE table;
What will happen to pte_offset_map_nolock() after this series? Does it
still exist or will it become an internal helper?
> + - pte_offset_map_ro_nolock()
> + maps PTE, returns pointer to PTE with pointer to its PTE table
> + lock (not taken), or returns NULL if no PTE table;
> + - pte_offset_map_rw_nolock()
> + maps PTE, returns pointer to PTE with pointer to its PTE table
> + lock (not taken) and the value of its pmd entry, or returns NULL
> + if no PTE table;
[...]
> +pte_t *pte_offset_map_rw_nolock(struct mm_struct *mm, pmd_t *pmd,
> + unsigned long addr, pmd_t *pmdvalp,
> + spinlock_t **ptlp)
> +{
> + pmd_t pmdval;
> + pte_t *pte;
> +
> + BUG_ON(!pmdvalp);
As raised, no BUG_ON please. VM_WARN_ON_ONCE() is helpful during early
testing and should catch these kind of things.
If someone thinks not requiring a non-NULL pointer is better, please
speak up, I'm not married to that idea :)
> + pte = __pte_offset_map(pmd, addr, &pmdval);
> + if (likely(pte))
> + *ptlp = pte_lockptr(mm, &pmdval);
> + *pmdvalp = pmdval;
> + return pte;
> +}
> +
> /*
> * pte_offset_map_lock(mm, pmd, addr, ptlp), and its internal implementation
> * __pte_offset_map_lock() below, is usually called with the pmd pointer for
> @@ -356,6 +383,22 @@ pte_t *pte_offset_map_nolock(struct mm_struct *mm, pmd_t *pmd,
> * recheck *pmd once the lock is taken; in practice, no callsite needs that -
> * either the mmap_lock for write, or pte_same() check on contents, is enough.
> *
> + * pte_offset_map_ro_nolock(mm, pmd, addr, ptlp), above, is like
> + * pte_offset_map(); but when successful, it also outputs a pointer to the
> + * spinlock in ptlp - as pte_offset_map_lock() does, but in this case without
> + * locking it. This helps the caller to avoid a later pte_lockptr(mm, *pmd),
> + * which might by that time act on a changed *pmd: pte_offset_map_ro_nolock()
> + * provides the correct spinlock pointer for the page table that it returns.
> + * For readonly case, the caller does not need to recheck *pmd after the lock is
> + * taken, because the RCU lock will ensure that the PTE page will not be freed. > + *
> + * pte_offset_map_rw_nolock(mm, pmd, addr, pmdvalp, ptlp), above, is like
> + * pte_offset_map_ro_nolock(); but when successful, it also outputs the
> + * pdmval. For cases where pte or pmd entries may be modified, that is, maywrite
> + * case, this can help the caller recheck *pmd once the lock is taken. In some
> + * cases, that is, either the mmap_lock for write, or pte_same() check on
> + * contents, is also enough to ensure that the pmd entry is stable.
> + *
> * Note that free_pgtables(), used after unmapping detached vmas, or when
> * exiting the whole mm, does not take page table lock before freeing a page
> * table, and may not use RCU at all: "outsiders" like khugepaged should avoid
In general to me a step into the right direction. Likely the
documentation could be further clarified in some aspects:
Like that the use of pte_offset_map_ro_nolock() does not allow to easily
identify if the page table was replaced in the meantime. Even after
grabbing the PTL, we might be looking either at a page table that is
still mapped or one that was unmapped and is about to get freed. But for
R/O access this is usually sufficient AFAIUK.
Or that "RO" / "RW" expresses the intended semantics, not that the
*kmap* will be RO/RW protected.
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 43+ messages in thread* Re: [PATCH v2 01/14] mm: pgtable: introduce pte_offset_map_{ro|rw}_nolock()
2024-08-26 15:21 ` [PATCH v2 01/14] " David Hildenbrand
@ 2024-08-27 4:33 ` Qi Zheng
2024-08-28 10:48 ` David Hildenbrand
0 siblings, 1 reply; 43+ messages in thread
From: Qi Zheng @ 2024-08-27 4:33 UTC (permalink / raw)
To: David Hildenbrand
Cc: hughd, willy, muchun.song, vbabka, akpm, rppt, vishal.moola,
peterx, ryan.roberts, christophe.leroy2, linux-kernel, linux-mm,
linux-arm-kernel, linuxppc-dev
Hi David,
On 2024/8/26 23:21, David Hildenbrand wrote:
> On 22.08.24 09:13, Qi Zheng wrote:
>> Currently, the usage of pte_offset_map_nolock() can be divided into the
>> following two cases:
>>
>> 1) After acquiring PTL, only read-only operations are performed on the
>> PTE
>> page. In this case, the RCU lock in pte_offset_map_nolock() will
>> ensure
>> that the PTE page will not be freed, and there is no need to worry
>> about whether the pmd entry is modified.
>
> There is also the usage where we don't grab the PTL at all, and only do
> a racy (read-only) lookup.
IIUC, pte_offset_map() should be used instead of pte_offset_map_nolock()
in this case.
>
>>
>> 2) After acquiring PTL, the pte or pmd entries may be modified. At this
>> time, we need to ensure that the pmd entry has not been modified
>> concurrently.
>>
>> To more clearing distinguish between these two cases, this commit
>> introduces two new helper functions to replace pte_offset_map_nolock().
>> For 1), just rename it to pte_offset_map_ro_nolock(). For 2), in addition
>> to changing the name to pte_offset_map_rw_nolock(), it also outputs the
>> pmdval when successful. This can help the caller recheck *pmd once the
>> PTL
>> is taken. In some cases, that is, either the mmap_lock for write, or
>> pte_same() check on contents, is also enough to ensure that the pmd entry
>> is stable. But in order to prevent the interface from being abused, we
>> choose to pass in a dummy local variable instead of NULL.
>>
>> Subsequent commits will convert pte_offset_map_nolock() into the above
>> two functions one by one, and finally completely delete it.
>>
>> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
>> ---
>> Documentation/mm/split_page_table_lock.rst | 7 ++++
>> include/linux/mm.h | 5 +++
>> mm/pgtable-generic.c | 43 ++++++++++++++++++++++
>> 3 files changed, 55 insertions(+)
>>
>> diff --git a/Documentation/mm/split_page_table_lock.rst
>> b/Documentation/mm/split_page_table_lock.rst
>> index e4f6972eb6c04..08d0e706a32db 100644
>> --- a/Documentation/mm/split_page_table_lock.rst
>> +++ b/Documentation/mm/split_page_table_lock.rst
>> @@ -19,6 +19,13 @@ There are helpers to lock/unlock a table and other
>> accessor functions:
>> - pte_offset_map_nolock()
>> maps PTE, returns pointer to PTE with pointer to its PTE table
>> lock (not taken), or returns NULL if no PTE table;
>
> What will happen to pte_offset_map_nolock() after this series? Does it
> still exist or will it become an internal helper?
I choose to remove it completely in [PATCH v2 13/14].
>
>> + - pte_offset_map_ro_nolock()
>> + maps PTE, returns pointer to PTE with pointer to its PTE table
>> + lock (not taken), or returns NULL if no PTE table;
>> + - pte_offset_map_rw_nolock()
>> + maps PTE, returns pointer to PTE with pointer to its PTE table
>> + lock (not taken) and the value of its pmd entry, or returns NULL
>> + if no PTE table;
>
> [...]
>
>> +pte_t *pte_offset_map_rw_nolock(struct mm_struct *mm, pmd_t *pmd,
>> + unsigned long addr, pmd_t *pmdvalp,
>> + spinlock_t **ptlp)
>> +{
>> + pmd_t pmdval;
>> + pte_t *pte;
>> +
>> + BUG_ON(!pmdvalp);
>
> As raised, no BUG_ON please. VM_WARN_ON_ONCE() is helpful during early
> testing and should catch these kind of things.
OK, this patch was sent before you pointed out this, will use
VM_WARN_ON_ONCE() instead of BUG_ON() in v3.
>
> If someone thinks not requiring a non-NULL pointer is better, please
> speak up, I'm not married to that idea :)
>
>> + pte = __pte_offset_map(pmd, addr, &pmdval);
>> + if (likely(pte))
>> + *ptlp = pte_lockptr(mm, &pmdval);
>> + *pmdvalp = pmdval;
>> + return pte;
>> +}
>> +
>> /*
>> * pte_offset_map_lock(mm, pmd, addr, ptlp), and its internal
>> implementation
>> * __pte_offset_map_lock() below, is usually called with the pmd
>> pointer for
>> @@ -356,6 +383,22 @@ pte_t *pte_offset_map_nolock(struct mm_struct
>> *mm, pmd_t *pmd,
>> * recheck *pmd once the lock is taken; in practice, no callsite
>> needs that -
>> * either the mmap_lock for write, or pte_same() check on contents,
>> is enough.
>> *
>> + * pte_offset_map_ro_nolock(mm, pmd, addr, ptlp), above, is like
>> + * pte_offset_map(); but when successful, it also outputs a pointer
>> to the
>> + * spinlock in ptlp - as pte_offset_map_lock() does, but in this case
>> without
>> + * locking it. This helps the caller to avoid a later
>> pte_lockptr(mm, *pmd),
>> + * which might by that time act on a changed *pmd:
>> pte_offset_map_ro_nolock()
>> + * provides the correct spinlock pointer for the page table that it
>> returns.
>> + * For readonly case, the caller does not need to recheck *pmd after
>> the lock is
>> + * taken, because the RCU lock will ensure that the PTE page will not
>> be freed. > + *
>> + * pte_offset_map_rw_nolock(mm, pmd, addr, pmdvalp, ptlp), above, is
>> like
>> + * pte_offset_map_ro_nolock(); but when successful, it also outputs the
>> + * pdmval. For cases where pte or pmd entries may be modified, that
>> is, maywrite
>> + * case, this can help the caller recheck *pmd once the lock is
>> taken. In some
>> + * cases, that is, either the mmap_lock for write, or pte_same()
>> check on
>> + * contents, is also enough to ensure that the pmd entry is stable.
>> + *
>> * Note that free_pgtables(), used after unmapping detached vmas, or
>> when
>> * exiting the whole mm, does not take page table lock before
>> freeing a page
>> * table, and may not use RCU at all: "outsiders" like khugepaged
>> should avoid
>
> In general to me a step into the right direction. Likely the
> documentation could be further clarified in some aspects:
>
> Like that the use of pte_offset_map_ro_nolock() does not allow to easily
> identify if the page table was replaced in the meantime. Even after
> grabbing the PTL, we might be looking either at a page table that is
> still mapped or one that was unmapped and is about to get freed. But for
> R/O access this is usually sufficient AFAIUK.
>
> Or that "RO" / "RW" expresses the intended semantics, not that the
> *kmap* will be RO/RW protected.
How about the following:
pte_offset_map_ro_nolock(mm, pmd, addr, ptlp), above, is like
pte_offset_map(); but when successful, it also outputs a pointer to the
spinlock in ptlp - as pte_offset_map_lock() does, but in this case
without locking it. This helps the caller to avoid a later
pte_lockptr(mm, *pmd), which might by that time act on a changed *pmd:
pte_offset_map_ro_nolock() provides the correct spinlock pointer for the
page table that it returns. Even after grabbing the spinlock, we might
be looking either at a page table that is still mapped or one that was
unmapped and is about to get freed. But for R/O access this is usually
sufficient AFAIUK.
pte_offset_map_rw_nolock(mm, pmd, addr, pmdvalp, ptlp), above, is like
pte_offset_map_ro_nolock(); but when successful, it also outputs the
pdmval. For R/W access, the callers can not accept that the page table
it sees has been unmapped and is about to get freed. The pmdval can help
callers to recheck pmd_same() to identify this case once the spinlock is
taken. For some cases where exclusivity is already guaranteed, such as
holding the write lock of mmap_lock, or in cases where checking is
sufficient, such as a !pte_none() pte will be rechecked after the
spinlock is taken, there is no need to recheck pdmval.
Note: "RO" / "RW" expresses the intended semantics, not that the *kmap*
will be RO/RW protected.
Thanks,
Qi
>
^ permalink raw reply [flat|nested] 43+ messages in thread* Re: [PATCH v2 01/14] mm: pgtable: introduce pte_offset_map_{ro|rw}_nolock()
2024-08-27 4:33 ` Qi Zheng
@ 2024-08-28 10:48 ` David Hildenbrand
2024-08-29 3:27 ` Qi Zheng
2024-08-29 10:59 ` Qi Zheng
0 siblings, 2 replies; 43+ messages in thread
From: David Hildenbrand @ 2024-08-28 10:48 UTC (permalink / raw)
To: Qi Zheng
Cc: hughd, willy, muchun.song, vbabka, akpm, rppt, vishal.moola,
peterx, ryan.roberts, christophe.leroy2, linux-kernel, linux-mm,
linux-arm-kernel, linuxppc-dev
On 27.08.24 06:33, Qi Zheng wrote:
> Hi David,
>
> On 2024/8/26 23:21, David Hildenbrand wrote:
>> On 22.08.24 09:13, Qi Zheng wrote:
>>> Currently, the usage of pte_offset_map_nolock() can be divided into the
>>> following two cases:
>>>
>>> 1) After acquiring PTL, only read-only operations are performed on the
>>> PTE
>>> page. In this case, the RCU lock in pte_offset_map_nolock() will
>>> ensure
>>> that the PTE page will not be freed, and there is no need to worry
>>> about whether the pmd entry is modified.
>>
>> There is also the usage where we don't grab the PTL at all, and only do
>> a racy (read-only) lookup.
>
> IIUC, pte_offset_map() should be used instead of pte_offset_map_nolock()
> in this case.
Yes, but the filemap.c thingy conditionally wants to lock later. But I
agree that pte_offset_map() is better when not even wanting to lock.
[...]
>>> accessor functions:
>>> - pte_offset_map_nolock()
>>> maps PTE, returns pointer to PTE with pointer to its PTE table
>>> lock (not taken), or returns NULL if no PTE table;
>>
>> What will happen to pte_offset_map_nolock() after this series? Does it
>> still exist or will it become an internal helper?
>
> I choose to remove it completely in [PATCH v2 13/14].
>
Ah, great.
[...]
>> If someone thinks not requiring a non-NULL pointer is better, please
>> speak up, I'm not married to that idea :)
>>
>>> + pte = __pte_offset_map(pmd, addr, &pmdval);
>>> + if (likely(pte))
>>> + *ptlp = pte_lockptr(mm, &pmdval);
>>> + *pmdvalp = pmdval;
>>> + return pte;
>>> +}
>>> +
>>> /*
>>> * pte_offset_map_lock(mm, pmd, addr, ptlp), and its internal
>>> implementation
>>> * __pte_offset_map_lock() below, is usually called with the pmd
>>> pointer for
>>> @@ -356,6 +383,22 @@ pte_t *pte_offset_map_nolock(struct mm_struct
>>> *mm, pmd_t *pmd,
>>> * recheck *pmd once the lock is taken; in practice, no callsite
>>> needs that -
>>> * either the mmap_lock for write, or pte_same() check on contents,
>>> is enough.
>>> *
>>> + * pte_offset_map_ro_nolock(mm, pmd, addr, ptlp), above, is like
>>> + * pte_offset_map(); but when successful, it also outputs a pointer
>>> to the
>>> + * spinlock in ptlp - as pte_offset_map_lock() does, but in this case
>>> without
>>> + * locking it. This helps the caller to avoid a later
>>> pte_lockptr(mm, *pmd),
>>> + * which might by that time act on a changed *pmd:
>>> pte_offset_map_ro_nolock()
>>> + * provides the correct spinlock pointer for the page table that it
>>> returns.
>>> + * For readonly case, the caller does not need to recheck *pmd after
>>> the lock is
>>> + * taken, because the RCU lock will ensure that the PTE page will not
>>> be freed. > + *
>>> + * pte_offset_map_rw_nolock(mm, pmd, addr, pmdvalp, ptlp), above, is
>>> like
>>> + * pte_offset_map_ro_nolock(); but when successful, it also outputs the
>>> + * pdmval. For cases where pte or pmd entries may be modified, that
>>> is, maywrite
>>> + * case, this can help the caller recheck *pmd once the lock is
>>> taken. In some
>>> + * cases, that is, either the mmap_lock for write, or pte_same()
>>> check on
>>> + * contents, is also enough to ensure that the pmd entry is stable.
>>> + *
>>> * Note that free_pgtables(), used after unmapping detached vmas, or
>>> when
>>> * exiting the whole mm, does not take page table lock before
>>> freeing a page
>>> * table, and may not use RCU at all: "outsiders" like khugepaged
>>> should avoid
>>
>> In general to me a step into the right direction. Likely the
>> documentation could be further clarified in some aspects:
>>
>> Like that the use of pte_offset_map_ro_nolock() does not allow to easily
>> identify if the page table was replaced in the meantime. Even after
>> grabbing the PTL, we might be looking either at a page table that is
>> still mapped or one that was unmapped and is about to get freed. But for
>> R/O access this is usually sufficient AFAIUK.
>>
>> Or that "RO" / "RW" expresses the intended semantics, not that the
>> *kmap* will be RO/RW protected.
>
> How about the following:
>
> pte_offset_map_ro_nolock(mm, pmd, addr, ptlp), above, is like
> pte_offset_map(); but when successful, it also outputs a pointer to the
> spinlock in ptlp - as pte_offset_map_lock() does, but in this case
> without locking it. This helps the caller to avoid a later
> pte_lockptr(mm, *pmd), which might by that time act on a changed *pmd:
> pte_offset_map_ro_nolock() provides the correct spinlock pointer for the
> page table that it returns. Even after grabbing the spinlock, we might
> be looking either at a page table that is still mapped or one that was
> unmapped and is about to get freed. But for R/O access this is usually
> sufficient AFAIUK.
Drop the "AFAIUK" :)
"For R/O access this is sufficient."
>
> pte_offset_map_rw_nolock(mm, pmd, addr, pmdvalp, ptlp), above, is like
> pte_offset_map_ro_nolock(); but when successful, it also outputs the
> pdmval. For R/W access, the callers can not accept that the page table
> it sees has been unmapped and is about to get freed. The pmdval can help
> callers to recheck pmd_same() to identify this case once the spinlock is
> taken. For some cases where exclusivity is already guaranteed, such as
> holding the write lock of mmap_lock, or in cases where checking is
> sufficient, such as a !pte_none() pte will be rechecked after the
> spinlock is taken, there is no need to recheck pdmval.
Right, using pte_same() one can achieve a similar result, assuming that
the freed page table gets all ptes set to pte_none().
page_table_check_pte_clear_range() before pte_free_defer() in
retract_page_tables/collapse_pte_mapped_thp() sanity checks that I think.
In collapse_huge_page() that is not the case. But here, we also
currently grab all heavily locks, to prevent any concurrent page table
walker.
>
> Note: "RO" / "RW" expresses the intended semantics, not that the *kmap*
> will be RO/RW protected.
Good. Please also incorporate the feedback from Muchun.
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 43+ messages in thread* Re: [PATCH v2 01/14] mm: pgtable: introduce pte_offset_map_{ro|rw}_nolock()
2024-08-28 10:48 ` David Hildenbrand
@ 2024-08-29 3:27 ` Qi Zheng
2024-08-29 10:59 ` Qi Zheng
1 sibling, 0 replies; 43+ messages in thread
From: Qi Zheng @ 2024-08-29 3:27 UTC (permalink / raw)
To: David Hildenbrand, muchun.song
Cc: hughd, willy, vbabka, akpm, rppt, vishal.moola, peterx,
ryan.roberts, christophe.leroy2, linux-kernel, linux-mm,
linux-arm-kernel, linuxppc-dev
On 2024/8/28 18:48, David Hildenbrand wrote:
> On 27.08.24 06:33, Qi Zheng wrote:
>> Hi David,
>>
>> On 2024/8/26 23:21, David Hildenbrand wrote:
>>> On 22.08.24 09:13, Qi Zheng wrote:
>>>> Currently, the usage of pte_offset_map_nolock() can be divided into the
>>>> following two cases:
>>>>
>>>> 1) After acquiring PTL, only read-only operations are performed on the
>>>> PTE
>>>> page. In this case, the RCU lock in pte_offset_map_nolock() will
>>>> ensure
>>>> that the PTE page will not be freed, and there is no need to worry
>>>> about whether the pmd entry is modified.
>>>
>>> There is also the usage where we don't grab the PTL at all, and only do
>>> a racy (read-only) lookup.
>>
>> IIUC, pte_offset_map() should be used instead of pte_offset_map_nolock()
>> in this case.
>
> Yes, but the filemap.c thingy conditionally wants to lock later. But I
> agree that pte_offset_map() is better when not even wanting to lock.
>
> [...]
>
>>>> accessor functions:
>>>> - pte_offset_map_nolock()
>>>> maps PTE, returns pointer to PTE with pointer to its PTE table
>>>> lock (not taken), or returns NULL if no PTE table;
>>>
>>> What will happen to pte_offset_map_nolock() after this series? Does it
>>> still exist or will it become an internal helper?
>>
>> I choose to remove it completely in [PATCH v2 13/14].
>>
>
> Ah, great.
>
> [...]
>
>>> If someone thinks not requiring a non-NULL pointer is better, please
>>> speak up, I'm not married to that idea :)
>>>
>>>> + pte = __pte_offset_map(pmd, addr, &pmdval);
>>>> + if (likely(pte))
>>>> + *ptlp = pte_lockptr(mm, &pmdval);
>>>> + *pmdvalp = pmdval;
>>>> + return pte;
>>>> +}
>>>> +
>>>> /*
>>>> * pte_offset_map_lock(mm, pmd, addr, ptlp), and its internal
>>>> implementation
>>>> * __pte_offset_map_lock() below, is usually called with the pmd
>>>> pointer for
>>>> @@ -356,6 +383,22 @@ pte_t *pte_offset_map_nolock(struct mm_struct
>>>> *mm, pmd_t *pmd,
>>>> * recheck *pmd once the lock is taken; in practice, no callsite
>>>> needs that -
>>>> * either the mmap_lock for write, or pte_same() check on contents,
>>>> is enough.
>>>> *
>>>> + * pte_offset_map_ro_nolock(mm, pmd, addr, ptlp), above, is like
>>>> + * pte_offset_map(); but when successful, it also outputs a pointer
>>>> to the
>>>> + * spinlock in ptlp - as pte_offset_map_lock() does, but in this case
>>>> without
>>>> + * locking it. This helps the caller to avoid a later
>>>> pte_lockptr(mm, *pmd),
>>>> + * which might by that time act on a changed *pmd:
>>>> pte_offset_map_ro_nolock()
>>>> + * provides the correct spinlock pointer for the page table that it
>>>> returns.
>>>> + * For readonly case, the caller does not need to recheck *pmd after
>>>> the lock is
>>>> + * taken, because the RCU lock will ensure that the PTE page will not
>>>> be freed. > + *
>>>> + * pte_offset_map_rw_nolock(mm, pmd, addr, pmdvalp, ptlp), above, is
>>>> like
>>>> + * pte_offset_map_ro_nolock(); but when successful, it also outputs
>>>> the
>>>> + * pdmval. For cases where pte or pmd entries may be modified, that
>>>> is, maywrite
>>>> + * case, this can help the caller recheck *pmd once the lock is
>>>> taken. In some
>>>> + * cases, that is, either the mmap_lock for write, or pte_same()
>>>> check on
>>>> + * contents, is also enough to ensure that the pmd entry is stable.
>>>> + *
>>>> * Note that free_pgtables(), used after unmapping detached vmas, or
>>>> when
>>>> * exiting the whole mm, does not take page table lock before
>>>> freeing a page
>>>> * table, and may not use RCU at all: "outsiders" like khugepaged
>>>> should avoid
>>>
>>> In general to me a step into the right direction. Likely the
>>> documentation could be further clarified in some aspects:
>>>
>>> Like that the use of pte_offset_map_ro_nolock() does not allow to easily
>>> identify if the page table was replaced in the meantime. Even after
>>> grabbing the PTL, we might be looking either at a page table that is
>>> still mapped or one that was unmapped and is about to get freed. But for
>>> R/O access this is usually sufficient AFAIUK.
>>>
>>> Or that "RO" / "RW" expresses the intended semantics, not that the
>>> *kmap* will be RO/RW protected.
>>
>> How about the following:
>>
>> pte_offset_map_ro_nolock(mm, pmd, addr, ptlp), above, is like
>> pte_offset_map(); but when successful, it also outputs a pointer to the
>> spinlock in ptlp - as pte_offset_map_lock() does, but in this case
>> without locking it. This helps the caller to avoid a later
>> pte_lockptr(mm, *pmd), which might by that time act on a changed *pmd:
>> pte_offset_map_ro_nolock() provides the correct spinlock pointer for the
>> page table that it returns. Even after grabbing the spinlock, we might
>> be looking either at a page table that is still mapped or one that was
>> unmapped and is about to get freed. But for R/O access this is usually
>> sufficient AFAIUK.
>
> Drop the "AFAIUK" :)
>
> "For R/O access this is sufficient."
OK.
>
>>
>> pte_offset_map_rw_nolock(mm, pmd, addr, pmdvalp, ptlp), above, is like
>> pte_offset_map_ro_nolock(); but when successful, it also outputs the
>> pdmval. For R/W access, the callers can not accept that the page table
>> it sees has been unmapped and is about to get freed. The pmdval can help
>> callers to recheck pmd_same() to identify this case once the spinlock is
>> taken. For some cases where exclusivity is already guaranteed, such as
>> holding the write lock of mmap_lock, or in cases where checking is
>> sufficient, such as a !pte_none() pte will be rechecked after the
>> spinlock is taken, there is no need to recheck pdmval.
>
> Right, using pte_same() one can achieve a similar result, assuming that
> the freed page table gets all ptes set to pte_none().
>
> page_table_check_pte_clear_range() before pte_free_defer() in
> retract_page_tables/collapse_pte_mapped_thp() sanity checks that I think.
>
> In collapse_huge_page() that is not the case. But here, we also
> currently grab all heavily locks, to prevent any concurrent page table
> walker.
Yes.
>
>>
>> Note: "RO" / "RW" expresses the intended semantics, not that the *kmap*
>> will be RO/RW protected.
>
>
> Good. Please also incorporate the feedback from Muchun.
OK, I will change it in v3 to the following:
pte_offset_map_ro_nolock(mm, pmd, addr, ptlp), above, is like
pte_offset_map(); but when successful, it also outputs a pointer to the
spinlock in ptlp - as pte_offset_map_lock() does, but in this case
without locking it. This helps the caller to avoid a later
pte_lockptr(mm, *pmd), which might by that time act on a changed *pmd:
pte_offset_map_ro_nolock() provides the correct spinlock pointer for the
page table that it returns. Even after grabbing the spinlock, we might
be looking either at a page table that is still mapped or one that was
unmapped and is about to get freed. But for R/O access this is
sufficient. So it is only applicable for read-only cases where any
modification operations to the page table are not allowed even if
the corresponding spinlock is held afterwards.
pte_offset_map_rw_nolock(mm, pmd, addr, pmdvalp, ptlp), above, is like
pte_offset_map_ro_nolock(); but when successful, it also outputs the
pdmval. It is applicable for may-write cases where any modification
operations to the page table may happen after the corresponding spinlock
is held afterwards. But the users should make sure the page table is
stable like holding mmap_lock for write or checking pte_same() or
checking pmd_same() by using the output pmdval before performing the
write operations.
Note: "RO" / "RW" expresses the intended semantics, not that the *kmap*
will be read-only/read-write protected.
>
^ permalink raw reply [flat|nested] 43+ messages in thread* Re: [PATCH v2 01/14] mm: pgtable: introduce pte_offset_map_{ro|rw}_nolock()
2024-08-28 10:48 ` David Hildenbrand
2024-08-29 3:27 ` Qi Zheng
@ 2024-08-29 10:59 ` Qi Zheng
2024-08-29 15:31 ` David Hildenbrand
1 sibling, 1 reply; 43+ messages in thread
From: Qi Zheng @ 2024-08-29 10:59 UTC (permalink / raw)
To: David Hildenbrand, muchun.song
Cc: hughd, willy, vbabka, akpm, rppt, vishal.moola, peterx,
ryan.roberts, christophe.leroy2, linux-kernel, linux-mm,
linux-arm-kernel, linuxppc-dev
On 2024/8/28 18:48, David Hildenbrand wrote:
> On 27.08.24 06:33, Qi Zheng wrote:
[...]
>> sufficient AFAIUK.
>
> Drop the "AFAIUK" :)
>
> "For R/O access this is sufficient."
>
>>
>> pte_offset_map_rw_nolock(mm, pmd, addr, pmdvalp, ptlp), above, is like
>> pte_offset_map_ro_nolock(); but when successful, it also outputs the
>> pdmval. For R/W access, the callers can not accept that the page table
>> it sees has been unmapped and is about to get freed. The pmdval can help
>> callers to recheck pmd_same() to identify this case once the spinlock is
>> taken. For some cases where exclusivity is already guaranteed, such as
>> holding the write lock of mmap_lock, or in cases where checking is
>> sufficient, such as a !pte_none() pte will be rechecked after the
>> spinlock is taken, there is no need to recheck pdmval.
>
> Right, using pte_same() one can achieve a similar result, assuming that
> the freed page table gets all ptes set to pte_none().
>
> page_table_check_pte_clear_range() before pte_free_defer() in
> retract_page_tables/collapse_pte_mapped_thp() sanity checks that I think.
Since commit 1d65b771bc08, retract_page_tables() only holds the
i_mmap_lock_read(mapping) but not mmap_lock, so it seems that
holding the write lock of mmap_lock cannot guarantee the stability
of the PTE page.
IIUC, I will also perform a pmd_same() check on the case where the
write lock of mmap_lock is held in v3. Or do I miss something?
>
> In collapse_huge_page() that is not the case. But here, we also
> currently grab all heavily locks, to prevent any concurrent page table
> walker.
>
>>
>> Note: "RO" / "RW" expresses the intended semantics, not that the *kmap*
>> will be RO/RW protected.
>
>
> Good. Please also incorporate the feedback from Muchun.
>
^ permalink raw reply [flat|nested] 43+ messages in thread* Re: [PATCH v2 01/14] mm: pgtable: introduce pte_offset_map_{ro|rw}_nolock()
2024-08-29 10:59 ` Qi Zheng
@ 2024-08-29 15:31 ` David Hildenbrand
2024-08-30 6:37 ` Qi Zheng
0 siblings, 1 reply; 43+ messages in thread
From: David Hildenbrand @ 2024-08-29 15:31 UTC (permalink / raw)
To: Qi Zheng, muchun.song
Cc: hughd, willy, vbabka, akpm, rppt, vishal.moola, peterx,
ryan.roberts, christophe.leroy2, linux-kernel, linux-mm,
linux-arm-kernel, linuxppc-dev
On 29.08.24 12:59, Qi Zheng wrote:
>
>
> On 2024/8/28 18:48, David Hildenbrand wrote:
>> On 27.08.24 06:33, Qi Zheng wrote:
>
> [...]
>
>>> sufficient AFAIUK.
>>
>> Drop the "AFAIUK" :)
>>
>> "For R/O access this is sufficient."
>>
>>>
>>> pte_offset_map_rw_nolock(mm, pmd, addr, pmdvalp, ptlp), above, is like
>>> pte_offset_map_ro_nolock(); but when successful, it also outputs the
>>> pdmval. For R/W access, the callers can not accept that the page table
>>> it sees has been unmapped and is about to get freed. The pmdval can help
>>> callers to recheck pmd_same() to identify this case once the spinlock is
>>> taken. For some cases where exclusivity is already guaranteed, such as
>>> holding the write lock of mmap_lock, or in cases where checking is
>>> sufficient, such as a !pte_none() pte will be rechecked after the
>>> spinlock is taken, there is no need to recheck pdmval.
>>
>> Right, using pte_same() one can achieve a similar result, assuming that
>> the freed page table gets all ptes set to pte_none().
>>
>> page_table_check_pte_clear_range() before pte_free_defer() in
>> retract_page_tables/collapse_pte_mapped_thp() sanity checks that I think.
>
> Since commit 1d65b771bc08, retract_page_tables() only holds the
> i_mmap_lock_read(mapping) but not mmap_lock, so it seems that
> holding the write lock of mmap_lock cannot guarantee the stability
> of the PTE page.
Guess it depends. khugepaged on anonymous memory will block any page
table walkers (like anon THP collapse does) -- per-VMA lock, mmap lock,
mapping lock/RMAP lock ... so it *should* be sufficient to hold any of
these, right?
So at least for now, these (anonymous memory) cases would be ok. Likely
that will change when reclaiming empty page tables.
>
> IIUC, I will also perform a pmd_same() check on the case where the
> write lock of mmap_lock is held in v3. Or do I miss something?
Can you spell out the instances where you think it might be required.
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 43+ messages in thread* Re: [PATCH v2 01/14] mm: pgtable: introduce pte_offset_map_{ro|rw}_nolock()
2024-08-29 15:31 ` David Hildenbrand
@ 2024-08-30 6:37 ` Qi Zheng
0 siblings, 0 replies; 43+ messages in thread
From: Qi Zheng @ 2024-08-30 6:37 UTC (permalink / raw)
To: David Hildenbrand
Cc: muchun.song, hughd, willy, vbabka, akpm, rppt, vishal.moola,
peterx, ryan.roberts, christophe.leroy2, linux-kernel, linux-mm,
linux-arm-kernel, linuxppc-dev
On 2024/8/29 23:31, David Hildenbrand wrote:
> On 29.08.24 12:59, Qi Zheng wrote:
>>
>>
>> On 2024/8/28 18:48, David Hildenbrand wrote:
>>> On 27.08.24 06:33, Qi Zheng wrote:
>>
>> [...]
>>
>>>> sufficient AFAIUK.
>>>
>>> Drop the "AFAIUK" :)
>>>
>>> "For R/O access this is sufficient."
>>>
>>>>
>>>> pte_offset_map_rw_nolock(mm, pmd, addr, pmdvalp, ptlp), above, is like
>>>> pte_offset_map_ro_nolock(); but when successful, it also outputs the
>>>> pdmval. For R/W access, the callers can not accept that the page table
>>>> it sees has been unmapped and is about to get freed. The pmdval can
>>>> help
>>>> callers to recheck pmd_same() to identify this case once the
>>>> spinlock is
>>>> taken. For some cases where exclusivity is already guaranteed, such as
>>>> holding the write lock of mmap_lock, or in cases where checking is
>>>> sufficient, such as a !pte_none() pte will be rechecked after the
>>>> spinlock is taken, there is no need to recheck pdmval.
>>>
>>> Right, using pte_same() one can achieve a similar result, assuming that
>>> the freed page table gets all ptes set to pte_none().
>>>
>>> page_table_check_pte_clear_range() before pte_free_defer() in
>>> retract_page_tables/collapse_pte_mapped_thp() sanity checks that I
>>> think.
>>
>> Since commit 1d65b771bc08, retract_page_tables() only holds the
>> i_mmap_lock_read(mapping) but not mmap_lock, so it seems that
>> holding the write lock of mmap_lock cannot guarantee the stability
>> of the PTE page.
>
> Guess it depends. khugepaged on anonymous memory will block any page
> table walkers (like anon THP collapse does) -- per-VMA lock, mmap lock,
> mapping lock/RMAP lock ... so it *should* be sufficient to hold any of
> these, right?
retract_page_tables() itself is safe, but because it does not hold the
read lock of mmap_lock, other paths that only hold the write lock of
mmap_lock may be concurrent with it, such as the paths in
[PATCH v2 08/14] and [PATCH v2 09/14].
>
> So at least for now, these (anonymous memory) cases would be ok. Likely
> that will change when reclaiming empty page tables.
When reclaiming the empty page tables, I will hold the read lock of
mmap_lock.
Therefore, either perform a pmd_same() check on the case where the
write lock of mmap_lock is held, or add the read lock of mmap_lock
to retract_page_tables() as well.
>
>>
>> IIUC, I will also perform a pmd_same() check on the case where the
>> write lock of mmap_lock is held in v3. Or do I miss something?
>
> Can you spell out the instances where you think it might be required.
For example, the paths in [PATCH v2 08/14] and [PATCH v2 09/14] need
to do pmd_same() check after holding the PTL.
>
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH v2 01/14] mm: pgtable: introduce pte_offset_map_{ro|rw}_nolock()
2024-08-22 7:13 ` [PATCH v2 01/14] mm: pgtable: " Qi Zheng
2024-08-26 3:45 ` [PATCH v2 01/14 update] " Qi Zheng
2024-08-26 15:21 ` [PATCH v2 01/14] " David Hildenbrand
@ 2024-08-28 9:48 ` Muchun Song
2 siblings, 0 replies; 43+ messages in thread
From: Muchun Song @ 2024-08-28 9:48 UTC (permalink / raw)
To: Qi Zheng
Cc: linux-kernel, linux-mm, linux-arm-kernel, linuxppc-dev, david,
hughd, willy, vbabka, akpm, rppt, vishal.moola, peterx,
ryan.roberts, christophe.leroy2
On 2024/8/22 15:13, Qi Zheng wrote:
> Currently, the usage of pte_offset_map_nolock() can be divided into the
> following two cases:
>
> 1) After acquiring PTL, only read-only operations are performed on the PTE
> page. In this case, the RCU lock in pte_offset_map_nolock() will ensure
> that the PTE page will not be freed, and there is no need to worry
> about whether the pmd entry is modified.
>
> 2) After acquiring PTL, the pte or pmd entries may be modified. At this
> time, we need to ensure that the pmd entry has not been modified
> concurrently.
>
> To more clearing distinguish between these two cases, this commit
> introduces two new helper functions to replace pte_offset_map_nolock().
> For 1), just rename it to pte_offset_map_ro_nolock(). For 2), in addition
> to changing the name to pte_offset_map_rw_nolock(), it also outputs the
> pmdval when successful. This can help the caller recheck *pmd once the PTL
> is taken. In some cases, that is, either the mmap_lock for write, or
> pte_same() check on contents, is also enough to ensure that the pmd entry
> is stable. But in order to prevent the interface from being abused, we
> choose to pass in a dummy local variable instead of NULL.
>
> Subsequent commits will convert pte_offset_map_nolock() into the above
> two functions one by one, and finally completely delete it.
>
> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
> ---
> Documentation/mm/split_page_table_lock.rst | 7 ++++
> include/linux/mm.h | 5 +++
> mm/pgtable-generic.c | 43 ++++++++++++++++++++++
> 3 files changed, 55 insertions(+)
>
> diff --git a/Documentation/mm/split_page_table_lock.rst b/Documentation/mm/split_page_table_lock.rst
> index e4f6972eb6c04..08d0e706a32db 100644
> --- a/Documentation/mm/split_page_table_lock.rst
> +++ b/Documentation/mm/split_page_table_lock.rst
> @@ -19,6 +19,13 @@ There are helpers to lock/unlock a table and other accessor functions:
> - pte_offset_map_nolock()
> maps PTE, returns pointer to PTE with pointer to its PTE table
> lock (not taken), or returns NULL if no PTE table;
> + - pte_offset_map_ro_nolock()
> + maps PTE, returns pointer to PTE with pointer to its PTE table
> + lock (not taken), or returns NULL if no PTE table;
> + - pte_offset_map_rw_nolock()
> + maps PTE, returns pointer to PTE with pointer to its PTE table
> + lock (not taken) and the value of its pmd entry, or returns NULL
> + if no PTE table;
> - pte_offset_map()
> maps PTE, returns pointer to PTE, or returns NULL if no PTE table;
> - pte_unmap()
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index da29b066495d6..a00cb35ce065f 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2954,6 +2954,11 @@ static inline pte_t *pte_offset_map_lock(struct mm_struct *mm, pmd_t *pmd,
>
> pte_t *pte_offset_map_nolock(struct mm_struct *mm, pmd_t *pmd,
> unsigned long addr, spinlock_t **ptlp);
> +pte_t *pte_offset_map_ro_nolock(struct mm_struct *mm, pmd_t *pmd,
> + unsigned long addr, spinlock_t **ptlp);
> +pte_t *pte_offset_map_rw_nolock(struct mm_struct *mm, pmd_t *pmd,
> + unsigned long addr, pmd_t *pmdvalp,
> + spinlock_t **ptlp);
>
> #define pte_unmap_unlock(pte, ptl) do { \
> spin_unlock(ptl); \
> diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
> index a78a4adf711ac..9a1666574c959 100644
> --- a/mm/pgtable-generic.c
> +++ b/mm/pgtable-generic.c
> @@ -317,6 +317,33 @@ pte_t *pte_offset_map_nolock(struct mm_struct *mm, pmd_t *pmd,
> return pte;
> }
>
> +pte_t *pte_offset_map_ro_nolock(struct mm_struct *mm, pmd_t *pmd,
> + unsigned long addr, spinlock_t **ptlp)
> +{
> + pmd_t pmdval;
> + pte_t *pte;
> +
> + pte = __pte_offset_map(pmd, addr, &pmdval);
> + if (likely(pte))
> + *ptlp = pte_lockptr(mm, &pmdval);
> + return pte;
> +}
> +
> +pte_t *pte_offset_map_rw_nolock(struct mm_struct *mm, pmd_t *pmd,
> + unsigned long addr, pmd_t *pmdvalp,
> + spinlock_t **ptlp)
> +{
> + pmd_t pmdval;
> + pte_t *pte;
> +
> + BUG_ON(!pmdvalp);
> + pte = __pte_offset_map(pmd, addr, &pmdval);
> + if (likely(pte))
> + *ptlp = pte_lockptr(mm, &pmdval);
> + *pmdvalp = pmdval;
> + return pte;
> +}
> +
> /*
> * pte_offset_map_lock(mm, pmd, addr, ptlp), and its internal implementation
> * __pte_offset_map_lock() below, is usually called with the pmd pointer for
> @@ -356,6 +383,22 @@ pte_t *pte_offset_map_nolock(struct mm_struct *mm, pmd_t *pmd,
> * recheck *pmd once the lock is taken; in practice, no callsite needs that -
> * either the mmap_lock for write, or pte_same() check on contents, is enough.
> *
> + * pte_offset_map_ro_nolock(mm, pmd, addr, ptlp), above, is like
> + * pte_offset_map(); but when successful, it also outputs a pointer to the
> + * spinlock in ptlp - as pte_offset_map_lock() does, but in this case without
> + * locking it. This helps the caller to avoid a later pte_lockptr(mm, *pmd),
> + * which might by that time act on a changed *pmd: pte_offset_map_ro_nolock()
> + * provides the correct spinlock pointer for the page table that it returns.
> + * For readonly case, the caller does not need to recheck *pmd after the lock is
> + * taken, because the RCU lock will ensure that the PTE page will not be freed.
I'd like to add something like:
"
It is only applicable for read-only cases where any modification
operations to the PTE page table are not allowed even if the
corresponding PTL is held afterwards.
"
to explicitly specify its use cases.
> + *
> + * pte_offset_map_rw_nolock(mm, pmd, addr, pmdvalp, ptlp), above, is like
> + * pte_offset_map_ro_nolock(); but when successful, it also outputs the
> + * pdmval. For cases where pte or pmd entries may be modified, that is, maywrite
> + * case, this can help the caller recheck *pmd once the lock is taken. In some
> + * cases, that is, either the mmap_lock for write, or pte_same() check on
> + * contents, is also enough to ensure that the pmd entry is stable.
> + *
I'd like to rewrite some sentences to:
"
It is applicable for may-write cases where any modification operations
to the PTE page table may happen after the corresponding PTL is held
afterwards. But the users should make sure the PTE page table is stable
like holding mmap_lock for write or checking pte_same() or checking
pmd_same() before performing the write operations.
"
Muchun,
Thanks.
> * Note that free_pgtables(), used after unmapping detached vmas, or when
> * exiting the whole mm, does not take page table lock before freeing a page
> * table, and may not use RCU at all: "outsiders" like khugepaged should avoid
^ permalink raw reply [flat|nested] 43+ messages in thread
* [PATCH v2 02/14] arm: adjust_pte() use pte_offset_map_rw_nolock()
2024-08-22 7:13 [PATCH v2 00/14] introduce pte_offset_map_{ro|rw}_nolock() Qi Zheng
2024-08-22 7:13 ` [PATCH v2 01/14] mm: pgtable: " Qi Zheng
@ 2024-08-22 7:13 ` Qi Zheng
2024-08-26 15:26 ` David Hildenbrand
2024-08-22 7:13 ` [PATCH v2 03/14] powerpc: assert_pte_locked() use pte_offset_map_ro_nolock() Qi Zheng
` (11 subsequent siblings)
13 siblings, 1 reply; 43+ messages in thread
From: Qi Zheng @ 2024-08-22 7:13 UTC (permalink / raw)
To: david, hughd, willy, muchun.song, vbabka, akpm, rppt,
vishal.moola, peterx, ryan.roberts, christophe.leroy2
Cc: linux-kernel, linux-mm, linux-arm-kernel, linuxppc-dev, Qi Zheng
In do_adjust_pte(), we may modify the pte entry. At this time, the write
lock of mmap_lock is not held, and the pte_same() check is not performed
after the PTL held. The corresponding pmd entry may have been modified
concurrently. Therefore, in order to ensure the stability if pmd entry,
use pte_offset_map_rw_nolock() to replace pte_offset_map_nolock(), and do
pmd_same() check after holding the PTL.
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
arch/arm/mm/fault-armv.c | 9 ++++++++-
1 file changed, 8 insertions(+), 1 deletion(-)
diff --git a/arch/arm/mm/fault-armv.c b/arch/arm/mm/fault-armv.c
index 831793cd6ff94..de6c7d8a2ddfc 100644
--- a/arch/arm/mm/fault-armv.c
+++ b/arch/arm/mm/fault-armv.c
@@ -94,6 +94,7 @@ static int adjust_pte(struct vm_area_struct *vma, unsigned long address,
pud_t *pud;
pmd_t *pmd;
pte_t *pte;
+ pmd_t pmdval;
int ret;
pgd = pgd_offset(vma->vm_mm, address);
@@ -112,16 +113,22 @@ static int adjust_pte(struct vm_area_struct *vma, unsigned long address,
if (pmd_none_or_clear_bad(pmd))
return 0;
+again:
/*
* This is called while another page table is mapped, so we
* must use the nested version. This also means we need to
* open-code the spin-locking.
*/
- pte = pte_offset_map_nolock(vma->vm_mm, pmd, address, &ptl);
+ pte = pte_offset_map_rw_nolock(vma->vm_mm, pmd, address, &pmdval, &ptl);
if (!pte)
return 0;
do_pte_lock(ptl);
+ if (unlikely(!pmd_same(pmdval, pmdp_get_lockless(pmd)))) {
+ do_pte_unlock(ptl);
+ pte_unmap(pte);
+ goto again;
+ }
ret = do_adjust_pte(vma, address, pfn, pte);
--
2.20.1
^ permalink raw reply related [flat|nested] 43+ messages in thread* Re: [PATCH v2 02/14] arm: adjust_pte() use pte_offset_map_rw_nolock()
2024-08-22 7:13 ` [PATCH v2 02/14] arm: adjust_pte() use pte_offset_map_rw_nolock() Qi Zheng
@ 2024-08-26 15:26 ` David Hildenbrand
2024-08-29 3:39 ` Muchun Song
0 siblings, 1 reply; 43+ messages in thread
From: David Hildenbrand @ 2024-08-26 15:26 UTC (permalink / raw)
To: Qi Zheng, hughd, willy, muchun.song, vbabka, akpm, rppt,
vishal.moola, peterx, ryan.roberts, christophe.leroy2
Cc: linux-kernel, linux-mm, linux-arm-kernel, linuxppc-dev
On 22.08.24 09:13, Qi Zheng wrote:
> In do_adjust_pte(), we may modify the pte entry. At this time, the write
> lock of mmap_lock is not held, and the pte_same() check is not performed
> after the PTL held. The corresponding pmd entry may have been modified
> concurrently. Therefore, in order to ensure the stability if pmd entry,
> use pte_offset_map_rw_nolock() to replace pte_offset_map_nolock(), and do
> pmd_same() check after holding the PTL.
>
> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
> ---
> arch/arm/mm/fault-armv.c | 9 ++++++++-
> 1 file changed, 8 insertions(+), 1 deletion(-)
>
> diff --git a/arch/arm/mm/fault-armv.c b/arch/arm/mm/fault-armv.c
> index 831793cd6ff94..de6c7d8a2ddfc 100644
> --- a/arch/arm/mm/fault-armv.c
> +++ b/arch/arm/mm/fault-armv.c
> @@ -94,6 +94,7 @@ static int adjust_pte(struct vm_area_struct *vma, unsigned long address,
> pud_t *pud;
> pmd_t *pmd;
> pte_t *pte;
> + pmd_t pmdval;
> int ret;
>
> pgd = pgd_offset(vma->vm_mm, address);
> @@ -112,16 +113,22 @@ static int adjust_pte(struct vm_area_struct *vma, unsigned long address,
> if (pmd_none_or_clear_bad(pmd))
> return 0;
>
> +again:
> /*
> * This is called while another page table is mapped, so we
> * must use the nested version. This also means we need to
> * open-code the spin-locking.
> */
> - pte = pte_offset_map_nolock(vma->vm_mm, pmd, address, &ptl);
> + pte = pte_offset_map_rw_nolock(vma->vm_mm, pmd, address, &pmdval, &ptl);
> if (!pte)
> return 0;
>
> do_pte_lock(ptl);
> + if (unlikely(!pmd_same(pmdval, pmdp_get_lockless(pmd)))) {
> + do_pte_unlock(ptl);
> + pte_unmap(pte);
> + goto again;
> + }
>
> ret = do_adjust_pte(vma, address, pfn, pte);
>
Looks correct to me, but I wonder why the missing pmd_same check is not
an issue so far ... any experts? THP on __LINUX_ARM_ARCH__ < 6 is not
really used/possible?
Acked-by: David Hildenbrand <david@redhat.com>
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 43+ messages in thread* Re: [PATCH v2 02/14] arm: adjust_pte() use pte_offset_map_rw_nolock()
2024-08-26 15:26 ` David Hildenbrand
@ 2024-08-29 3:39 ` Muchun Song
0 siblings, 0 replies; 43+ messages in thread
From: Muchun Song @ 2024-08-29 3:39 UTC (permalink / raw)
To: David Hildenbrand, Qi Zheng
Cc: linux-kernel, linux-mm, linux-arm-kernel, linuxppc-dev, hughd,
willy, vbabka, akpm, rppt, vishal.moola, peterx, ryan.roberts,
christophe.leroy2
On 2024/8/26 23:26, David Hildenbrand wrote:
> On 22.08.24 09:13, Qi Zheng wrote:
>> In do_adjust_pte(), we may modify the pte entry. At this time, the write
>> lock of mmap_lock is not held, and the pte_same() check is not performed
>> after the PTL held. The corresponding pmd entry may have been modified
>> concurrently. Therefore, in order to ensure the stability if pmd entry,
>> use pte_offset_map_rw_nolock() to replace pte_offset_map_nolock(),
>> and do
>> pmd_same() check after holding the PTL.
>>
>> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Reviewed-by: Muchun Song <muchun.song@linux.dev>
>> ---
>> arch/arm/mm/fault-armv.c | 9 ++++++++-
>> 1 file changed, 8 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/arm/mm/fault-armv.c b/arch/arm/mm/fault-armv.c
>> index 831793cd6ff94..de6c7d8a2ddfc 100644
>> --- a/arch/arm/mm/fault-armv.c
>> +++ b/arch/arm/mm/fault-armv.c
>> @@ -94,6 +94,7 @@ static int adjust_pte(struct vm_area_struct *vma,
>> unsigned long address,
>> pud_t *pud;
>> pmd_t *pmd;
>> pte_t *pte;
>> + pmd_t pmdval;
>> int ret;
>> pgd = pgd_offset(vma->vm_mm, address);
>> @@ -112,16 +113,22 @@ static int adjust_pte(struct vm_area_struct
>> *vma, unsigned long address,
>> if (pmd_none_or_clear_bad(pmd))
>> return 0;
>> +again:
>> /*
>> * This is called while another page table is mapped, so we
>> * must use the nested version. This also means we need to
>> * open-code the spin-locking.
>> */
>> - pte = pte_offset_map_nolock(vma->vm_mm, pmd, address, &ptl);
>> + pte = pte_offset_map_rw_nolock(vma->vm_mm, pmd, address,
>> &pmdval, &ptl);
>> if (!pte)
>> return 0;
>> do_pte_lock(ptl);
>> + if (unlikely(!pmd_same(pmdval, pmdp_get_lockless(pmd)))) {
>> + do_pte_unlock(ptl);
>> + pte_unmap(pte);
>> + goto again;
>> + }
>> ret = do_adjust_pte(vma, address, pfn, pte);
>
> Looks correct to me, but I wonder why the missing pmd_same check is
> not an issue so far ... any experts? THP on __LINUX_ARM_ARCH__ < 6 is
> not really used/possible?
I think it is because it does not support THP.
TRANSPARENT_HUGEPAGE depends on HAVE_ARCH_TRANSPARENT_HUGEPAGE which
depends on ARM_LPAE. However, the Kconfig says ARM_LPAE is only
supported on ARMv7 processor.
config ARM_LPAE
bool "Support for the Large Physical Address Extension"
depends on MMU && CPU_32v7 && !CPU_32v6 && !CPU_32v5 && \
!CPU_32v4 && !CPU_32v3
select PHYS_ADDR_T_64BIT
select SWIOTLB
help
Say Y if you have an ARMv7 processor supporting the LPAE page
table format and you would like to access memory beyond the
4GB limit. The resulting kernel image will not run on
processors without the LPA extension.
If unsure, say N.
Thanks.
>
> Acked-by: David Hildenbrand <david@redhat.com>
>
^ permalink raw reply [flat|nested] 43+ messages in thread
* [PATCH v2 03/14] powerpc: assert_pte_locked() use pte_offset_map_ro_nolock()
2024-08-22 7:13 [PATCH v2 00/14] introduce pte_offset_map_{ro|rw}_nolock() Qi Zheng
2024-08-22 7:13 ` [PATCH v2 01/14] mm: pgtable: " Qi Zheng
2024-08-22 7:13 ` [PATCH v2 02/14] arm: adjust_pte() use pte_offset_map_rw_nolock() Qi Zheng
@ 2024-08-22 7:13 ` Qi Zheng
2024-08-26 15:28 ` David Hildenbrand
2024-08-29 7:21 ` Muchun Song
2024-08-22 7:13 ` [PATCH v2 04/14] mm: filemap: filemap_fault_recheck_pte_none() " Qi Zheng
` (10 subsequent siblings)
13 siblings, 2 replies; 43+ messages in thread
From: Qi Zheng @ 2024-08-22 7:13 UTC (permalink / raw)
To: david, hughd, willy, muchun.song, vbabka, akpm, rppt,
vishal.moola, peterx, ryan.roberts, christophe.leroy2
Cc: linux-kernel, linux-mm, linux-arm-kernel, linuxppc-dev, Qi Zheng
In assert_pte_locked(), we just get the ptl and assert if it was already
held, so convert it to using pte_offset_map_ro_nolock().
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
arch/powerpc/mm/pgtable.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/powerpc/mm/pgtable.c b/arch/powerpc/mm/pgtable.c
index 7316396e452d8..61df5aed79894 100644
--- a/arch/powerpc/mm/pgtable.c
+++ b/arch/powerpc/mm/pgtable.c
@@ -398,7 +398,7 @@ void assert_pte_locked(struct mm_struct *mm, unsigned long addr)
*/
if (pmd_none(*pmd))
return;
- pte = pte_offset_map_nolock(mm, pmd, addr, &ptl);
+ pte = pte_offset_map_ro_nolock(mm, pmd, addr, &ptl);
BUG_ON(!pte);
assert_spin_locked(ptl);
pte_unmap(pte);
--
2.20.1
^ permalink raw reply related [flat|nested] 43+ messages in thread* Re: [PATCH v2 03/14] powerpc: assert_pte_locked() use pte_offset_map_ro_nolock()
2024-08-22 7:13 ` [PATCH v2 03/14] powerpc: assert_pte_locked() use pte_offset_map_ro_nolock() Qi Zheng
@ 2024-08-26 15:28 ` David Hildenbrand
2024-08-29 7:21 ` Muchun Song
1 sibling, 0 replies; 43+ messages in thread
From: David Hildenbrand @ 2024-08-26 15:28 UTC (permalink / raw)
To: Qi Zheng, hughd, willy, muchun.song, vbabka, akpm, rppt,
vishal.moola, peterx, ryan.roberts, christophe.leroy2
Cc: linux-kernel, linux-mm, linux-arm-kernel, linuxppc-dev
On 22.08.24 09:13, Qi Zheng wrote:
> In assert_pte_locked(), we just get the ptl and assert if it was already
> held, so convert it to using pte_offset_map_ro_nolock().
>
> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
> ---
> arch/powerpc/mm/pgtable.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/arch/powerpc/mm/pgtable.c b/arch/powerpc/mm/pgtable.c
> index 7316396e452d8..61df5aed79894 100644
> --- a/arch/powerpc/mm/pgtable.c
> +++ b/arch/powerpc/mm/pgtable.c
> @@ -398,7 +398,7 @@ void assert_pte_locked(struct mm_struct *mm, unsigned long addr)
> */
> if (pmd_none(*pmd))
> return;
> - pte = pte_offset_map_nolock(mm, pmd, addr, &ptl);
> + pte = pte_offset_map_ro_nolock(mm, pmd, addr, &ptl);
> BUG_ON(!pte);
> assert_spin_locked(ptl);
> pte_unmap(pte);
Acked-by: David Hildenbrand <david@redhat.com>
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH v2 03/14] powerpc: assert_pte_locked() use pte_offset_map_ro_nolock()
2024-08-22 7:13 ` [PATCH v2 03/14] powerpc: assert_pte_locked() use pte_offset_map_ro_nolock() Qi Zheng
2024-08-26 15:28 ` David Hildenbrand
@ 2024-08-29 7:21 ` Muchun Song
1 sibling, 0 replies; 43+ messages in thread
From: Muchun Song @ 2024-08-29 7:21 UTC (permalink / raw)
To: Qi Zheng, david, hughd, willy, vbabka, akpm, rppt, vishal.moola,
peterx, ryan.roberts, christophe.leroy2
Cc: linux-kernel, linux-mm, linux-arm-kernel, linuxppc-dev
On 2024/8/22 15:13, Qi Zheng wrote:
> In assert_pte_locked(), we just get the ptl and assert if it was already
> held, so convert it to using pte_offset_map_ro_nolock().
>
> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Reviewed-by: Muchun Song <muchun.song@linux.dev>
^ permalink raw reply [flat|nested] 43+ messages in thread
* [PATCH v2 04/14] mm: filemap: filemap_fault_recheck_pte_none() use pte_offset_map_ro_nolock()
2024-08-22 7:13 [PATCH v2 00/14] introduce pte_offset_map_{ro|rw}_nolock() Qi Zheng
` (2 preceding siblings ...)
2024-08-22 7:13 ` [PATCH v2 03/14] powerpc: assert_pte_locked() use pte_offset_map_ro_nolock() Qi Zheng
@ 2024-08-22 7:13 ` Qi Zheng
2024-08-26 15:29 ` David Hildenbrand
2024-08-29 7:23 ` Muchun Song
2024-08-22 7:13 ` [PATCH v2 05/14] mm: khugepaged: __collapse_huge_page_swapin() " Qi Zheng
` (9 subsequent siblings)
13 siblings, 2 replies; 43+ messages in thread
From: Qi Zheng @ 2024-08-22 7:13 UTC (permalink / raw)
To: david, hughd, willy, muchun.song, vbabka, akpm, rppt,
vishal.moola, peterx, ryan.roberts, christophe.leroy2
Cc: linux-kernel, linux-mm, linux-arm-kernel, linuxppc-dev, Qi Zheng
In filemap_fault_recheck_pte_none(), we just do pte_none() check, so
convert it to using pte_offset_map_ro_nolock().
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
mm/filemap.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/mm/filemap.c b/mm/filemap.c
index 0f13126b43b08..c98da9af6b9bd 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3234,8 +3234,8 @@ static vm_fault_t filemap_fault_recheck_pte_none(struct vm_fault *vmf)
if (!(vmf->flags & FAULT_FLAG_ORIG_PTE_VALID))
return 0;
- ptep = pte_offset_map_nolock(vma->vm_mm, vmf->pmd, vmf->address,
- &vmf->ptl);
+ ptep = pte_offset_map_ro_nolock(vma->vm_mm, vmf->pmd, vmf->address,
+ &vmf->ptl);
if (unlikely(!ptep))
return VM_FAULT_NOPAGE;
--
2.20.1
^ permalink raw reply related [flat|nested] 43+ messages in thread* Re: [PATCH v2 04/14] mm: filemap: filemap_fault_recheck_pte_none() use pte_offset_map_ro_nolock()
2024-08-22 7:13 ` [PATCH v2 04/14] mm: filemap: filemap_fault_recheck_pte_none() " Qi Zheng
@ 2024-08-26 15:29 ` David Hildenbrand
2024-08-29 7:23 ` Muchun Song
1 sibling, 0 replies; 43+ messages in thread
From: David Hildenbrand @ 2024-08-26 15:29 UTC (permalink / raw)
To: Qi Zheng, hughd, willy, muchun.song, vbabka, akpm, rppt,
vishal.moola, peterx, ryan.roberts, christophe.leroy2
Cc: linux-kernel, linux-mm, linux-arm-kernel, linuxppc-dev
On 22.08.24 09:13, Qi Zheng wrote:
> In filemap_fault_recheck_pte_none(), we just do pte_none() check, so
> convert it to using pte_offset_map_ro_nolock().
>
> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
> ---
> mm/filemap.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 0f13126b43b08..c98da9af6b9bd 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -3234,8 +3234,8 @@ static vm_fault_t filemap_fault_recheck_pte_none(struct vm_fault *vmf)
> if (!(vmf->flags & FAULT_FLAG_ORIG_PTE_VALID))
> return 0;
>
> - ptep = pte_offset_map_nolock(vma->vm_mm, vmf->pmd, vmf->address,
> - &vmf->ptl);
> + ptep = pte_offset_map_ro_nolock(vma->vm_mm, vmf->pmd, vmf->address,
> + &vmf->ptl);
> if (unlikely(!ptep))
> return VM_FAULT_NOPAGE;
>
Acked-by: David Hildenbrand <david@redhat.com>
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH v2 04/14] mm: filemap: filemap_fault_recheck_pte_none() use pte_offset_map_ro_nolock()
2024-08-22 7:13 ` [PATCH v2 04/14] mm: filemap: filemap_fault_recheck_pte_none() " Qi Zheng
2024-08-26 15:29 ` David Hildenbrand
@ 2024-08-29 7:23 ` Muchun Song
1 sibling, 0 replies; 43+ messages in thread
From: Muchun Song @ 2024-08-29 7:23 UTC (permalink / raw)
To: Qi Zheng
Cc: David Hildenbrand, Hugh Dickins, Matthew Wilcox,
Vlastimil Babka (SUSE), Andrew Morton, Mike Rapoport,
Vishal Moola, Peter Xu, Ryan Roberts, christophe.leroy2, LKML,
Linux Memory Management List, linux-arm-kernel, linuxppc-dev
> On Aug 22, 2024, at 15:13, Qi Zheng <zhengqi.arch@bytedance.com> wrote:
>
> In filemap_fault_recheck_pte_none(), we just do pte_none() check, so
> convert it to using pte_offset_map_ro_nolock().
>
> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Reviewed-by: Muchun Song <muchun.song@linux.dev>
^ permalink raw reply [flat|nested] 43+ messages in thread
* [PATCH v2 05/14] mm: khugepaged: __collapse_huge_page_swapin() use pte_offset_map_ro_nolock()
2024-08-22 7:13 [PATCH v2 00/14] introduce pte_offset_map_{ro|rw}_nolock() Qi Zheng
` (3 preceding siblings ...)
2024-08-22 7:13 ` [PATCH v2 04/14] mm: filemap: filemap_fault_recheck_pte_none() " Qi Zheng
@ 2024-08-22 7:13 ` Qi Zheng
2024-08-26 15:33 ` David Hildenbrand
2024-08-29 7:25 ` Muchun Song
2024-08-22 7:13 ` [PATCH v2 06/14] mm: handle_pte_fault() use pte_offset_map_rw_nolock() Qi Zheng
` (8 subsequent siblings)
13 siblings, 2 replies; 43+ messages in thread
From: Qi Zheng @ 2024-08-22 7:13 UTC (permalink / raw)
To: david, hughd, willy, muchun.song, vbabka, akpm, rppt,
vishal.moola, peterx, ryan.roberts, christophe.leroy2
Cc: linux-kernel, linux-mm, linux-arm-kernel, linuxppc-dev, Qi Zheng
In __collapse_huge_page_swapin(), we just use the ptl for pte_same() check
in do_swap_page(). In other places, we directly use pte_offset_map_lock(),
so convert it to using pte_offset_map_ro_nolock().
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
mm/khugepaged.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 4a83c40d90538..53bfa7f4b7f82 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1011,7 +1011,11 @@ static int __collapse_huge_page_swapin(struct mm_struct *mm,
};
if (!pte++) {
- pte = pte_offset_map_nolock(mm, pmd, address, &ptl);
+ /*
+ * Here the ptl is only used to check pte_same() in
+ * do_swap_page(), so readonly version is enough.
+ */
+ pte = pte_offset_map_ro_nolock(mm, pmd, address, &ptl);
if (!pte) {
mmap_read_unlock(mm);
result = SCAN_PMD_NULL;
--
2.20.1
^ permalink raw reply related [flat|nested] 43+ messages in thread* Re: [PATCH v2 05/14] mm: khugepaged: __collapse_huge_page_swapin() use pte_offset_map_ro_nolock()
2024-08-22 7:13 ` [PATCH v2 05/14] mm: khugepaged: __collapse_huge_page_swapin() " Qi Zheng
@ 2024-08-26 15:33 ` David Hildenbrand
2024-08-29 7:25 ` Muchun Song
1 sibling, 0 replies; 43+ messages in thread
From: David Hildenbrand @ 2024-08-26 15:33 UTC (permalink / raw)
To: Qi Zheng, hughd, willy, muchun.song, vbabka, akpm, rppt,
vishal.moola, peterx, ryan.roberts, christophe.leroy2
Cc: linux-kernel, linux-mm, linux-arm-kernel, linuxppc-dev
On 22.08.24 09:13, Qi Zheng wrote:
> In __collapse_huge_page_swapin(), we just use the ptl for pte_same() check
> in do_swap_page(). In other places, we directly use pte_offset_map_lock(),
> so convert it to using pte_offset_map_ro_nolock().
>
> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
> ---
> mm/khugepaged.c | 6 +++++-
> 1 file changed, 5 insertions(+), 1 deletion(-)
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 4a83c40d90538..53bfa7f4b7f82 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1011,7 +1011,11 @@ static int __collapse_huge_page_swapin(struct mm_struct *mm,
> };
>
> if (!pte++) {
> - pte = pte_offset_map_nolock(mm, pmd, address, &ptl);
> + /*
> + * Here the ptl is only used to check pte_same() in
> + * do_swap_page(), so readonly version is enough.
> + */
> + pte = pte_offset_map_ro_nolock(mm, pmd, address, &ptl);
> if (!pte) {
> mmap_read_unlock(mm);
> result = SCAN_PMD_NULL;
Suboptimal that the pteval comparison + unmap is buried in
do_swap_page(). Moving that to the caller is also not significantly
better ...
Acked-by: David Hildenbrand <david@redhat.com>
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 43+ messages in thread* Re: [PATCH v2 05/14] mm: khugepaged: __collapse_huge_page_swapin() use pte_offset_map_ro_nolock()
2024-08-22 7:13 ` [PATCH v2 05/14] mm: khugepaged: __collapse_huge_page_swapin() " Qi Zheng
2024-08-26 15:33 ` David Hildenbrand
@ 2024-08-29 7:25 ` Muchun Song
1 sibling, 0 replies; 43+ messages in thread
From: Muchun Song @ 2024-08-29 7:25 UTC (permalink / raw)
To: Qi Zheng
Cc: David Hildenbrand, Hugh Dickins, Matthew Wilcox,
Vlastimil Babka (SUSE), Andrew Morton, Mike Rapoport,
vishal.moola, peterx, ryan.roberts, christophe.leroy2,
linux-kernel, linux-mm, linux-arm-kernel, linuxppc-dev
> On Aug 22, 2024, at 15:13, Qi Zheng <zhengqi.arch@bytedance.com> wrote:
>
> In __collapse_huge_page_swapin(), we just use the ptl for pte_same() check
> in do_swap_page(). In other places, we directly use pte_offset_map_lock(),
> so convert it to using pte_offset_map_ro_nolock().
>
> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Reviewed-by: Muchun Song <muchun.song@linux.dev>
^ permalink raw reply [flat|nested] 43+ messages in thread
* [PATCH v2 06/14] mm: handle_pte_fault() use pte_offset_map_rw_nolock()
2024-08-22 7:13 [PATCH v2 00/14] introduce pte_offset_map_{ro|rw}_nolock() Qi Zheng
` (4 preceding siblings ...)
2024-08-22 7:13 ` [PATCH v2 05/14] mm: khugepaged: __collapse_huge_page_swapin() " Qi Zheng
@ 2024-08-22 7:13 ` Qi Zheng
2024-08-26 15:36 ` David Hildenbrand
2024-08-29 7:30 ` Muchun Song
2024-08-22 7:13 ` [PATCH v2 07/14] mm: khugepaged: collapse_pte_mapped_thp() " Qi Zheng
` (7 subsequent siblings)
13 siblings, 2 replies; 43+ messages in thread
From: Qi Zheng @ 2024-08-22 7:13 UTC (permalink / raw)
To: david, hughd, willy, muchun.song, vbabka, akpm, rppt,
vishal.moola, peterx, ryan.roberts, christophe.leroy2
Cc: linux-kernel, linux-mm, linux-arm-kernel, linuxppc-dev, Qi Zheng
In handle_pte_fault(), we may modify the vmf->pte after acquiring the
vmf->ptl, so convert it to using pte_offset_map_rw_nolock(). But since we
will do the pte_same() check, so there is no need to get pmdval to do
pmd_same() check, just pass a dummy variable to it.
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
mm/memory.c | 12 ++++++++++--
1 file changed, 10 insertions(+), 2 deletions(-)
diff --git a/mm/memory.c b/mm/memory.c
index 93c0c25433d02..7b6071a0e21e2 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -5499,14 +5499,22 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
vmf->pte = NULL;
vmf->flags &= ~FAULT_FLAG_ORIG_PTE_VALID;
} else {
+ pmd_t dummy_pmdval;
+
/*
* A regular pmd is established and it can't morph into a huge
* pmd by anon khugepaged, since that takes mmap_lock in write
* mode; but shmem or file collapse to THP could still morph
* it into a huge pmd: just retry later if so.
+ *
+ * Use the maywrite version to indicate that vmf->pte will be
+ * modified, but since we will use pte_same() to detect the
+ * change of the pte entry, there is no need to get pmdval, so
+ * just pass a dummy variable to it.
*/
- vmf->pte = pte_offset_map_nolock(vmf->vma->vm_mm, vmf->pmd,
- vmf->address, &vmf->ptl);
+ vmf->pte = pte_offset_map_rw_nolock(vmf->vma->vm_mm, vmf->pmd,
+ vmf->address, &dummy_pmdval,
+ &vmf->ptl);
if (unlikely(!vmf->pte))
return 0;
vmf->orig_pte = ptep_get_lockless(vmf->pte);
--
2.20.1
^ permalink raw reply related [flat|nested] 43+ messages in thread* Re: [PATCH v2 06/14] mm: handle_pte_fault() use pte_offset_map_rw_nolock()
2024-08-22 7:13 ` [PATCH v2 06/14] mm: handle_pte_fault() use pte_offset_map_rw_nolock() Qi Zheng
@ 2024-08-26 15:36 ` David Hildenbrand
2024-08-27 4:53 ` Qi Zheng
2024-08-29 7:30 ` Muchun Song
1 sibling, 1 reply; 43+ messages in thread
From: David Hildenbrand @ 2024-08-26 15:36 UTC (permalink / raw)
To: Qi Zheng, hughd, willy, muchun.song, vbabka, akpm, rppt,
vishal.moola, peterx, ryan.roberts, christophe.leroy2
Cc: linux-kernel, linux-mm, linux-arm-kernel, linuxppc-dev
On 22.08.24 09:13, Qi Zheng wrote:
> In handle_pte_fault(), we may modify the vmf->pte after acquiring the
> vmf->ptl, so convert it to using pte_offset_map_rw_nolock(). But since we
> will do the pte_same() check, so there is no need to get pmdval to do
> pmd_same() check, just pass a dummy variable to it.
>
> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
> ---
> mm/memory.c | 12 ++++++++++--
> 1 file changed, 10 insertions(+), 2 deletions(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 93c0c25433d02..7b6071a0e21e2 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -5499,14 +5499,22 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
> vmf->pte = NULL;
> vmf->flags &= ~FAULT_FLAG_ORIG_PTE_VALID;
> } else {
> + pmd_t dummy_pmdval;
> +
> /*
> * A regular pmd is established and it can't morph into a huge
> * pmd by anon khugepaged, since that takes mmap_lock in write
> * mode; but shmem or file collapse to THP could still morph
> * it into a huge pmd: just retry later if so.
> + *
> + * Use the maywrite version to indicate that vmf->pte will be
> + * modified, but since we will use pte_same() to detect the
> + * change of the pte entry, there is no need to get pmdval, so
> + * just pass a dummy variable to it.
> */
> - vmf->pte = pte_offset_map_nolock(vmf->vma->vm_mm, vmf->pmd,
> - vmf->address, &vmf->ptl);
> + vmf->pte = pte_offset_map_rw_nolock(vmf->vma->vm_mm, vmf->pmd,
> + vmf->address, &dummy_pmdval,
> + &vmf->ptl);
> if (unlikely(!vmf->pte))
> return 0;
> vmf->orig_pte = ptep_get_lockless(vmf->pte);
No I understand why we don't need the PMD val in these cases ... the PTE
would also be pte_none() at the point the page table is freed, so we
would detect the change as well.
I do enjoy documenting why we use a dummy value, though. Likely without
that, new users will just pass NULL and call it a day.
Acked-by: David Hildenbrand <david@redhat.com>
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 43+ messages in thread* Re: [PATCH v2 06/14] mm: handle_pte_fault() use pte_offset_map_rw_nolock()
2024-08-26 15:36 ` David Hildenbrand
@ 2024-08-27 4:53 ` Qi Zheng
0 siblings, 0 replies; 43+ messages in thread
From: Qi Zheng @ 2024-08-27 4:53 UTC (permalink / raw)
To: David Hildenbrand
Cc: hughd, willy, muchun.song, vbabka, akpm, rppt, vishal.moola,
peterx, ryan.roberts, christophe.leroy2, linux-kernel, linux-mm,
linux-arm-kernel, linuxppc-dev
On 2024/8/26 23:36, David Hildenbrand wrote:
> On 22.08.24 09:13, Qi Zheng wrote:
>> In handle_pte_fault(), we may modify the vmf->pte after acquiring the
>> vmf->ptl, so convert it to using pte_offset_map_rw_nolock(). But since we
>> will do the pte_same() check, so there is no need to get pmdval to do
>> pmd_same() check, just pass a dummy variable to it.
>>
>> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
>> ---
>> mm/memory.c | 12 ++++++++++--
>> 1 file changed, 10 insertions(+), 2 deletions(-)
>>
>> diff --git a/mm/memory.c b/mm/memory.c
>> index 93c0c25433d02..7b6071a0e21e2 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -5499,14 +5499,22 @@ static vm_fault_t handle_pte_fault(struct
>> vm_fault *vmf)
>> vmf->pte = NULL;
>> vmf->flags &= ~FAULT_FLAG_ORIG_PTE_VALID;
>> } else {
>> + pmd_t dummy_pmdval;
>> +
>> /*
>> * A regular pmd is established and it can't morph into a huge
>> * pmd by anon khugepaged, since that takes mmap_lock in write
>> * mode; but shmem or file collapse to THP could still morph
>> * it into a huge pmd: just retry later if so.
>> + *
>> + * Use the maywrite version to indicate that vmf->pte will be
>> + * modified, but since we will use pte_same() to detect the
>> + * change of the pte entry, there is no need to get pmdval, so
>> + * just pass a dummy variable to it.
>> */
>> - vmf->pte = pte_offset_map_nolock(vmf->vma->vm_mm, vmf->pmd,
>> - vmf->address, &vmf->ptl);
>> + vmf->pte = pte_offset_map_rw_nolock(vmf->vma->vm_mm, vmf->pmd,
>> + vmf->address, &dummy_pmdval,
>> + &vmf->ptl);
>> if (unlikely(!vmf->pte))
>> return 0;
>> vmf->orig_pte = ptep_get_lockless(vmf->pte);
>
> No I understand why we don't need the PMD val in these cases ... the PTE
> would also be pte_none() at the point the page table is freed, so we
> would detect the change as well.
Yes.
>
> I do enjoy documenting why we use a dummy value, though. Likely without
> that, new users will just pass NULL and call it a day.
OK, how about the following:
Use the maywrite version to indicate that vmf->pte will be
modified, but since we will use pte_same() to detect the
change of the !pte_none() entry, there is no need to recheck
the pmdval. Here we chooes to pass a dummy variable instead
of NULL, which helps new user think about why this place is
special.
>
> Acked-by: David Hildenbrand <david@redhat.com>
Thanks!
>
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH v2 06/14] mm: handle_pte_fault() use pte_offset_map_rw_nolock()
2024-08-22 7:13 ` [PATCH v2 06/14] mm: handle_pte_fault() use pte_offset_map_rw_nolock() Qi Zheng
2024-08-26 15:36 ` David Hildenbrand
@ 2024-08-29 7:30 ` Muchun Song
1 sibling, 0 replies; 43+ messages in thread
From: Muchun Song @ 2024-08-29 7:30 UTC (permalink / raw)
To: Qi Zheng
Cc: David Hildenbrand, Hugh Dickins, Matthew Wilcox,
Vlastimil Babka (SUSE), Andrew Morton, Mike Rapoport,
vishal.moola, peterx, ryan.roberts, christophe.leroy2,
linux-kernel, linux-mm, linux-arm-kernel, linuxppc-dev
> On Aug 22, 2024, at 15:13, Qi Zheng <zhengqi.arch@bytedance.com> wrote:
>
> In handle_pte_fault(), we may modify the vmf->pte after acquiring the
> vmf->ptl, so convert it to using pte_offset_map_rw_nolock(). But since we
> will do the pte_same() check, so there is no need to get pmdval to do
> pmd_same() check, just pass a dummy variable to it.
>
> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Reviewed-by: Muchun Song <muchun.song@linux.dev>
A nit below.
> ---
> mm/memory.c | 12 ++++++++++--
> 1 file changed, 10 insertions(+), 2 deletions(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 93c0c25433d02..7b6071a0e21e2 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -5499,14 +5499,22 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
> vmf->pte = NULL;
> vmf->flags &= ~FAULT_FLAG_ORIG_PTE_VALID;
> } else {
> + pmd_t dummy_pmdval;
> +
> /*
> * A regular pmd is established and it can't morph into a huge
> * pmd by anon khugepaged, since that takes mmap_lock in write
> * mode; but shmem or file collapse to THP could still morph
> * it into a huge pmd: just retry later if so.
> + *
> + * Use the maywrite version to indicate that vmf->pte will be
Not "will be", should be "may be".
> + * modified, but since we will use pte_same() to detect the
> + * change of the pte entry, there is no need to get pmdval, so
> + * just pass a dummy variable to it.
> */
> - vmf->pte = pte_offset_map_nolock(vmf->vma->vm_mm, vmf->pmd,
> - vmf->address, &vmf->ptl);
> + vmf->pte = pte_offset_map_rw_nolock(vmf->vma->vm_mm, vmf->pmd,
> + vmf->address, &dummy_pmdval,
> + &vmf->ptl);
> if (unlikely(!vmf->pte))
> return 0;
> vmf->orig_pte = ptep_get_lockless(vmf->pte);
> --
> 2.20.1
>
^ permalink raw reply [flat|nested] 43+ messages in thread
* [PATCH v2 07/14] mm: khugepaged: collapse_pte_mapped_thp() use pte_offset_map_rw_nolock()
2024-08-22 7:13 [PATCH v2 00/14] introduce pte_offset_map_{ro|rw}_nolock() Qi Zheng
` (5 preceding siblings ...)
2024-08-22 7:13 ` [PATCH v2 06/14] mm: handle_pte_fault() use pte_offset_map_rw_nolock() Qi Zheng
@ 2024-08-22 7:13 ` Qi Zheng
2024-08-29 8:10 ` Muchun Song
2024-08-22 7:13 ` [PATCH v2 08/14] mm: copy_pte_range() " Qi Zheng
` (6 subsequent siblings)
13 siblings, 1 reply; 43+ messages in thread
From: Qi Zheng @ 2024-08-22 7:13 UTC (permalink / raw)
To: david, hughd, willy, muchun.song, vbabka, akpm, rppt,
vishal.moola, peterx, ryan.roberts, christophe.leroy2
Cc: linux-kernel, linux-mm, linux-arm-kernel, linuxppc-dev, Qi Zheng
In collapse_pte_mapped_thp(), we may modify the pte and pmd entry after
acquring the ptl, so convert it to using pte_offset_map_rw_nolock(). At
this time, the write lock of mmap_lock is not held, and the pte_same()
check is not performed after the PTL held. So we should get pgt_pmd and do
pmd_same() check after the ptl held.
For the case where the ptl is released first and then the pml is acquired,
the PTE page may have been freed, so we must do pmd_same() check before
reacquiring the ptl.
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
mm/khugepaged.c | 16 +++++++++++++++-
1 file changed, 15 insertions(+), 1 deletion(-)
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 53bfa7f4b7f82..15d3f7f3c65f2 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1604,7 +1604,7 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
if (userfaultfd_armed(vma) && !(vma->vm_flags & VM_SHARED))
pml = pmd_lock(mm, pmd);
- start_pte = pte_offset_map_nolock(mm, pmd, haddr, &ptl);
+ start_pte = pte_offset_map_rw_nolock(mm, pmd, haddr, &pgt_pmd, &ptl);
if (!start_pte) /* mmap_lock + page lock should prevent this */
goto abort;
if (!pml)
@@ -1612,6 +1612,9 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
else if (ptl != pml)
spin_lock_nested(ptl, SINGLE_DEPTH_NESTING);
+ if (unlikely(!pmd_same(pgt_pmd, pmdp_get_lockless(pmd))))
+ goto abort;
+
/* step 2: clear page table and adjust rmap */
for (i = 0, addr = haddr, pte = start_pte;
i < HPAGE_PMD_NR; i++, addr += PAGE_SIZE, pte++) {
@@ -1657,6 +1660,16 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
/* step 4: remove empty page table */
if (!pml) {
pml = pmd_lock(mm, pmd);
+ /*
+ * We called pte_unmap() and release the ptl before acquiring
+ * the pml, which means we left the RCU critical section, so the
+ * PTE page may have been freed, so we must do pmd_same() check
+ * before reacquiring the ptl.
+ */
+ if (unlikely(!pmd_same(pgt_pmd, pmdp_get_lockless(pmd)))) {
+ spin_unlock(pml);
+ goto pmd_change;
+ }
if (ptl != pml)
spin_lock_nested(ptl, SINGLE_DEPTH_NESTING);
}
@@ -1688,6 +1701,7 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
pte_unmap_unlock(start_pte, ptl);
if (pml && pml != ptl)
spin_unlock(pml);
+pmd_change:
if (notified)
mmu_notifier_invalidate_range_end(&range);
drop_folio:
--
2.20.1
^ permalink raw reply related [flat|nested] 43+ messages in thread* Re: [PATCH v2 07/14] mm: khugepaged: collapse_pte_mapped_thp() use pte_offset_map_rw_nolock()
2024-08-22 7:13 ` [PATCH v2 07/14] mm: khugepaged: collapse_pte_mapped_thp() " Qi Zheng
@ 2024-08-29 8:10 ` Muchun Song
2024-08-30 6:54 ` Qi Zheng
0 siblings, 1 reply; 43+ messages in thread
From: Muchun Song @ 2024-08-29 8:10 UTC (permalink / raw)
To: Qi Zheng, david, hughd, willy, vbabka, akpm, rppt, vishal.moola,
peterx, ryan.roberts, christophe.leroy2
Cc: linux-kernel, linux-mm, linux-arm-kernel, linuxppc-dev
On 2024/8/22 15:13, Qi Zheng wrote:
> In collapse_pte_mapped_thp(), we may modify the pte and pmd entry after
> acquring the ptl, so convert it to using pte_offset_map_rw_nolock(). At
> this time, the write lock of mmap_lock is not held, and the pte_same()
> check is not performed after the PTL held. So we should get pgt_pmd and do
> pmd_same() check after the ptl held.
>
> For the case where the ptl is released first and then the pml is acquired,
> the PTE page may have been freed, so we must do pmd_same() check before
> reacquiring the ptl.
>
> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
> ---
> mm/khugepaged.c | 16 +++++++++++++++-
> 1 file changed, 15 insertions(+), 1 deletion(-)
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 53bfa7f4b7f82..15d3f7f3c65f2 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1604,7 +1604,7 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
> if (userfaultfd_armed(vma) && !(vma->vm_flags & VM_SHARED))
> pml = pmd_lock(mm, pmd);
>
> - start_pte = pte_offset_map_nolock(mm, pmd, haddr, &ptl);
> + start_pte = pte_offset_map_rw_nolock(mm, pmd, haddr, &pgt_pmd, &ptl);
> if (!start_pte) /* mmap_lock + page lock should prevent this */
> goto abort;
> if (!pml)
> @@ -1612,6 +1612,9 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
> else if (ptl != pml)
> spin_lock_nested(ptl, SINGLE_DEPTH_NESTING);
>
> + if (unlikely(!pmd_same(pgt_pmd, pmdp_get_lockless(pmd))))
> + goto abort;
> +
> /* step 2: clear page table and adjust rmap */
> for (i = 0, addr = haddr, pte = start_pte;
> i < HPAGE_PMD_NR; i++, addr += PAGE_SIZE, pte++) {
> @@ -1657,6 +1660,16 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
> /* step 4: remove empty page table */
> if (!pml) {
> pml = pmd_lock(mm, pmd);
> + /*
> + * We called pte_unmap() and release the ptl before acquiring
> + * the pml, which means we left the RCU critical section, so the
> + * PTE page may have been freed, so we must do pmd_same() check
> + * before reacquiring the ptl.
> + */
> + if (unlikely(!pmd_same(pgt_pmd, pmdp_get_lockless(pmd)))) {
> + spin_unlock(pml);
> + goto pmd_change;
Seems we forget to flush TLB since we've cleared some pte entry?
> + }
> if (ptl != pml)
> spin_lock_nested(ptl, SINGLE_DEPTH_NESTING);
> }
> @@ -1688,6 +1701,7 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
> pte_unmap_unlock(start_pte, ptl);
> if (pml && pml != ptl)
> spin_unlock(pml);
> +pmd_change:
> if (notified)
> mmu_notifier_invalidate_range_end(&range);
> drop_folio:
^ permalink raw reply [flat|nested] 43+ messages in thread* Re: [PATCH v2 07/14] mm: khugepaged: collapse_pte_mapped_thp() use pte_offset_map_rw_nolock()
2024-08-29 8:10 ` Muchun Song
@ 2024-08-30 6:54 ` Qi Zheng
2024-09-05 6:32 ` Muchun Song
0 siblings, 1 reply; 43+ messages in thread
From: Qi Zheng @ 2024-08-30 6:54 UTC (permalink / raw)
To: Muchun Song
Cc: david, hughd, willy, vbabka, akpm, rppt, vishal.moola, peterx,
ryan.roberts, christophe.leroy2, linux-kernel, linux-mm,
linux-arm-kernel, linuxppc-dev
On 2024/8/29 16:10, Muchun Song wrote:
>
>
> On 2024/8/22 15:13, Qi Zheng wrote:
>> In collapse_pte_mapped_thp(), we may modify the pte and pmd entry after
>> acquring the ptl, so convert it to using pte_offset_map_rw_nolock(). At
>> this time, the write lock of mmap_lock is not held, and the pte_same()
>> check is not performed after the PTL held. So we should get pgt_pmd
>> and do
>> pmd_same() check after the ptl held.
>>
>> For the case where the ptl is released first and then the pml is
>> acquired,
>> the PTE page may have been freed, so we must do pmd_same() check before
>> reacquiring the ptl.
>>
>> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
>> ---
>> mm/khugepaged.c | 16 +++++++++++++++-
>> 1 file changed, 15 insertions(+), 1 deletion(-)
>>
>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>> index 53bfa7f4b7f82..15d3f7f3c65f2 100644
>> --- a/mm/khugepaged.c
>> +++ b/mm/khugepaged.c
>> @@ -1604,7 +1604,7 @@ int collapse_pte_mapped_thp(struct mm_struct
>> *mm, unsigned long addr,
>> if (userfaultfd_armed(vma) && !(vma->vm_flags & VM_SHARED))
>> pml = pmd_lock(mm, pmd);
>> - start_pte = pte_offset_map_nolock(mm, pmd, haddr, &ptl);
>> + start_pte = pte_offset_map_rw_nolock(mm, pmd, haddr, &pgt_pmd,
>> &ptl);
>> if (!start_pte) /* mmap_lock + page lock should prevent
>> this */
>> goto abort;
>> if (!pml)
>> @@ -1612,6 +1612,9 @@ int collapse_pte_mapped_thp(struct mm_struct
>> *mm, unsigned long addr,
>> else if (ptl != pml)
>> spin_lock_nested(ptl, SINGLE_DEPTH_NESTING);
>> + if (unlikely(!pmd_same(pgt_pmd, pmdp_get_lockless(pmd))))
>> + goto abort;
>> +
>> /* step 2: clear page table and adjust rmap */
>> for (i = 0, addr = haddr, pte = start_pte;
>> i < HPAGE_PMD_NR; i++, addr += PAGE_SIZE, pte++) {
>> @@ -1657,6 +1660,16 @@ int collapse_pte_mapped_thp(struct mm_struct
>> *mm, unsigned long addr,
>> /* step 4: remove empty page table */
>> if (!pml) {
>> pml = pmd_lock(mm, pmd);
>> + /*
>> + * We called pte_unmap() and release the ptl before acquiring
>> + * the pml, which means we left the RCU critical section, so the
>> + * PTE page may have been freed, so we must do pmd_same() check
>> + * before reacquiring the ptl.
>> + */
>> + if (unlikely(!pmd_same(pgt_pmd, pmdp_get_lockless(pmd)))) {
>> + spin_unlock(pml);
>> + goto pmd_change;
>
> Seems we forget to flush TLB since we've cleared some pte entry?
See comment above the ptep_clear():
/*
* Must clear entry, or a racing truncate may re-remove it.
* TLB flush can be left until pmdp_collapse_flush() does it.
* PTE dirty? Shmem page is already dirty; file is read-only.
*/
The TLB flush was handed over to pmdp_collapse_flush(). If a
concurrent thread free the PTE page at this time, the TLB will
also be flushed after pmd_clear().
>
>> + }
>> if (ptl != pml)
>> spin_lock_nested(ptl, SINGLE_DEPTH_NESTING);
>> }
>> @@ -1688,6 +1701,7 @@ int collapse_pte_mapped_thp(struct mm_struct
>> *mm, unsigned long addr,
>> pte_unmap_unlock(start_pte, ptl);
>> if (pml && pml != ptl)
>> spin_unlock(pml);
>> +pmd_change:
>> if (notified)
>> mmu_notifier_invalidate_range_end(&range);
>> drop_folio:
>
^ permalink raw reply [flat|nested] 43+ messages in thread* Re: [PATCH v2 07/14] mm: khugepaged: collapse_pte_mapped_thp() use pte_offset_map_rw_nolock()
2024-08-30 6:54 ` Qi Zheng
@ 2024-09-05 6:32 ` Muchun Song
2024-09-05 6:41 ` Qi Zheng
0 siblings, 1 reply; 43+ messages in thread
From: Muchun Song @ 2024-09-05 6:32 UTC (permalink / raw)
To: Qi Zheng
Cc: David Hildenbrand, Hugh Dickins, Matthew Wilcox,
Vlastimil Babka (SUSE), Andrew Morton, Mike Rapoport,
Vishal Moola, Peter Xu, Ryan Roberts, christophe.leroy2, LKML,
Linux Memory Management List, linux-arm-kernel, linuxppc-dev
> On Aug 30, 2024, at 14:54, Qi Zheng <zhengqi.arch@bytedance.com> wrote:
>
>
>
> On 2024/8/29 16:10, Muchun Song wrote:
>> On 2024/8/22 15:13, Qi Zheng wrote:
>>> In collapse_pte_mapped_thp(), we may modify the pte and pmd entry after
>>> acquring the ptl, so convert it to using pte_offset_map_rw_nolock(). At
>>> this time, the write lock of mmap_lock is not held, and the pte_same()
>>> check is not performed after the PTL held. So we should get pgt_pmd and do
>>> pmd_same() check after the ptl held.
>>>
>>> For the case where the ptl is released first and then the pml is acquired,
>>> the PTE page may have been freed, so we must do pmd_same() check before
>>> reacquiring the ptl.
>>>
>>> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
>>> ---
>>> mm/khugepaged.c | 16 +++++++++++++++-
>>> 1 file changed, 15 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>>> index 53bfa7f4b7f82..15d3f7f3c65f2 100644
>>> --- a/mm/khugepaged.c
>>> +++ b/mm/khugepaged.c
>>> @@ -1604,7 +1604,7 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
>>> if (userfaultfd_armed(vma) && !(vma->vm_flags & VM_SHARED))
>>> pml = pmd_lock(mm, pmd);
>>> - start_pte = pte_offset_map_nolock(mm, pmd, haddr, &ptl);
>>> + start_pte = pte_offset_map_rw_nolock(mm, pmd, haddr, &pgt_pmd, &ptl);
>>> if (!start_pte) /* mmap_lock + page lock should prevent this */
>>> goto abort;
>>> if (!pml)
>>> @@ -1612,6 +1612,9 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
>>> else if (ptl != pml)
>>> spin_lock_nested(ptl, SINGLE_DEPTH_NESTING);
>>> + if (unlikely(!pmd_same(pgt_pmd, pmdp_get_lockless(pmd))))
>>> + goto abort;
>>> +
>>> /* step 2: clear page table and adjust rmap */
>>> for (i = 0, addr = haddr, pte = start_pte;
>>> i < HPAGE_PMD_NR; i++, addr += PAGE_SIZE, pte++) {
>>> @@ -1657,6 +1660,16 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
>>> /* step 4: remove empty page table */
>>> if (!pml) {
>>> pml = pmd_lock(mm, pmd);
>>> + /*
>>> + * We called pte_unmap() and release the ptl before acquiring
>>> + * the pml, which means we left the RCU critical section, so the
>>> + * PTE page may have been freed, so we must do pmd_same() check
>>> + * before reacquiring the ptl.
>>> + */
>>> + if (unlikely(!pmd_same(pgt_pmd, pmdp_get_lockless(pmd)))) {
>>> + spin_unlock(pml);
>>> + goto pmd_change;
>> Seems we forget to flush TLB since we've cleared some pte entry?
>
> See comment above the ptep_clear():
>
> /*
> * Must clear entry, or a racing truncate may re-remove it.
> * TLB flush can be left until pmdp_collapse_flush() does it.
> * PTE dirty? Shmem page is already dirty; file is read-only.
> */
>
> The TLB flush was handed over to pmdp_collapse_flush(). If a
But you skipped pmdp_collapse_flush().
> concurrent thread free the PTE page at this time, the TLB will
> also be flushed after pmd_clear().
>
>>> + }
>>> if (ptl != pml)
>>> spin_lock_nested(ptl, SINGLE_DEPTH_NESTING);
>>> }
>>> @@ -1688,6 +1701,7 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
>>> pte_unmap_unlock(start_pte, ptl);
>>> if (pml && pml != ptl)
>>> spin_unlock(pml);
>>> +pmd_change:
>>> if (notified)
>>> mmu_notifier_invalidate_range_end(&range);
>>> drop_folio:
^ permalink raw reply [flat|nested] 43+ messages in thread* Re: [PATCH v2 07/14] mm: khugepaged: collapse_pte_mapped_thp() use pte_offset_map_rw_nolock()
2024-09-05 6:32 ` Muchun Song
@ 2024-09-05 6:41 ` Qi Zheng
2024-09-05 7:18 ` Muchun Song
0 siblings, 1 reply; 43+ messages in thread
From: Qi Zheng @ 2024-09-05 6:41 UTC (permalink / raw)
To: Muchun Song
Cc: David Hildenbrand, Hugh Dickins, Matthew Wilcox,
Vlastimil Babka (SUSE), Andrew Morton, Mike Rapoport,
Vishal Moola, Peter Xu, Ryan Roberts, christophe.leroy2, LKML,
Linux Memory Management List, linux-arm-kernel, linuxppc-dev
On 2024/9/5 14:32, Muchun Song wrote:
>
>
>> On Aug 30, 2024, at 14:54, Qi Zheng <zhengqi.arch@bytedance.com> wrote:
>>
>>
>>
>> On 2024/8/29 16:10, Muchun Song wrote:
>>> On 2024/8/22 15:13, Qi Zheng wrote:
>>>> In collapse_pte_mapped_thp(), we may modify the pte and pmd entry after
>>>> acquring the ptl, so convert it to using pte_offset_map_rw_nolock(). At
>>>> this time, the write lock of mmap_lock is not held, and the pte_same()
>>>> check is not performed after the PTL held. So we should get pgt_pmd and do
>>>> pmd_same() check after the ptl held.
>>>>
>>>> For the case where the ptl is released first and then the pml is acquired,
>>>> the PTE page may have been freed, so we must do pmd_same() check before
>>>> reacquiring the ptl.
>>>>
>>>> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
>>>> ---
>>>> mm/khugepaged.c | 16 +++++++++++++++-
>>>> 1 file changed, 15 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>>>> index 53bfa7f4b7f82..15d3f7f3c65f2 100644
>>>> --- a/mm/khugepaged.c
>>>> +++ b/mm/khugepaged.c
>>>> @@ -1604,7 +1604,7 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
>>>> if (userfaultfd_armed(vma) && !(vma->vm_flags & VM_SHARED))
>>>> pml = pmd_lock(mm, pmd);
>>>> - start_pte = pte_offset_map_nolock(mm, pmd, haddr, &ptl);
>>>> + start_pte = pte_offset_map_rw_nolock(mm, pmd, haddr, &pgt_pmd, &ptl);
>>>> if (!start_pte) /* mmap_lock + page lock should prevent this */
>>>> goto abort;
>>>> if (!pml)
>>>> @@ -1612,6 +1612,9 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
>>>> else if (ptl != pml)
>>>> spin_lock_nested(ptl, SINGLE_DEPTH_NESTING);
>>>> + if (unlikely(!pmd_same(pgt_pmd, pmdp_get_lockless(pmd))))
>>>> + goto abort;
>>>> +
>>>> /* step 2: clear page table and adjust rmap */
>>>> for (i = 0, addr = haddr, pte = start_pte;
>>>> i < HPAGE_PMD_NR; i++, addr += PAGE_SIZE, pte++) {
>>>> @@ -1657,6 +1660,16 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
>>>> /* step 4: remove empty page table */
>>>> if (!pml) {
>>>> pml = pmd_lock(mm, pmd);
>>>> + /*
>>>> + * We called pte_unmap() and release the ptl before acquiring
>>>> + * the pml, which means we left the RCU critical section, so the
>>>> + * PTE page may have been freed, so we must do pmd_same() check
>>>> + * before reacquiring the ptl.
>>>> + */
>>>> + if (unlikely(!pmd_same(pgt_pmd, pmdp_get_lockless(pmd)))) {
>>>> + spin_unlock(pml);
>>>> + goto pmd_change;
>>> Seems we forget to flush TLB since we've cleared some pte entry?
>>
>> See comment above the ptep_clear():
>>
>> /*
>> * Must clear entry, or a racing truncate may re-remove it.
>> * TLB flush can be left until pmdp_collapse_flush() does it.
>> * PTE dirty? Shmem page is already dirty; file is read-only.
>> */
>>
>> The TLB flush was handed over to pmdp_collapse_flush(). If a
>
> But you skipped pmdp_collapse_flush().
I skip it only in !pmd_same() case, at which time it must be cleared
by other thread, which will be responsible for flushing TLB:
CPU 0 CPU 1
pmd_clear
spin_unlock
flushing tlb
spin_lock
if (!pmd_same)
goto pmd_change;
pmdp_collapse_flush
Did I miss something?
>
>> concurrent thread free the PTE page at this time, the TLB will
>> also be flushed after pmd_clear().
>>
>>>> + }
>>>> if (ptl != pml)
>>>> spin_lock_nested(ptl, SINGLE_DEPTH_NESTING);
>>>> }
>>>> @@ -1688,6 +1701,7 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
>>>> pte_unmap_unlock(start_pte, ptl);
>>>> if (pml && pml != ptl)
>>>> spin_unlock(pml);
>>>> +pmd_change:
>>>> if (notified)
>>>> mmu_notifier_invalidate_range_end(&range);
>>>> drop_folio:
>
^ permalink raw reply [flat|nested] 43+ messages in thread* Re: [PATCH v2 07/14] mm: khugepaged: collapse_pte_mapped_thp() use pte_offset_map_rw_nolock()
2024-09-05 6:41 ` Qi Zheng
@ 2024-09-05 7:18 ` Muchun Song
0 siblings, 0 replies; 43+ messages in thread
From: Muchun Song @ 2024-09-05 7:18 UTC (permalink / raw)
To: Qi Zheng
Cc: David Hildenbrand, Hugh Dickins, Matthew Wilcox,
Vlastimil Babka (SUSE), Andrew Morton, Mike Rapoport,
Vishal Moola, Peter Xu, Ryan Roberts, christophe.leroy2, LKML,
Linux Memory Management List, linux-arm-kernel, linuxppc-dev
> On Sep 5, 2024, at 14:41, Qi Zheng <zhengqi.arch@bytedance.com> wrote:
>
>
>
> On 2024/9/5 14:32, Muchun Song wrote:
>>> On Aug 30, 2024, at 14:54, Qi Zheng <zhengqi.arch@bytedance.com> wrote:
>>>
>>>
>>>
>>> On 2024/8/29 16:10, Muchun Song wrote:
>>>> On 2024/8/22 15:13, Qi Zheng wrote:
>>>>> In collapse_pte_mapped_thp(), we may modify the pte and pmd entry after
>>>>> acquring the ptl, so convert it to using pte_offset_map_rw_nolock(). At
>>>>> this time, the write lock of mmap_lock is not held, and the pte_same()
>>>>> check is not performed after the PTL held. So we should get pgt_pmd and do
>>>>> pmd_same() check after the ptl held.
>>>>>
>>>>> For the case where the ptl is released first and then the pml is acquired,
>>>>> the PTE page may have been freed, so we must do pmd_same() check before
>>>>> reacquiring the ptl.
>>>>>
>>>>> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
>>>>> ---
>>>>> mm/khugepaged.c | 16 +++++++++++++++-
>>>>> 1 file changed, 15 insertions(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>>>>> index 53bfa7f4b7f82..15d3f7f3c65f2 100644
>>>>> --- a/mm/khugepaged.c
>>>>> +++ b/mm/khugepaged.c
>>>>> @@ -1604,7 +1604,7 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
>>>>> if (userfaultfd_armed(vma) && !(vma->vm_flags & VM_SHARED))
>>>>> pml = pmd_lock(mm, pmd);
>>>>> - start_pte = pte_offset_map_nolock(mm, pmd, haddr, &ptl);
>>>>> + start_pte = pte_offset_map_rw_nolock(mm, pmd, haddr, &pgt_pmd, &ptl);
>>>>> if (!start_pte) /* mmap_lock + page lock should prevent this */
>>>>> goto abort;
>>>>> if (!pml)
>>>>> @@ -1612,6 +1612,9 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
>>>>> else if (ptl != pml)
>>>>> spin_lock_nested(ptl, SINGLE_DEPTH_NESTING);
>>>>> + if (unlikely(!pmd_same(pgt_pmd, pmdp_get_lockless(pmd))))
>>>>> + goto abort;
>>>>> +
>>>>> /* step 2: clear page table and adjust rmap */
>>>>> for (i = 0, addr = haddr, pte = start_pte;
>>>>> i < HPAGE_PMD_NR; i++, addr += PAGE_SIZE, pte++) {
>>>>> @@ -1657,6 +1660,16 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
>>>>> /* step 4: remove empty page table */
>>>>> if (!pml) {
>>>>> pml = pmd_lock(mm, pmd);
>>>>> + /*
>>>>> + * We called pte_unmap() and release the ptl before acquiring
>>>>> + * the pml, which means we left the RCU critical section, so the
>>>>> + * PTE page may have been freed, so we must do pmd_same() check
>>>>> + * before reacquiring the ptl.
>>>>> + */
>>>>> + if (unlikely(!pmd_same(pgt_pmd, pmdp_get_lockless(pmd)))) {
>>>>> + spin_unlock(pml);
>>>>> + goto pmd_change;
>>>> Seems we forget to flush TLB since we've cleared some pte entry?
>>>
>>> See comment above the ptep_clear():
>>>
>>> /*
>>> * Must clear entry, or a racing truncate may re-remove it.
>>> * TLB flush can be left until pmdp_collapse_flush() does it.
>>> * PTE dirty? Shmem page is already dirty; file is read-only.
>>> */
>>>
>>> The TLB flush was handed over to pmdp_collapse_flush(). If a
>> But you skipped pmdp_collapse_flush().
>
> I skip it only in !pmd_same() case, at which time it must be cleared
> by other thread, which will be responsible for flushing TLB:
WOW! AMAZING! You are right.
>
> CPU 0 CPU 1
> pmd_clear
> spin_unlock
> flushing tlb
> spin_lock
> if (!pmd_same)
> goto pmd_change;
> pmdp_collapse_flush
>
> Did I miss something?
>
>>> concurrent thread free the PTE page at this time, the TLB will
>>> also be flushed after pmd_clear().
>>>
>>>>> + }
>>>>> if (ptl != pml)
>>>>> spin_lock_nested(ptl, SINGLE_DEPTH_NESTING);
>>>>> }
>>>>> @@ -1688,6 +1701,7 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
>>>>> pte_unmap_unlock(start_pte, ptl);
>>>>> if (pml && pml != ptl)
>>>>> spin_unlock(pml);
>>>>> +pmd_change:
>>>>> if (notified)
>>>>> mmu_notifier_invalidate_range_end(&range);
>>>>> drop_folio:
^ permalink raw reply [flat|nested] 43+ messages in thread
* [PATCH v2 08/14] mm: copy_pte_range() use pte_offset_map_rw_nolock()
2024-08-22 7:13 [PATCH v2 00/14] introduce pte_offset_map_{ro|rw}_nolock() Qi Zheng
` (6 preceding siblings ...)
2024-08-22 7:13 ` [PATCH v2 07/14] mm: khugepaged: collapse_pte_mapped_thp() " Qi Zheng
@ 2024-08-22 7:13 ` Qi Zheng
2024-08-29 8:13 ` Muchun Song
2024-08-29 15:36 ` David Hildenbrand
2024-08-22 7:13 ` [PATCH v2 09/14] mm: mremap: move_ptes() " Qi Zheng
` (5 subsequent siblings)
13 siblings, 2 replies; 43+ messages in thread
From: Qi Zheng @ 2024-08-22 7:13 UTC (permalink / raw)
To: david, hughd, willy, muchun.song, vbabka, akpm, rppt,
vishal.moola, peterx, ryan.roberts, christophe.leroy2
Cc: linux-kernel, linux-mm, linux-arm-kernel, linuxppc-dev, Qi Zheng
In copy_pte_range(), we may modify the src_pte entry after holding the
src_ptl, so convert it to using pte_offset_map_rw_nolock(). But since we
already hold the write lock of mmap_lock, there is no need to get pmdval
to do pmd_same() check, just pass a dummy variable to it.
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
mm/memory.c | 11 ++++++++++-
1 file changed, 10 insertions(+), 1 deletion(-)
diff --git a/mm/memory.c b/mm/memory.c
index 7b6071a0e21e2..30d98025b2a40 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1083,6 +1083,7 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
struct mm_struct *src_mm = src_vma->vm_mm;
pte_t *orig_src_pte, *orig_dst_pte;
pte_t *src_pte, *dst_pte;
+ pmd_t dummy_pmdval;
pte_t ptent;
spinlock_t *src_ptl, *dst_ptl;
int progress, max_nr, ret = 0;
@@ -1108,7 +1109,15 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
ret = -ENOMEM;
goto out;
}
- src_pte = pte_offset_map_nolock(src_mm, src_pmd, addr, &src_ptl);
+
+ /*
+ * Use the maywrite version to indicate that dst_pte will be modified,
+ * but since we already hold the write lock of mmap_lock, there is no
+ * need to get pmdval to do pmd_same() check, just pass a dummy variable
+ * to it.
+ */
+ src_pte = pte_offset_map_rw_nolock(src_mm, src_pmd, addr, &dummy_pmdval,
+ &src_ptl);
if (!src_pte) {
pte_unmap_unlock(dst_pte, dst_ptl);
/* ret == 0 */
--
2.20.1
^ permalink raw reply related [flat|nested] 43+ messages in thread* Re: [PATCH v2 08/14] mm: copy_pte_range() use pte_offset_map_rw_nolock()
2024-08-22 7:13 ` [PATCH v2 08/14] mm: copy_pte_range() " Qi Zheng
@ 2024-08-29 8:13 ` Muchun Song
2024-08-29 15:36 ` David Hildenbrand
1 sibling, 0 replies; 43+ messages in thread
From: Muchun Song @ 2024-08-29 8:13 UTC (permalink / raw)
To: Qi Zheng
Cc: David Hildenbrand, Hugh Dickins, Matthew Wilcox,
Vlastimil Babka (SUSE), Andrew Morton, Mike Rapoport,
Vishal Moola, Peter Xu, Ryan Roberts, christophe.leroy2, LKML,
Linux Memory Management List, linux-arm-kernel, linuxppc-dev
> On Aug 22, 2024, at 15:13, Qi Zheng <zhengqi.arch@bytedance.com> wrote:
>
> In copy_pte_range(), we may modify the src_pte entry after holding the
> src_ptl, so convert it to using pte_offset_map_rw_nolock(). But since we
> already hold the write lock of mmap_lock, there is no need to get pmdval
> to do pmd_same() check, just pass a dummy variable to it.
>
> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Reviewed-by: Muchun Song <muchun.song@linux.dev>
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH v2 08/14] mm: copy_pte_range() use pte_offset_map_rw_nolock()
2024-08-22 7:13 ` [PATCH v2 08/14] mm: copy_pte_range() " Qi Zheng
2024-08-29 8:13 ` Muchun Song
@ 2024-08-29 15:36 ` David Hildenbrand
2024-08-30 6:42 ` Qi Zheng
1 sibling, 1 reply; 43+ messages in thread
From: David Hildenbrand @ 2024-08-29 15:36 UTC (permalink / raw)
To: Qi Zheng, hughd, willy, muchun.song, vbabka, akpm, rppt,
vishal.moola, peterx, ryan.roberts, christophe.leroy2
Cc: linux-kernel, linux-mm, linux-arm-kernel, linuxppc-dev
On 22.08.24 09:13, Qi Zheng wrote:
> In copy_pte_range(), we may modify the src_pte entry after holding the
> src_ptl, so convert it to using pte_offset_map_rw_nolock(). But since we
> already hold the write lock of mmap_lock, there is no need to get pmdval
> to do pmd_same() check, just pass a dummy variable to it.
>
> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
> ---
> mm/memory.c | 11 ++++++++++-
> 1 file changed, 10 insertions(+), 1 deletion(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 7b6071a0e21e2..30d98025b2a40 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1083,6 +1083,7 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
> struct mm_struct *src_mm = src_vma->vm_mm;
> pte_t *orig_src_pte, *orig_dst_pte;
> pte_t *src_pte, *dst_pte;
> + pmd_t dummy_pmdval;
> pte_t ptent;
> spinlock_t *src_ptl, *dst_ptl;
> int progress, max_nr, ret = 0;
> @@ -1108,7 +1109,15 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
> ret = -ENOMEM;
> goto out;
> }
> - src_pte = pte_offset_map_nolock(src_mm, src_pmd, addr, &src_ptl);
> +
> + /*
> + * Use the maywrite version to indicate that dst_pte will be modified,
> + * but since we already hold the write lock of mmap_lock, there is no
> + * need to get pmdval to do pmd_same() check, just pass a dummy variable
> + * to it.
As we hold the mmap lock write lock, I assume it will prevent any page
table removal, because they need *at least* the mmap lock in read mode,
right?
We should probably document the rules for removing a page table -- which
locks must be held in which mode (if not already done).
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH v2 08/14] mm: copy_pte_range() use pte_offset_map_rw_nolock()
2024-08-29 15:36 ` David Hildenbrand
@ 2024-08-30 6:42 ` Qi Zheng
0 siblings, 0 replies; 43+ messages in thread
From: Qi Zheng @ 2024-08-30 6:42 UTC (permalink / raw)
To: David Hildenbrand
Cc: hughd, willy, muchun.song, vbabka, akpm, rppt, vishal.moola,
peterx, ryan.roberts, christophe.leroy2, linux-kernel, linux-mm,
linux-arm-kernel, linuxppc-dev
On 2024/8/29 23:36, David Hildenbrand wrote:
> On 22.08.24 09:13, Qi Zheng wrote:
>> In copy_pte_range(), we may modify the src_pte entry after holding the
>> src_ptl, so convert it to using pte_offset_map_rw_nolock(). But since we
>> already hold the write lock of mmap_lock, there is no need to get pmdval
>> to do pmd_same() check, just pass a dummy variable to it.
>>
>> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
>> ---
>> mm/memory.c | 11 ++++++++++-
>> 1 file changed, 10 insertions(+), 1 deletion(-)
>>
>> diff --git a/mm/memory.c b/mm/memory.c
>> index 7b6071a0e21e2..30d98025b2a40 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -1083,6 +1083,7 @@ copy_pte_range(struct vm_area_struct *dst_vma,
>> struct vm_area_struct *src_vma,
>> struct mm_struct *src_mm = src_vma->vm_mm;
>> pte_t *orig_src_pte, *orig_dst_pte;
>> pte_t *src_pte, *dst_pte;
>> + pmd_t dummy_pmdval;
>> pte_t ptent;
>> spinlock_t *src_ptl, *dst_ptl;
>> int progress, max_nr, ret = 0;
>> @@ -1108,7 +1109,15 @@ copy_pte_range(struct vm_area_struct *dst_vma,
>> struct vm_area_struct *src_vma,
>> ret = -ENOMEM;
>> goto out;
>> }
>> - src_pte = pte_offset_map_nolock(src_mm, src_pmd, addr, &src_ptl);
>> +
>> + /*
>> + * Use the maywrite version to indicate that dst_pte will be
>> modified,
>> + * but since we already hold the write lock of mmap_lock, there
>> is no
>> + * need to get pmdval to do pmd_same() check, just pass a dummy
>> variable
>> + * to it.
>
> As we hold the mmap lock write lock, I assume it will prevent any page
> table removal, because they need *at least* the mmap lock in read mode,
> right?
Except for retract_page_tables(), all others hold the read lock of
mmap_lock.
>
> We should probably document the rules for removing a page table -- which
> locks must be held in which mode (if not already done).
Agree, I will document it in the v3.
>
^ permalink raw reply [flat|nested] 43+ messages in thread
* [PATCH v2 09/14] mm: mremap: move_ptes() use pte_offset_map_rw_nolock()
2024-08-22 7:13 [PATCH v2 00/14] introduce pte_offset_map_{ro|rw}_nolock() Qi Zheng
` (7 preceding siblings ...)
2024-08-22 7:13 ` [PATCH v2 08/14] mm: copy_pte_range() " Qi Zheng
@ 2024-08-22 7:13 ` Qi Zheng
2024-08-22 7:13 ` [PATCH v2 10/14] mm: page_vma_mapped_walk: map_pte() " Qi Zheng
` (4 subsequent siblings)
13 siblings, 0 replies; 43+ messages in thread
From: Qi Zheng @ 2024-08-22 7:13 UTC (permalink / raw)
To: david, hughd, willy, muchun.song, vbabka, akpm, rppt,
vishal.moola, peterx, ryan.roberts, christophe.leroy2
Cc: linux-kernel, linux-mm, linux-arm-kernel, linuxppc-dev, Qi Zheng
In move_ptes(), we may modify the new_pte after acquiring the new_ptl, so
convert it to using pte_offset_map_rw_nolock(). But since we already hold
the exclusive mmap_lock, there is no need to get pmdval to do pmd_same()
check, just pass a dummy variable to it.
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
mm/mremap.c | 9 ++++++++-
1 file changed, 8 insertions(+), 1 deletion(-)
diff --git a/mm/mremap.c b/mm/mremap.c
index 24712f8dbb6b5..f96b025c09079 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -143,6 +143,7 @@ static int move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd,
spinlock_t *old_ptl, *new_ptl;
bool force_flush = false;
unsigned long len = old_end - old_addr;
+ pmd_t dummy_pmdval;
int err = 0;
/*
@@ -175,7 +176,13 @@ static int move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd,
err = -EAGAIN;
goto out;
}
- new_pte = pte_offset_map_nolock(mm, new_pmd, new_addr, &new_ptl);
+ /*
+ * Use the maywrite version to indicate that new_pte will be modified,
+ * but since we hold the exclusive mmap_lock, there is no need to
+ * recheck pmd_same() after acquiring the new_ptl.
+ */
+ new_pte = pte_offset_map_rw_nolock(mm, new_pmd, new_addr, &dummy_pmdval,
+ &new_ptl);
if (!new_pte) {
pte_unmap_unlock(old_pte, old_ptl);
err = -EAGAIN;
--
2.20.1
^ permalink raw reply related [flat|nested] 43+ messages in thread* [PATCH v2 10/14] mm: page_vma_mapped_walk: map_pte() use pte_offset_map_rw_nolock()
2024-08-22 7:13 [PATCH v2 00/14] introduce pte_offset_map_{ro|rw}_nolock() Qi Zheng
` (8 preceding siblings ...)
2024-08-22 7:13 ` [PATCH v2 09/14] mm: mremap: move_ptes() " Qi Zheng
@ 2024-08-22 7:13 ` Qi Zheng
2024-08-22 7:13 ` [PATCH v2 11/14] mm: userfaultfd: move_pages_pte() " Qi Zheng
` (3 subsequent siblings)
13 siblings, 0 replies; 43+ messages in thread
From: Qi Zheng @ 2024-08-22 7:13 UTC (permalink / raw)
To: david, hughd, willy, muchun.song, vbabka, akpm, rppt,
vishal.moola, peterx, ryan.roberts, christophe.leroy2
Cc: linux-kernel, linux-mm, linux-arm-kernel, linuxppc-dev, Qi Zheng
In the caller of map_pte(), we may modify the pvmw->pte after acquiring
the pvmw->ptl, so convert it to using pte_offset_map_rw_nolock(). At
this time, the write lock of mmap_lock is not held, and the pte_same()
check is not performed after the pvmw->ptl held, so we should get pmdval
and do pmd_same() check to ensure the stability of pvmw->pmd.
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
mm/page_vma_mapped.c | 24 ++++++++++++++++++++----
1 file changed, 20 insertions(+), 4 deletions(-)
diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
index ae5cc42aa2087..f1d73fd448708 100644
--- a/mm/page_vma_mapped.c
+++ b/mm/page_vma_mapped.c
@@ -13,9 +13,11 @@ static inline bool not_found(struct page_vma_mapped_walk *pvmw)
return false;
}
-static bool map_pte(struct page_vma_mapped_walk *pvmw, spinlock_t **ptlp)
+static bool map_pte(struct page_vma_mapped_walk *pvmw, pmd_t *pmdvalp,
+ spinlock_t **ptlp)
{
pte_t ptent;
+ pmd_t pmdval;
if (pvmw->flags & PVMW_SYNC) {
/* Use the stricter lookup */
@@ -25,6 +27,7 @@ static bool map_pte(struct page_vma_mapped_walk *pvmw, spinlock_t **ptlp)
return !!pvmw->pte;
}
+again:
/*
* It is important to return the ptl corresponding to pte,
* in case *pvmw->pmd changes underneath us; so we need to
@@ -32,10 +35,11 @@ static bool map_pte(struct page_vma_mapped_walk *pvmw, spinlock_t **ptlp)
* proceeds to loop over next ptes, and finds a match later.
* Though, in most cases, page lock already protects this.
*/
- pvmw->pte = pte_offset_map_nolock(pvmw->vma->vm_mm, pvmw->pmd,
- pvmw->address, ptlp);
+ pvmw->pte = pte_offset_map_rw_nolock(pvmw->vma->vm_mm, pvmw->pmd,
+ pvmw->address, &pmdval, ptlp);
if (!pvmw->pte)
return false;
+ *pmdvalp = pmdval;
ptent = ptep_get(pvmw->pte);
@@ -69,6 +73,12 @@ static bool map_pte(struct page_vma_mapped_walk *pvmw, spinlock_t **ptlp)
}
pvmw->ptl = *ptlp;
spin_lock(pvmw->ptl);
+
+ if (unlikely(!pmd_same(pmdval, pmdp_get_lockless(pvmw->pmd)))) {
+ spin_unlock(pvmw->ptl);
+ goto again;
+ }
+
return true;
}
@@ -278,7 +288,7 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
step_forward(pvmw, PMD_SIZE);
continue;
}
- if (!map_pte(pvmw, &ptl)) {
+ if (!map_pte(pvmw, &pmde, &ptl)) {
if (!pvmw->pte)
goto restart;
goto next_pte;
@@ -307,6 +317,12 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
if (!pvmw->ptl) {
pvmw->ptl = ptl;
spin_lock(pvmw->ptl);
+ if (unlikely(!pmd_same(pmde, pmdp_get_lockless(pvmw->pmd)))) {
+ pte_unmap_unlock(pvmw->pte, pvmw->ptl);
+ pvmw->ptl = NULL;
+ pvmw->pte = NULL;
+ goto restart;
+ }
}
goto this_pte;
} while (pvmw->address < end);
--
2.20.1
^ permalink raw reply related [flat|nested] 43+ messages in thread* [PATCH v2 11/14] mm: userfaultfd: move_pages_pte() use pte_offset_map_rw_nolock()
2024-08-22 7:13 [PATCH v2 00/14] introduce pte_offset_map_{ro|rw}_nolock() Qi Zheng
` (9 preceding siblings ...)
2024-08-22 7:13 ` [PATCH v2 10/14] mm: page_vma_mapped_walk: map_pte() " Qi Zheng
@ 2024-08-22 7:13 ` Qi Zheng
2024-08-22 7:13 ` [PATCH v2 12/14] mm: multi-gen LRU: walk_pte_range() " Qi Zheng
` (2 subsequent siblings)
13 siblings, 0 replies; 43+ messages in thread
From: Qi Zheng @ 2024-08-22 7:13 UTC (permalink / raw)
To: david, hughd, willy, muchun.song, vbabka, akpm, rppt,
vishal.moola, peterx, ryan.roberts, christophe.leroy2
Cc: linux-kernel, linux-mm, linux-arm-kernel, linuxppc-dev, Qi Zheng
In move_pages_pte(), we may modify the dst_pte and src_pte after acquiring
the ptl, so convert it to using pte_offset_map_rw_nolock(). But since we
already do the pte_same() check, there is no need to get pmdval to do
pmd_same() check, just pass a dummy variable to it.
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
mm/userfaultfd.c | 15 ++++++++++++---
1 file changed, 12 insertions(+), 3 deletions(-)
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 6ef42d9cd482e..414ee744257b7 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -1135,7 +1135,7 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd,
spinlock_t *src_ptl, *dst_ptl;
pte_t *src_pte = NULL;
pte_t *dst_pte = NULL;
-
+ pmd_t dummy_pmdval;
struct folio *src_folio = NULL;
struct anon_vma *src_anon_vma = NULL;
struct mmu_notifier_range range;
@@ -1146,7 +1146,14 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd,
src_addr, src_addr + PAGE_SIZE);
mmu_notifier_invalidate_range_start(&range);
retry:
- dst_pte = pte_offset_map_nolock(mm, dst_pmd, dst_addr, &dst_ptl);
+ /*
+ * Use the maywrite version to indicate that dst_pte will be modified,
+ * but since we will use pte_same() to detect the change of the pte
+ * entry, there is no need to get pmdval, so just pass a dummy variable
+ * to it.
+ */
+ dst_pte = pte_offset_map_rw_nolock(mm, dst_pmd, dst_addr, &dummy_pmdval,
+ &dst_ptl);
/* Retry if a huge pmd materialized from under us */
if (unlikely(!dst_pte)) {
@@ -1154,7 +1161,9 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd,
goto out;
}
- src_pte = pte_offset_map_nolock(mm, src_pmd, src_addr, &src_ptl);
+ /* same as dst_pte */
+ src_pte = pte_offset_map_rw_nolock(mm, src_pmd, src_addr, &dummy_pmdval,
+ &src_ptl);
/*
* We held the mmap_lock for reading so MADV_DONTNEED
--
2.20.1
^ permalink raw reply related [flat|nested] 43+ messages in thread* [PATCH v2 12/14] mm: multi-gen LRU: walk_pte_range() use pte_offset_map_rw_nolock()
2024-08-22 7:13 [PATCH v2 00/14] introduce pte_offset_map_{ro|rw}_nolock() Qi Zheng
` (10 preceding siblings ...)
2024-08-22 7:13 ` [PATCH v2 11/14] mm: userfaultfd: move_pages_pte() " Qi Zheng
@ 2024-08-22 7:13 ` Qi Zheng
2024-08-22 7:13 ` [PATCH v2 13/14] mm: pgtable: remove pte_offset_map_nolock() Qi Zheng
2024-08-22 7:13 ` [PATCH v2 14/14] mm: khugepaged: retract_page_tables() use pte_offset_map_rw_nolock() Qi Zheng
13 siblings, 0 replies; 43+ messages in thread
From: Qi Zheng @ 2024-08-22 7:13 UTC (permalink / raw)
To: david, hughd, willy, muchun.song, vbabka, akpm, rppt,
vishal.moola, peterx, ryan.roberts, christophe.leroy2
Cc: linux-kernel, linux-mm, linux-arm-kernel, linuxppc-dev, Qi Zheng
In walk_pte_range(), we may modify the pte entry after holding the ptl, so
convert it to using pte_offset_map_rw_nolock(). At this time, the write
lock of mmap_lock is not held, and the pte_same() check is not performed
after the ptl held, so we should get pmdval and do pmd_same() check to
ensure the stability of pmd entry.
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
mm/vmscan.c | 9 ++++++++-
1 file changed, 8 insertions(+), 1 deletion(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 5dc96a843466a..89ef1ac8eb1a6 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3396,8 +3396,10 @@ static bool walk_pte_range(pmd_t *pmd, unsigned long start, unsigned long end,
struct pglist_data *pgdat = lruvec_pgdat(walk->lruvec);
DEFINE_MAX_SEQ(walk->lruvec);
int old_gen, new_gen = lru_gen_from_seq(max_seq);
+ pmd_t pmdval;
- pte = pte_offset_map_nolock(args->mm, pmd, start & PMD_MASK, &ptl);
+ pte = pte_offset_map_rw_nolock(args->mm, pmd, start & PMD_MASK, &pmdval,
+ &ptl);
if (!pte)
return false;
if (!spin_trylock(ptl)) {
@@ -3405,6 +3407,11 @@ static bool walk_pte_range(pmd_t *pmd, unsigned long start, unsigned long end,
return false;
}
+ if (unlikely(!pmd_same(pmdval, pmdp_get_lockless(pmd)))) {
+ pte_unmap_unlock(pte, ptl);
+ return false;
+ }
+
arch_enter_lazy_mmu_mode();
restart:
for (i = pte_index(start), addr = start; addr != end; i++, addr += PAGE_SIZE) {
--
2.20.1
^ permalink raw reply related [flat|nested] 43+ messages in thread* [PATCH v2 13/14] mm: pgtable: remove pte_offset_map_nolock()
2024-08-22 7:13 [PATCH v2 00/14] introduce pte_offset_map_{ro|rw}_nolock() Qi Zheng
` (11 preceding siblings ...)
2024-08-22 7:13 ` [PATCH v2 12/14] mm: multi-gen LRU: walk_pte_range() " Qi Zheng
@ 2024-08-22 7:13 ` Qi Zheng
2024-08-22 7:13 ` [PATCH v2 14/14] mm: khugepaged: retract_page_tables() use pte_offset_map_rw_nolock() Qi Zheng
13 siblings, 0 replies; 43+ messages in thread
From: Qi Zheng @ 2024-08-22 7:13 UTC (permalink / raw)
To: david, hughd, willy, muchun.song, vbabka, akpm, rppt,
vishal.moola, peterx, ryan.roberts, christophe.leroy2
Cc: linux-kernel, linux-mm, linux-arm-kernel, linuxppc-dev, Qi Zheng
Now no users are using the pte_offset_map_nolock(), remove it.
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
Documentation/mm/split_page_table_lock.rst | 3 ---
include/linux/mm.h | 2 --
mm/pgtable-generic.c | 21 ---------------------
3 files changed, 26 deletions(-)
diff --git a/Documentation/mm/split_page_table_lock.rst b/Documentation/mm/split_page_table_lock.rst
index 08d0e706a32db..581446d4a4eba 100644
--- a/Documentation/mm/split_page_table_lock.rst
+++ b/Documentation/mm/split_page_table_lock.rst
@@ -16,9 +16,6 @@ There are helpers to lock/unlock a table and other accessor functions:
- pte_offset_map_lock()
maps PTE and takes PTE table lock, returns pointer to PTE with
pointer to its PTE table lock, or returns NULL if no PTE table;
- - pte_offset_map_nolock()
- maps PTE, returns pointer to PTE with pointer to its PTE table
- lock (not taken), or returns NULL if no PTE table;
- pte_offset_map_ro_nolock()
maps PTE, returns pointer to PTE with pointer to its PTE table
lock (not taken), or returns NULL if no PTE table;
diff --git a/include/linux/mm.h b/include/linux/mm.h
index a00cb35ce065f..1d3e8882f6ba1 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2952,8 +2952,6 @@ static inline pte_t *pte_offset_map_lock(struct mm_struct *mm, pmd_t *pmd,
return pte;
}
-pte_t *pte_offset_map_nolock(struct mm_struct *mm, pmd_t *pmd,
- unsigned long addr, spinlock_t **ptlp);
pte_t *pte_offset_map_ro_nolock(struct mm_struct *mm, pmd_t *pmd,
unsigned long addr, spinlock_t **ptlp);
pte_t *pte_offset_map_rw_nolock(struct mm_struct *mm, pmd_t *pmd,
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index 9a1666574c959..3b6dd8bb5b134 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -305,18 +305,6 @@ pte_t *__pte_offset_map(pmd_t *pmd, unsigned long addr, pmd_t *pmdvalp)
return NULL;
}
-pte_t *pte_offset_map_nolock(struct mm_struct *mm, pmd_t *pmd,
- unsigned long addr, spinlock_t **ptlp)
-{
- pmd_t pmdval;
- pte_t *pte;
-
- pte = __pte_offset_map(pmd, addr, &pmdval);
- if (likely(pte))
- *ptlp = pte_lockptr(mm, &pmdval);
- return pte;
-}
-
pte_t *pte_offset_map_ro_nolock(struct mm_struct *mm, pmd_t *pmd,
unsigned long addr, spinlock_t **ptlp)
{
@@ -374,15 +362,6 @@ pte_t *pte_offset_map_rw_nolock(struct mm_struct *mm, pmd_t *pmd,
* and disconnected table. Until pte_unmap(pte) unmaps and rcu_read_unlock()s
* afterwards.
*
- * pte_offset_map_nolock(mm, pmd, addr, ptlp), above, is like pte_offset_map();
- * but when successful, it also outputs a pointer to the spinlock in ptlp - as
- * pte_offset_map_lock() does, but in this case without locking it. This helps
- * the caller to avoid a later pte_lockptr(mm, *pmd), which might by that time
- * act on a changed *pmd: pte_offset_map_nolock() provides the correct spinlock
- * pointer for the page table that it returns. In principle, the caller should
- * recheck *pmd once the lock is taken; in practice, no callsite needs that -
- * either the mmap_lock for write, or pte_same() check on contents, is enough.
- *
* pte_offset_map_ro_nolock(mm, pmd, addr, ptlp), above, is like
* pte_offset_map(); but when successful, it also outputs a pointer to the
* spinlock in ptlp - as pte_offset_map_lock() does, but in this case without
--
2.20.1
^ permalink raw reply related [flat|nested] 43+ messages in thread* [PATCH v2 14/14] mm: khugepaged: retract_page_tables() use pte_offset_map_rw_nolock()
2024-08-22 7:13 [PATCH v2 00/14] introduce pte_offset_map_{ro|rw}_nolock() Qi Zheng
` (12 preceding siblings ...)
2024-08-22 7:13 ` [PATCH v2 13/14] mm: pgtable: remove pte_offset_map_nolock() Qi Zheng
@ 2024-08-22 7:13 ` Qi Zheng
13 siblings, 0 replies; 43+ messages in thread
From: Qi Zheng @ 2024-08-22 7:13 UTC (permalink / raw)
To: david, hughd, willy, muchun.song, vbabka, akpm, rppt,
vishal.moola, peterx, ryan.roberts, christophe.leroy2
Cc: linux-kernel, linux-mm, linux-arm-kernel, linuxppc-dev, Qi Zheng
In retract_page_tables(), we may modify the pmd entry after acquiring the
pml and ptl, so we should also check whether the pmd entry is stable.
Using pte_offset_map_rw_nolock() + pmd_same() to do it.
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
mm/khugepaged.c | 17 ++++++++++++++++-
1 file changed, 16 insertions(+), 1 deletion(-)
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 15d3f7f3c65f2..799412041d262 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1723,6 +1723,7 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
spinlock_t *pml;
spinlock_t *ptl;
bool skipped_uffd = false;
+ pte_t *pte;
/*
* Check vma->anon_vma to exclude MAP_PRIVATE mappings that
@@ -1758,11 +1759,25 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
addr, addr + HPAGE_PMD_SIZE);
mmu_notifier_invalidate_range_start(&range);
+ pte = pte_offset_map_rw_nolock(mm, pmd, addr, &pgt_pmd, &ptl);
+ if (!pte) {
+ mmu_notifier_invalidate_range_end(&range);
+ continue;
+ }
+
pml = pmd_lock(mm, pmd);
- ptl = pte_lockptr(mm, pmd);
if (ptl != pml)
spin_lock_nested(ptl, SINGLE_DEPTH_NESTING);
+ if (unlikely(!pmd_same(pgt_pmd, pmdp_get_lockless(pmd)))) {
+ pte_unmap_unlock(pte, ptl);
+ if (ptl != pml)
+ spin_unlock(pml);
+ mmu_notifier_invalidate_range_end(&range);
+ continue;
+ }
+ pte_unmap(pte);
+
/*
* Huge page lock is still held, so normally the page table
* must remain empty; and we have already skipped anon_vma
--
2.20.1
^ permalink raw reply related [flat|nested] 43+ messages in thread