[PATCH v1 2/2] mm/hugetlb: fix hugetlb vs. core-mm PT locking

public inbox for stable@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH v1 2/2] mm/hugetlb: fix hugetlb vs. core-mm PT locking
       [not found] <20240725183955.2268884-1-david@redhat.com>
@ 2024-07-25 18:39 ` David Hildenbrand
  2024-07-26  2:33   ` Baolin Wang
                     ` (3 more replies)
  0 siblings, 4 replies; 12+ messages in thread
From: David Hildenbrand @ 2024-07-25 18:39 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, David Hildenbrand, Andrew Morton, Muchun Song, Peter Xu,
	Oscar Salvador, stable

We recently made GUP's common page table walking code to also walk
hugetlb VMAs without most hugetlb special-casing, preparing for the
future of having less hugetlb-specific page table walking code in the
codebase. Turns out that we missed one page table locking detail: page
table locking for hugetlb folios that are not mapped using a single
PMD/PUD.

Assume we have hugetlb folio that spans multiple PTEs (e.g., 64 KiB
hugetlb folios on arm64 with 4 KiB base page size). GUP, as it walks the
page tables, will perform a pte_offset_map_lock() to grab the PTE table
lock.

However, hugetlb that concurrently modifies these page tables would
actually grab the mm->page_table_lock: with USE_SPLIT_PTE_PTLOCKS, the
locks would differ. Something similar can happen right now with hugetlb
folios that span multiple PMDs when USE_SPLIT_PMD_PTLOCKS.

Let's make huge_pte_lockptr() effectively uses the same PT locks as any
core-mm page table walker would.

There is one ugly case: powerpc 8xx, whereby we have an 8 MiB hugetlb
folio being mapped using two PTE page tables. While hugetlb wants to take
the PMD table lock, core-mm would grab the PTE table lock of one of both
PTE page tables. In such corner cases, we have to make sure that both
locks match, which is (fortunately!) currently guaranteed for 8xx as it
does not support SMP.

Fixes: 9cb28da54643 ("mm/gup: handle hugetlb in the generic follow_page_mask code")
Cc: <stable@vger.kernel.org>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 include/linux/hugetlb.h | 25 ++++++++++++++++++++++---
 1 file changed, 22 insertions(+), 3 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index c9bf68c239a01..da800e56fe590 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -944,10 +944,29 @@ static inline bool htlb_allow_alloc_fallback(int reason)
 static inline spinlock_t *huge_pte_lockptr(struct hstate *h,
 					   struct mm_struct *mm, pte_t *pte)
 {
-	if (huge_page_size(h) == PMD_SIZE)
+	VM_WARN_ON(huge_page_size(h) == PAGE_SIZE);
+	VM_WARN_ON(huge_page_size(h) >= P4D_SIZE);
+
+	/*
+	 * hugetlb must use the exact same PT locks as core-mm page table
+	 * walkers would. When modifying a PTE table, hugetlb must take the
+	 * PTE PT lock, when modifying a PMD table, hugetlb must take the PMD
+	 * PT lock etc.
+	 *
+	 * The expectation is that any hugetlb folio smaller than a PMD is
+	 * always mapped into a single PTE table and that any hugetlb folio
+	 * smaller than a PUD (but at least as big as a PMD) is always mapped
+	 * into a single PMD table.
+	 *
+	 * If that does not hold for an architecture, then that architecture
+	 * must disable split PT locks such that all *_lockptr() functions
+	 * will give us the same result: the per-MM PT lock.
+	 */
+	if (huge_page_size(h) < PMD_SIZE)
+		return pte_lockptr(mm, pte);
+	else if (huge_page_size(h) < PUD_SIZE)
 		return pmd_lockptr(mm, (pmd_t *) pte);
-	VM_BUG_ON(huge_page_size(h) == PAGE_SIZE);
-	return &mm->page_table_lock;
+	return pud_lockptr(mm, (pud_t *) pte);
 }

 #ifndef hugepages_supported
-- 
2.45.2

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH v1 2/2] mm/hugetlb: fix hugetlb vs. core-mm PT locking
  2024-07-25 18:39 ` [PATCH v1 2/2] mm/hugetlb: fix hugetlb vs. core-mm PT locking David Hildenbrand
@ 2024-07-26  2:33   ` Baolin Wang
  2024-07-26  3:03     ` Baolin Wang
  2024-07-26  8:04     ` David Hildenbrand
  2024-07-26  8:18   ` Muchun Song
                     ` (2 subsequent siblings)
  3 siblings, 2 replies; 12+ messages in thread
From: Baolin Wang @ 2024-07-26  2:33 UTC (permalink / raw)
  To: David Hildenbrand, linux-kernel
  Cc: linux-mm, Andrew Morton, Muchun Song, Peter Xu, Oscar Salvador,
	stable



On 2024/7/26 02:39, David Hildenbrand wrote:
> We recently made GUP's common page table walking code to also walk
> hugetlb VMAs without most hugetlb special-casing, preparing for the
> future of having less hugetlb-specific page table walking code in the
> codebase. Turns out that we missed one page table locking detail: page
> table locking for hugetlb folios that are not mapped using a single
> PMD/PUD.
> 
> Assume we have hugetlb folio that spans multiple PTEs (e.g., 64 KiB
> hugetlb folios on arm64 with 4 KiB base page size). GUP, as it walks the
> page tables, will perform a pte_offset_map_lock() to grab the PTE table
> lock.
> 
> However, hugetlb that concurrently modifies these page tables would
> actually grab the mm->page_table_lock: with USE_SPLIT_PTE_PTLOCKS, the
> locks would differ. Something similar can happen right now with hugetlb
> folios that span multiple PMDs when USE_SPLIT_PMD_PTLOCKS.
> 
> Let's make huge_pte_lockptr() effectively uses the same PT locks as any
> core-mm page table walker would.

Thanks for raising the issue again. I remember fixing this issue 2 years 
ago in commit fac35ba763ed ("mm/hugetlb: fix races when looking up a 
CONT-PTE/PMD size hugetlb page"), but it seems to be broken again.

> There is one ugly case: powerpc 8xx, whereby we have an 8 MiB hugetlb
> folio being mapped using two PTE page tables. While hugetlb wants to take
> the PMD table lock, core-mm would grab the PTE table lock of one of both
> PTE page tables. In such corner cases, we have to make sure that both
> locks match, which is (fortunately!) currently guaranteed for 8xx as it
> does not support SMP.
> 
> Fixes: 9cb28da54643 ("mm/gup: handle hugetlb in the generic follow_page_mask code")
> Cc: <stable@vger.kernel.org>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>   include/linux/hugetlb.h | 25 ++++++++++++++++++++++---
>   1 file changed, 22 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index c9bf68c239a01..da800e56fe590 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -944,10 +944,29 @@ static inline bool htlb_allow_alloc_fallback(int reason)
>   static inline spinlock_t *huge_pte_lockptr(struct hstate *h,
>   					   struct mm_struct *mm, pte_t *pte)
>   {
> -	if (huge_page_size(h) == PMD_SIZE)
> +	VM_WARN_ON(huge_page_size(h) == PAGE_SIZE);
> +	VM_WARN_ON(huge_page_size(h) >= P4D_SIZE);
> +
> +	/*
> +	 * hugetlb must use the exact same PT locks as core-mm page table
> +	 * walkers would. When modifying a PTE table, hugetlb must take the
> +	 * PTE PT lock, when modifying a PMD table, hugetlb must take the PMD
> +	 * PT lock etc.
> +	 *
> +	 * The expectation is that any hugetlb folio smaller than a PMD is
> +	 * always mapped into a single PTE table and that any hugetlb folio
> +	 * smaller than a PUD (but at least as big as a PMD) is always mapped
> +	 * into a single PMD table.

ARM64 also supports cont-PMD size hugetlb, which is 32MiB size with a 4 
KiB base page size. This means the PT locks for 32MiB hugetlb may race 
again, as we currently only hold one PMD lock for several PMD entries of 
a cont-PMD size hugetlb.

> +	 *
> +	 * If that does not hold for an architecture, then that architecture
> +	 * must disable split PT locks such that all *_lockptr() functions
> +	 * will give us the same result: the per-MM PT lock.
> +	 */
> +	if (huge_page_size(h) < PMD_SIZE)
> +		return pte_lockptr(mm, pte);
> +	else if (huge_page_size(h) < PUD_SIZE)
>   		return pmd_lockptr(mm, (pmd_t *) pte);

IIUC, as I said above, this change doesn't fix the inconsistent lock for 
cont-PMD size hugetlb for GUP, and it will also break the lock rule for 
unmapping/migrating a cont-PMD size hugetlb (use mm->page_table_lock 
before for cont-PMD size hugetlb before).

> -	VM_BUG_ON(huge_page_size(h) == PAGE_SIZE);
> -	return &mm->page_table_lock;
> +	return pud_lockptr(mm, (pud_t *) pte);
>   }
>   
>   #ifndef hugepages_supported

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v1 2/2] mm/hugetlb: fix hugetlb vs. core-mm PT locking
  2024-07-26  2:33   ` Baolin Wang
@ 2024-07-26  3:03     ` Baolin Wang
  2024-07-26  8:04       ` David Hildenbrand
  2024-07-26  8:04     ` David Hildenbrand
  1 sibling, 1 reply; 12+ messages in thread
From: Baolin Wang @ 2024-07-26  3:03 UTC (permalink / raw)
  To: David Hildenbrand, linux-kernel
  Cc: linux-mm, Andrew Morton, Muchun Song, Peter Xu, Oscar Salvador,
	stable



On 2024/7/26 10:33, Baolin Wang wrote:
> 
> 
> On 2024/7/26 02:39, David Hildenbrand wrote:
>> We recently made GUP's common page table walking code to also walk
>> hugetlb VMAs without most hugetlb special-casing, preparing for the
>> future of having less hugetlb-specific page table walking code in the
>> codebase. Turns out that we missed one page table locking detail: page
>> table locking for hugetlb folios that are not mapped using a single
>> PMD/PUD.
>>
>> Assume we have hugetlb folio that spans multiple PTEs (e.g., 64 KiB
>> hugetlb folios on arm64 with 4 KiB base page size). GUP, as it walks the
>> page tables, will perform a pte_offset_map_lock() to grab the PTE table
>> lock.
>>
>> However, hugetlb that concurrently modifies these page tables would
>> actually grab the mm->page_table_lock: with USE_SPLIT_PTE_PTLOCKS, the
>> locks would differ. Something similar can happen right now with hugetlb
>> folios that span multiple PMDs when USE_SPLIT_PMD_PTLOCKS.
>>
>> Let's make huge_pte_lockptr() effectively uses the same PT locks as any
>> core-mm page table walker would.
> 
> Thanks for raising the issue again. I remember fixing this issue 2 years 
> ago in commit fac35ba763ed ("mm/hugetlb: fix races when looking up a 
> CONT-PTE/PMD size hugetlb page"), but it seems to be broken again.
> 
>> There is one ugly case: powerpc 8xx, whereby we have an 8 MiB hugetlb
>> folio being mapped using two PTE page tables. While hugetlb wants to take
>> the PMD table lock, core-mm would grab the PTE table lock of one of both
>> PTE page tables. In such corner cases, we have to make sure that both
>> locks match, which is (fortunately!) currently guaranteed for 8xx as it
>> does not support SMP.
>>
>> Fixes: 9cb28da54643 ("mm/gup: handle hugetlb in the generic 
>> follow_page_mask code")
>> Cc: <stable@vger.kernel.org>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
>> ---
>>   include/linux/hugetlb.h | 25 ++++++++++++++++++++++---
>>   1 file changed, 22 insertions(+), 3 deletions(-)
>>
>> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
>> index c9bf68c239a01..da800e56fe590 100644
>> --- a/include/linux/hugetlb.h
>> +++ b/include/linux/hugetlb.h
>> @@ -944,10 +944,29 @@ static inline bool htlb_allow_alloc_fallback(int 
>> reason)
>>   static inline spinlock_t *huge_pte_lockptr(struct hstate *h,
>>                          struct mm_struct *mm, pte_t *pte)
>>   {
>> -    if (huge_page_size(h) == PMD_SIZE)
>> +    VM_WARN_ON(huge_page_size(h) == PAGE_SIZE);
>> +    VM_WARN_ON(huge_page_size(h) >= P4D_SIZE);
>> +
>> +    /*
>> +     * hugetlb must use the exact same PT locks as core-mm page table
>> +     * walkers would. When modifying a PTE table, hugetlb must take the
>> +     * PTE PT lock, when modifying a PMD table, hugetlb must take the 
>> PMD
>> +     * PT lock etc.
>> +     *
>> +     * The expectation is that any hugetlb folio smaller than a PMD is
>> +     * always mapped into a single PTE table and that any hugetlb folio
>> +     * smaller than a PUD (but at least as big as a PMD) is always 
>> mapped
>> +     * into a single PMD table.
> 
> ARM64 also supports cont-PMD size hugetlb, which is 32MiB size with a 4 
> KiB base page size. This means the PT locks for 32MiB hugetlb may race 
> again, as we currently only hold one PMD lock for several PMD entries of 
> a cont-PMD size hugetlb.
> 
>> +     *
>> +     * If that does not hold for an architecture, then that architecture
>> +     * must disable split PT locks such that all *_lockptr() functions
>> +     * will give us the same result: the per-MM PT lock.
>> +     */
>> +    if (huge_page_size(h) < PMD_SIZE)
>> +        return pte_lockptr(mm, pte);
>> +    else if (huge_page_size(h) < PUD_SIZE)
>>           return pmd_lockptr(mm, (pmd_t *) pte);
> 
> IIUC, as I said above, this change doesn't fix the inconsistent lock for 
> cont-PMD size hugetlb for GUP, and it will also break the lock rule for 
> unmapping/migrating a cont-PMD size hugetlb (use mm->page_table_lock 
> before for cont-PMD size hugetlb before).

After more thinking, I realized I confused the PMD table with the PMD 
entry. Therefore, using the PMD table's lock is safe for cont-PMD size 
hugetlb. This change looks good to me. Sorry for noise.

Please feel free to add:
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>

> 
>> -    VM_BUG_ON(huge_page_size(h) == PAGE_SIZE);
>> -    return &mm->page_table_lock;
>> +    return pud_lockptr(mm, (pud_t *) pte);
>>   }
>>   #ifndef hugepages_supported

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v1 2/2] mm/hugetlb: fix hugetlb vs. core-mm PT locking
  2024-07-26  3:03     ` Baolin Wang
@ 2024-07-26  8:04       ` David Hildenbrand
  0 siblings, 0 replies; 12+ messages in thread
From: David Hildenbrand @ 2024-07-26  8:04 UTC (permalink / raw)
  To: Baolin Wang, linux-kernel
  Cc: linux-mm, Andrew Morton, Muchun Song, Peter Xu, Oscar Salvador,
	stable

>>
>>> +     *
>>> +     * If that does not hold for an architecture, then that architecture
>>> +     * must disable split PT locks such that all *_lockptr() functions
>>> +     * will give us the same result: the per-MM PT lock.
>>> +     */
>>> +    if (huge_page_size(h) < PMD_SIZE)
>>> +        return pte_lockptr(mm, pte);
>>> +    else if (huge_page_size(h) < PUD_SIZE)
>>>            return pmd_lockptr(mm, (pmd_t *) pte);
>>
>> IIUC, as I said above, this change doesn't fix the inconsistent lock for
>> cont-PMD size hugetlb for GUP, and it will also break the lock rule for
>> unmapping/migrating a cont-PMD size hugetlb (use mm->page_table_lock
>> before for cont-PMD size hugetlb before).
> 
> After more thinking, I realized I confused the PMD table with the PMD
> entry. Therefore, using the PMD table's lock is safe for cont-PMD size
> hugetlb. This change looks good to me. Sorry for noise.
> 

Thanks for the review, highly appreciated!

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v1 2/2] mm/hugetlb: fix hugetlb vs. core-mm PT locking
  2024-07-26  2:33   ` Baolin Wang
  2024-07-26  3:03     ` Baolin Wang
@ 2024-07-26  8:04     ` David Hildenbrand
  2024-07-26  9:38       ` Baolin Wang
  1 sibling, 1 reply; 12+ messages in thread
From: David Hildenbrand @ 2024-07-26  8:04 UTC (permalink / raw)
  To: Baolin Wang, linux-kernel
  Cc: linux-mm, Andrew Morton, Muchun Song, Peter Xu, Oscar Salvador,
	stable

On 26.07.24 04:33, Baolin Wang wrote:
> 
> 
> On 2024/7/26 02:39, David Hildenbrand wrote:
>> We recently made GUP's common page table walking code to also walk
>> hugetlb VMAs without most hugetlb special-casing, preparing for the
>> future of having less hugetlb-specific page table walking code in the
>> codebase. Turns out that we missed one page table locking detail: page
>> table locking for hugetlb folios that are not mapped using a single
>> PMD/PUD.
>>
>> Assume we have hugetlb folio that spans multiple PTEs (e.g., 64 KiB
>> hugetlb folios on arm64 with 4 KiB base page size). GUP, as it walks the
>> page tables, will perform a pte_offset_map_lock() to grab the PTE table
>> lock.
>>
>> However, hugetlb that concurrently modifies these page tables would
>> actually grab the mm->page_table_lock: with USE_SPLIT_PTE_PTLOCKS, the
>> locks would differ. Something similar can happen right now with hugetlb
>> folios that span multiple PMDs when USE_SPLIT_PMD_PTLOCKS.
>>
>> Let's make huge_pte_lockptr() effectively uses the same PT locks as any
>> core-mm page table walker would.
> 
> Thanks for raising the issue again. I remember fixing this issue 2 years
> ago in commit fac35ba763ed ("mm/hugetlb: fix races when looking up a
> CONT-PTE/PMD size hugetlb page"), but it seems to be broken again.
> 

Ah, right! We fixed it by rerouting to hugetlb code that we then removed :D

Did we have a reproducer back then that would make my live easier?

>> There is one ugly case: powerpc 8xx, whereby we have an 8 MiB hugetlb
>> folio being mapped using two PTE page tables. While hugetlb wants to take
>> the PMD table lock, core-mm would grab the PTE table lock of one of both
>> PTE page tables. In such corner cases, we have to make sure that both
>> locks match, which is (fortunately!) currently guaranteed for 8xx as it
>> does not support SMP.
>>
>> Fixes: 9cb28da54643 ("mm/gup: handle hugetlb in the generic follow_page_mask code")
>> Cc: <stable@vger.kernel.org>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
>> ---
>>    include/linux/hugetlb.h | 25 ++++++++++++++++++++++---
>>    1 file changed, 22 insertions(+), 3 deletions(-)
>>
>> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
>> index c9bf68c239a01..da800e56fe590 100644
>> --- a/include/linux/hugetlb.h
>> +++ b/include/linux/hugetlb.h
>> @@ -944,10 +944,29 @@ static inline bool htlb_allow_alloc_fallback(int reason)
>>    static inline spinlock_t *huge_pte_lockptr(struct hstate *h,
>>    					   struct mm_struct *mm, pte_t *pte)
>>    {
>> -	if (huge_page_size(h) == PMD_SIZE)
>> +	VM_WARN_ON(huge_page_size(h) == PAGE_SIZE);
>> +	VM_WARN_ON(huge_page_size(h) >= P4D_SIZE);
>> +
>> +	/*
>> +	 * hugetlb must use the exact same PT locks as core-mm page table
>> +	 * walkers would. When modifying a PTE table, hugetlb must take the
>> +	 * PTE PT lock, when modifying a PMD table, hugetlb must take the PMD
>> +	 * PT lock etc.
>> +	 *
>> +	 * The expectation is that any hugetlb folio smaller than a PMD is
>> +	 * always mapped into a single PTE table and that any hugetlb folio
>> +	 * smaller than a PUD (but at least as big as a PMD) is always mapped
>> +	 * into a single PMD table.
> 
> ARM64 also supports cont-PMD size hugetlb, which is 32MiB size with a 4
> KiB base page size. This means the PT locks for 32MiB hugetlb may race
> again, as we currently only hold one PMD lock for several PMD entries of
> a cont-PMD size hugetlb.

Exactly, that's the case where all cont-PMD entries fall into the same 
page table.

That's also what I will try reproducing with (migration racing with 
GUP). But the race window is small and a lot of other stuff is protected 
by the VMA lock.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v1 2/2] mm/hugetlb: fix hugetlb vs. core-mm PT locking
  2024-07-26  8:04     ` David Hildenbrand
@ 2024-07-26  9:38       ` Baolin Wang
  2024-07-26 11:40         ` David Hildenbrand
  0 siblings, 1 reply; 12+ messages in thread
From: Baolin Wang @ 2024-07-26  9:38 UTC (permalink / raw)
  To: David Hildenbrand, linux-kernel
  Cc: linux-mm, Andrew Morton, Muchun Song, Peter Xu, Oscar Salvador,
	stable



On 2024/7/26 16:04, David Hildenbrand wrote:
> On 26.07.24 04:33, Baolin Wang wrote:
>>
>>
>> On 2024/7/26 02:39, David Hildenbrand wrote:
>>> We recently made GUP's common page table walking code to also walk
>>> hugetlb VMAs without most hugetlb special-casing, preparing for the
>>> future of having less hugetlb-specific page table walking code in the
>>> codebase. Turns out that we missed one page table locking detail: page
>>> table locking for hugetlb folios that are not mapped using a single
>>> PMD/PUD.
>>>
>>> Assume we have hugetlb folio that spans multiple PTEs (e.g., 64 KiB
>>> hugetlb folios on arm64 with 4 KiB base page size). GUP, as it walks the
>>> page tables, will perform a pte_offset_map_lock() to grab the PTE table
>>> lock.
>>>
>>> However, hugetlb that concurrently modifies these page tables would
>>> actually grab the mm->page_table_lock: with USE_SPLIT_PTE_PTLOCKS, the
>>> locks would differ. Something similar can happen right now with hugetlb
>>> folios that span multiple PMDs when USE_SPLIT_PMD_PTLOCKS.
>>>
>>> Let's make huge_pte_lockptr() effectively uses the same PT locks as any
>>> core-mm page table walker would.
>>
>> Thanks for raising the issue again. I remember fixing this issue 2 years
>> ago in commit fac35ba763ed ("mm/hugetlb: fix races when looking up a
>> CONT-PTE/PMD size hugetlb page"), but it seems to be broken again.
>>
> 
> Ah, right! We fixed it by rerouting to hugetlb code that we then removed :D
> 
> Did we have a reproducer back then that would make my live easier?

I don't have any reproducers right now. I remember I added some ugly 
hack code (adding delay() etc.) in kernel to analyze this issue, and not 
easy to reproduce. :(

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v1 2/2] mm/hugetlb: fix hugetlb vs. core-mm PT locking
  2024-07-26  9:38       ` Baolin Wang
@ 2024-07-26 11:40         ` David Hildenbrand
  2024-07-29  1:48           ` Baolin Wang
  0 siblings, 1 reply; 12+ messages in thread
From: David Hildenbrand @ 2024-07-26 11:40 UTC (permalink / raw)
  To: Baolin Wang, linux-kernel
  Cc: linux-mm, Andrew Morton, Muchun Song, Peter Xu, Oscar Salvador,
	stable

On 26.07.24 11:38, Baolin Wang wrote:
> 
> 
> On 2024/7/26 16:04, David Hildenbrand wrote:
>> On 26.07.24 04:33, Baolin Wang wrote:
>>>
>>>
>>> On 2024/7/26 02:39, David Hildenbrand wrote:
>>>> We recently made GUP's common page table walking code to also walk
>>>> hugetlb VMAs without most hugetlb special-casing, preparing for the
>>>> future of having less hugetlb-specific page table walking code in the
>>>> codebase. Turns out that we missed one page table locking detail: page
>>>> table locking for hugetlb folios that are not mapped using a single
>>>> PMD/PUD.
>>>>
>>>> Assume we have hugetlb folio that spans multiple PTEs (e.g., 64 KiB
>>>> hugetlb folios on arm64 with 4 KiB base page size). GUP, as it walks the
>>>> page tables, will perform a pte_offset_map_lock() to grab the PTE table
>>>> lock.
>>>>
>>>> However, hugetlb that concurrently modifies these page tables would
>>>> actually grab the mm->page_table_lock: with USE_SPLIT_PTE_PTLOCKS, the
>>>> locks would differ. Something similar can happen right now with hugetlb
>>>> folios that span multiple PMDs when USE_SPLIT_PMD_PTLOCKS.
>>>>
>>>> Let's make huge_pte_lockptr() effectively uses the same PT locks as any
>>>> core-mm page table walker would.
>>>
>>> Thanks for raising the issue again. I remember fixing this issue 2 years
>>> ago in commit fac35ba763ed ("mm/hugetlb: fix races when looking up a
>>> CONT-PTE/PMD size hugetlb page"), but it seems to be broken again.
>>>
>>
>> Ah, right! We fixed it by rerouting to hugetlb code that we then removed :D
>>
>> Did we have a reproducer back then that would make my live easier?
> 
> I don't have any reproducers right now. I remember I added some ugly
> hack code (adding delay() etc.) in kernel to analyze this issue, and not
> easy to reproduce. :(

Hah!

I tried with 32MB without luck -- migration simply takes too long. 64KB
did the trick within seconds!


On a VM with 2 vNUMA nodes, after reserving a bunch of 64KiB hugetlb pages:

# numactl -H
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
node 0 size: 64439 MB
node 0 free: 64097 MB
node 1 cpus: 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
node 1 size: 64205 MB
node 1 free: 63809 MB
node distances:
node   0   1
   0:  10  20
   1:  20  10
# echo 100  > /sys/kernel/mm/hugepages/hugepages-64kB/nr_hugepagespages-64kB/nr_hugepagesepages-64kB/nr_hugepages
# gcc reproducer.c -o reproducer -O3 -lnuma -lpthread
# ./reproducer
[ 3105.936100] ------------[ cut here ]------------
[ 3105.939323] WARNING: CPU: 31 PID: 2732 at mm/gup.c:142 try_grab_folio+0x11c/0x188
[ 3105.944634] Modules linked in: nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 rfkill ip_set nf_tables qrtr sunrpc vfat fat ext4 mbcache jbd2 virtio_net net_failover failover virtio_balloon dimlib loop dm_multipath nfnetlink zram xfs crct10dif_ce ghash_ce sha2_ce sha256_arm64 sha1_ce virtio_blk virtio_console virtio_mmio dm_mirror dm_region_hash dm_log dm_mod fuse
[ 3105.974841] CPU: 31 PID: 2732 Comm: reproducer Not tainted 6.10.0-64.eln141.aarch64 #1
[ 3105.980406] Hardware name: QEMU KVM Virtual Machine, BIOS edk2-20240524-4.fc40 05/24/2024
[ 3105.986185] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[ 3105.991108] pc : try_grab_folio+0x11c/0x188
[ 3105.994013] lr : follow_page_pte+0xd8/0x430
[ 3105.996986] sp : ffff80008eafb8f0
[ 3105.999346] x29: ffff80008eafb900 x28: ffffffe8d481f380 x27: 00f80001207cff43
[ 3106.004414] x26: 0000000000000001 x25: 0000000000000000 x24: ffff80008eafba48
[ 3106.009520] x23: 0000ffff9372f000 x22: ffff7a54459e2000 x21: ffff7a546c1aa978
[ 3106.014529] x20: ffffffe8d481f3c0 x19: 0000000000610041 x18: 0000000000000001
[ 3106.019506] x17: 0000000000000001 x16: ffffffffffffffff x15: 0000000000000000
[ 3106.024494] x14: ffffb85477fdfe08 x13: 0000ffff9372ffff x12: 0000000000000000
[ 3106.029469] x11: 1fffef4a88a96be1 x10: ffff7a54454b5f0c x9 : ffffb854771b12f0
[ 3106.034324] x8 : 0008000000000000 x7 : ffff7a546c1aa980 x6 : 0008000000000080
[ 3106.038902] x5 : 00000000001207cf x4 : 0000ffff9372f000 x3 : ffffffe8d481f000
[ 3106.043420] x2 : 0000000000610041 x1 : 0000000000000001 x0 : 0000000000000000
[ 3106.047957] Call trace:
[ 3106.049522]  try_grab_folio+0x11c/0x188
[ 3106.051996]  follow_pmd_mask.constprop.0.isra.0+0x150/0x2e0
[ 3106.055527]  follow_page_mask+0x1a0/0x2b8
[ 3106.058118]  __get_user_pages+0xf0/0x348
[ 3106.060647]  faultin_page_range+0xb0/0x360
[ 3106.063651]  do_madvise+0x340/0x598
...



# cat reproducer.c
#include <stdio.h>
#include <sys/mman.h>
#include <linux/mman.h>
#include <errno.h>
#include <string.h>
#include <numaif.h>
#include <pthread.h>

#define SIZE_64KB (64 * 1024ul)

static void *thread_fn(void *arg)
{
         char *mem = arg;

         /* Let GUP go crazy on the page without grabbing a reference. */
         while (1) {
                 if (madvise(mem, SIZE_64KB, MADV_POPULATE_WRITE)) {
                         fprintf(stderr, "madvise() failed: %d\n", errno);
                 }
         }
}

int main(void)
{
         pthread_t thread;
         unsigned int i;
         char *mem;

         mem = mmap(0, SIZE_64KB, PROT_READ|PROT_WRITE,
                    MAP_ANON|MAP_PRIVATE|MAP_HUGETLB|MAP_HUGE_64KB, -1, 0);
         if (mem == MAP_FAILED) {
                 fprintf(stderr, "mmap() failed: %d\n", errno);
                 return -1;
         }

         memset(mem, 0, SIZE_64KB);

         pthread_create(&thread, NULL, thread_fn, mem);

         /* Migrate it back and forth between two nodes. */
         for (i = 0; ; i++) {
                 void *pages[1] = { mem, };
                 int nodes[1] = { i % 2, };
                 int status[1] = { 0 };

                 if (move_pages(0, 1, pages, nodes, status, MPOL_MF_MOVE_ALL))
                         fprintf(stderr, "move_pages failed: %d\n", errno);
                 if (status[0] != nodes[0])
                         printf("migration failed\n");
         }
         return 0;
}


-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v1 2/2] mm/hugetlb: fix hugetlb vs. core-mm PT locking
  2024-07-26 11:40         ` David Hildenbrand
@ 2024-07-29  1:48           ` Baolin Wang
  0 siblings, 0 replies; 12+ messages in thread
From: Baolin Wang @ 2024-07-29  1:48 UTC (permalink / raw)
  To: David Hildenbrand, linux-kernel
  Cc: linux-mm, Andrew Morton, Muchun Song, Peter Xu, Oscar Salvador,
	stable



On 2024/7/26 19:40, David Hildenbrand wrote:
> On 26.07.24 11:38, Baolin Wang wrote:
>>
>>
>> On 2024/7/26 16:04, David Hildenbrand wrote:
>>> On 26.07.24 04:33, Baolin Wang wrote:
>>>>
>>>>
>>>> On 2024/7/26 02:39, David Hildenbrand wrote:
>>>>> We recently made GUP's common page table walking code to also walk
>>>>> hugetlb VMAs without most hugetlb special-casing, preparing for the
>>>>> future of having less hugetlb-specific page table walking code in the
>>>>> codebase. Turns out that we missed one page table locking detail: page
>>>>> table locking for hugetlb folios that are not mapped using a single
>>>>> PMD/PUD.
>>>>>
>>>>> Assume we have hugetlb folio that spans multiple PTEs (e.g., 64 KiB
>>>>> hugetlb folios on arm64 with 4 KiB base page size). GUP, as it 
>>>>> walks the
>>>>> page tables, will perform a pte_offset_map_lock() to grab the PTE 
>>>>> table
>>>>> lock.
>>>>>
>>>>> However, hugetlb that concurrently modifies these page tables would
>>>>> actually grab the mm->page_table_lock: with USE_SPLIT_PTE_PTLOCKS, the
>>>>> locks would differ. Something similar can happen right now with 
>>>>> hugetlb
>>>>> folios that span multiple PMDs when USE_SPLIT_PMD_PTLOCKS.
>>>>>
>>>>> Let's make huge_pte_lockptr() effectively uses the same PT locks as 
>>>>> any
>>>>> core-mm page table walker would.
>>>>
>>>> Thanks for raising the issue again. I remember fixing this issue 2 
>>>> years
>>>> ago in commit fac35ba763ed ("mm/hugetlb: fix races when looking up a
>>>> CONT-PTE/PMD size hugetlb page"), but it seems to be broken again.
>>>>
>>>
>>> Ah, right! We fixed it by rerouting to hugetlb code that we then 
>>> removed :D
>>>
>>> Did we have a reproducer back then that would make my live easier?
>>
>> I don't have any reproducers right now. I remember I added some ugly
>> hack code (adding delay() etc.) in kernel to analyze this issue, and not
>> easy to reproduce. :(
> 
> Hah!
> 
> I tried with 32MB without luck -- migration simply takes too long. 64KB
> did the trick within seconds!
> 
> 
> On a VM with 2 vNUMA nodes, after reserving a bunch of 64KiB hugetlb pages:
> 
> # numactl -H
> available: 2 nodes (0-1)
> node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
> node 0 size: 64439 MB
> node 0 free: 64097 MB
> node 1 cpus: 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
> node 1 size: 64205 MB
> node 1 free: 63809 MB
> node distances:
> node   0   1
>    0:  10  20
>    1:  20  10
> # echo 100  > 
> /sys/kernel/mm/hugepages/hugepages-64kB/nr_hugepagespages-64kB/nr_hugepagesepages-64kB/nr_hugepages
> # gcc reproducer.c -o reproducer -O3 -lnuma -lpthread
> # ./reproducer
> [ 3105.936100] ------------[ cut here ]------------
> [ 3105.939323] WARNING: CPU: 31 PID: 2732 at mm/gup.c:142 
> try_grab_folio+0x11c/0x188
> [ 3105.944634] Modules linked in: nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 
> nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct 
> nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 rfkill 
> ip_set nf_tables qrtr sunrpc vfat fat ext4 mbcache jbd2 virtio_net 
> net_failover failover virtio_balloon dimlib loop dm_multipath nfnetlink 
> zram xfs crct10dif_ce ghash_ce sha2_ce sha256_arm64 sha1_ce virtio_blk 
> virtio_console virtio_mmio dm_mirror dm_region_hash dm_log dm_mod fuse
> [ 3105.974841] CPU: 31 PID: 2732 Comm: reproducer Not tainted 
> 6.10.0-64.eln141.aarch64 #1
> [ 3105.980406] Hardware name: QEMU KVM Virtual Machine, BIOS 
> edk2-20240524-4.fc40 05/24/2024
> [ 3105.986185] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS 
> BTYPE=--)
> [ 3105.991108] pc : try_grab_folio+0x11c/0x188
> [ 3105.994013] lr : follow_page_pte+0xd8/0x430
> [ 3105.996986] sp : ffff80008eafb8f0
> [ 3105.999346] x29: ffff80008eafb900 x28: ffffffe8d481f380 x27: 
> 00f80001207cff43
> [ 3106.004414] x26: 0000000000000001 x25: 0000000000000000 x24: 
> ffff80008eafba48
> [ 3106.009520] x23: 0000ffff9372f000 x22: ffff7a54459e2000 x21: 
> ffff7a546c1aa978
> [ 3106.014529] x20: ffffffe8d481f3c0 x19: 0000000000610041 x18: 
> 0000000000000001
> [ 3106.019506] x17: 0000000000000001 x16: ffffffffffffffff x15: 
> 0000000000000000
> [ 3106.024494] x14: ffffb85477fdfe08 x13: 0000ffff9372ffff x12: 
> 0000000000000000
> [ 3106.029469] x11: 1fffef4a88a96be1 x10: ffff7a54454b5f0c x9 : 
> ffffb854771b12f0
> [ 3106.034324] x8 : 0008000000000000 x7 : ffff7a546c1aa980 x6 : 
> 0008000000000080
> [ 3106.038902] x5 : 00000000001207cf x4 : 0000ffff9372f000 x3 : 
> ffffffe8d481f000
> [ 3106.043420] x2 : 0000000000610041 x1 : 0000000000000001 x0 : 
> 0000000000000000
> [ 3106.047957] Call trace:
> [ 3106.049522]  try_grab_folio+0x11c/0x188
> [ 3106.051996]  follow_pmd_mask.constprop.0.isra.0+0x150/0x2e0
> [ 3106.055527]  follow_page_mask+0x1a0/0x2b8
> [ 3106.058118]  __get_user_pages+0xf0/0x348
> [ 3106.060647]  faultin_page_range+0xb0/0x360
> [ 3106.063651]  do_madvise+0x340/0x598
> ...
> 
> 
> 
> # cat reproducer.c
> #include <stdio.h>
> #include <sys/mman.h>
> #include <linux/mman.h>
> #include <errno.h>
> #include <string.h>
> #include <numaif.h>
> #include <pthread.h>
> 
> #define SIZE_64KB (64 * 1024ul)
> 
> static void *thread_fn(void *arg)
> {
>          char *mem = arg;
> 
>          /* Let GUP go crazy on the page without grabbing a reference. */
>          while (1) {
>                  if (madvise(mem, SIZE_64KB, MADV_POPULATE_WRITE)) {
>                          fprintf(stderr, "madvise() failed: %d\n", errno);
>                  }
>          }
> }
> 
> int main(void)
> {
>          pthread_t thread;
>          unsigned int i;
>          char *mem;
> 
>          mem = mmap(0, SIZE_64KB, PROT_READ|PROT_WRITE,
>                     MAP_ANON|MAP_PRIVATE|MAP_HUGETLB|MAP_HUGE_64KB, -1, 0);
>          if (mem == MAP_FAILED) {
>                  fprintf(stderr, "mmap() failed: %d\n", errno);
>                  return -1;
>          }
> 
>          memset(mem, 0, SIZE_64KB);
> 
>          pthread_create(&thread, NULL, thread_fn, mem);
> 
>          /* Migrate it back and forth between two nodes. */
>          for (i = 0; ; i++) {
>                  void *pages[1] = { mem, };
>                  int nodes[1] = { i % 2, };
>                  int status[1] = { 0 };
> 
>                  if (move_pages(0, 1, pages, nodes, status, 
> MPOL_MF_MOVE_ALL))
>                          fprintf(stderr, "move_pages failed: %d\n", errno);
>                  if (status[0] != nodes[0])
>                          printf("migration failed\n");
>          }
>          return 0;
> }

Great! Thanks for creating the reproducer (I will also try it if I find 
some time).

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v1 2/2] mm/hugetlb: fix hugetlb vs. core-mm PT locking
  2024-07-25 18:39 ` [PATCH v1 2/2] mm/hugetlb: fix hugetlb vs. core-mm PT locking David Hildenbrand
  2024-07-26  2:33   ` Baolin Wang
@ 2024-07-26  8:18   ` Muchun Song
  2024-07-26 15:26   ` Peter Xu
  2024-07-29  4:51   ` Oscar Salvador
  3 siblings, 0 replies; 12+ messages in thread
From: Muchun Song @ 2024-07-26  8:18 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: LKML, Linux Memory Management List, Andrew Morton, Peter Xu,
	Oscar Salvador, stable



> On Jul 26, 2024, at 02:39, David Hildenbrand <david@redhat.com> wrote:
> 
> We recently made GUP's common page table walking code to also walk
> hugetlb VMAs without most hugetlb special-casing, preparing for the
> future of having less hugetlb-specific page table walking code in the
> codebase. Turns out that we missed one page table locking detail: page
> table locking for hugetlb folios that are not mapped using a single
> PMD/PUD.
> 
> Assume we have hugetlb folio that spans multiple PTEs (e.g., 64 KiB
> hugetlb folios on arm64 with 4 KiB base page size). GUP, as it walks the
> page tables, will perform a pte_offset_map_lock() to grab the PTE table
> lock.
> 
> However, hugetlb that concurrently modifies these page tables would
> actually grab the mm->page_table_lock: with USE_SPLIT_PTE_PTLOCKS, the
> locks would differ. Something similar can happen right now with hugetlb
> folios that span multiple PMDs when USE_SPLIT_PMD_PTLOCKS.
> 
> Let's make huge_pte_lockptr() effectively uses the same PT locks as any
> core-mm page table walker would.
> 
> There is one ugly case: powerpc 8xx, whereby we have an 8 MiB hugetlb
> folio being mapped using two PTE page tables. While hugetlb wants to take
> the PMD table lock, core-mm would grab the PTE table lock of one of both
> PTE page tables. In such corner cases, we have to make sure that both
> locks match, which is (fortunately!) currently guaranteed for 8xx as it
> does not support SMP.
> 
> Fixes: 9cb28da54643 ("mm/gup: handle hugetlb in the generic follow_page_mask code")
> Cc: <stable@vger.kernel.org>
> Signed-off-by: David Hildenbrand <david@redhat.com>

Acked-by: Muchun Song <muchun.song@linux.dev>

Thanks.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v1 2/2] mm/hugetlb: fix hugetlb vs. core-mm PT locking
  2024-07-25 18:39 ` [PATCH v1 2/2] mm/hugetlb: fix hugetlb vs. core-mm PT locking David Hildenbrand
  2024-07-26  2:33   ` Baolin Wang
  2024-07-26  8:18   ` Muchun Song
@ 2024-07-26 15:26   ` Peter Xu
  2024-07-26 15:32     ` David Hildenbrand
  2024-07-29  4:51   ` Oscar Salvador
  3 siblings, 1 reply; 12+ messages in thread
From: Peter Xu @ 2024-07-26 15:26 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, Andrew Morton, Muchun Song,
	Oscar Salvador, stable

On Thu, Jul 25, 2024 at 08:39:55PM +0200, David Hildenbrand wrote:
> We recently made GUP's common page table walking code to also walk
> hugetlb VMAs without most hugetlb special-casing, preparing for the
> future of having less hugetlb-specific page table walking code in the
> codebase. Turns out that we missed one page table locking detail: page
> table locking for hugetlb folios that are not mapped using a single
> PMD/PUD.
> 
> Assume we have hugetlb folio that spans multiple PTEs (e.g., 64 KiB
> hugetlb folios on arm64 with 4 KiB base page size). GUP, as it walks the
> page tables, will perform a pte_offset_map_lock() to grab the PTE table
> lock.
> 
> However, hugetlb that concurrently modifies these page tables would
> actually grab the mm->page_table_lock: with USE_SPLIT_PTE_PTLOCKS, the
> locks would differ. Something similar can happen right now with hugetlb
> folios that span multiple PMDs when USE_SPLIT_PMD_PTLOCKS.
> 
> Let's make huge_pte_lockptr() effectively uses the same PT locks as any
> core-mm page table walker would.
> 
> There is one ugly case: powerpc 8xx, whereby we have an 8 MiB hugetlb
> folio being mapped using two PTE page tables. While hugetlb wants to take
> the PMD table lock, core-mm would grab the PTE table lock of one of both
> PTE page tables. In such corner cases, we have to make sure that both
> locks match, which is (fortunately!) currently guaranteed for 8xx as it
> does not support SMP.

Do you mean "does not support SPLIT_PMD_PTLOCK" here instead of SMP?

> 
> Fixes: 9cb28da54643 ("mm/gup: handle hugetlb in the generic follow_page_mask code")
> Cc: <stable@vger.kernel.org>
> Signed-off-by: David Hildenbrand <david@redhat.com>

Patch looks all right to me:

Reviewed-by: Peter Xu <peterx@redhat.com>

Thanks!

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v1 2/2] mm/hugetlb: fix hugetlb vs. core-mm PT locking
  2024-07-26 15:26   ` Peter Xu
@ 2024-07-26 15:32     ` David Hildenbrand
  0 siblings, 0 replies; 12+ messages in thread
From: David Hildenbrand @ 2024-07-26 15:32 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-kernel, linux-mm, Andrew Morton, Muchun Song,
	Oscar Salvador, stable

On 26.07.24 17:26, Peter Xu wrote:
> On Thu, Jul 25, 2024 at 08:39:55PM +0200, David Hildenbrand wrote:
>> We recently made GUP's common page table walking code to also walk
>> hugetlb VMAs without most hugetlb special-casing, preparing for the
>> future of having less hugetlb-specific page table walking code in the
>> codebase. Turns out that we missed one page table locking detail: page
>> table locking for hugetlb folios that are not mapped using a single
>> PMD/PUD.
>>
>> Assume we have hugetlb folio that spans multiple PTEs (e.g., 64 KiB
>> hugetlb folios on arm64 with 4 KiB base page size). GUP, as it walks the
>> page tables, will perform a pte_offset_map_lock() to grab the PTE table
>> lock.
>>
>> However, hugetlb that concurrently modifies these page tables would
>> actually grab the mm->page_table_lock: with USE_SPLIT_PTE_PTLOCKS, the
>> locks would differ. Something similar can happen right now with hugetlb
>> folios that span multiple PMDs when USE_SPLIT_PMD_PTLOCKS.
>>
>> Let's make huge_pte_lockptr() effectively uses the same PT locks as any
>> core-mm page table walker would.
>>
>> There is one ugly case: powerpc 8xx, whereby we have an 8 MiB hugetlb
>> folio being mapped using two PTE page tables. While hugetlb wants to take
>> the PMD table lock, core-mm would grab the PTE table lock of one of both
>> PTE page tables. In such corner cases, we have to make sure that both
>> locks match, which is (fortunately!) currently guaranteed for 8xx as it
>> does not support SMP.
> 
> Do you mean "does not support SPLIT_PMD_PTLOCK" here instead of SMP?

Split PT locks are only enabled once we have NR_CPUS >= 4, so without 
SMP (NR_CPUS == 1), no split PT locks are used. I document that in the 
other series that I just sent out in a better way.

> 
>>
>> Fixes: 9cb28da54643 ("mm/gup: handle hugetlb in the generic follow_page_mask code")
>> Cc: <stable@vger.kernel.org>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
> 
> Patch looks all right to me:
> 
> Reviewed-by: Peter Xu <peterx@redhat.com>
> 
> Thanks!

Thanks Peter!

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v1 2/2] mm/hugetlb: fix hugetlb vs. core-mm PT locking
  2024-07-25 18:39 ` [PATCH v1 2/2] mm/hugetlb: fix hugetlb vs. core-mm PT locking David Hildenbrand
                     ` (2 preceding siblings ...)
  2024-07-26 15:26   ` Peter Xu
@ 2024-07-29  4:51   ` Oscar Salvador
  3 siblings, 0 replies; 12+ messages in thread
From: Oscar Salvador @ 2024-07-29  4:51 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, Andrew Morton, Muchun Song, Peter Xu,
	stable

On Thu, Jul 25, 2024 at 08:39:55PM +0200, David Hildenbrand wrote:
> We recently made GUP's common page table walking code to also walk
> hugetlb VMAs without most hugetlb special-casing, preparing for the
> future of having less hugetlb-specific page table walking code in the
> codebase. Turns out that we missed one page table locking detail: page
> table locking for hugetlb folios that are not mapped using a single
> PMD/PUD.
> 
> Assume we have hugetlb folio that spans multiple PTEs (e.g., 64 KiB
> hugetlb folios on arm64 with 4 KiB base page size). GUP, as it walks the
> page tables, will perform a pte_offset_map_lock() to grab the PTE table
> lock.
> 
> However, hugetlb that concurrently modifies these page tables would
> actually grab the mm->page_table_lock: with USE_SPLIT_PTE_PTLOCKS, the
> locks would differ. Something similar can happen right now with hugetlb
> folios that span multiple PMDs when USE_SPLIT_PMD_PTLOCKS.
> 
> Let's make huge_pte_lockptr() effectively uses the same PT locks as any
> core-mm page table walker would.
> 
> There is one ugly case: powerpc 8xx, whereby we have an 8 MiB hugetlb
> folio being mapped using two PTE page tables. While hugetlb wants to take
> the PMD table lock, core-mm would grab the PTE table lock of one of both
> PTE page tables. In such corner cases, we have to make sure that both
> locks match, which is (fortunately!) currently guaranteed for 8xx as it
> does not support SMP.
> 
> Fixes: 9cb28da54643 ("mm/gup: handle hugetlb in the generic follow_page_mask code")
> Cc: <stable@vger.kernel.org>
> Signed-off-by: David Hildenbrand <david@redhat.com>

Reviewed-by: Oscar Salvador <osalvador@suse.de>


-- 
Oscar Salvador
SUSE Labs

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2024-07-29  4:51 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20240725183955.2268884-1-david@redhat.com>
2024-07-25 18:39 ` [PATCH v1 2/2] mm/hugetlb: fix hugetlb vs. core-mm PT locking David Hildenbrand
2024-07-26  2:33   ` Baolin Wang
2024-07-26  3:03     ` Baolin Wang
2024-07-26  8:04       ` David Hildenbrand
2024-07-26  8:04     ` David Hildenbrand
2024-07-26  9:38       ` Baolin Wang
2024-07-26 11:40         ` David Hildenbrand
2024-07-29  1:48           ` Baolin Wang
2024-07-26  8:18   ` Muchun Song
2024-07-26 15:26   ` Peter Xu
2024-07-26 15:32     ` David Hildenbrand
2024-07-29  4:51   ` Oscar Salvador

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox