* Re: [PATCH v1 2/2] mm/hugetlb: fix hugetlb vs. core-mm PT locking
2024-07-25 18:39 ` [PATCH v1 2/2] mm/hugetlb: fix hugetlb vs. core-mm PT locking David Hildenbrand
@ 2024-07-26 2:33 ` Baolin Wang
2024-07-26 3:03 ` Baolin Wang
2024-07-26 8:04 ` David Hildenbrand
2024-07-26 8:18 ` Muchun Song
` (2 subsequent siblings)
3 siblings, 2 replies; 12+ messages in thread
From: Baolin Wang @ 2024-07-26 2:33 UTC (permalink / raw)
To: David Hildenbrand, linux-kernel
Cc: linux-mm, Andrew Morton, Muchun Song, Peter Xu, Oscar Salvador,
stable
On 2024/7/26 02:39, David Hildenbrand wrote:
> We recently made GUP's common page table walking code to also walk
> hugetlb VMAs without most hugetlb special-casing, preparing for the
> future of having less hugetlb-specific page table walking code in the
> codebase. Turns out that we missed one page table locking detail: page
> table locking for hugetlb folios that are not mapped using a single
> PMD/PUD.
>
> Assume we have hugetlb folio that spans multiple PTEs (e.g., 64 KiB
> hugetlb folios on arm64 with 4 KiB base page size). GUP, as it walks the
> page tables, will perform a pte_offset_map_lock() to grab the PTE table
> lock.
>
> However, hugetlb that concurrently modifies these page tables would
> actually grab the mm->page_table_lock: with USE_SPLIT_PTE_PTLOCKS, the
> locks would differ. Something similar can happen right now with hugetlb
> folios that span multiple PMDs when USE_SPLIT_PMD_PTLOCKS.
>
> Let's make huge_pte_lockptr() effectively uses the same PT locks as any
> core-mm page table walker would.
Thanks for raising the issue again. I remember fixing this issue 2 years
ago in commit fac35ba763ed ("mm/hugetlb: fix races when looking up a
CONT-PTE/PMD size hugetlb page"), but it seems to be broken again.
> There is one ugly case: powerpc 8xx, whereby we have an 8 MiB hugetlb
> folio being mapped using two PTE page tables. While hugetlb wants to take
> the PMD table lock, core-mm would grab the PTE table lock of one of both
> PTE page tables. In such corner cases, we have to make sure that both
> locks match, which is (fortunately!) currently guaranteed for 8xx as it
> does not support SMP.
>
> Fixes: 9cb28da54643 ("mm/gup: handle hugetlb in the generic follow_page_mask code")
> Cc: <stable@vger.kernel.org>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
> include/linux/hugetlb.h | 25 ++++++++++++++++++++++---
> 1 file changed, 22 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index c9bf68c239a01..da800e56fe590 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -944,10 +944,29 @@ static inline bool htlb_allow_alloc_fallback(int reason)
> static inline spinlock_t *huge_pte_lockptr(struct hstate *h,
> struct mm_struct *mm, pte_t *pte)
> {
> - if (huge_page_size(h) == PMD_SIZE)
> + VM_WARN_ON(huge_page_size(h) == PAGE_SIZE);
> + VM_WARN_ON(huge_page_size(h) >= P4D_SIZE);
> +
> + /*
> + * hugetlb must use the exact same PT locks as core-mm page table
> + * walkers would. When modifying a PTE table, hugetlb must take the
> + * PTE PT lock, when modifying a PMD table, hugetlb must take the PMD
> + * PT lock etc.
> + *
> + * The expectation is that any hugetlb folio smaller than a PMD is
> + * always mapped into a single PTE table and that any hugetlb folio
> + * smaller than a PUD (but at least as big as a PMD) is always mapped
> + * into a single PMD table.
ARM64 also supports cont-PMD size hugetlb, which is 32MiB size with a 4
KiB base page size. This means the PT locks for 32MiB hugetlb may race
again, as we currently only hold one PMD lock for several PMD entries of
a cont-PMD size hugetlb.
> + *
> + * If that does not hold for an architecture, then that architecture
> + * must disable split PT locks such that all *_lockptr() functions
> + * will give us the same result: the per-MM PT lock.
> + */
> + if (huge_page_size(h) < PMD_SIZE)
> + return pte_lockptr(mm, pte);
> + else if (huge_page_size(h) < PUD_SIZE)
> return pmd_lockptr(mm, (pmd_t *) pte);
IIUC, as I said above, this change doesn't fix the inconsistent lock for
cont-PMD size hugetlb for GUP, and it will also break the lock rule for
unmapping/migrating a cont-PMD size hugetlb (use mm->page_table_lock
before for cont-PMD size hugetlb before).
> - VM_BUG_ON(huge_page_size(h) == PAGE_SIZE);
> - return &mm->page_table_lock;
> + return pud_lockptr(mm, (pud_t *) pte);
> }
>
> #ifndef hugepages_supported
^ permalink raw reply [flat|nested] 12+ messages in thread* Re: [PATCH v1 2/2] mm/hugetlb: fix hugetlb vs. core-mm PT locking
2024-07-26 2:33 ` Baolin Wang
@ 2024-07-26 3:03 ` Baolin Wang
2024-07-26 8:04 ` David Hildenbrand
2024-07-26 8:04 ` David Hildenbrand
1 sibling, 1 reply; 12+ messages in thread
From: Baolin Wang @ 2024-07-26 3:03 UTC (permalink / raw)
To: David Hildenbrand, linux-kernel
Cc: linux-mm, Andrew Morton, Muchun Song, Peter Xu, Oscar Salvador,
stable
On 2024/7/26 10:33, Baolin Wang wrote:
>
>
> On 2024/7/26 02:39, David Hildenbrand wrote:
>> We recently made GUP's common page table walking code to also walk
>> hugetlb VMAs without most hugetlb special-casing, preparing for the
>> future of having less hugetlb-specific page table walking code in the
>> codebase. Turns out that we missed one page table locking detail: page
>> table locking for hugetlb folios that are not mapped using a single
>> PMD/PUD.
>>
>> Assume we have hugetlb folio that spans multiple PTEs (e.g., 64 KiB
>> hugetlb folios on arm64 with 4 KiB base page size). GUP, as it walks the
>> page tables, will perform a pte_offset_map_lock() to grab the PTE table
>> lock.
>>
>> However, hugetlb that concurrently modifies these page tables would
>> actually grab the mm->page_table_lock: with USE_SPLIT_PTE_PTLOCKS, the
>> locks would differ. Something similar can happen right now with hugetlb
>> folios that span multiple PMDs when USE_SPLIT_PMD_PTLOCKS.
>>
>> Let's make huge_pte_lockptr() effectively uses the same PT locks as any
>> core-mm page table walker would.
>
> Thanks for raising the issue again. I remember fixing this issue 2 years
> ago in commit fac35ba763ed ("mm/hugetlb: fix races when looking up a
> CONT-PTE/PMD size hugetlb page"), but it seems to be broken again.
>
>> There is one ugly case: powerpc 8xx, whereby we have an 8 MiB hugetlb
>> folio being mapped using two PTE page tables. While hugetlb wants to take
>> the PMD table lock, core-mm would grab the PTE table lock of one of both
>> PTE page tables. In such corner cases, we have to make sure that both
>> locks match, which is (fortunately!) currently guaranteed for 8xx as it
>> does not support SMP.
>>
>> Fixes: 9cb28da54643 ("mm/gup: handle hugetlb in the generic
>> follow_page_mask code")
>> Cc: <stable@vger.kernel.org>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
>> ---
>> include/linux/hugetlb.h | 25 ++++++++++++++++++++++---
>> 1 file changed, 22 insertions(+), 3 deletions(-)
>>
>> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
>> index c9bf68c239a01..da800e56fe590 100644
>> --- a/include/linux/hugetlb.h
>> +++ b/include/linux/hugetlb.h
>> @@ -944,10 +944,29 @@ static inline bool htlb_allow_alloc_fallback(int
>> reason)
>> static inline spinlock_t *huge_pte_lockptr(struct hstate *h,
>> struct mm_struct *mm, pte_t *pte)
>> {
>> - if (huge_page_size(h) == PMD_SIZE)
>> + VM_WARN_ON(huge_page_size(h) == PAGE_SIZE);
>> + VM_WARN_ON(huge_page_size(h) >= P4D_SIZE);
>> +
>> + /*
>> + * hugetlb must use the exact same PT locks as core-mm page table
>> + * walkers would. When modifying a PTE table, hugetlb must take the
>> + * PTE PT lock, when modifying a PMD table, hugetlb must take the
>> PMD
>> + * PT lock etc.
>> + *
>> + * The expectation is that any hugetlb folio smaller than a PMD is
>> + * always mapped into a single PTE table and that any hugetlb folio
>> + * smaller than a PUD (but at least as big as a PMD) is always
>> mapped
>> + * into a single PMD table.
>
> ARM64 also supports cont-PMD size hugetlb, which is 32MiB size with a 4
> KiB base page size. This means the PT locks for 32MiB hugetlb may race
> again, as we currently only hold one PMD lock for several PMD entries of
> a cont-PMD size hugetlb.
>
>> + *
>> + * If that does not hold for an architecture, then that architecture
>> + * must disable split PT locks such that all *_lockptr() functions
>> + * will give us the same result: the per-MM PT lock.
>> + */
>> + if (huge_page_size(h) < PMD_SIZE)
>> + return pte_lockptr(mm, pte);
>> + else if (huge_page_size(h) < PUD_SIZE)
>> return pmd_lockptr(mm, (pmd_t *) pte);
>
> IIUC, as I said above, this change doesn't fix the inconsistent lock for
> cont-PMD size hugetlb for GUP, and it will also break the lock rule for
> unmapping/migrating a cont-PMD size hugetlb (use mm->page_table_lock
> before for cont-PMD size hugetlb before).
After more thinking, I realized I confused the PMD table with the PMD
entry. Therefore, using the PMD table's lock is safe for cont-PMD size
hugetlb. This change looks good to me. Sorry for noise.
Please feel free to add:
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>
>> - VM_BUG_ON(huge_page_size(h) == PAGE_SIZE);
>> - return &mm->page_table_lock;
>> + return pud_lockptr(mm, (pud_t *) pte);
>> }
>> #ifndef hugepages_supported
^ permalink raw reply [flat|nested] 12+ messages in thread* Re: [PATCH v1 2/2] mm/hugetlb: fix hugetlb vs. core-mm PT locking
2024-07-26 3:03 ` Baolin Wang
@ 2024-07-26 8:04 ` David Hildenbrand
0 siblings, 0 replies; 12+ messages in thread
From: David Hildenbrand @ 2024-07-26 8:04 UTC (permalink / raw)
To: Baolin Wang, linux-kernel
Cc: linux-mm, Andrew Morton, Muchun Song, Peter Xu, Oscar Salvador,
stable
>>
>>> + *
>>> + * If that does not hold for an architecture, then that architecture
>>> + * must disable split PT locks such that all *_lockptr() functions
>>> + * will give us the same result: the per-MM PT lock.
>>> + */
>>> + if (huge_page_size(h) < PMD_SIZE)
>>> + return pte_lockptr(mm, pte);
>>> + else if (huge_page_size(h) < PUD_SIZE)
>>> return pmd_lockptr(mm, (pmd_t *) pte);
>>
>> IIUC, as I said above, this change doesn't fix the inconsistent lock for
>> cont-PMD size hugetlb for GUP, and it will also break the lock rule for
>> unmapping/migrating a cont-PMD size hugetlb (use mm->page_table_lock
>> before for cont-PMD size hugetlb before).
>
> After more thinking, I realized I confused the PMD table with the PMD
> entry. Therefore, using the PMD table's lock is safe for cont-PMD size
> hugetlb. This change looks good to me. Sorry for noise.
>
Thanks for the review, highly appreciated!
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v1 2/2] mm/hugetlb: fix hugetlb vs. core-mm PT locking
2024-07-26 2:33 ` Baolin Wang
2024-07-26 3:03 ` Baolin Wang
@ 2024-07-26 8:04 ` David Hildenbrand
2024-07-26 9:38 ` Baolin Wang
1 sibling, 1 reply; 12+ messages in thread
From: David Hildenbrand @ 2024-07-26 8:04 UTC (permalink / raw)
To: Baolin Wang, linux-kernel
Cc: linux-mm, Andrew Morton, Muchun Song, Peter Xu, Oscar Salvador,
stable
On 26.07.24 04:33, Baolin Wang wrote:
>
>
> On 2024/7/26 02:39, David Hildenbrand wrote:
>> We recently made GUP's common page table walking code to also walk
>> hugetlb VMAs without most hugetlb special-casing, preparing for the
>> future of having less hugetlb-specific page table walking code in the
>> codebase. Turns out that we missed one page table locking detail: page
>> table locking for hugetlb folios that are not mapped using a single
>> PMD/PUD.
>>
>> Assume we have hugetlb folio that spans multiple PTEs (e.g., 64 KiB
>> hugetlb folios on arm64 with 4 KiB base page size). GUP, as it walks the
>> page tables, will perform a pte_offset_map_lock() to grab the PTE table
>> lock.
>>
>> However, hugetlb that concurrently modifies these page tables would
>> actually grab the mm->page_table_lock: with USE_SPLIT_PTE_PTLOCKS, the
>> locks would differ. Something similar can happen right now with hugetlb
>> folios that span multiple PMDs when USE_SPLIT_PMD_PTLOCKS.
>>
>> Let's make huge_pte_lockptr() effectively uses the same PT locks as any
>> core-mm page table walker would.
>
> Thanks for raising the issue again. I remember fixing this issue 2 years
> ago in commit fac35ba763ed ("mm/hugetlb: fix races when looking up a
> CONT-PTE/PMD size hugetlb page"), but it seems to be broken again.
>
Ah, right! We fixed it by rerouting to hugetlb code that we then removed :D
Did we have a reproducer back then that would make my live easier?
>> There is one ugly case: powerpc 8xx, whereby we have an 8 MiB hugetlb
>> folio being mapped using two PTE page tables. While hugetlb wants to take
>> the PMD table lock, core-mm would grab the PTE table lock of one of both
>> PTE page tables. In such corner cases, we have to make sure that both
>> locks match, which is (fortunately!) currently guaranteed for 8xx as it
>> does not support SMP.
>>
>> Fixes: 9cb28da54643 ("mm/gup: handle hugetlb in the generic follow_page_mask code")
>> Cc: <stable@vger.kernel.org>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
>> ---
>> include/linux/hugetlb.h | 25 ++++++++++++++++++++++---
>> 1 file changed, 22 insertions(+), 3 deletions(-)
>>
>> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
>> index c9bf68c239a01..da800e56fe590 100644
>> --- a/include/linux/hugetlb.h
>> +++ b/include/linux/hugetlb.h
>> @@ -944,10 +944,29 @@ static inline bool htlb_allow_alloc_fallback(int reason)
>> static inline spinlock_t *huge_pte_lockptr(struct hstate *h,
>> struct mm_struct *mm, pte_t *pte)
>> {
>> - if (huge_page_size(h) == PMD_SIZE)
>> + VM_WARN_ON(huge_page_size(h) == PAGE_SIZE);
>> + VM_WARN_ON(huge_page_size(h) >= P4D_SIZE);
>> +
>> + /*
>> + * hugetlb must use the exact same PT locks as core-mm page table
>> + * walkers would. When modifying a PTE table, hugetlb must take the
>> + * PTE PT lock, when modifying a PMD table, hugetlb must take the PMD
>> + * PT lock etc.
>> + *
>> + * The expectation is that any hugetlb folio smaller than a PMD is
>> + * always mapped into a single PTE table and that any hugetlb folio
>> + * smaller than a PUD (but at least as big as a PMD) is always mapped
>> + * into a single PMD table.
>
> ARM64 also supports cont-PMD size hugetlb, which is 32MiB size with a 4
> KiB base page size. This means the PT locks for 32MiB hugetlb may race
> again, as we currently only hold one PMD lock for several PMD entries of
> a cont-PMD size hugetlb.
Exactly, that's the case where all cont-PMD entries fall into the same
page table.
That's also what I will try reproducing with (migration racing with
GUP). But the race window is small and a lot of other stuff is protected
by the VMA lock.
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 12+ messages in thread* Re: [PATCH v1 2/2] mm/hugetlb: fix hugetlb vs. core-mm PT locking
2024-07-26 8:04 ` David Hildenbrand
@ 2024-07-26 9:38 ` Baolin Wang
2024-07-26 11:40 ` David Hildenbrand
0 siblings, 1 reply; 12+ messages in thread
From: Baolin Wang @ 2024-07-26 9:38 UTC (permalink / raw)
To: David Hildenbrand, linux-kernel
Cc: linux-mm, Andrew Morton, Muchun Song, Peter Xu, Oscar Salvador,
stable
On 2024/7/26 16:04, David Hildenbrand wrote:
> On 26.07.24 04:33, Baolin Wang wrote:
>>
>>
>> On 2024/7/26 02:39, David Hildenbrand wrote:
>>> We recently made GUP's common page table walking code to also walk
>>> hugetlb VMAs without most hugetlb special-casing, preparing for the
>>> future of having less hugetlb-specific page table walking code in the
>>> codebase. Turns out that we missed one page table locking detail: page
>>> table locking for hugetlb folios that are not mapped using a single
>>> PMD/PUD.
>>>
>>> Assume we have hugetlb folio that spans multiple PTEs (e.g., 64 KiB
>>> hugetlb folios on arm64 with 4 KiB base page size). GUP, as it walks the
>>> page tables, will perform a pte_offset_map_lock() to grab the PTE table
>>> lock.
>>>
>>> However, hugetlb that concurrently modifies these page tables would
>>> actually grab the mm->page_table_lock: with USE_SPLIT_PTE_PTLOCKS, the
>>> locks would differ. Something similar can happen right now with hugetlb
>>> folios that span multiple PMDs when USE_SPLIT_PMD_PTLOCKS.
>>>
>>> Let's make huge_pte_lockptr() effectively uses the same PT locks as any
>>> core-mm page table walker would.
>>
>> Thanks for raising the issue again. I remember fixing this issue 2 years
>> ago in commit fac35ba763ed ("mm/hugetlb: fix races when looking up a
>> CONT-PTE/PMD size hugetlb page"), but it seems to be broken again.
>>
>
> Ah, right! We fixed it by rerouting to hugetlb code that we then removed :D
>
> Did we have a reproducer back then that would make my live easier?
I don't have any reproducers right now. I remember I added some ugly
hack code (adding delay() etc.) in kernel to analyze this issue, and not
easy to reproduce. :(
^ permalink raw reply [flat|nested] 12+ messages in thread* Re: [PATCH v1 2/2] mm/hugetlb: fix hugetlb vs. core-mm PT locking
2024-07-26 9:38 ` Baolin Wang
@ 2024-07-26 11:40 ` David Hildenbrand
2024-07-29 1:48 ` Baolin Wang
0 siblings, 1 reply; 12+ messages in thread
From: David Hildenbrand @ 2024-07-26 11:40 UTC (permalink / raw)
To: Baolin Wang, linux-kernel
Cc: linux-mm, Andrew Morton, Muchun Song, Peter Xu, Oscar Salvador,
stable
On 26.07.24 11:38, Baolin Wang wrote:
>
>
> On 2024/7/26 16:04, David Hildenbrand wrote:
>> On 26.07.24 04:33, Baolin Wang wrote:
>>>
>>>
>>> On 2024/7/26 02:39, David Hildenbrand wrote:
>>>> We recently made GUP's common page table walking code to also walk
>>>> hugetlb VMAs without most hugetlb special-casing, preparing for the
>>>> future of having less hugetlb-specific page table walking code in the
>>>> codebase. Turns out that we missed one page table locking detail: page
>>>> table locking for hugetlb folios that are not mapped using a single
>>>> PMD/PUD.
>>>>
>>>> Assume we have hugetlb folio that spans multiple PTEs (e.g., 64 KiB
>>>> hugetlb folios on arm64 with 4 KiB base page size). GUP, as it walks the
>>>> page tables, will perform a pte_offset_map_lock() to grab the PTE table
>>>> lock.
>>>>
>>>> However, hugetlb that concurrently modifies these page tables would
>>>> actually grab the mm->page_table_lock: with USE_SPLIT_PTE_PTLOCKS, the
>>>> locks would differ. Something similar can happen right now with hugetlb
>>>> folios that span multiple PMDs when USE_SPLIT_PMD_PTLOCKS.
>>>>
>>>> Let's make huge_pte_lockptr() effectively uses the same PT locks as any
>>>> core-mm page table walker would.
>>>
>>> Thanks for raising the issue again. I remember fixing this issue 2 years
>>> ago in commit fac35ba763ed ("mm/hugetlb: fix races when looking up a
>>> CONT-PTE/PMD size hugetlb page"), but it seems to be broken again.
>>>
>>
>> Ah, right! We fixed it by rerouting to hugetlb code that we then removed :D
>>
>> Did we have a reproducer back then that would make my live easier?
>
> I don't have any reproducers right now. I remember I added some ugly
> hack code (adding delay() etc.) in kernel to analyze this issue, and not
> easy to reproduce. :(
Hah!
I tried with 32MB without luck -- migration simply takes too long. 64KB
did the trick within seconds!
On a VM with 2 vNUMA nodes, after reserving a bunch of 64KiB hugetlb pages:
# numactl -H
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
node 0 size: 64439 MB
node 0 free: 64097 MB
node 1 cpus: 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
node 1 size: 64205 MB
node 1 free: 63809 MB
node distances:
node 0 1
0: 10 20
1: 20 10
# echo 100 > /sys/kernel/mm/hugepages/hugepages-64kB/nr_hugepagespages-64kB/nr_hugepagesepages-64kB/nr_hugepages
# gcc reproducer.c -o reproducer -O3 -lnuma -lpthread
# ./reproducer
[ 3105.936100] ------------[ cut here ]------------
[ 3105.939323] WARNING: CPU: 31 PID: 2732 at mm/gup.c:142 try_grab_folio+0x11c/0x188
[ 3105.944634] Modules linked in: nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 rfkill ip_set nf_tables qrtr sunrpc vfat fat ext4 mbcache jbd2 virtio_net net_failover failover virtio_balloon dimlib loop dm_multipath nfnetlink zram xfs crct10dif_ce ghash_ce sha2_ce sha256_arm64 sha1_ce virtio_blk virtio_console virtio_mmio dm_mirror dm_region_hash dm_log dm_mod fuse
[ 3105.974841] CPU: 31 PID: 2732 Comm: reproducer Not tainted 6.10.0-64.eln141.aarch64 #1
[ 3105.980406] Hardware name: QEMU KVM Virtual Machine, BIOS edk2-20240524-4.fc40 05/24/2024
[ 3105.986185] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[ 3105.991108] pc : try_grab_folio+0x11c/0x188
[ 3105.994013] lr : follow_page_pte+0xd8/0x430
[ 3105.996986] sp : ffff80008eafb8f0
[ 3105.999346] x29: ffff80008eafb900 x28: ffffffe8d481f380 x27: 00f80001207cff43
[ 3106.004414] x26: 0000000000000001 x25: 0000000000000000 x24: ffff80008eafba48
[ 3106.009520] x23: 0000ffff9372f000 x22: ffff7a54459e2000 x21: ffff7a546c1aa978
[ 3106.014529] x20: ffffffe8d481f3c0 x19: 0000000000610041 x18: 0000000000000001
[ 3106.019506] x17: 0000000000000001 x16: ffffffffffffffff x15: 0000000000000000
[ 3106.024494] x14: ffffb85477fdfe08 x13: 0000ffff9372ffff x12: 0000000000000000
[ 3106.029469] x11: 1fffef4a88a96be1 x10: ffff7a54454b5f0c x9 : ffffb854771b12f0
[ 3106.034324] x8 : 0008000000000000 x7 : ffff7a546c1aa980 x6 : 0008000000000080
[ 3106.038902] x5 : 00000000001207cf x4 : 0000ffff9372f000 x3 : ffffffe8d481f000
[ 3106.043420] x2 : 0000000000610041 x1 : 0000000000000001 x0 : 0000000000000000
[ 3106.047957] Call trace:
[ 3106.049522] try_grab_folio+0x11c/0x188
[ 3106.051996] follow_pmd_mask.constprop.0.isra.0+0x150/0x2e0
[ 3106.055527] follow_page_mask+0x1a0/0x2b8
[ 3106.058118] __get_user_pages+0xf0/0x348
[ 3106.060647] faultin_page_range+0xb0/0x360
[ 3106.063651] do_madvise+0x340/0x598
...
# cat reproducer.c
#include <stdio.h>
#include <sys/mman.h>
#include <linux/mman.h>
#include <errno.h>
#include <string.h>
#include <numaif.h>
#include <pthread.h>
#define SIZE_64KB (64 * 1024ul)
static void *thread_fn(void *arg)
{
char *mem = arg;
/* Let GUP go crazy on the page without grabbing a reference. */
while (1) {
if (madvise(mem, SIZE_64KB, MADV_POPULATE_WRITE)) {
fprintf(stderr, "madvise() failed: %d\n", errno);
}
}
}
int main(void)
{
pthread_t thread;
unsigned int i;
char *mem;
mem = mmap(0, SIZE_64KB, PROT_READ|PROT_WRITE,
MAP_ANON|MAP_PRIVATE|MAP_HUGETLB|MAP_HUGE_64KB, -1, 0);
if (mem == MAP_FAILED) {
fprintf(stderr, "mmap() failed: %d\n", errno);
return -1;
}
memset(mem, 0, SIZE_64KB);
pthread_create(&thread, NULL, thread_fn, mem);
/* Migrate it back and forth between two nodes. */
for (i = 0; ; i++) {
void *pages[1] = { mem, };
int nodes[1] = { i % 2, };
int status[1] = { 0 };
if (move_pages(0, 1, pages, nodes, status, MPOL_MF_MOVE_ALL))
fprintf(stderr, "move_pages failed: %d\n", errno);
if (status[0] != nodes[0])
printf("migration failed\n");
}
return 0;
}
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 12+ messages in thread* Re: [PATCH v1 2/2] mm/hugetlb: fix hugetlb vs. core-mm PT locking
2024-07-26 11:40 ` David Hildenbrand
@ 2024-07-29 1:48 ` Baolin Wang
0 siblings, 0 replies; 12+ messages in thread
From: Baolin Wang @ 2024-07-29 1:48 UTC (permalink / raw)
To: David Hildenbrand, linux-kernel
Cc: linux-mm, Andrew Morton, Muchun Song, Peter Xu, Oscar Salvador,
stable
On 2024/7/26 19:40, David Hildenbrand wrote:
> On 26.07.24 11:38, Baolin Wang wrote:
>>
>>
>> On 2024/7/26 16:04, David Hildenbrand wrote:
>>> On 26.07.24 04:33, Baolin Wang wrote:
>>>>
>>>>
>>>> On 2024/7/26 02:39, David Hildenbrand wrote:
>>>>> We recently made GUP's common page table walking code to also walk
>>>>> hugetlb VMAs without most hugetlb special-casing, preparing for the
>>>>> future of having less hugetlb-specific page table walking code in the
>>>>> codebase. Turns out that we missed one page table locking detail: page
>>>>> table locking for hugetlb folios that are not mapped using a single
>>>>> PMD/PUD.
>>>>>
>>>>> Assume we have hugetlb folio that spans multiple PTEs (e.g., 64 KiB
>>>>> hugetlb folios on arm64 with 4 KiB base page size). GUP, as it
>>>>> walks the
>>>>> page tables, will perform a pte_offset_map_lock() to grab the PTE
>>>>> table
>>>>> lock.
>>>>>
>>>>> However, hugetlb that concurrently modifies these page tables would
>>>>> actually grab the mm->page_table_lock: with USE_SPLIT_PTE_PTLOCKS, the
>>>>> locks would differ. Something similar can happen right now with
>>>>> hugetlb
>>>>> folios that span multiple PMDs when USE_SPLIT_PMD_PTLOCKS.
>>>>>
>>>>> Let's make huge_pte_lockptr() effectively uses the same PT locks as
>>>>> any
>>>>> core-mm page table walker would.
>>>>
>>>> Thanks for raising the issue again. I remember fixing this issue 2
>>>> years
>>>> ago in commit fac35ba763ed ("mm/hugetlb: fix races when looking up a
>>>> CONT-PTE/PMD size hugetlb page"), but it seems to be broken again.
>>>>
>>>
>>> Ah, right! We fixed it by rerouting to hugetlb code that we then
>>> removed :D
>>>
>>> Did we have a reproducer back then that would make my live easier?
>>
>> I don't have any reproducers right now. I remember I added some ugly
>> hack code (adding delay() etc.) in kernel to analyze this issue, and not
>> easy to reproduce. :(
>
> Hah!
>
> I tried with 32MB without luck -- migration simply takes too long. 64KB
> did the trick within seconds!
>
>
> On a VM with 2 vNUMA nodes, after reserving a bunch of 64KiB hugetlb pages:
>
> # numactl -H
> available: 2 nodes (0-1)
> node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
> node 0 size: 64439 MB
> node 0 free: 64097 MB
> node 1 cpus: 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
> node 1 size: 64205 MB
> node 1 free: 63809 MB
> node distances:
> node 0 1
> 0: 10 20
> 1: 20 10
> # echo 100 >
> /sys/kernel/mm/hugepages/hugepages-64kB/nr_hugepagespages-64kB/nr_hugepagesepages-64kB/nr_hugepages
> # gcc reproducer.c -o reproducer -O3 -lnuma -lpthread
> # ./reproducer
> [ 3105.936100] ------------[ cut here ]------------
> [ 3105.939323] WARNING: CPU: 31 PID: 2732 at mm/gup.c:142
> try_grab_folio+0x11c/0x188
> [ 3105.944634] Modules linked in: nft_fib_inet nft_fib_ipv4 nft_fib_ipv6
> nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct
> nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 rfkill
> ip_set nf_tables qrtr sunrpc vfat fat ext4 mbcache jbd2 virtio_net
> net_failover failover virtio_balloon dimlib loop dm_multipath nfnetlink
> zram xfs crct10dif_ce ghash_ce sha2_ce sha256_arm64 sha1_ce virtio_blk
> virtio_console virtio_mmio dm_mirror dm_region_hash dm_log dm_mod fuse
> [ 3105.974841] CPU: 31 PID: 2732 Comm: reproducer Not tainted
> 6.10.0-64.eln141.aarch64 #1
> [ 3105.980406] Hardware name: QEMU KVM Virtual Machine, BIOS
> edk2-20240524-4.fc40 05/24/2024
> [ 3105.986185] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS
> BTYPE=--)
> [ 3105.991108] pc : try_grab_folio+0x11c/0x188
> [ 3105.994013] lr : follow_page_pte+0xd8/0x430
> [ 3105.996986] sp : ffff80008eafb8f0
> [ 3105.999346] x29: ffff80008eafb900 x28: ffffffe8d481f380 x27:
> 00f80001207cff43
> [ 3106.004414] x26: 0000000000000001 x25: 0000000000000000 x24:
> ffff80008eafba48
> [ 3106.009520] x23: 0000ffff9372f000 x22: ffff7a54459e2000 x21:
> ffff7a546c1aa978
> [ 3106.014529] x20: ffffffe8d481f3c0 x19: 0000000000610041 x18:
> 0000000000000001
> [ 3106.019506] x17: 0000000000000001 x16: ffffffffffffffff x15:
> 0000000000000000
> [ 3106.024494] x14: ffffb85477fdfe08 x13: 0000ffff9372ffff x12:
> 0000000000000000
> [ 3106.029469] x11: 1fffef4a88a96be1 x10: ffff7a54454b5f0c x9 :
> ffffb854771b12f0
> [ 3106.034324] x8 : 0008000000000000 x7 : ffff7a546c1aa980 x6 :
> 0008000000000080
> [ 3106.038902] x5 : 00000000001207cf x4 : 0000ffff9372f000 x3 :
> ffffffe8d481f000
> [ 3106.043420] x2 : 0000000000610041 x1 : 0000000000000001 x0 :
> 0000000000000000
> [ 3106.047957] Call trace:
> [ 3106.049522] try_grab_folio+0x11c/0x188
> [ 3106.051996] follow_pmd_mask.constprop.0.isra.0+0x150/0x2e0
> [ 3106.055527] follow_page_mask+0x1a0/0x2b8
> [ 3106.058118] __get_user_pages+0xf0/0x348
> [ 3106.060647] faultin_page_range+0xb0/0x360
> [ 3106.063651] do_madvise+0x340/0x598
> ...
>
>
>
> # cat reproducer.c
> #include <stdio.h>
> #include <sys/mman.h>
> #include <linux/mman.h>
> #include <errno.h>
> #include <string.h>
> #include <numaif.h>
> #include <pthread.h>
>
> #define SIZE_64KB (64 * 1024ul)
>
> static void *thread_fn(void *arg)
> {
> char *mem = arg;
>
> /* Let GUP go crazy on the page without grabbing a reference. */
> while (1) {
> if (madvise(mem, SIZE_64KB, MADV_POPULATE_WRITE)) {
> fprintf(stderr, "madvise() failed: %d\n", errno);
> }
> }
> }
>
> int main(void)
> {
> pthread_t thread;
> unsigned int i;
> char *mem;
>
> mem = mmap(0, SIZE_64KB, PROT_READ|PROT_WRITE,
> MAP_ANON|MAP_PRIVATE|MAP_HUGETLB|MAP_HUGE_64KB, -1, 0);
> if (mem == MAP_FAILED) {
> fprintf(stderr, "mmap() failed: %d\n", errno);
> return -1;
> }
>
> memset(mem, 0, SIZE_64KB);
>
> pthread_create(&thread, NULL, thread_fn, mem);
>
> /* Migrate it back and forth between two nodes. */
> for (i = 0; ; i++) {
> void *pages[1] = { mem, };
> int nodes[1] = { i % 2, };
> int status[1] = { 0 };
>
> if (move_pages(0, 1, pages, nodes, status,
> MPOL_MF_MOVE_ALL))
> fprintf(stderr, "move_pages failed: %d\n", errno);
> if (status[0] != nodes[0])
> printf("migration failed\n");
> }
> return 0;
> }
Great! Thanks for creating the reproducer (I will also try it if I find
some time).
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v1 2/2] mm/hugetlb: fix hugetlb vs. core-mm PT locking
2024-07-25 18:39 ` [PATCH v1 2/2] mm/hugetlb: fix hugetlb vs. core-mm PT locking David Hildenbrand
2024-07-26 2:33 ` Baolin Wang
@ 2024-07-26 8:18 ` Muchun Song
2024-07-26 15:26 ` Peter Xu
2024-07-29 4:51 ` Oscar Salvador
3 siblings, 0 replies; 12+ messages in thread
From: Muchun Song @ 2024-07-26 8:18 UTC (permalink / raw)
To: David Hildenbrand
Cc: LKML, Linux Memory Management List, Andrew Morton, Peter Xu,
Oscar Salvador, stable
> On Jul 26, 2024, at 02:39, David Hildenbrand <david@redhat.com> wrote:
>
> We recently made GUP's common page table walking code to also walk
> hugetlb VMAs without most hugetlb special-casing, preparing for the
> future of having less hugetlb-specific page table walking code in the
> codebase. Turns out that we missed one page table locking detail: page
> table locking for hugetlb folios that are not mapped using a single
> PMD/PUD.
>
> Assume we have hugetlb folio that spans multiple PTEs (e.g., 64 KiB
> hugetlb folios on arm64 with 4 KiB base page size). GUP, as it walks the
> page tables, will perform a pte_offset_map_lock() to grab the PTE table
> lock.
>
> However, hugetlb that concurrently modifies these page tables would
> actually grab the mm->page_table_lock: with USE_SPLIT_PTE_PTLOCKS, the
> locks would differ. Something similar can happen right now with hugetlb
> folios that span multiple PMDs when USE_SPLIT_PMD_PTLOCKS.
>
> Let's make huge_pte_lockptr() effectively uses the same PT locks as any
> core-mm page table walker would.
>
> There is one ugly case: powerpc 8xx, whereby we have an 8 MiB hugetlb
> folio being mapped using two PTE page tables. While hugetlb wants to take
> the PMD table lock, core-mm would grab the PTE table lock of one of both
> PTE page tables. In such corner cases, we have to make sure that both
> locks match, which is (fortunately!) currently guaranteed for 8xx as it
> does not support SMP.
>
> Fixes: 9cb28da54643 ("mm/gup: handle hugetlb in the generic follow_page_mask code")
> Cc: <stable@vger.kernel.org>
> Signed-off-by: David Hildenbrand <david@redhat.com>
Acked-by: Muchun Song <muchun.song@linux.dev>
Thanks.
^ permalink raw reply [flat|nested] 12+ messages in thread* Re: [PATCH v1 2/2] mm/hugetlb: fix hugetlb vs. core-mm PT locking
2024-07-25 18:39 ` [PATCH v1 2/2] mm/hugetlb: fix hugetlb vs. core-mm PT locking David Hildenbrand
2024-07-26 2:33 ` Baolin Wang
2024-07-26 8:18 ` Muchun Song
@ 2024-07-26 15:26 ` Peter Xu
2024-07-26 15:32 ` David Hildenbrand
2024-07-29 4:51 ` Oscar Salvador
3 siblings, 1 reply; 12+ messages in thread
From: Peter Xu @ 2024-07-26 15:26 UTC (permalink / raw)
To: David Hildenbrand
Cc: linux-kernel, linux-mm, Andrew Morton, Muchun Song,
Oscar Salvador, stable
On Thu, Jul 25, 2024 at 08:39:55PM +0200, David Hildenbrand wrote:
> We recently made GUP's common page table walking code to also walk
> hugetlb VMAs without most hugetlb special-casing, preparing for the
> future of having less hugetlb-specific page table walking code in the
> codebase. Turns out that we missed one page table locking detail: page
> table locking for hugetlb folios that are not mapped using a single
> PMD/PUD.
>
> Assume we have hugetlb folio that spans multiple PTEs (e.g., 64 KiB
> hugetlb folios on arm64 with 4 KiB base page size). GUP, as it walks the
> page tables, will perform a pte_offset_map_lock() to grab the PTE table
> lock.
>
> However, hugetlb that concurrently modifies these page tables would
> actually grab the mm->page_table_lock: with USE_SPLIT_PTE_PTLOCKS, the
> locks would differ. Something similar can happen right now with hugetlb
> folios that span multiple PMDs when USE_SPLIT_PMD_PTLOCKS.
>
> Let's make huge_pte_lockptr() effectively uses the same PT locks as any
> core-mm page table walker would.
>
> There is one ugly case: powerpc 8xx, whereby we have an 8 MiB hugetlb
> folio being mapped using two PTE page tables. While hugetlb wants to take
> the PMD table lock, core-mm would grab the PTE table lock of one of both
> PTE page tables. In such corner cases, we have to make sure that both
> locks match, which is (fortunately!) currently guaranteed for 8xx as it
> does not support SMP.
Do you mean "does not support SPLIT_PMD_PTLOCK" here instead of SMP?
>
> Fixes: 9cb28da54643 ("mm/gup: handle hugetlb in the generic follow_page_mask code")
> Cc: <stable@vger.kernel.org>
> Signed-off-by: David Hildenbrand <david@redhat.com>
Patch looks all right to me:
Reviewed-by: Peter Xu <peterx@redhat.com>
Thanks!
--
Peter Xu
^ permalink raw reply [flat|nested] 12+ messages in thread* Re: [PATCH v1 2/2] mm/hugetlb: fix hugetlb vs. core-mm PT locking
2024-07-26 15:26 ` Peter Xu
@ 2024-07-26 15:32 ` David Hildenbrand
0 siblings, 0 replies; 12+ messages in thread
From: David Hildenbrand @ 2024-07-26 15:32 UTC (permalink / raw)
To: Peter Xu
Cc: linux-kernel, linux-mm, Andrew Morton, Muchun Song,
Oscar Salvador, stable
On 26.07.24 17:26, Peter Xu wrote:
> On Thu, Jul 25, 2024 at 08:39:55PM +0200, David Hildenbrand wrote:
>> We recently made GUP's common page table walking code to also walk
>> hugetlb VMAs without most hugetlb special-casing, preparing for the
>> future of having less hugetlb-specific page table walking code in the
>> codebase. Turns out that we missed one page table locking detail: page
>> table locking for hugetlb folios that are not mapped using a single
>> PMD/PUD.
>>
>> Assume we have hugetlb folio that spans multiple PTEs (e.g., 64 KiB
>> hugetlb folios on arm64 with 4 KiB base page size). GUP, as it walks the
>> page tables, will perform a pte_offset_map_lock() to grab the PTE table
>> lock.
>>
>> However, hugetlb that concurrently modifies these page tables would
>> actually grab the mm->page_table_lock: with USE_SPLIT_PTE_PTLOCKS, the
>> locks would differ. Something similar can happen right now with hugetlb
>> folios that span multiple PMDs when USE_SPLIT_PMD_PTLOCKS.
>>
>> Let's make huge_pte_lockptr() effectively uses the same PT locks as any
>> core-mm page table walker would.
>>
>> There is one ugly case: powerpc 8xx, whereby we have an 8 MiB hugetlb
>> folio being mapped using two PTE page tables. While hugetlb wants to take
>> the PMD table lock, core-mm would grab the PTE table lock of one of both
>> PTE page tables. In such corner cases, we have to make sure that both
>> locks match, which is (fortunately!) currently guaranteed for 8xx as it
>> does not support SMP.
>
> Do you mean "does not support SPLIT_PMD_PTLOCK" here instead of SMP?
Split PT locks are only enabled once we have NR_CPUS >= 4, so without
SMP (NR_CPUS == 1), no split PT locks are used. I document that in the
other series that I just sent out in a better way.
>
>>
>> Fixes: 9cb28da54643 ("mm/gup: handle hugetlb in the generic follow_page_mask code")
>> Cc: <stable@vger.kernel.org>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
>
> Patch looks all right to me:
>
> Reviewed-by: Peter Xu <peterx@redhat.com>
>
> Thanks!
Thanks Peter!
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v1 2/2] mm/hugetlb: fix hugetlb vs. core-mm PT locking
2024-07-25 18:39 ` [PATCH v1 2/2] mm/hugetlb: fix hugetlb vs. core-mm PT locking David Hildenbrand
` (2 preceding siblings ...)
2024-07-26 15:26 ` Peter Xu
@ 2024-07-29 4:51 ` Oscar Salvador
3 siblings, 0 replies; 12+ messages in thread
From: Oscar Salvador @ 2024-07-29 4:51 UTC (permalink / raw)
To: David Hildenbrand
Cc: linux-kernel, linux-mm, Andrew Morton, Muchun Song, Peter Xu,
stable
On Thu, Jul 25, 2024 at 08:39:55PM +0200, David Hildenbrand wrote:
> We recently made GUP's common page table walking code to also walk
> hugetlb VMAs without most hugetlb special-casing, preparing for the
> future of having less hugetlb-specific page table walking code in the
> codebase. Turns out that we missed one page table locking detail: page
> table locking for hugetlb folios that are not mapped using a single
> PMD/PUD.
>
> Assume we have hugetlb folio that spans multiple PTEs (e.g., 64 KiB
> hugetlb folios on arm64 with 4 KiB base page size). GUP, as it walks the
> page tables, will perform a pte_offset_map_lock() to grab the PTE table
> lock.
>
> However, hugetlb that concurrently modifies these page tables would
> actually grab the mm->page_table_lock: with USE_SPLIT_PTE_PTLOCKS, the
> locks would differ. Something similar can happen right now with hugetlb
> folios that span multiple PMDs when USE_SPLIT_PMD_PTLOCKS.
>
> Let's make huge_pte_lockptr() effectively uses the same PT locks as any
> core-mm page table walker would.
>
> There is one ugly case: powerpc 8xx, whereby we have an 8 MiB hugetlb
> folio being mapped using two PTE page tables. While hugetlb wants to take
> the PMD table lock, core-mm would grab the PTE table lock of one of both
> PTE page tables. In such corner cases, we have to make sure that both
> locks match, which is (fortunately!) currently guaranteed for 8xx as it
> does not support SMP.
>
> Fixes: 9cb28da54643 ("mm/gup: handle hugetlb in the generic follow_page_mask code")
> Cc: <stable@vger.kernel.org>
> Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
--
Oscar Salvador
SUSE Labs
^ permalink raw reply [flat|nested] 12+ messages in thread