All of lore.kernel.org
 help / color / mirror / Atom feed
From: Lance Yang <lance.yang@linux.dev>
To: Zi Yan <ziy@nvidia.com>
Cc: akpm@linux-foundation.org, david@redhat.com,
	lorenzo.stoakes@oracle.com, usamaarif642@gmail.com,
	yuzhao@google.com, baolin.wang@linux.alibaba.com,
	baohua@kernel.org, voidice@gmail.com, Liam.Howlett@oracle.com,
	catalin.marinas@arm.com, cerasuolodomenico@gmail.com,
	hannes@cmpxchg.org, kaleshsingh@google.com, npache@redhat.com,
	riel@surriel.com, roman.gushchin@linux.dev, rppt@kernel.org,
	ryan.roberts@arm.com, dev.jain@arm.com, ryncsn@gmail.com,
	shakeel.butt@linux.dev, surenb@google.com, hughd@google.com,
	willy@infradead.org, matthew.brost@intel.com,
	joshua.hahnjy@gmail.com, rakie.kim@sk.com, byungchul@sk.com,
	gourry@gourry.net, ying.huang@linux.alibaba.com,
	apopple@nvidia.com, qun-wei.lin@mediatek.com,
	Andrew.Yang@mediatek.com, casper.li@mediatek.com,
	chinwen.chang@mediatek.com, linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, linux-mediatek@lists.infradead.org,
	linux-mm@kvack.org, ioworker0@gmail.com, stable@vger.kernel.org,
	linux-riscv@lists.infradead.org, palmer@rivosinc.com,
	samuel.holland@sifive.com, charlie@rivosinc.com
Subject: Re: [PATCH 1/1] mm/thp: fix MTE tag mismatch when replacing zero-filled subpages
Date: Mon, 22 Sep 2025 11:36:11 +0800	[thread overview]
Message-ID: <e4e82695-c03f-4105-bddd-9778d7e368d4@linux.dev> (raw)
In-Reply-To: <3DD2EF5E-3E6A-40B0-AFCC-8FB38F0763DB@nvidia.com>

Cc: RISC-V folks

On 2025/9/22 10:36, Zi Yan wrote:
> On 21 Sep 2025, at 22:14, Lance Yang wrote:
> 
>> From: Lance Yang <lance.yang@linux.dev>
>>
>> When both THP and MTE are enabled, splitting a THP and replacing its
>> zero-filled subpages with the shared zeropage can cause MTE tag mismatch
>> faults in userspace.
>>
>> Remapping zero-filled subpages to the shared zeropage is unsafe, as the
>> zeropage has a fixed tag of zero, which may not match the tag expected by
>> the userspace pointer.
>>
>> KSM already avoids this problem by using memcmp_pages(), which on arm64
>> intentionally reports MTE-tagged pages as non-identical to prevent unsafe
>> merging.
>>
>> As suggested by David[1], this patch adopts the same pattern, replacing the
>> memchr_inv() byte-level check with a call to pages_identical(). This
>> leverages existing architecture-specific logic to determine if a page is
>> truly identical to the shared zeropage.
>>
>> Having both the THP shrinker and KSM rely on pages_identical() makes the
>> design more future-proof, IMO. Instead of handling quirks in generic code,
>> we just let the architecture decide what makes two pages identical.
>>
>> [1] https://lore.kernel.org/all/ca2106a3-4bb2-4457-81af-301fd99fbef4@redhat.com
>>
>> Cc: <stable@vger.kernel.org>
>> Reported-by: Qun-wei Lin <Qun-wei.Lin@mediatek.com>
>> Closes: https://lore.kernel.org/all/a7944523fcc3634607691c35311a5d59d1a3f8d4.camel@mediatek.com
>> Fixes: b1f202060afe ("mm: remap unused subpages to shared zeropage when splitting isolated thp")
>> Suggested-by: David Hildenbrand <david@redhat.com>
>> Signed-off-by: Lance Yang <lance.yang@linux.dev>
>> ---
>> Tested on x86_64 and on QEMU for arm64 (with and without MTE support),
>> and the fix works as expected.
> 
>  From [1], I see you mentioned RISC-V also has the address masking feature.
> Is it affected by this? And memcmp_pages() is only implemented by ARM64
> for MTE. Should any arch with address masking always implement it to avoid
> the same issue?

Yeah, I'm new to RISC-V, seems like RISC-V has a similar feature as
described in Documentation/arch/riscv/uabi.rst, which is the Supm
(Supervisor-mode Pointer Masking) extension.

```
Pointer masking
---------------

Support for pointer masking in userspace (the Supm extension) is 
provided via
the ``PR_SET_TAGGED_ADDR_CTRL`` and ``PR_GET_TAGGED_ADDR_CTRL`` ``prctl()``
operations. Pointer masking is disabled by default. To enable it, userspace
must call ``PR_SET_TAGGED_ADDR_CTRL`` with the ``PR_PMLEN`` field set to the
number of mask/tag bits needed by the application. ``PR_PMLEN`` is 
interpreted
as a lower bound; if the kernel is unable to satisfy the request, the
``PR_SET_TAGGED_ADDR_CTRL`` operation will fail. The actual number of 
tag bits
is returned in ``PR_PMLEN`` by the ``PR_GET_TAGGED_ADDR_CTRL`` operation.
```

But, IIUC, Supm by itself only ensures that the upper bits are ignored on
memory access :)

So, RISC-V today would likely not be affected. However, once it implements
full hardware tag checking, it will face the exact same zero-page problem.

Anyway, any architecture with a feature like MTE in the future will need
its own memcmp_pages() to prevent unsafe merges ;)

> 
>>
>>   mm/huge_memory.c | 15 +++------------
>>   mm/migrate.c     |  8 +-------
>>   2 files changed, 4 insertions(+), 19 deletions(-)
>>
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index 32e0ec2dde36..28d4b02a1aa5 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -4104,29 +4104,20 @@ static unsigned long deferred_split_count(struct shrinker *shrink,
>>   static bool thp_underused(struct folio *folio)
>>   {
>>   	int num_zero_pages = 0, num_filled_pages = 0;
>> -	void *kaddr;
>>   	int i;
>>
>>   	for (i = 0; i < folio_nr_pages(folio); i++) {
>> -		kaddr = kmap_local_folio(folio, i * PAGE_SIZE);
>> -		if (!memchr_inv(kaddr, 0, PAGE_SIZE)) {
>> -			num_zero_pages++;
>> -			if (num_zero_pages > khugepaged_max_ptes_none) {
>> -				kunmap_local(kaddr);
>> +		if (pages_identical(folio_page(folio, i), ZERO_PAGE(0))) {
>> +			if (++num_zero_pages > khugepaged_max_ptes_none)
>>   				return true;
>> -			}
>>   		} else {
>>   			/*
>>   			 * Another path for early exit once the number
>>   			 * of non-zero filled pages exceeds threshold.
>>   			 */
>> -			num_filled_pages++;
>> -			if (num_filled_pages >= HPAGE_PMD_NR - khugepaged_max_ptes_none) {
>> -				kunmap_local(kaddr);
>> +			if (++num_filled_pages >= HPAGE_PMD_NR - khugepaged_max_ptes_none)
>>   				return false;
>> -			}
>>   		}
>> -		kunmap_local(kaddr);
>>   	}
>>   	return false;
>>   }
>> diff --git a/mm/migrate.c b/mm/migrate.c
>> index aee61a980374..ce83c2c3c287 100644
>> --- a/mm/migrate.c
>> +++ b/mm/migrate.c
>> @@ -300,9 +300,7 @@ static bool try_to_map_unused_to_zeropage(struct page_vma_mapped_walk *pvmw,
>>   					  unsigned long idx)
>>   {
>>   	struct page *page = folio_page(folio, idx);
>> -	bool contains_data;
>>   	pte_t newpte;
>> -	void *addr;
>>
>>   	if (PageCompound(page))
>>   		return false;
>> @@ -319,11 +317,7 @@ static bool try_to_map_unused_to_zeropage(struct page_vma_mapped_walk *pvmw,
>>   	 * this subpage has been non present. If the subpage is only zero-filled
>>   	 * then map it to the shared zeropage.
>>   	 */
>> -	addr = kmap_local_page(page);
>> -	contains_data = memchr_inv(addr, 0, PAGE_SIZE);
>> -	kunmap_local(addr);
>> -
>> -	if (contains_data)
>> +	if (!pages_identical(page, ZERO_PAGE(0)))
>>   		return false;
>>
>>   	newpte = pte_mkspecial(pfn_pte(my_zero_pfn(pvmw->address),
>> -- 
>> 2.49.0
> 
> The changes look good to me. Thanks. Acked-by: Zi Yan <ziy@nvidia.com>

Cheers!



WARNING: multiple messages have this Message-ID (diff)
From: Lance Yang <lance.yang@linux.dev>
To: Zi Yan <ziy@nvidia.com>
Cc: akpm@linux-foundation.org, david@redhat.com,
	lorenzo.stoakes@oracle.com, usamaarif642@gmail.com,
	yuzhao@google.com, baolin.wang@linux.alibaba.com,
	baohua@kernel.org, voidice@gmail.com, Liam.Howlett@oracle.com,
	catalin.marinas@arm.com, cerasuolodomenico@gmail.com,
	hannes@cmpxchg.org, kaleshsingh@google.com, npache@redhat.com,
	riel@surriel.com, roman.gushchin@linux.dev, rppt@kernel.org,
	ryan.roberts@arm.com, dev.jain@arm.com, ryncsn@gmail.com,
	shakeel.butt@linux.dev, surenb@google.com, hughd@google.com,
	willy@infradead.org, matthew.brost@intel.com,
	joshua.hahnjy@gmail.com, rakie.kim@sk.com, byungchul@sk.com,
	gourry@gourry.net, ying.huang@linux.alibaba.com,
	apopple@nvidia.com, qun-wei.lin@mediatek.com,
	Andrew.Yang@mediatek.com, casper.li@mediatek.com,
	chinwen.chang@mediatek.com, linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, linux-mediatek@lists.infradead.org,
	linux-mm@kvack.org, ioworker0@gmail.com, stable@vger.kernel.org,
	linux-riscv@lists.infradead.org, palmer@rivosinc.com,
	samuel.holland@sifive.com, charlie@rivosinc.com
Subject: Re: [PATCH 1/1] mm/thp: fix MTE tag mismatch when replacing zero-filled subpages
Date: Mon, 22 Sep 2025 11:36:11 +0800	[thread overview]
Message-ID: <e4e82695-c03f-4105-bddd-9778d7e368d4@linux.dev> (raw)
In-Reply-To: <3DD2EF5E-3E6A-40B0-AFCC-8FB38F0763DB@nvidia.com>

Cc: RISC-V folks

On 2025/9/22 10:36, Zi Yan wrote:
> On 21 Sep 2025, at 22:14, Lance Yang wrote:
> 
>> From: Lance Yang <lance.yang@linux.dev>
>>
>> When both THP and MTE are enabled, splitting a THP and replacing its
>> zero-filled subpages with the shared zeropage can cause MTE tag mismatch
>> faults in userspace.
>>
>> Remapping zero-filled subpages to the shared zeropage is unsafe, as the
>> zeropage has a fixed tag of zero, which may not match the tag expected by
>> the userspace pointer.
>>
>> KSM already avoids this problem by using memcmp_pages(), which on arm64
>> intentionally reports MTE-tagged pages as non-identical to prevent unsafe
>> merging.
>>
>> As suggested by David[1], this patch adopts the same pattern, replacing the
>> memchr_inv() byte-level check with a call to pages_identical(). This
>> leverages existing architecture-specific logic to determine if a page is
>> truly identical to the shared zeropage.
>>
>> Having both the THP shrinker and KSM rely on pages_identical() makes the
>> design more future-proof, IMO. Instead of handling quirks in generic code,
>> we just let the architecture decide what makes two pages identical.
>>
>> [1] https://lore.kernel.org/all/ca2106a3-4bb2-4457-81af-301fd99fbef4@redhat.com
>>
>> Cc: <stable@vger.kernel.org>
>> Reported-by: Qun-wei Lin <Qun-wei.Lin@mediatek.com>
>> Closes: https://lore.kernel.org/all/a7944523fcc3634607691c35311a5d59d1a3f8d4.camel@mediatek.com
>> Fixes: b1f202060afe ("mm: remap unused subpages to shared zeropage when splitting isolated thp")
>> Suggested-by: David Hildenbrand <david@redhat.com>
>> Signed-off-by: Lance Yang <lance.yang@linux.dev>
>> ---
>> Tested on x86_64 and on QEMU for arm64 (with and without MTE support),
>> and the fix works as expected.
> 
>  From [1], I see you mentioned RISC-V also has the address masking feature.
> Is it affected by this? And memcmp_pages() is only implemented by ARM64
> for MTE. Should any arch with address masking always implement it to avoid
> the same issue?

Yeah, I'm new to RISC-V, seems like RISC-V has a similar feature as
described in Documentation/arch/riscv/uabi.rst, which is the Supm
(Supervisor-mode Pointer Masking) extension.

```
Pointer masking
---------------

Support for pointer masking in userspace (the Supm extension) is 
provided via
the ``PR_SET_TAGGED_ADDR_CTRL`` and ``PR_GET_TAGGED_ADDR_CTRL`` ``prctl()``
operations. Pointer masking is disabled by default. To enable it, userspace
must call ``PR_SET_TAGGED_ADDR_CTRL`` with the ``PR_PMLEN`` field set to the
number of mask/tag bits needed by the application. ``PR_PMLEN`` is 
interpreted
as a lower bound; if the kernel is unable to satisfy the request, the
``PR_SET_TAGGED_ADDR_CTRL`` operation will fail. The actual number of 
tag bits
is returned in ``PR_PMLEN`` by the ``PR_GET_TAGGED_ADDR_CTRL`` operation.
```

But, IIUC, Supm by itself only ensures that the upper bits are ignored on
memory access :)

So, RISC-V today would likely not be affected. However, once it implements
full hardware tag checking, it will face the exact same zero-page problem.

Anyway, any architecture with a feature like MTE in the future will need
its own memcmp_pages() to prevent unsafe merges ;)

> 
>>
>>   mm/huge_memory.c | 15 +++------------
>>   mm/migrate.c     |  8 +-------
>>   2 files changed, 4 insertions(+), 19 deletions(-)
>>
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index 32e0ec2dde36..28d4b02a1aa5 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -4104,29 +4104,20 @@ static unsigned long deferred_split_count(struct shrinker *shrink,
>>   static bool thp_underused(struct folio *folio)
>>   {
>>   	int num_zero_pages = 0, num_filled_pages = 0;
>> -	void *kaddr;
>>   	int i;
>>
>>   	for (i = 0; i < folio_nr_pages(folio); i++) {
>> -		kaddr = kmap_local_folio(folio, i * PAGE_SIZE);
>> -		if (!memchr_inv(kaddr, 0, PAGE_SIZE)) {
>> -			num_zero_pages++;
>> -			if (num_zero_pages > khugepaged_max_ptes_none) {
>> -				kunmap_local(kaddr);
>> +		if (pages_identical(folio_page(folio, i), ZERO_PAGE(0))) {
>> +			if (++num_zero_pages > khugepaged_max_ptes_none)
>>   				return true;
>> -			}
>>   		} else {
>>   			/*
>>   			 * Another path for early exit once the number
>>   			 * of non-zero filled pages exceeds threshold.
>>   			 */
>> -			num_filled_pages++;
>> -			if (num_filled_pages >= HPAGE_PMD_NR - khugepaged_max_ptes_none) {
>> -				kunmap_local(kaddr);
>> +			if (++num_filled_pages >= HPAGE_PMD_NR - khugepaged_max_ptes_none)
>>   				return false;
>> -			}
>>   		}
>> -		kunmap_local(kaddr);
>>   	}
>>   	return false;
>>   }
>> diff --git a/mm/migrate.c b/mm/migrate.c
>> index aee61a980374..ce83c2c3c287 100644
>> --- a/mm/migrate.c
>> +++ b/mm/migrate.c
>> @@ -300,9 +300,7 @@ static bool try_to_map_unused_to_zeropage(struct page_vma_mapped_walk *pvmw,
>>   					  unsigned long idx)
>>   {
>>   	struct page *page = folio_page(folio, idx);
>> -	bool contains_data;
>>   	pte_t newpte;
>> -	void *addr;
>>
>>   	if (PageCompound(page))
>>   		return false;
>> @@ -319,11 +317,7 @@ static bool try_to_map_unused_to_zeropage(struct page_vma_mapped_walk *pvmw,
>>   	 * this subpage has been non present. If the subpage is only zero-filled
>>   	 * then map it to the shared zeropage.
>>   	 */
>> -	addr = kmap_local_page(page);
>> -	contains_data = memchr_inv(addr, 0, PAGE_SIZE);
>> -	kunmap_local(addr);
>> -
>> -	if (contains_data)
>> +	if (!pages_identical(page, ZERO_PAGE(0)))
>>   		return false;
>>
>>   	newpte = pte_mkspecial(pfn_pte(my_zero_pfn(pvmw->address),
>> -- 
>> 2.49.0
> 
> The changes look good to me. Thanks. Acked-by: Zi Yan <ziy@nvidia.com>

Cheers!


_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

  reply	other threads:[~2025-09-22  3:37 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-09-22  2:14 [PATCH 1/1] mm/thp: fix MTE tag mismatch when replacing zero-filled subpages Lance Yang
2025-09-22  2:36 ` Zi Yan
2025-09-22  3:36   ` Lance Yang [this message]
2025-09-22  3:36     ` Lance Yang
2025-09-22  7:41 ` David Hildenbrand
2025-09-22  8:24 ` Usama Arif
2025-09-22 17:24 ` Catalin Marinas
2025-09-22 17:59   ` David Hildenbrand
2025-09-23  1:48     ` Lance Yang
2025-09-23 11:52     ` Catalin Marinas
2025-09-23 12:00       ` David Hildenbrand
2025-09-23 12:04         ` Lance Yang
2025-09-23 12:51         ` Catalin Marinas
2025-09-23 17:20         ` Lance Yang
2025-09-23 16:14       ` Catalin Marinas
2025-09-23 16:40         ` David Hildenbrand
2025-09-24  2:49         ` Lance Yang
2025-09-24  8:50           ` Catalin Marinas
2025-09-24  9:13             ` David Hildenbrand
2025-09-24  9:34               ` Catalin Marinas
2025-09-24  9:44                 ` David Hildenbrand
2025-09-24  9:59                   ` Catalin Marinas
2025-09-23  2:10 ` Wei Yang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=e4e82695-c03f-4105-bddd-9778d7e368d4@linux.dev \
    --to=lance.yang@linux.dev \
    --cc=Andrew.Yang@mediatek.com \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=apopple@nvidia.com \
    --cc=baohua@kernel.org \
    --cc=baolin.wang@linux.alibaba.com \
    --cc=byungchul@sk.com \
    --cc=casper.li@mediatek.com \
    --cc=catalin.marinas@arm.com \
    --cc=cerasuolodomenico@gmail.com \
    --cc=charlie@rivosinc.com \
    --cc=chinwen.chang@mediatek.com \
    --cc=david@redhat.com \
    --cc=dev.jain@arm.com \
    --cc=gourry@gourry.net \
    --cc=hannes@cmpxchg.org \
    --cc=hughd@google.com \
    --cc=ioworker0@gmail.com \
    --cc=joshua.hahnjy@gmail.com \
    --cc=kaleshsingh@google.com \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mediatek@lists.infradead.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-riscv@lists.infradead.org \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=matthew.brost@intel.com \
    --cc=npache@redhat.com \
    --cc=palmer@rivosinc.com \
    --cc=qun-wei.lin@mediatek.com \
    --cc=rakie.kim@sk.com \
    --cc=riel@surriel.com \
    --cc=roman.gushchin@linux.dev \
    --cc=rppt@kernel.org \
    --cc=ryan.roberts@arm.com \
    --cc=ryncsn@gmail.com \
    --cc=samuel.holland@sifive.com \
    --cc=shakeel.butt@linux.dev \
    --cc=stable@vger.kernel.org \
    --cc=surenb@google.com \
    --cc=usamaarif642@gmail.com \
    --cc=voidice@gmail.com \
    --cc=willy@infradead.org \
    --cc=ying.huang@linux.alibaba.com \
    --cc=yuzhao@google.com \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.