[PATCH] mm/huge_memory: skip huge_zero_pmd in zap_huge_pmd

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH] mm/huge_memory: skip huge_zero_pmd in zap_huge_pmd_folio()
@ 2026-04-30  4:11 Bibo Mao
  2026-04-30  4:28 ` Lance Yang
  0 siblings, 1 reply; 10+ messages in thread
From: Bibo Mao @ 2026-04-30  4:11 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Lorenzo Stoakes, Zi Yan,
	Baolin Wang, Liam R. Howlett, Nico Pache, Ryan Roberts, Dev Jain,
	Barry Song, Lance Yang
  Cc: linux-mm, linux-kernel

when executing command "make check" with qemu software, there is
error report like this:
 BUG: Bad rss-counter state mm:00000000972846bc type:MM_FILEPAGES val:-4096 Comm:bios-tables-tes Pid:27802
 BUG: Bad rss-counter state mm:00000000752180c5 type:MM_FILEPAGES val:-2048 Comm:worker Pid:27815
 BUG: Bad rss-counter state mm:000000009c2f6a61 type:MM_FILEPAGES val:-2048 Comm:qom-test Pid:27825

The problem is that when application exits, rss counter is calculated
with huge_zero_pmd huge page, instead it should be skipped.

Signed-off-by: Bibo Mao <maobibo@loongson.cn>
---
 mm/huge_memory.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 970e077019b7..3cbea344d4a2 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2423,6 +2423,9 @@ static void zap_huge_pmd_folio(struct mm_struct *mm, struct vm_area_struct *vma,
 {
 	const bool is_device_private = folio_is_device_private(folio);
 
+	if (is_huge_zero_pmd(pmdval))
+		return;
+
 	/* Present and device private folios are rmappable. */
 	if (is_present || is_device_private)
 		folio_remove_rmap_pmd(folio, &folio->page, vma);

base-commit: 3b3bea6d4b9c162f9e555905d96b8c1da67ecd5b
-- 
2.39.3


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH] mm/huge_memory: skip huge_zero_pmd in zap_huge_pmd_folio()
  2026-04-30  4:11 [PATCH] mm/huge_memory: skip huge_zero_pmd in zap_huge_pmd_folio() Bibo Mao
@ 2026-04-30  4:28 ` Lance Yang
  2026-04-30  4:58   ` Lance Yang
  2026-04-30  6:34   ` Bibo Mao
  0 siblings, 2 replies; 10+ messages in thread
From: Lance Yang @ 2026-04-30  4:28 UTC (permalink / raw)
  To: maobibo
  Cc: akpm, david, ljs, ziy, baolin.wang, Liam.Howlett, npache,
	ryan.roberts, dev.jain, baohua, lance.yang, linux-mm,
	linux-kernel


On Thu, Apr 30, 2026 at 12:11:20PM +0800, Bibo Mao wrote:
>when executing command "make check" with qemu software, there is
>error report like this:
> BUG: Bad rss-counter state mm:00000000972846bc type:MM_FILEPAGES val:-4096 Comm:bios-tables-tes Pid:27802
> BUG: Bad rss-counter state mm:00000000752180c5 type:MM_FILEPAGES val:-2048 Comm:worker Pid:27815
> BUG: Bad rss-counter state mm:000000009c2f6a61 type:MM_FILEPAGES val:-2048 Comm:qom-test Pid:27825

Good catch!

>The problem is that when application exits, rss counter is calculated
>with huge_zero_pmd huge page, instead it should be skipped.

Looks like the same problem[1] we discussed recently.

[1] https://lore.kernel.org/linux-mm/74a75b59-2e13-3985-ee99-d5521f39df2a@google.com/

>Signed-off-by: Bibo Mao <maobibo@loongson.cn>
>---
> mm/huge_memory.c | 3 +++
> 1 file changed, 3 insertions(+)
>
>diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>index 970e077019b7..3cbea344d4a2 100644
>--- a/mm/huge_memory.c
>+++ b/mm/huge_memory.c
>@@ -2423,6 +2423,9 @@ static void zap_huge_pmd_folio(struct mm_struct *mm, struct vm_area_struct *vma,
> {
> 	const bool is_device_private = folio_is_device_private(folio);
> 
>+	if (is_huge_zero_pmd(pmdval))
>+		return;
>+

The huge zero PMD should not be returned by vm_normal_page_pmd() or
vm_normal_folio_pmd() as a normal folio. If it reaches
zap_huge_pmd_folio(), we already made the wrong normal-vs-special
decision ...

So I don't think we should special-case it in zap_huge_pmd_folio(). That
only avoids this RSS decrement :)

Could you please check whether the fix[2] also fixes your QEMU test?

[2] https://lore.kernel.org/linux-mm/ea1453a6-14c9-4334-ac7e-2758586393b2@kernel.org/

Thanks, Lance

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] mm/huge_memory: skip huge_zero_pmd in zap_huge_pmd_folio()
  2026-04-30  4:28 ` Lance Yang
@ 2026-04-30  4:58   ` Lance Yang
  2026-04-30  6:34   ` Bibo Mao
  1 sibling, 0 replies; 10+ messages in thread
From: Lance Yang @ 2026-04-30  4:58 UTC (permalink / raw)
  To: maobibo
  Cc: akpm, david, ljs, ziy, baolin.wang, Liam.Howlett, npache,
	ryan.roberts, dev.jain, baohua, linux-mm, linux-kernel



On 2026/4/30 12:28, Lance Yang wrote:
> 
> On Thu, Apr 30, 2026 at 12:11:20PM +0800, Bibo Mao wrote:
>> when executing command "make check" with qemu software, there is
>> error report like this:
>> BUG: Bad rss-counter state mm:00000000972846bc type:MM_FILEPAGES val:-4096 Comm:bios-tables-tes Pid:27802
>> BUG: Bad rss-counter state mm:00000000752180c5 type:MM_FILEPAGES val:-2048 Comm:worker Pid:27815
>> BUG: Bad rss-counter state mm:000000009c2f6a61 type:MM_FILEPAGES val:-2048 Comm:qom-test Pid:27825
> 
> Good catch!
> 
>> The problem is that when application exits, rss counter is calculated
>> with huge_zero_pmd huge page, instead it should be skipped.
> 
> Looks like the same problem[1] we discussed recently.
> 
> [1] https://lore.kernel.org/linux-mm/74a75b59-2e13-3985-ee99-d5521f39df2a@google.com/
> 
>> Signed-off-by: Bibo Mao <maobibo@loongson.cn>
>> ---
>> mm/huge_memory.c | 3 +++
>> 1 file changed, 3 insertions(+)
>>
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index 970e077019b7..3cbea344d4a2 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -2423,6 +2423,9 @@ static void zap_huge_pmd_folio(struct mm_struct *mm, struct vm_area_struct *vma,
>> {
>> 	const bool is_device_private = folio_is_device_private(folio);
>>
>> +	if (is_huge_zero_pmd(pmdval))
>> +		return;
>> +
> 
> The huge zero PMD should not be returned by vm_normal_page_pmd() or
> vm_normal_folio_pmd() as a normal folio. If it reaches
> zap_huge_pmd_folio(), we already made the wrong normal-vs-special
> decision ...
> 
> So I don't think we should special-case it in zap_huge_pmd_folio(). That
> only avoids this RSS decrement :)
> 
> Could you please check whether the fix[2] also fixes your QEMU test?

In addition, like x86-32, 64-bit LoongArch selects ARCH_HAS_PTE_SPECIAL,
but not ARCH_SUPPORTS_HUGE_PFNMAP. So CONFIG_ARCH_SUPPORTS_PMD_PFNMAP
is not enabled, and pmd_special() falls back to the generic stub that
always returns false.

So I guess the fix should do the trick :)

> [2] https://lore.kernel.org/linux-mm/ea1453a6-14c9-4334-ac7e-2758586393b2@kernel.org/
> 
> Thanks, Lance

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] mm/huge_memory: skip huge_zero_pmd in zap_huge_pmd_folio()
  2026-04-30  4:28 ` Lance Yang
  2026-04-30  4:58   ` Lance Yang
@ 2026-04-30  6:34   ` Bibo Mao
  2026-04-30  7:02     ` Lance Yang
  1 sibling, 1 reply; 10+ messages in thread
From: Bibo Mao @ 2026-04-30  6:34 UTC (permalink / raw)
  To: Lance Yang
  Cc: akpm, david, ljs, ziy, baolin.wang, Liam.Howlett, npache,
	ryan.roberts, dev.jain, baohua, linux-mm, linux-kernel



On 2026/4/30 下午12:28, Lance Yang wrote:
> 
> On Thu, Apr 30, 2026 at 12:11:20PM +0800, Bibo Mao wrote:
>> when executing command "make check" with qemu software, there is
>> error report like this:
>> BUG: Bad rss-counter state mm:00000000972846bc type:MM_FILEPAGES val:-4096 Comm:bios-tables-tes Pid:27802
>> BUG: Bad rss-counter state mm:00000000752180c5 type:MM_FILEPAGES val:-2048 Comm:worker Pid:27815
>> BUG: Bad rss-counter state mm:000000009c2f6a61 type:MM_FILEPAGES val:-2048 Comm:qom-test Pid:27825
> 
> Good catch!
> 
>> The problem is that when application exits, rss counter is calculated
>> with huge_zero_pmd huge page, instead it should be skipped.
> 
> Looks like the same problem[1] we discussed recently.
> 
> [1] https://lore.kernel.org/linux-mm/74a75b59-2e13-3985-ee99-d5521f39df2a@google.com/
> 
>> Signed-off-by: Bibo Mao <maobibo@loongson.cn>
>> ---
>> mm/huge_memory.c | 3 +++
>> 1 file changed, 3 insertions(+)
>>
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index 970e077019b7..3cbea344d4a2 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -2423,6 +2423,9 @@ static void zap_huge_pmd_folio(struct mm_struct *mm, struct vm_area_struct *vma,
>> {
>> 	const bool is_device_private = folio_is_device_private(folio);
>>
>> +	if (is_huge_zero_pmd(pmdval))
>> +		return;
>> +
> 
> The huge zero PMD should not be returned by vm_normal_page_pmd() or
> vm_normal_folio_pmd() as a normal folio. If it reaches
> zap_huge_pmd_folio(), we already made the wrong normal-vs-special
> decision ...
> 
> So I don't think we should special-case it in zap_huge_pmd_folio(). That
> only avoids this RSS decrement :)
> 
> Could you please check whether the fix[2] also fixes your QEMU test?
> 
> [2] https://lore.kernel.org/linux-mm/ea1453a6-14c9-4334-ac7e-2758586393b2@kernel.org/
yes, I think it will solve this problem.

Only that I think that there should be tlb flush operation after 
pmdp_huge_get_and_clear_full() even with huge_zero_pmd page, so 
tlb_remove_page_size() should be called. Is that right?

Regards
Bibo Mao


> 
> Thanks, Lance
> 


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] mm/huge_memory: skip huge_zero_pmd in zap_huge_pmd_folio()
  2026-04-30  6:34   ` Bibo Mao
@ 2026-04-30  7:02     ` Lance Yang
  2026-04-30  7:05       ` Bibo Mao
  2026-04-30  7:12       ` Lance Yang
  0 siblings, 2 replies; 10+ messages in thread
From: Lance Yang @ 2026-04-30  7:02 UTC (permalink / raw)
  To: maobibo
  Cc: lance.yang, akpm, david, ljs, ziy, baolin.wang, Liam.Howlett,
	npache, ryan.roberts, dev.jain, baohua, linux-mm, linux-kernel


On Thu, Apr 30, 2026 at 02:34:20PM +0800, Bibo Mao wrote:
>
>
>On 2026/4/30 下午12:28, Lance Yang wrote:
>> 
>> On Thu, Apr 30, 2026 at 12:11:20PM +0800, Bibo Mao wrote:
>>> when executing command "make check" with qemu software, there is
>>> error report like this:
>>> BUG: Bad rss-counter state mm:00000000972846bc type:MM_FILEPAGES val:-4096 Comm:bios-tables-tes Pid:27802
>>> BUG: Bad rss-counter state mm:00000000752180c5 type:MM_FILEPAGES val:-2048 Comm:worker Pid:27815
>>> BUG: Bad rss-counter state mm:000000009c2f6a61 type:MM_FILEPAGES val:-2048 Comm:qom-test Pid:27825
>> 
>> Good catch!
>> 
>>> The problem is that when application exits, rss counter is calculated
>>> with huge_zero_pmd huge page, instead it should be skipped.
>> 
>> Looks like the same problem[1] we discussed recently.
>> 
>> [1] https://lore.kernel.org/linux-mm/74a75b59-2e13-3985-ee99-d5521f39df2a@google.com/
>> 
>>> Signed-off-by: Bibo Mao <maobibo@loongson.cn>
>>> ---
>>> mm/huge_memory.c | 3 +++
>>> 1 file changed, 3 insertions(+)
>>>
>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>> index 970e077019b7..3cbea344d4a2 100644
>>> --- a/mm/huge_memory.c
>>> +++ b/mm/huge_memory.c
>>> @@ -2423,6 +2423,9 @@ static void zap_huge_pmd_folio(struct mm_struct *mm, struct vm_area_struct *vma,
>>> {
>>> 	const bool is_device_private = folio_is_device_private(folio);
>>>
>>> +	if (is_huge_zero_pmd(pmdval))
>>> +		return;
>>> +
>> 
>> The huge zero PMD should not be returned by vm_normal_page_pmd() or
>> vm_normal_folio_pmd() as a normal folio. If it reaches
>> zap_huge_pmd_folio(), we already made the wrong normal-vs-special
>> decision ...
>> 
>> So I don't think we should special-case it in zap_huge_pmd_folio(). That
>> only avoids this RSS decrement :)
>> 
>> Could you please check whether the fix[2] also fixes your QEMU test?
>> 
>> [2] https://lore.kernel.org/linux-mm/ea1453a6-14c9-4334-ac7e-2758586393b2@kernel.org/
>yes, I think it will solve this problem.
>
>Only that I think that there should be tlb flush operation after 
>pmdp_huge_get_and_clear_full() even with huge_zero_pmd page, so 
>tlb_remove_page_size() should be called. Is that right?

Calling tlb_remove_page_size() is not necessary there :)

zap_huge_pmd() already marks the PMD range for TLB invalidation right
after clearing the entry:

	orig_pmd = pmdp_huge_get_and_clear_full(...);
	tlb_remove_pmd_tlb_entry(tlb, pmd, addr);

The later tlb_remove_page_size() is guarded by "is_present && folio",
and is for the normal folio case after normal_or_softleaf_folio_pmd()
return one :)

Please correct me if I missed something :D

Cheers, Lance

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] mm/huge_memory: skip huge_zero_pmd in zap_huge_pmd_folio()
  2026-04-30  7:02     ` Lance Yang
@ 2026-04-30  7:05       ` Bibo Mao
  2026-04-30  7:16         ` Lance Yang
  2026-04-30  7:12       ` Lance Yang
  1 sibling, 1 reply; 10+ messages in thread
From: Bibo Mao @ 2026-04-30  7:05 UTC (permalink / raw)
  To: Lance Yang
  Cc: akpm, david, ljs, ziy, baolin.wang, Liam.Howlett, npache,
	ryan.roberts, dev.jain, baohua, linux-mm, linux-kernel



On 2026/4/30 下午3:02, Lance Yang wrote:
> 
> On Thu, Apr 30, 2026 at 02:34:20PM +0800, Bibo Mao wrote:
>>
>>
>> On 2026/4/30 下午12:28, Lance Yang wrote:
>>>
>>> On Thu, Apr 30, 2026 at 12:11:20PM +0800, Bibo Mao wrote:
>>>> when executing command "make check" with qemu software, there is
>>>> error report like this:
>>>> BUG: Bad rss-counter state mm:00000000972846bc type:MM_FILEPAGES val:-4096 Comm:bios-tables-tes Pid:27802
>>>> BUG: Bad rss-counter state mm:00000000752180c5 type:MM_FILEPAGES val:-2048 Comm:worker Pid:27815
>>>> BUG: Bad rss-counter state mm:000000009c2f6a61 type:MM_FILEPAGES val:-2048 Comm:qom-test Pid:27825
>>>
>>> Good catch!
>>>
>>>> The problem is that when application exits, rss counter is calculated
>>>> with huge_zero_pmd huge page, instead it should be skipped.
>>>
>>> Looks like the same problem[1] we discussed recently.
>>>
>>> [1] https://lore.kernel.org/linux-mm/74a75b59-2e13-3985-ee99-d5521f39df2a@google.com/
>>>
>>>> Signed-off-by: Bibo Mao <maobibo@loongson.cn>
>>>> ---
>>>> mm/huge_memory.c | 3 +++
>>>> 1 file changed, 3 insertions(+)
>>>>
>>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>>> index 970e077019b7..3cbea344d4a2 100644
>>>> --- a/mm/huge_memory.c
>>>> +++ b/mm/huge_memory.c
>>>> @@ -2423,6 +2423,9 @@ static void zap_huge_pmd_folio(struct mm_struct *mm, struct vm_area_struct *vma,
>>>> {
>>>> 	const bool is_device_private = folio_is_device_private(folio);
>>>>
>>>> +	if (is_huge_zero_pmd(pmdval))
>>>> +		return;
>>>> +
>>>
>>> The huge zero PMD should not be returned by vm_normal_page_pmd() or
>>> vm_normal_folio_pmd() as a normal folio. If it reaches
>>> zap_huge_pmd_folio(), we already made the wrong normal-vs-special
>>> decision ...
>>>
>>> So I don't think we should special-case it in zap_huge_pmd_folio(). That
>>> only avoids this RSS decrement :)
>>>
>>> Could you please check whether the fix[2] also fixes your QEMU test?
>>>
>>> [2] https://lore.kernel.org/linux-mm/ea1453a6-14c9-4334-ac7e-2758586393b2@kernel.org/
>> yes, I think it will solve this problem.
>>
>> Only that I think that there should be tlb flush operation after
>> pmdp_huge_get_and_clear_full() even with huge_zero_pmd page, so
>> tlb_remove_page_size() should be called. Is that right?
> 
> Calling tlb_remove_page_size() is not necessary there :)
> 
> zap_huge_pmd() already marks the PMD range for TLB invalidation right
> after clearing the entry:
> 
> 	orig_pmd = pmdp_huge_get_and_clear_full(...);
> 	tlb_remove_pmd_tlb_entry(tlb, pmd, addr);
Yes, it is. I forget the tlb_flush_pmd_range() calling in 
tlb_remove_pmd_tlb_entry().

So the fix solves this problem. And thanks for your explanation.

Regards
Bibo Mao
> 
> The later tlb_remove_page_size() is guarded by "is_present && folio",
> and is for the normal folio case after normal_or_softleaf_folio_pmd()
> return one :)
> 
> Please correct me if I missed something :D
> 
> Cheers, Lance
> 


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] mm/huge_memory: skip huge_zero_pmd in zap_huge_pmd_folio()
  2026-04-30  7:05       ` Bibo Mao
@ 2026-04-30  7:16         ` Lance Yang
  2026-04-30  8:09           ` Bibo Mao
  0 siblings, 1 reply; 10+ messages in thread
From: Lance Yang @ 2026-04-30  7:16 UTC (permalink / raw)
  To: Bibo Mao
  Cc: akpm, david, ljs, ziy, baolin.wang, Liam.Howlett, npache,
	ryan.roberts, dev.jain, baohua, linux-mm, linux-kernel



On 2026/4/30 15:05, Bibo Mao wrote:
> 
> 
> On 2026/4/30 下午3:02, Lance Yang wrote:
>>
>> On Thu, Apr 30, 2026 at 02:34:20PM +0800, Bibo Mao wrote:
>>>
>>>
>>> On 2026/4/30 下午12:28, Lance Yang wrote:
>>>>
>>>> On Thu, Apr 30, 2026 at 12:11:20PM +0800, Bibo Mao wrote:
>>>>> when executing command "make check" with qemu software, there is
>>>>> error report like this:
>>>>> BUG: Bad rss-counter state mm:00000000972846bc type:MM_FILEPAGES 
>>>>> val:-4096 Comm:bios-tables-tes Pid:27802
>>>>> BUG: Bad rss-counter state mm:00000000752180c5 type:MM_FILEPAGES 
>>>>> val:-2048 Comm:worker Pid:27815
>>>>> BUG: Bad rss-counter state mm:000000009c2f6a61 type:MM_FILEPAGES 
>>>>> val:-2048 Comm:qom-test Pid:27825
>>>>
>>>> Good catch!
>>>>
>>>>> The problem is that when application exits, rss counter is calculated
>>>>> with huge_zero_pmd huge page, instead it should be skipped.
>>>>
>>>> Looks like the same problem[1] we discussed recently.
>>>>
>>>> [1] https://lore.kernel.org/linux-mm/74a75b59-2e13-3985-ee99- 
>>>> d5521f39df2a@google.com/
>>>>
>>>>> Signed-off-by: Bibo Mao <maobibo@loongson.cn>
>>>>> ---
>>>>> mm/huge_memory.c | 3 +++
>>>>> 1 file changed, 3 insertions(+)
>>>>>
>>>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>>>> index 970e077019b7..3cbea344d4a2 100644
>>>>> --- a/mm/huge_memory.c
>>>>> +++ b/mm/huge_memory.c
>>>>> @@ -2423,6 +2423,9 @@ static void zap_huge_pmd_folio(struct 
>>>>> mm_struct *mm, struct vm_area_struct *vma,
>>>>> {
>>>>>     const bool is_device_private = folio_is_device_private(folio);
>>>>>
>>>>> +    if (is_huge_zero_pmd(pmdval))
>>>>> +        return;
>>>>> +
>>>>
>>>> The huge zero PMD should not be returned by vm_normal_page_pmd() or
>>>> vm_normal_folio_pmd() as a normal folio. If it reaches
>>>> zap_huge_pmd_folio(), we already made the wrong normal-vs-special
>>>> decision ...
>>>>
>>>> So I don't think we should special-case it in zap_huge_pmd_folio(). 
>>>> That
>>>> only avoids this RSS decrement :)
>>>>
>>>> Could you please check whether the fix[2] also fixes your QEMU test?
>>>>
>>>> [2] https://lore.kernel.org/linux-mm/ea1453a6-14c9-4334- 
>>>> ac7e-2758586393b2@kernel.org/
>>> yes, I think it will solve this problem.
>>>
>>> Only that I think that there should be tlb flush operation after
>>> pmdp_huge_get_and_clear_full() even with huge_zero_pmd page, so
>>> tlb_remove_page_size() should be called. Is that right?
>>
>> Calling tlb_remove_page_size() is not necessary there :)
>>
>> zap_huge_pmd() already marks the PMD range for TLB invalidation right
>> after clearing the entry:
>>
>>     orig_pmd = pmdp_huge_get_and_clear_full(...);
>>     tlb_remove_pmd_tlb_entry(tlb, pmd, addr);
> Yes, it is. I forget the tlb_flush_pmd_range() calling in 
> tlb_remove_pmd_tlb_entry().
> 
> So the fix solves this problem. And thanks for your explanation.

If possible, can you test the fix[1] with your QEMU workload and
provide a Tested-by? That would be very helpful :D

[1] 
https://lore.kernel.org/linux-mm/4d950326-6944-409b-b108-a4e67256857f@kernel.org/

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] mm/huge_memory: skip huge_zero_pmd in zap_huge_pmd_folio()
  2026-04-30  7:16         ` Lance Yang
@ 2026-04-30  8:09           ` Bibo Mao
  2026-04-30  8:15             ` Lance Yang
  0 siblings, 1 reply; 10+ messages in thread
From: Bibo Mao @ 2026-04-30  8:09 UTC (permalink / raw)
  To: Lance Yang
  Cc: akpm, david, ljs, ziy, baolin.wang, Liam.Howlett, npache,
	ryan.roberts, dev.jain, baohua, linux-mm, linux-kernel



On 2026/4/30 下午3:16, Lance Yang wrote:
> 
> 
> On 2026/4/30 15:05, Bibo Mao wrote:
>>
>>
>> On 2026/4/30 下午3:02, Lance Yang wrote:
>>>
>>> On Thu, Apr 30, 2026 at 02:34:20PM +0800, Bibo Mao wrote:
>>>>
>>>>
>>>> On 2026/4/30 下午12:28, Lance Yang wrote:
>>>>>
>>>>> On Thu, Apr 30, 2026 at 12:11:20PM +0800, Bibo Mao wrote:
>>>>>> when executing command "make check" with qemu software, there is
>>>>>> error report like this:
>>>>>> BUG: Bad rss-counter state mm:00000000972846bc type:MM_FILEPAGES 
>>>>>> val:-4096 Comm:bios-tables-tes Pid:27802
>>>>>> BUG: Bad rss-counter state mm:00000000752180c5 type:MM_FILEPAGES 
>>>>>> val:-2048 Comm:worker Pid:27815
>>>>>> BUG: Bad rss-counter state mm:000000009c2f6a61 type:MM_FILEPAGES 
>>>>>> val:-2048 Comm:qom-test Pid:27825
>>>>>
>>>>> Good catch!
>>>>>
>>>>>> The problem is that when application exits, rss counter is calculated
>>>>>> with huge_zero_pmd huge page, instead it should be skipped.
>>>>>
>>>>> Looks like the same problem[1] we discussed recently.
>>>>>
>>>>> [1] https://lore.kernel.org/linux-mm/74a75b59-2e13-3985-ee99- 
>>>>> d5521f39df2a@google.com/
>>>>>
>>>>>> Signed-off-by: Bibo Mao <maobibo@loongson.cn>
>>>>>> ---
>>>>>> mm/huge_memory.c | 3 +++
>>>>>> 1 file changed, 3 insertions(+)
>>>>>>
>>>>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>>>>> index 970e077019b7..3cbea344d4a2 100644
>>>>>> --- a/mm/huge_memory.c
>>>>>> +++ b/mm/huge_memory.c
>>>>>> @@ -2423,6 +2423,9 @@ static void zap_huge_pmd_folio(struct 
>>>>>> mm_struct *mm, struct vm_area_struct *vma,
>>>>>> {
>>>>>>     const bool is_device_private = folio_is_device_private(folio);
>>>>>>
>>>>>> +    if (is_huge_zero_pmd(pmdval))
>>>>>> +        return;
>>>>>> +
>>>>>
>>>>> The huge zero PMD should not be returned by vm_normal_page_pmd() or
>>>>> vm_normal_folio_pmd() as a normal folio. If it reaches
>>>>> zap_huge_pmd_folio(), we already made the wrong normal-vs-special
>>>>> decision ...
>>>>>
>>>>> So I don't think we should special-case it in zap_huge_pmd_folio(). 
>>>>> That
>>>>> only avoids this RSS decrement :)
>>>>>
>>>>> Could you please check whether the fix[2] also fixes your QEMU test?
>>>>>
>>>>> [2] https://lore.kernel.org/linux-mm/ea1453a6-14c9-4334- 
>>>>> ac7e-2758586393b2@kernel.org/
>>>> yes, I think it will solve this problem.
>>>>
>>>> Only that I think that there should be tlb flush operation after
>>>> pmdp_huge_get_and_clear_full() even with huge_zero_pmd page, so
>>>> tlb_remove_page_size() should be called. Is that right?
>>>
>>> Calling tlb_remove_page_size() is not necessary there :)
>>>
>>> zap_huge_pmd() already marks the PMD range for TLB invalidation right
>>> after clearing the entry:
>>>
>>>     orig_pmd = pmdp_huge_get_and_clear_full(...);
>>>     tlb_remove_pmd_tlb_entry(tlb, pmd, addr);
>> Yes, it is. I forget the tlb_flush_pmd_range() calling in 
>> tlb_remove_pmd_tlb_entry().
>>
>> So the fix solves this problem. And thanks for your explanation.
> 
> If possible, can you test the fix[1] with your QEMU workload and
> provide a Tested-by? That would be very helpful :D
yes, this patch solves the problem. I do not subscribe 
linux-mm@kvack.org mailing list, please feel free to add

Tested-by: Bibo Mao <maobibo@loongson.cn>

Regards
Bibo Mao

> 
> [1] 
> https://lore.kernel.org/linux-mm/4d950326-6944-409b-b108-a4e67256857f@kernel.org/ 
> 


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] mm/huge_memory: skip huge_zero_pmd in zap_huge_pmd_folio()
  2026-04-30  8:09           ` Bibo Mao
@ 2026-04-30  8:15             ` Lance Yang
  0 siblings, 0 replies; 10+ messages in thread
From: Lance Yang @ 2026-04-30  8:15 UTC (permalink / raw)
  To: Bibo Mao
  Cc: akpm, david, ljs, ziy, baolin.wang, Liam.Howlett, npache,
	ryan.roberts, dev.jain, baohua, linux-mm, linux-kernel



On 2026/4/30 16:09, Bibo Mao wrote:
> 
> 
> On 2026/4/30 下午3:16, Lance Yang wrote:
>>
>>
>> On 2026/4/30 15:05, Bibo Mao wrote:
>>>
>>>
>>> On 2026/4/30 下午3:02, Lance Yang wrote:
>>>>
>>>> On Thu, Apr 30, 2026 at 02:34:20PM +0800, Bibo Mao wrote:
>>>>>
>>>>>
>>>>> On 2026/4/30 下午12:28, Lance Yang wrote:
>>>>>>
>>>>>> On Thu, Apr 30, 2026 at 12:11:20PM +0800, Bibo Mao wrote:
>>>>>>> when executing command "make check" with qemu software, there is
>>>>>>> error report like this:
>>>>>>> BUG: Bad rss-counter state mm:00000000972846bc type:MM_FILEPAGES 
>>>>>>> val:-4096 Comm:bios-tables-tes Pid:27802
>>>>>>> BUG: Bad rss-counter state mm:00000000752180c5 type:MM_FILEPAGES 
>>>>>>> val:-2048 Comm:worker Pid:27815
>>>>>>> BUG: Bad rss-counter state mm:000000009c2f6a61 type:MM_FILEPAGES 
>>>>>>> val:-2048 Comm:qom-test Pid:27825
>>>>>>
>>>>>> Good catch!
>>>>>>
>>>>>>> The problem is that when application exits, rss counter is 
>>>>>>> calculated
>>>>>>> with huge_zero_pmd huge page, instead it should be skipped.
>>>>>>
>>>>>> Looks like the same problem[1] we discussed recently.
>>>>>>
>>>>>> [1] https://lore.kernel.org/linux-mm/74a75b59-2e13-3985-ee99- 
>>>>>> d5521f39df2a@google.com/
>>>>>>
>>>>>>> Signed-off-by: Bibo Mao <maobibo@loongson.cn>
>>>>>>> ---
>>>>>>> mm/huge_memory.c | 3 +++
>>>>>>> 1 file changed, 3 insertions(+)
>>>>>>>
>>>>>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>>>>>> index 970e077019b7..3cbea344d4a2 100644
>>>>>>> --- a/mm/huge_memory.c
>>>>>>> +++ b/mm/huge_memory.c
>>>>>>> @@ -2423,6 +2423,9 @@ static void zap_huge_pmd_folio(struct 
>>>>>>> mm_struct *mm, struct vm_area_struct *vma,
>>>>>>> {
>>>>>>>     const bool is_device_private = folio_is_device_private(folio);
>>>>>>>
>>>>>>> +    if (is_huge_zero_pmd(pmdval))
>>>>>>> +        return;
>>>>>>> +
>>>>>>
>>>>>> The huge zero PMD should not be returned by vm_normal_page_pmd() or
>>>>>> vm_normal_folio_pmd() as a normal folio. If it reaches
>>>>>> zap_huge_pmd_folio(), we already made the wrong normal-vs-special
>>>>>> decision ...
>>>>>>
>>>>>> So I don't think we should special-case it in 
>>>>>> zap_huge_pmd_folio(). That
>>>>>> only avoids this RSS decrement :)
>>>>>>
>>>>>> Could you please check whether the fix[2] also fixes your QEMU test?
>>>>>>
>>>>>> [2] https://lore.kernel.org/linux-mm/ea1453a6-14c9-4334- 
>>>>>> ac7e-2758586393b2@kernel.org/
>>>>> yes, I think it will solve this problem.
>>>>>
>>>>> Only that I think that there should be tlb flush operation after
>>>>> pmdp_huge_get_and_clear_full() even with huge_zero_pmd page, so
>>>>> tlb_remove_page_size() should be called. Is that right?
>>>>
>>>> Calling tlb_remove_page_size() is not necessary there :)
>>>>
>>>> zap_huge_pmd() already marks the PMD range for TLB invalidation right
>>>> after clearing the entry:
>>>>
>>>>     orig_pmd = pmdp_huge_get_and_clear_full(...);
>>>>     tlb_remove_pmd_tlb_entry(tlb, pmd, addr);
>>> Yes, it is. I forget the tlb_flush_pmd_range() calling in 
>>> tlb_remove_pmd_tlb_entry().
>>>
>>> So the fix solves this problem. And thanks for your explanation.
>>
>> If possible, can you test the fix[1] with your QEMU workload and
>> provide a Tested-by? That would be very helpful :D
> yes, this patch solves the problem. I do not subscribe linux- 
> mm@kvack.org mailing list, please feel free to add
> 
> Tested-by: Bibo Mao <maobibo@loongson.cn>

Thanks for testing!

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] mm/huge_memory: skip huge_zero_pmd in zap_huge_pmd_folio()
  2026-04-30  7:02     ` Lance Yang
  2026-04-30  7:05       ` Bibo Mao
@ 2026-04-30  7:12       ` Lance Yang
  1 sibling, 0 replies; 10+ messages in thread
From: Lance Yang @ 2026-04-30  7:12 UTC (permalink / raw)
  To: maobibo
  Cc: akpm, david, ljs, ziy, baolin.wang, Liam.Howlett, npache,
	ryan.roberts, dev.jain, baohua, linux-mm, linux-kernel



On 2026/4/30 15:02, Lance Yang wrote:
> 
> On Thu, Apr 30, 2026 at 02:34:20PM +0800, Bibo Mao wrote:
>>
>>
>> On 2026/4/30 下午12:28, Lance Yang wrote:
>>>
>>> On Thu, Apr 30, 2026 at 12:11:20PM +0800, Bibo Mao wrote:
>>>> when executing command "make check" with qemu software, there is
>>>> error report like this:
>>>> BUG: Bad rss-counter state mm:00000000972846bc type:MM_FILEPAGES val:-4096 Comm:bios-tables-tes Pid:27802
>>>> BUG: Bad rss-counter state mm:00000000752180c5 type:MM_FILEPAGES val:-2048 Comm:worker Pid:27815
>>>> BUG: Bad rss-counter state mm:000000009c2f6a61 type:MM_FILEPAGES val:-2048 Comm:qom-test Pid:27825
>>>
>>> Good catch!
>>>
>>>> The problem is that when application exits, rss counter is calculated
>>>> with huge_zero_pmd huge page, instead it should be skipped.
>>>
>>> Looks like the same problem[1] we discussed recently.
>>>
>>> [1] https://lore.kernel.org/linux-mm/74a75b59-2e13-3985-ee99-d5521f39df2a@google.com/
>>>
>>>> Signed-off-by: Bibo Mao <maobibo@loongson.cn>
>>>> ---
>>>> mm/huge_memory.c | 3 +++
>>>> 1 file changed, 3 insertions(+)
>>>>
>>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>>> index 970e077019b7..3cbea344d4a2 100644
>>>> --- a/mm/huge_memory.c
>>>> +++ b/mm/huge_memory.c
>>>> @@ -2423,6 +2423,9 @@ static void zap_huge_pmd_folio(struct mm_struct *mm, struct vm_area_struct *vma,
>>>> {
>>>> 	const bool is_device_private = folio_is_device_private(folio);
>>>>
>>>> +	if (is_huge_zero_pmd(pmdval))
>>>> +		return;
>>>> +
>>>
>>> The huge zero PMD should not be returned by vm_normal_page_pmd() or
>>> vm_normal_folio_pmd() as a normal folio. If it reaches
>>> zap_huge_pmd_folio(), we already made the wrong normal-vs-special
>>> decision ...
>>>
>>> So I don't think we should special-case it in zap_huge_pmd_folio(). That
>>> only avoids this RSS decrement :)
>>>
>>> Could you please check whether the fix[2] also fixes your QEMU test?
>>>
>>> [2] https://lore.kernel.org/linux-mm/ea1453a6-14c9-4334-ac7e-2758586393b2@kernel.org/
>> yes, I think it will solve this problem.
>>
>> Only that I think that there should be tlb flush operation after
>> pmdp_huge_get_and_clear_full() even with huge_zero_pmd page, so
>> tlb_remove_page_size() should be called. Is that right?
> 
> Calling tlb_remove_page_size() is not necessary there :)
> 
> zap_huge_pmd() already marks the PMD range for TLB invalidation right
> after clearing the entry:
> 
> 	orig_pmd = pmdp_huge_get_and_clear_full(...);
> 	tlb_remove_pmd_tlb_entry(tlb, pmd, addr);
> 
> The later tlb_remove_page_size() is guarded by "is_present && folio",
> and is for the normal folio case after normal_or_softleaf_folio_pmd()
> return one :)

Forgot to add:

tlb_remove_page_size() queues the folio for freeing via mmu_gather.
The shared huge zero folio only needs PMD TLB invalidation, not the
delayed freeing :)

> 
> Please correct me if I missed something :D
> 
> Cheers, Lance


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2026-04-30  8:15 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-30  4:11 [PATCH] mm/huge_memory: skip huge_zero_pmd in zap_huge_pmd_folio() Bibo Mao
2026-04-30  4:28 ` Lance Yang
2026-04-30  4:58   ` Lance Yang
2026-04-30  6:34   ` Bibo Mao
2026-04-30  7:02     ` Lance Yang
2026-04-30  7:05       ` Bibo Mao
2026-04-30  7:16         ` Lance Yang
2026-04-30  8:09           ` Bibo Mao
2026-04-30  8:15             ` Lance Yang
2026-04-30  7:12       ` Lance Yang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox