[PATCH] mm/vmscan: fix hwpoisoned large folio handling in shrink_folio

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] mm/vmscan: fix hwpoisoned large folio handling in shrink_folio_list
@ 2025-06-11  7:46 Jinjiang Tu
  2025-06-11  7:59 ` David Hildenbrand
  0 siblings, 1 reply; 18+ messages in thread
From: Jinjiang Tu @ 2025-06-11  7:46 UTC (permalink / raw)
  To: akpm, linmiaohe, david; +Cc: linux-mm, wangkefeng.wang, tujinjiang

In shrink_folio_list(), the hwpoisoned folio may be large folio, which
can't be handled by unmap_poisoned_folio().

Since UCE is rare in real world, and race with reclaimation is more rare,
just skipping the hwpoisoned large folio is enough. memory_failure() will
handle it if the UCE is triggered again.

Fixes: 1b0449544c64 ("mm/vmscan: don't try to reclaim hwpoison folio")
Reported-by: syzbot+3b220254df55d8ca8a61@syzkaller.appspotmail.com
Signed-off-by: Jinjiang Tu <tujinjiang@huawei.com>
---
 mm/vmscan.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index b6f4db6c240f..3a4e8d7419ae 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1131,6 +1131,14 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
 			goto keep;
 
 		if (folio_contain_hwpoisoned_page(folio)) {
+			/*
+			 * unmap_poisoned_folio() can't handle large
+			 * folio, just skip it. memory_failure() will
+			 * handle it if the UCE is triggered again.
+			 */
+			if (folio_test_large(folio))
+				goto keep_locked;
+
 			unmap_poisoned_folio(folio, folio_pfn(folio), false);
 			folio_unlock(folio);
 			folio_put(folio);
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH] mm/vmscan: fix hwpoisoned large folio handling in shrink_folio_list
  2025-06-11  7:46 [PATCH] mm/vmscan: fix hwpoisoned large folio handling in shrink_folio_list Jinjiang Tu
@ 2025-06-11  7:59 ` David Hildenbrand
  2025-06-11  8:29   ` Jinjiang Tu
  0 siblings, 1 reply; 18+ messages in thread
From: David Hildenbrand @ 2025-06-11  7:59 UTC (permalink / raw)
  To: Jinjiang Tu, akpm, linmiaohe; +Cc: linux-mm, wangkefeng.wang

On 11.06.25 09:46, Jinjiang Tu wrote:
> In shrink_folio_list(), the hwpoisoned folio may be large folio, which
> can't be handled by unmap_poisoned_folio().
> 
> Since UCE is rare in real world, and race with reclaimation is more rare,
> just skipping the hwpoisoned large folio is enough. memory_failure() will
> handle it if the UCE is triggered again.
> 
> Fixes: 1b0449544c64 ("mm/vmscan: don't try to reclaim hwpoison folio")

Please also add

Closes:

with a link to the report

> Reported-by: syzbot+3b220254df55d8ca8a61@syzkaller.appspotmail.com
> Signed-off-by: Jinjiang Tu <tujinjiang@huawei.com>
> ---
>   mm/vmscan.c | 8 ++++++++
>   1 file changed, 8 insertions(+)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index b6f4db6c240f..3a4e8d7419ae 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1131,6 +1131,14 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
>   			goto keep;
>   
>   		if (folio_contain_hwpoisoned_page(folio)) {
> +			/*
> +			 * unmap_poisoned_folio() can't handle large
> +			 * folio, just skip it. memory_failure() will
> +			 * handle it if the UCE is triggered again.
> +			 */
> +			if (folio_test_large(folio))
> +				goto keep_locked;
> +
>   			unmap_poisoned_folio(folio, folio_pfn(folio), false);
>   			folio_unlock(folio);
>   			folio_put(folio);

Why not handle that in unmap_poisoned_folio() to make that limitation 
clear and avoid?

Essentially, checking for folio_test_large() && !folio_test_hugetlb() in 
unmap_poisoned_folio().

I'm surprised that try_to_unmap() doesn't seem to be able to handle 
PMD-mapped file pages, probably easy to add once there is demand for it.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] mm/vmscan: fix hwpoisoned large folio handling in shrink_folio_list
  2025-06-11  7:59 ` David Hildenbrand
@ 2025-06-11  8:29   ` Jinjiang Tu
  2025-06-11  8:35     ` David Hildenbrand
  0 siblings, 1 reply; 18+ messages in thread
From: Jinjiang Tu @ 2025-06-11  8:29 UTC (permalink / raw)
  To: David Hildenbrand, akpm, linmiaohe; +Cc: linux-mm, wangkefeng.wang


在 2025/6/11 15:59, David Hildenbrand 写道:
> On 11.06.25 09:46, Jinjiang Tu wrote:
>> In shrink_folio_list(), the hwpoisoned folio may be large folio, which
>> can't be handled by unmap_poisoned_folio().
>>
>> Since UCE is rare in real world, and race with reclaimation is more 
>> rare,
>> just skipping the hwpoisoned large folio is enough. memory_failure() 
>> will
>> handle it if the UCE is triggered again.
>>
>> Fixes: 1b0449544c64 ("mm/vmscan: don't try to reclaim hwpoison folio")
>
> Please also add
>
> Closes:
>
> with a link to the report
Thanks, I will add it.
>
>> Reported-by: syzbot+3b220254df55d8ca8a61@syzkaller.appspotmail.com
>> Signed-off-by: Jinjiang Tu <tujinjiang@huawei.com>
>> ---
>>   mm/vmscan.c | 8 ++++++++
>>   1 file changed, 8 insertions(+)
>> /home/tujinjiang/hulk-repo/hulk/mm/mempolicy.c
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index b6f4db6c240f..3a4e8d7419ae 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -1131,6 +1131,14 @@ static unsigned int shrink_folio_list(struct 
>> list_head *folio_list,
>>               goto keep;
>>             if (folio_contain_hwpoisoned_page(folio)) {
>> +            /*
>> +             * unmap_poisoned_folio() can't handle large
>> +             * folio, just skip it. memory_failure() will
>> +             * handle it if the UCE is triggered again.
>> +             */
>> +            if (folio_test_large(folio))
>> +                goto keep_locked;
>> +
>>               unmap_poisoned_folio(folio, folio_pfn(folio), false);
>>               folio_unlock(folio);
>>               folio_put(folio);
>
> Why not handle that in unmap_poisoned_folio() to make that limitation 
> clear and avoid?
I tried to put the check in unmap_poisoned_folio(), but it still exists 
other issues.
The calltrace in v6.6 kernel:

Unable to handle kernel paging request at virtual address fbd5200000000024
KASAN: maybe wild-memory-access in range 
[0xdead000000000120-0xdead000000000127]
pc : __list_add_valid_or_report+0x50/0x158 lib/list_debug.c:32
lr : __list_add_valid include/linux/list.h:88 [inline]
lr : __list_add include/linux/list.h:150 [inline]
lr : list_add_tail include/linux/list.h:183 [inline]
lr : lru_add_page_tail.constprop.0+0x4ac/0x640 mm/huge_memory.c:3187
Call trace:
  __list_add_valid_or_report+0x50/0x158 lib/list_debug.c:32
  __list_add_valid include/linux/list.h:88 [inline]
  __list_add include/linux/list.h:150 [inline]
  list_add_tail include/linux/list.h:183 [inline]
  lru_add_page_tail.constprop.0+0x4ac/0x640 mm/huge_memory.c:3187
  __split_huge_page_tail.isra.0+0x344/0x508 mm/huge_memory.c:3286
  __split_huge_page+0x244/0x1270 mm/huge_memory.c:3317
  split_huge_page_to_list_to_order+0x1038/0x1620 mm/huge_memory.c:3625
  split_folio_to_list_to_order include/linux/huge_mm.h:638 [inline]
  split_folio_to_order include/linux/huge_mm.h:643 [inline]
  deferred_split_scan+0x5f8/0xb70 mm/huge_memory.c:3778
  do_shrink_slab+0x2a0/0x828 mm/vmscan.c:927
  shrink_slab_memcg+0x2c0/0x558 mm/vmscan.c:996
  shrink_slab+0x228/0x250 mm/vmscan.c:1075
  shrink_node_memcgs+0x34c/0x6a0 mm/vmscan.c:6630
  shrink_node+0x21c/0x1378 mm/vmscan.c:6664
  shrink_zones.constprop.0+0x24c/0xab0 mm/vmscan.c:6906
  do_try_to_free_pages+0x150/0x880 mm/vmscan.c:6968


The folio is deleted from lru and the folio->lru can't be accessed. If 
the folio is splitted later,
lru_add_split_folio() assumes the folio is on lru.
>
> Essentially, checking for folio_test_large() && !folio_test_hugetlb() 
> in unmap_poisoned_folio().
>
> I'm surprised that try_to_unmap() doesn't seem to be able to handle 
> PMD-mapped file pages, probably easy to add once there is demand for it.
If the entire PMD entry is converted to hwpoison entry, then if the 
process access the base page in the THP which doesn't has UCE, the 
process will be killed,
it is unnecessary.

Thanks.



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] mm/vmscan: fix hwpoisoned large folio handling in shrink_folio_list
  2025-06-11  8:29   ` Jinjiang Tu
@ 2025-06-11  8:35     ` David Hildenbrand
  2025-06-11  9:00       ` Jinjiang Tu
  0 siblings, 1 reply; 18+ messages in thread
From: David Hildenbrand @ 2025-06-11  8:35 UTC (permalink / raw)
  To: Jinjiang Tu, akpm, linmiaohe; +Cc: linux-mm, wangkefeng.wang

On 11.06.25 10:29, Jinjiang Tu wrote:
> 
> 在 2025/6/11 15:59, David Hildenbrand 写道:
>> On 11.06.25 09:46, Jinjiang Tu wrote:
>>> In shrink_folio_list(), the hwpoisoned folio may be large folio, which
>>> can't be handled by unmap_poisoned_folio().
>>>
>>> Since UCE is rare in real world, and race with reclaimation is more
>>> rare,
>>> just skipping the hwpoisoned large folio is enough. memory_failure()
>>> will
>>> handle it if the UCE is triggered again.
>>>
>>> Fixes: 1b0449544c64 ("mm/vmscan: don't try to reclaim hwpoison folio")
>>
>> Please also add
>>
>> Closes:
>>
>> with a link to the report
> Thanks, I will add it.
>>
>>> Reported-by: syzbot+3b220254df55d8ca8a61@syzkaller.appspotmail.com
>>> Signed-off-by: Jinjiang Tu <tujinjiang@huawei.com>
>>> ---
>>>    mm/vmscan.c | 8 ++++++++
>>>    1 file changed, 8 insertions(+)
>>> /home/tujinjiang/hulk-repo/hulk/mm/mempolicy.c
>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>> index b6f4db6c240f..3a4e8d7419ae 100644
>>> --- a/mm/vmscan.c
>>> +++ b/mm/vmscan.c
>>> @@ -1131,6 +1131,14 @@ static unsigned int shrink_folio_list(struct
>>> list_head *folio_list,
>>>                goto keep;
>>>              if (folio_contain_hwpoisoned_page(folio)) {
>>> +            /*
>>> +             * unmap_poisoned_folio() can't handle large
>>> +             * folio, just skip it. memory_failure() will
>>> +             * handle it if the UCE is triggered again.
>>> +             */
>>> +            if (folio_test_large(folio))
>>> +                goto keep_locked;
>>> +
>>>                unmap_poisoned_folio(folio, folio_pfn(folio), false);
>>>                folio_unlock(folio);
>>>                folio_put(folio);
>>
>> Why not handle that in unmap_poisoned_folio() to make that limitation
>> clear and avoid?
> I tried to put the check in unmap_poisoned_folio(), but it still exists
> other issues.



> The calltrace in v6.6 kernel:
> 
> Unable to handle kernel paging request at virtual address fbd5200000000024
> KASAN: maybe wild-memory-access in range
> [0xdead000000000120-0xdead000000000127]
> pc : __list_add_valid_or_report+0x50/0x158 lib/list_debug.c:32
> lr : __list_add_valid include/linux/list.h:88 [inline]
> lr : __list_add include/linux/list.h:150 [inline]
> lr : list_add_tail include/linux/list.h:183 [inline]
> lr : lru_add_page_tail.constprop.0+0x4ac/0x640 mm/huge_memory.c:3187
> Call trace:
>    __list_add_valid_or_report+0x50/0x158 lib/list_debug.c:32
>    __list_add_valid include/linux/list.h:88 [inline]
>    __list_add include/linux/list.h:150 [inline]
>    list_add_tail include/linux/list.h:183 [inline]
>    lru_add_page_tail.constprop.0+0x4ac/0x640 mm/huge_memory.c:3187
>    __split_huge_page_tail.isra.0+0x344/0x508 mm/huge_memory.c:3286
>    __split_huge_page+0x244/0x1270 mm/huge_memory.c:3317
>    split_huge_page_to_list_to_order+0x1038/0x1620 mm/huge_memory.c:3625
>    split_folio_to_list_to_order include/linux/huge_mm.h:638 [inline]
>    split_folio_to_order include/linux/huge_mm.h:643 [inline]
>    deferred_split_scan+0x5f8/0xb70 mm/huge_memory.c:3778
>    do_shrink_slab+0x2a0/0x828 mm/vmscan.c:927
>    shrink_slab_memcg+0x2c0/0x558 mm/vmscan.c:996
>    shrink_slab+0x228/0x250 mm/vmscan.c:1075
>    shrink_node_memcgs+0x34c/0x6a0 mm/vmscan.c:6630
>    shrink_node+0x21c/0x1378 mm/vmscan.c:6664
>    shrink_zones.constprop.0+0x24c/0xab0 mm/vmscan.c:6906
>    do_try_to_free_pages+0x150/0x880 mm/vmscan.c:6968
> 
> 
> The folio is deleted from lru and the folio->lru can't be accessed. If
> the folio is splitted later,
> lru_add_split_folio() assumes the folio is on lru.

Not sure if something like the following would be appropriate:

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index b91a33fb6c694..fdd58c8ba5254 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1566,6 +1566,9 @@ int unmap_poisoned_folio(struct folio *folio, unsigned long pfn, bool must_kill)
         enum ttu_flags ttu = TTU_IGNORE_MLOCK | TTU_SYNC | TTU_HWPOISON;
         struct address_space *mapping;
  
+       if (folio_test_large && !folio_test_hugetlb(folio))
+               return -EBUSY;
+
         if (folio_test_swapcache(folio)) {
                 pr_err("%#lx: keeping poisoned page in swap cache\n", pfn);
                 ttu &= ~TTU_HWPOISON;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index f8dfd2864bbf4..6a3426bc9e9d7 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1138,7 +1138,8 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
                         goto keep;
  
                 if (folio_contain_hwpoisoned_page(folio)) {
-                       unmap_poisoned_folio(folio, folio_pfn(folio), false);
+                       if (unmap_poisoned_folio(folio, folio_pfn(folio), false)){
+                               list_add(&folio->lru, &ret_folios);
                         folio_unlock(folio);
                         folio_put(folio);
                         continue;

-- 
Cheers,

David / dhildenb



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH] mm/vmscan: fix hwpoisoned large folio handling in shrink_folio_list
  2025-06-11  8:35     ` David Hildenbrand
@ 2025-06-11  9:00       ` Jinjiang Tu
  2025-06-11  9:20         ` David Hildenbrand
  0 siblings, 1 reply; 18+ messages in thread
From: Jinjiang Tu @ 2025-06-11  9:00 UTC (permalink / raw)
  To: David Hildenbrand, akpm, linmiaohe; +Cc: linux-mm, wangkefeng.wang


在 2025/6/11 16:35, David Hildenbrand 写道:
> On 11.06.25 10:29, Jinjiang Tu wrote:
>>
>> 在 2025/6/11 15:59, David Hildenbrand 写道:
>>> On 11.06.25 09:46, Jinjiang Tu wrote:
>>>> In shrink_folio_list(), the hwpoisoned folio may be large folio, which
>>>> can't be handled by unmap_poisoned_folio().
>>>>
>>>> Since UCE is rare in real world, and race with reclaimation is more
>>>> rare,
>>>> just skipping the hwpoisoned large folio is enough. memory_failure()
>>>> will
>>>> handle it if the UCE is triggered again.
>>>>
>>>> Fixes: 1b0449544c64 ("mm/vmscan: don't try to reclaim hwpoison folio")
>>>
>>> Please also add
>>>
>>> Closes:
>>>
>>> with a link to the report
>> Thanks, I will add it.
>>>
>>>> Reported-by: syzbot+3b220254df55d8ca8a61@syzkaller.appspotmail.com
>>>> Signed-off-by: Jinjiang Tu <tujinjiang@huawei.com>
>>>> ---
>>>>    mm/vmscan.c | 8 ++++++++
>>>>    1 file changed, 8 insertions(+)
>>>> /home/tujinjiang/hulk-repo/hulk/mm/mempolicy.c
>>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>>> index b6f4db6c240f..3a4e8d7419ae 100644
>>>> --- a/mm/vmscan.c
>>>> +++ b/mm/vmscan.c
>>>> @@ -1131,6 +1131,14 @@ static unsigned int shrink_folio_list(struct
>>>> list_head *folio_list,
>>>>                goto keep;
>>>>              if (folio_contain_hwpoisoned_page(folio)) {
>>>> +            /*
>>>> +             * unmap_poisoned_folio() can't handle large
>>>> +             * folio, just skip it. memory_failure() will
>>>> +             * handle it if the UCE is triggered again.
>>>> +             */
>>>> +            if (folio_test_large(folio))
>>>> +                goto keep_locked;
>>>> +
>>>>                unmap_poisoned_folio(folio, folio_pfn(folio), false);
>>>>                folio_unlock(folio);
>>>>                folio_put(folio);
>>>
>>> Why not handle that in unmap_poisoned_folio() to make that limitation
>>> clear and avoid?
>> I tried to put the check in unmap_poisoned_folio(), but it still exists
>> other issues.
>
>
>
>> The calltrace in v6.6 kernel:
>>
>> Unable to handle kernel paging request at virtual address 
>> fbd5200000000024
>> KASAN: maybe wild-memory-access in range
>> [0xdead000000000120-0xdead000000000127]
>> pc : __list_add_valid_or_report+0x50/0x158 lib/list_debug.c:32
>> lr : __list_add_valid include/linux/list.h:88 [inline]
>> lr : __list_add include/linux/list.h:150 [inline]
>> lr : list_add_tail include/linux/list.h:183 [inline]
>> lr : lru_add_page_tail.constprop.0+0x4ac/0x640 mm/huge_memory.c:3187
>> Call trace:
>>    __list_add_valid_or_report+0x50/0x158 lib/list_debug.c:32
>>    __list_add_valid include/linux/list.h:88 [inline]
>>    __list_add include/linux/list.h:150 [inline]
>>    list_add_tail include/linux/list.h:183 [inline]
>>    lru_add_page_tail.constprop.0+0x4ac/0x640 mm/huge_memory.c:3187
>>    __split_huge_page_tail.isra.0+0x344/0x508 mm/huge_memory.c:3286
>>    __split_huge_page+0x244/0x1270 mm/huge_memory.c:3317
>>    split_huge_page_to_list_to_order+0x1038/0x1620 mm/huge_memory.c:3625
>>    split_folio_to_list_to_order include/linux/huge_mm.h:638 [inline]
>>    split_folio_to_order include/linux/huge_mm.h:643 [inline]
>>    deferred_split_scan+0x5f8/0xb70 mm/huge_memory.c:3778
>>    do_shrink_slab+0x2a0/0x828 mm/vmscan.c:927
>>    shrink_slab_memcg+0x2c0/0x558 mm/vmscan.c:996
>>    shrink_slab+0x228/0x250 mm/vmscan.c:1075
>>    shrink_node_memcgs+0x34c/0x6a0 mm/vmscan.c:6630
>>    shrink_node+0x21c/0x1378 mm/vmscan.c:6664
>>    shrink_zones.constprop.0+0x24c/0xab0 mm/vmscan.c:6906
>>    do_try_to_free_pages+0x150/0x880 mm/vmscan.c:6968
>>
>>
>> The folio is deleted from lru and the folio->lru can't be accessed. If
>> the folio is splitted later,
>> lru_add_split_folio() assumes the folio is on lru.
>
> Not sure if something like the following would be appropriate:
>
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index b91a33fb6c694..fdd58c8ba5254 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -1566,6 +1566,9 @@ int unmap_poisoned_folio(struct folio *folio, 
> unsigned long pfn, bool must_kill)
>         enum ttu_flags ttu = TTU_IGNORE_MLOCK | TTU_SYNC | TTU_HWPOISON;
>         struct address_space *mapping;
>
> +       if (folio_test_large && !folio_test_hugetlb(folio))
> +               return -EBUSY;
> +
>         if (folio_test_swapcache(folio)) {
>                 pr_err("%#lx: keeping poisoned page in swap cache\n", 
> pfn);
>                 ttu &= ~TTU_HWPOISON;
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index f8dfd2864bbf4..6a3426bc9e9d7 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1138,7 +1138,8 @@ static unsigned int shrink_folio_list(struct 
> list_head *folio_list,
>                         goto keep;
>
>                 if (folio_contain_hwpoisoned_page(folio)) {
> -                       unmap_poisoned_folio(folio, folio_pfn(folio), 
> false);
> +                       if (unmap_poisoned_folio(folio, 
> folio_pfn(folio), false)){
> +                               list_add(&folio->lru, &ret_folios);
>                         folio_unlock(folio);
>                         folio_put(folio);
>                         continue;

The expected behaviour is keeping the folio on lru if 
unmap_poisoned_folio fails?

If so, we should:

+                       if (unmap_poisoned_folio(folio, 
folio_pfn(folio), false)){
+                               goto keep_locked;

otherwise, folio_put() is called twice to put ref grabbed from isolation.



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] mm/vmscan: fix hwpoisoned large folio handling in shrink_folio_list
  2025-06-11  9:00       ` Jinjiang Tu
@ 2025-06-11  9:20         ` David Hildenbrand
  2025-06-11  9:24           ` David Hildenbrand
  0 siblings, 1 reply; 18+ messages in thread
From: David Hildenbrand @ 2025-06-11  9:20 UTC (permalink / raw)
  To: Jinjiang Tu, akpm, linmiaohe; +Cc: linux-mm, wangkefeng.wang, Zi Yan

On 11.06.25 11:00, Jinjiang Tu wrote:
> 
> 在 2025/6/11 16:35, David Hildenbrand 写道:
>> On 11.06.25 10:29, Jinjiang Tu wrote:
>>>
>>> 在 2025/6/11 15:59, David Hildenbrand 写道:
>>>> On 11.06.25 09:46, Jinjiang Tu wrote:
>>>>> In shrink_folio_list(), the hwpoisoned folio may be large folio, which
>>>>> can't be handled by unmap_poisoned_folio().
>>>>>
>>>>> Since UCE is rare in real world, and race with reclaimation is more
>>>>> rare,
>>>>> just skipping the hwpoisoned large folio is enough. memory_failure()
>>>>> will
>>>>> handle it if the UCE is triggered again.
>>>>>
>>>>> Fixes: 1b0449544c64 ("mm/vmscan: don't try to reclaim hwpoison folio")
>>>>
>>>> Please also add
>>>>
>>>> Closes:
>>>>
>>>> with a link to the report
>>> Thanks, I will add it.
>>>>
>>>>> Reported-by: syzbot+3b220254df55d8ca8a61@syzkaller.appspotmail.com
>>>>> Signed-off-by: Jinjiang Tu <tujinjiang@huawei.com>
>>>>> ---
>>>>>     mm/vmscan.c | 8 ++++++++
>>>>>     1 file changed, 8 insertions(+)
>>>>> /home/tujinjiang/hulk-repo/hulk/mm/mempolicy.c
>>>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>>>> index b6f4db6c240f..3a4e8d7419ae 100644
>>>>> --- a/mm/vmscan.c
>>>>> +++ b/mm/vmscan.c
>>>>> @@ -1131,6 +1131,14 @@ static unsigned int shrink_folio_list(struct
>>>>> list_head *folio_list,
>>>>>                 goto keep;
>>>>>               if (folio_contain_hwpoisoned_page(folio)) {
>>>>> +            /*
>>>>> +             * unmap_poisoned_folio() can't handle large
>>>>> +             * folio, just skip it. memory_failure() will
>>>>> +             * handle it if the UCE is triggered again.
>>>>> +             */
>>>>> +            if (folio_test_large(folio))
>>>>> +                goto keep_locked;
>>>>> +
>>>>>                 unmap_poisoned_folio(folio, folio_pfn(folio), false);
>>>>>                 folio_unlock(folio);
>>>>>                 folio_put(folio);
>>>>
>>>> Why not handle that in unmap_poisoned_folio() to make that limitation
>>>> clear and avoid?
>>> I tried to put the check in unmap_poisoned_folio(), but it still exists
>>> other issues.
>>
>>
>>
>>> The calltrace in v6.6 kernel:
>>>
>>> Unable to handle kernel paging request at virtual address
>>> fbd5200000000024
>>> KASAN: maybe wild-memory-access in range
>>> [0xdead000000000120-0xdead000000000127]
>>> pc : __list_add_valid_or_report+0x50/0x158 lib/list_debug.c:32
>>> lr : __list_add_valid include/linux/list.h:88 [inline]
>>> lr : __list_add include/linux/list.h:150 [inline]
>>> lr : list_add_tail include/linux/list.h:183 [inline]
>>> lr : lru_add_page_tail.constprop.0+0x4ac/0x640 mm/huge_memory.c:3187
>>> Call trace:
>>>     __list_add_valid_or_report+0x50/0x158 lib/list_debug.c:32
>>>     __list_add_valid include/linux/list.h:88 [inline]
>>>     __list_add include/linux/list.h:150 [inline]
>>>     list_add_tail include/linux/list.h:183 [inline]
>>>     lru_add_page_tail.constprop.0+0x4ac/0x640 mm/huge_memory.c:3187
>>>     __split_huge_page_tail.isra.0+0x344/0x508 mm/huge_memory.c:3286
>>>     __split_huge_page+0x244/0x1270 mm/huge_memory.c:3317
>>>     split_huge_page_to_list_to_order+0x1038/0x1620 mm/huge_memory.c:3625
>>>     split_folio_to_list_to_order include/linux/huge_mm.h:638 [inline]
>>>     split_folio_to_order include/linux/huge_mm.h:643 [inline]
>>>     deferred_split_scan+0x5f8/0xb70 mm/huge_memory.c:3778
>>>     do_shrink_slab+0x2a0/0x828 mm/vmscan.c:927
>>>     shrink_slab_memcg+0x2c0/0x558 mm/vmscan.c:996
>>>     shrink_slab+0x228/0x250 mm/vmscan.c:1075
>>>     shrink_node_memcgs+0x34c/0x6a0 mm/vmscan.c:6630
>>>     shrink_node+0x21c/0x1378 mm/vmscan.c:6664
>>>     shrink_zones.constprop.0+0x24c/0xab0 mm/vmscan.c:6906
>>>     do_try_to_free_pages+0x150/0x880 mm/vmscan.c:6968
>>>
>>>
>>> The folio is deleted from lru and the folio->lru can't be accessed. If
>>> the folio is splitted later,
>>> lru_add_split_folio() assumes the folio is on lru.
>>
>> Not sure if something like the following would be appropriate:
>>
>> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
>> index b91a33fb6c694..fdd58c8ba5254 100644
>> --- a/mm/memory-failure.c
>> +++ b/mm/memory-failure.c
>> @@ -1566,6 +1566,9 @@ int unmap_poisoned_folio(struct folio *folio,
>> unsigned long pfn, bool must_kill)
>>          enum ttu_flags ttu = TTU_IGNORE_MLOCK | TTU_SYNC | TTU_HWPOISON;
>>          struct address_space *mapping;
>>
>> +       if (folio_test_large && !folio_test_hugetlb(folio))
>> +               return -EBUSY;
>> +
>>          if (folio_test_swapcache(folio)) {
>>                  pr_err("%#lx: keeping poisoned page in swap cache\n",
>> pfn);
>>                  ttu &= ~TTU_HWPOISON;
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index f8dfd2864bbf4..6a3426bc9e9d7 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -1138,7 +1138,8 @@ static unsigned int shrink_folio_list(struct
>> list_head *folio_list,
>>                          goto keep;
>>
>>                  if (folio_contain_hwpoisoned_page(folio)) {
>> -                       unmap_poisoned_folio(folio, folio_pfn(folio),
>> false);
>> +                       if (unmap_poisoned_folio(folio,
>> folio_pfn(folio), false)){
>> +                               list_add(&folio->lru, &ret_folios);
>>                          folio_unlock(folio);
>>                          folio_put(folio);
>>                          continue;
> 
> The expected behaviour is keeping the folio on lru if
> unmap_poisoned_folio fails?

Good question, it's a mess.

If we keep the LRU bit cleared (kept isolated), we wouldn't have to add 
it to the list.

But now I wonder where deferred_split_scan() would check for the LRU flag?

It seems to trylock, to then call split_folio().

In __folio_split(), I don't find any checks for the lru flag ... :(

We call lru_add_split_folio() where we 
VM_BUG_ON_FOLIO(folio_test_lru(new_folio), folio);


So now I am confused.

> 
> If so, we should:
> 
> +                       if (unmap_poisoned_folio(folio,
> folio_pfn(folio), false)){
> +                               goto keep_locked;
> 
> otherwise, folio_put() is called twice to put ref grabbed from isolation.

Good point.

Maybe Zi Yan can help.

Re-adding them to the LRU if anything goes wrong should work, but I am 
not sure if that's the right approach.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] mm/vmscan: fix hwpoisoned large folio handling in shrink_folio_list
  2025-06-11  9:20         ` David Hildenbrand
@ 2025-06-11  9:24           ` David Hildenbrand
  2025-06-11 14:30             ` Zi Yan
  0 siblings, 1 reply; 18+ messages in thread
From: David Hildenbrand @ 2025-06-11  9:24 UTC (permalink / raw)
  To: Jinjiang Tu, akpm, linmiaohe; +Cc: linux-mm, wangkefeng.wang, Zi Yan

On 11.06.25 11:20, David Hildenbrand wrote:
> On 11.06.25 11:00, Jinjiang Tu wrote:
>>
>> 在 2025/6/11 16:35, David Hildenbrand 写道:
>>> On 11.06.25 10:29, Jinjiang Tu wrote:
>>>>
>>>> 在 2025/6/11 15:59, David Hildenbrand 写道:
>>>>> On 11.06.25 09:46, Jinjiang Tu wrote:
>>>>>> In shrink_folio_list(), the hwpoisoned folio may be large folio, which
>>>>>> can't be handled by unmap_poisoned_folio().
>>>>>>
>>>>>> Since UCE is rare in real world, and race with reclaimation is more
>>>>>> rare,
>>>>>> just skipping the hwpoisoned large folio is enough. memory_failure()
>>>>>> will
>>>>>> handle it if the UCE is triggered again.
>>>>>>
>>>>>> Fixes: 1b0449544c64 ("mm/vmscan: don't try to reclaim hwpoison folio")
>>>>>
>>>>> Please also add
>>>>>
>>>>> Closes:
>>>>>
>>>>> with a link to the report
>>>> Thanks, I will add it.
>>>>>
>>>>>> Reported-by: syzbot+3b220254df55d8ca8a61@syzkaller.appspotmail.com
>>>>>> Signed-off-by: Jinjiang Tu <tujinjiang@huawei.com>
>>>>>> ---
>>>>>>      mm/vmscan.c | 8 ++++++++
>>>>>>      1 file changed, 8 insertions(+)
>>>>>> /home/tujinjiang/hulk-repo/hulk/mm/mempolicy.c
>>>>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>>>>> index b6f4db6c240f..3a4e8d7419ae 100644
>>>>>> --- a/mm/vmscan.c
>>>>>> +++ b/mm/vmscan.c
>>>>>> @@ -1131,6 +1131,14 @@ static unsigned int shrink_folio_list(struct
>>>>>> list_head *folio_list,
>>>>>>                  goto keep;
>>>>>>                if (folio_contain_hwpoisoned_page(folio)) {
>>>>>> +            /*
>>>>>> +             * unmap_poisoned_folio() can't handle large
>>>>>> +             * folio, just skip it. memory_failure() will
>>>>>> +             * handle it if the UCE is triggered again.
>>>>>> +             */
>>>>>> +            if (folio_test_large(folio))
>>>>>> +                goto keep_locked;
>>>>>> +
>>>>>>                  unmap_poisoned_folio(folio, folio_pfn(folio), false);
>>>>>>                  folio_unlock(folio);
>>>>>>                  folio_put(folio);
>>>>>
>>>>> Why not handle that in unmap_poisoned_folio() to make that limitation
>>>>> clear and avoid?
>>>> I tried to put the check in unmap_poisoned_folio(), but it still exists
>>>> other issues.
>>>
>>>
>>>
>>>> The calltrace in v6.6 kernel:
>>>>
>>>> Unable to handle kernel paging request at virtual address
>>>> fbd5200000000024
>>>> KASAN: maybe wild-memory-access in range
>>>> [0xdead000000000120-0xdead000000000127]
>>>> pc : __list_add_valid_or_report+0x50/0x158 lib/list_debug.c:32
>>>> lr : __list_add_valid include/linux/list.h:88 [inline]
>>>> lr : __list_add include/linux/list.h:150 [inline]
>>>> lr : list_add_tail include/linux/list.h:183 [inline]
>>>> lr : lru_add_page_tail.constprop.0+0x4ac/0x640 mm/huge_memory.c:3187
>>>> Call trace:
>>>>      __list_add_valid_or_report+0x50/0x158 lib/list_debug.c:32
>>>>      __list_add_valid include/linux/list.h:88 [inline]
>>>>      __list_add include/linux/list.h:150 [inline]
>>>>      list_add_tail include/linux/list.h:183 [inline]
>>>>      lru_add_page_tail.constprop.0+0x4ac/0x640 mm/huge_memory.c:3187
>>>>      __split_huge_page_tail.isra.0+0x344/0x508 mm/huge_memory.c:3286
>>>>      __split_huge_page+0x244/0x1270 mm/huge_memory.c:3317
>>>>      split_huge_page_to_list_to_order+0x1038/0x1620 mm/huge_memory.c:3625
>>>>      split_folio_to_list_to_order include/linux/huge_mm.h:638 [inline]
>>>>      split_folio_to_order include/linux/huge_mm.h:643 [inline]
>>>>      deferred_split_scan+0x5f8/0xb70 mm/huge_memory.c:3778
>>>>      do_shrink_slab+0x2a0/0x828 mm/vmscan.c:927
>>>>      shrink_slab_memcg+0x2c0/0x558 mm/vmscan.c:996
>>>>      shrink_slab+0x228/0x250 mm/vmscan.c:1075
>>>>      shrink_node_memcgs+0x34c/0x6a0 mm/vmscan.c:6630
>>>>      shrink_node+0x21c/0x1378 mm/vmscan.c:6664
>>>>      shrink_zones.constprop.0+0x24c/0xab0 mm/vmscan.c:6906
>>>>      do_try_to_free_pages+0x150/0x880 mm/vmscan.c:6968
>>>>
>>>>
>>>> The folio is deleted from lru and the folio->lru can't be accessed. If
>>>> the folio is splitted later,
>>>> lru_add_split_folio() assumes the folio is on lru.
>>>
>>> Not sure if something like the following would be appropriate:
>>>
>>> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
>>> index b91a33fb6c694..fdd58c8ba5254 100644
>>> --- a/mm/memory-failure.c
>>> +++ b/mm/memory-failure.c
>>> @@ -1566,6 +1566,9 @@ int unmap_poisoned_folio(struct folio *folio,
>>> unsigned long pfn, bool must_kill)
>>>           enum ttu_flags ttu = TTU_IGNORE_MLOCK | TTU_SYNC | TTU_HWPOISON;
>>>           struct address_space *mapping;
>>>
>>> +       if (folio_test_large && !folio_test_hugetlb(folio))
>>> +               return -EBUSY;
>>> +
>>>           if (folio_test_swapcache(folio)) {
>>>                   pr_err("%#lx: keeping poisoned page in swap cache\n",
>>> pfn);
>>>                   ttu &= ~TTU_HWPOISON;
>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>> index f8dfd2864bbf4..6a3426bc9e9d7 100644
>>> --- a/mm/vmscan.c
>>> +++ b/mm/vmscan.c
>>> @@ -1138,7 +1138,8 @@ static unsigned int shrink_folio_list(struct
>>> list_head *folio_list,
>>>                           goto keep;
>>>
>>>                   if (folio_contain_hwpoisoned_page(folio)) {
>>> -                       unmap_poisoned_folio(folio, folio_pfn(folio),
>>> false);
>>> +                       if (unmap_poisoned_folio(folio,
>>> folio_pfn(folio), false)){
>>> +                               list_add(&folio->lru, &ret_folios);
>>>                           folio_unlock(folio);
>>>                           folio_put(folio);
>>>                           continue;
>>
>> The expected behaviour is keeping the folio on lru if
>> unmap_poisoned_folio fails?
> 
> Good question, it's a mess.
> 
> If we keep the LRU bit cleared (kept isolated), we wouldn't have to add
> it to the list.
> 
> But now I wonder where deferred_split_scan() would check for the LRU flag?
> 
> It seems to trylock, to then call split_folio().
> 
> In __folio_split(), I don't find any checks for the lru flag ... :(
> 
> We call lru_add_split_folio() where we
> VM_BUG_ON_FOLIO(folio_test_lru(new_folio), folio);

Oh, that's for the new folio. So the checks in the other two paths apply.

When we come through deferred, we don't expect a "list" and consequently 
assume that it is still on the LRU.

	VM_WARN_ON(!folio_test_lru(folio));

So not adding it back to the LRU will be problematic for 
lru_add_split_folio() when called through deferred shrinking I guess ...

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] mm/vmscan: fix hwpoisoned large folio handling in shrink_folio_list
  2025-06-11  9:24           ` David Hildenbrand
@ 2025-06-11 14:30             ` Zi Yan
  2025-06-11 17:34               ` David Hildenbrand
  0 siblings, 1 reply; 18+ messages in thread
From: Zi Yan @ 2025-06-11 14:30 UTC (permalink / raw)
  To: David Hildenbrand; +Cc: Jinjiang Tu, akpm, linmiaohe, linux-mm, wangkefeng.wang

On 11 Jun 2025, at 5:24, David Hildenbrand wrote:

> On 11.06.25 11:20, David Hildenbrand wrote:
>> On 11.06.25 11:00, Jinjiang Tu wrote:
>>>
>>> 在 2025/6/11 16:35, David Hildenbrand 写道:
>>>> On 11.06.25 10:29, Jinjiang Tu wrote:
>>>>>
>>>>> 在 2025/6/11 15:59, David Hildenbrand 写道:
>>>>>> On 11.06.25 09:46, Jinjiang Tu wrote:
>>>>>>> In shrink_folio_list(), the hwpoisoned folio may be large folio, which
>>>>>>> can't be handled by unmap_poisoned_folio().
>>>>>>>
>>>>>>> Since UCE is rare in real world, and race with reclaimation is more
>>>>>>> rare,
>>>>>>> just skipping the hwpoisoned large folio is enough. memory_failure()
>>>>>>> will
>>>>>>> handle it if the UCE is triggered again.
>>>>>>>
>>>>>>> Fixes: 1b0449544c64 ("mm/vmscan: don't try to reclaim hwpoison folio")
>>>>>>
>>>>>> Please also add
>>>>>>
>>>>>> Closes:
>>>>>>
>>>>>> with a link to the report
>>>>> Thanks, I will add it.
>>>>>>
>>>>>>> Reported-by: syzbot+3b220254df55d8ca8a61@syzkaller.appspotmail.com
>>>>>>> Signed-off-by: Jinjiang Tu <tujinjiang@huawei.com>
>>>>>>> ---
>>>>>>>      mm/vmscan.c | 8 ++++++++
>>>>>>>      1 file changed, 8 insertions(+)
>>>>>>> /home/tujinjiang/hulk-repo/hulk/mm/mempolicy.c
>>>>>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>>>>>> index b6f4db6c240f..3a4e8d7419ae 100644
>>>>>>> --- a/mm/vmscan.c
>>>>>>> +++ b/mm/vmscan.c
>>>>>>> @@ -1131,6 +1131,14 @@ static unsigned int shrink_folio_list(struct
>>>>>>> list_head *folio_list,
>>>>>>>                  goto keep;
>>>>>>>                if (folio_contain_hwpoisoned_page(folio)) {
>>>>>>> +            /*
>>>>>>> +             * unmap_poisoned_folio() can't handle large
>>>>>>> +             * folio, just skip it. memory_failure() will
>>>>>>> +             * handle it if the UCE is triggered again.
>>>>>>> +             */
>>>>>>> +            if (folio_test_large(folio))
>>>>>>> +                goto keep_locked;
>>>>>>> +
>>>>>>>                  unmap_poisoned_folio(folio, folio_pfn(folio), false);
>>>>>>>                  folio_unlock(folio);
>>>>>>>                  folio_put(folio);
>>>>>>
>>>>>> Why not handle that in unmap_poisoned_folio() to make that limitation
>>>>>> clear and avoid?
>>>>> I tried to put the check in unmap_poisoned_folio(), but it still exists
>>>>> other issues.
>>>>
>>>>
>>>>
>>>>> The calltrace in v6.6 kernel:
>>>>>
>>>>> Unable to handle kernel paging request at virtual address
>>>>> fbd5200000000024
>>>>> KASAN: maybe wild-memory-access in range
>>>>> [0xdead000000000120-0xdead000000000127]
>>>>> pc : __list_add_valid_or_report+0x50/0x158 lib/list_debug.c:32
>>>>> lr : __list_add_valid include/linux/list.h:88 [inline]
>>>>> lr : __list_add include/linux/list.h:150 [inline]
>>>>> lr : list_add_tail include/linux/list.h:183 [inline]
>>>>> lr : lru_add_page_tail.constprop.0+0x4ac/0x640 mm/huge_memory.c:3187
>>>>> Call trace:
>>>>>      __list_add_valid_or_report+0x50/0x158 lib/list_debug.c:32
>>>>>      __list_add_valid include/linux/list.h:88 [inline]
>>>>>      __list_add include/linux/list.h:150 [inline]
>>>>>      list_add_tail include/linux/list.h:183 [inline]
>>>>>      lru_add_page_tail.constprop.0+0x4ac/0x640 mm/huge_memory.c:3187
>>>>>      __split_huge_page_tail.isra.0+0x344/0x508 mm/huge_memory.c:3286
>>>>>      __split_huge_page+0x244/0x1270 mm/huge_memory.c:3317
>>>>>      split_huge_page_to_list_to_order+0x1038/0x1620 mm/huge_memory.c:3625
>>>>>      split_folio_to_list_to_order include/linux/huge_mm.h:638 [inline]
>>>>>      split_folio_to_order include/linux/huge_mm.h:643 [inline]
>>>>>      deferred_split_scan+0x5f8/0xb70 mm/huge_memory.c:3778
>>>>>      do_shrink_slab+0x2a0/0x828 mm/vmscan.c:927
>>>>>      shrink_slab_memcg+0x2c0/0x558 mm/vmscan.c:996
>>>>>      shrink_slab+0x228/0x250 mm/vmscan.c:1075
>>>>>      shrink_node_memcgs+0x34c/0x6a0 mm/vmscan.c:6630
>>>>>      shrink_node+0x21c/0x1378 mm/vmscan.c:6664
>>>>>      shrink_zones.constprop.0+0x24c/0xab0 mm/vmscan.c:6906
>>>>>      do_try_to_free_pages+0x150/0x880 mm/vmscan.c:6968
>>>>>
>>>>>
>>>>> The folio is deleted from lru and the folio->lru can't be accessed. If
>>>>> the folio is splitted later,
>>>>> lru_add_split_folio() assumes the folio is on lru.
>>>>
>>>> Not sure if something like the following would be appropriate:
>>>>
>>>> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
>>>> index b91a33fb6c694..fdd58c8ba5254 100644
>>>> --- a/mm/memory-failure.c
>>>> +++ b/mm/memory-failure.c
>>>> @@ -1566,6 +1566,9 @@ int unmap_poisoned_folio(struct folio *folio,
>>>> unsigned long pfn, bool must_kill)
>>>>           enum ttu_flags ttu = TTU_IGNORE_MLOCK | TTU_SYNC | TTU_HWPOISON;
>>>>           struct address_space *mapping;
>>>>
>>>> +       if (folio_test_large && !folio_test_hugetlb(folio))
>>>> +               return -EBUSY;
>>>> +
>>>>           if (folio_test_swapcache(folio)) {
>>>>                   pr_err("%#lx: keeping poisoned page in swap cache\n",
>>>> pfn);
>>>>                   ttu &= ~TTU_HWPOISON;
>>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>>> index f8dfd2864bbf4..6a3426bc9e9d7 100644
>>>> --- a/mm/vmscan.c
>>>> +++ b/mm/vmscan.c
>>>> @@ -1138,7 +1138,8 @@ static unsigned int shrink_folio_list(struct
>>>> list_head *folio_list,
>>>>                           goto keep;
>>>>
>>>>                   if (folio_contain_hwpoisoned_page(folio)) {
>>>> -                       unmap_poisoned_folio(folio, folio_pfn(folio),
>>>> false);
>>>> +                       if (unmap_poisoned_folio(folio,
>>>> folio_pfn(folio), false)){
>>>> +                               list_add(&folio->lru, &ret_folios);
>>>>                           folio_unlock(folio);
>>>>                           folio_put(folio);
>>>>                           continue;
>>>
>>> The expected behaviour is keeping the folio on lru if
>>> unmap_poisoned_folio fails?
>>
>> Good question, it's a mess.
>>
>> If we keep the LRU bit cleared (kept isolated), we wouldn't have to add
>> it to the list.
>>
>> But now I wonder where deferred_split_scan() would check for the LRU flag?
>>
>> It seems to trylock, to then call split_folio().
>>
>> In __folio_split(), I don't find any checks for the lru flag ... :(
>>
>> We call lru_add_split_folio() where we
>> VM_BUG_ON_FOLIO(folio_test_lru(new_folio), folio);
>
> Oh, that's for the new folio. So the checks in the other two paths apply.
>
> When we come through deferred, we don't expect a "list" and consequently assume that it is still on the LRU.
>
> 	VM_WARN_ON(!folio_test_lru(folio));
>
> So not adding it back to the LRU will be problematic for lru_add_split_folio() when called through deferred shrinking I guess ...

So __folio_split() has an implicit rule that:
1. if the given list is not NULL, the folio cannot be on LRU;
2. if the given list is NULL, the folio is on LRU.

And the rule is buried deeply in lru_add_split_folio().

Should we add some checks in __folio_split()?

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index d3e66136e41a..8ce2734c9ca0 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3732,6 +3732,11 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
 	VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
 	VM_BUG_ON_FOLIO(!folio_test_large(folio), folio);

+	if (list && folio_test_lru(folio))
+		return -EINVAL;
+	if (!list && !folio_test_lru(folio))
+		return -EINVAL;
+
 	if (folio != page_folio(split_at) || folio != page_folio(lock_at))
 		return -EINVAL;



--
Best Regards,
Yan, Zi

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH] mm/vmscan: fix hwpoisoned large folio handling in shrink_folio_list
  2025-06-11 14:30             ` Zi Yan
@ 2025-06-11 17:34               ` David Hildenbrand
  2025-06-11 17:52                 ` Zi Yan
  0 siblings, 1 reply; 18+ messages in thread
From: David Hildenbrand @ 2025-06-11 17:34 UTC (permalink / raw)
  To: Zi Yan; +Cc: Jinjiang Tu, akpm, linmiaohe, linux-mm, wangkefeng.wang

> So __folio_split() has an implicit rule that:
> 1. if the given list is not NULL, the folio cannot be on LRU;
> 2. if the given list is NULL, the folio is on LRU.
> 
> And the rule is buried deeply in lru_add_split_folio().
> 
> Should we add some checks in __folio_split()?
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index d3e66136e41a..8ce2734c9ca0 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -3732,6 +3732,11 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>   	VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
>   	VM_BUG_ON_FOLIO(!folio_test_large(folio), folio);
> 
> +	if (list && folio_test_lru(folio))
> +		return -EINVAL;
> +	if (!list && !folio_test_lru(folio))
> +		return -EINVAL;
> +

I guess we currently don't run into that, because whenever a folio is 
otherwise isolated, there is an additional reference or a page table 
mapping, so it cannot get split either way (e.g., freezing the refcount 
fails).

So maybe these checks would be too early and they should happen after we 
froze the refcount?

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] mm/vmscan: fix hwpoisoned large folio handling in shrink_folio_list
  2025-06-11 17:34               ` David Hildenbrand
@ 2025-06-11 17:52                 ` Zi Yan
  2025-06-12  7:53                   ` David Hildenbrand
  0 siblings, 1 reply; 18+ messages in thread
From: Zi Yan @ 2025-06-11 17:52 UTC (permalink / raw)
  To: David Hildenbrand; +Cc: Jinjiang Tu, akpm, linmiaohe, linux-mm, wangkefeng.wang

On 11 Jun 2025, at 13:34, David Hildenbrand wrote:

>> So __folio_split() has an implicit rule that:
>> 1. if the given list is not NULL, the folio cannot be on LRU;
>> 2. if the given list is NULL, the folio is on LRU.
>>
>> And the rule is buried deeply in lru_add_split_folio().
>>
>> Should we add some checks in __folio_split()?
>>
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index d3e66136e41a..8ce2734c9ca0 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -3732,6 +3732,11 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>   	VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
>>   	VM_BUG_ON_FOLIO(!folio_test_large(folio), folio);
>>
>> +	if (list && folio_test_lru(folio))
>> +		return -EINVAL;
>> +	if (!list && !folio_test_lru(folio))
>> +		return -EINVAL;
>> +
>
> I guess we currently don't run into that, because whenever a folio is otherwise isolated, there is an additional reference or a page table mapping, so it cannot get split either way (e.g., freezing the refcount fails).
>
> So maybe these checks would be too early and they should happen after we froze the refcount?

But if the caller does the isolation, the additional refcount is OK and
can_split_folio() will return true. In addition, __folio_split() does not
change folio LRU state, so these two checks are orthogonal to refcount
check, right? The placement of them does not matter, but earlier the better
to avoid unnecessary work. I see these are sanity checks for callers.


Best Regards,
Yan, Zi


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] mm/vmscan: fix hwpoisoned large folio handling in shrink_folio_list
  2025-06-11 17:52                 ` Zi Yan
@ 2025-06-12  7:53                   ` David Hildenbrand
  2025-06-12 15:35                     ` Zi Yan
  0 siblings, 1 reply; 18+ messages in thread
From: David Hildenbrand @ 2025-06-12  7:53 UTC (permalink / raw)
  To: Zi Yan; +Cc: Jinjiang Tu, akpm, linmiaohe, linux-mm, wangkefeng.wang

On 11.06.25 19:52, Zi Yan wrote:
> On 11 Jun 2025, at 13:34, David Hildenbrand wrote:
> 
>>> So __folio_split() has an implicit rule that:
>>> 1. if the given list is not NULL, the folio cannot be on LRU;
>>> 2. if the given list is NULL, the folio is on LRU.
>>>
>>> And the rule is buried deeply in lru_add_split_folio().
>>>
>>> Should we add some checks in __folio_split()?
>>>
>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>> index d3e66136e41a..8ce2734c9ca0 100644
>>> --- a/mm/huge_memory.c
>>> +++ b/mm/huge_memory.c
>>> @@ -3732,6 +3732,11 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>    	VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
>>>    	VM_BUG_ON_FOLIO(!folio_test_large(folio), folio);
>>>
>>> +	if (list && folio_test_lru(folio))
>>> +		return -EINVAL;
>>> +	if (!list && !folio_test_lru(folio))
>>> +		return -EINVAL;
>>> +
>>
>> I guess we currently don't run into that, because whenever a folio is otherwise isolated, there is an additional reference or a page table mapping, so it cannot get split either way (e.g., freezing the refcount fails).
>>
>> So maybe these checks would be too early and they should happen after we froze the refcount?
> 
> But if the caller does the isolation, the additional refcount is OK and
> can_split_folio() will return true. In addition, __folio_split() does not
> change folio LRU state, so these two checks are orthogonal to refcount
> check, right? The placement of them does not matter, but earlier the better
> to avoid unnecessary work. I see these are sanity checks for callers.

In light of the discussion in this thread, if you have someone that 
takes the folio off the LRU concurrently, I think we could still run 
into a race here. Because that could happen just after we passed the 
test in __folio_split().

That's why I think the test would have to happen when there are no such 
races possible anymore.

But the real question is if it is okay to remove the folio from the LRU 
as done in the patch discussed here ...

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] mm/vmscan: fix hwpoisoned large folio handling in shrink_folio_list
  2025-06-12  7:53                   ` David Hildenbrand
@ 2025-06-12 15:35                     ` Zi Yan
  2025-06-12 15:50                       ` David Hildenbrand
  0 siblings, 1 reply; 18+ messages in thread
From: Zi Yan @ 2025-06-12 15:35 UTC (permalink / raw)
  To: David Hildenbrand, Jinjiang Tu; +Cc: akpm, linmiaohe, linux-mm, wangkefeng.wang

On 12 Jun 2025, at 3:53, David Hildenbrand wrote:

> On 11.06.25 19:52, Zi Yan wrote:
>> On 11 Jun 2025, at 13:34, David Hildenbrand wrote:
>>
>>>> So __folio_split() has an implicit rule that:
>>>> 1. if the given list is not NULL, the folio cannot be on LRU;
>>>> 2. if the given list is NULL, the folio is on LRU.
>>>>
>>>> And the rule is buried deeply in lru_add_split_folio().
>>>>
>>>> Should we add some checks in __folio_split()?
>>>>
>>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>>> index d3e66136e41a..8ce2734c9ca0 100644
>>>> --- a/mm/huge_memory.c
>>>> +++ b/mm/huge_memory.c
>>>> @@ -3732,6 +3732,11 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>    	VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
>>>>    	VM_BUG_ON_FOLIO(!folio_test_large(folio), folio);
>>>>
>>>> +	if (list && folio_test_lru(folio))
>>>> +		return -EINVAL;
>>>> +	if (!list && !folio_test_lru(folio))
>>>> +		return -EINVAL;
>>>> +
>>>
>>> I guess we currently don't run into that, because whenever a folio is otherwise isolated, there is an additional reference or a page table mapping, so it cannot get split either way (e.g., freezing the refcount fails).
>>>
>>> So maybe these checks would be too early and they should happen after we froze the refcount?
>>
>> But if the caller does the isolation, the additional refcount is OK and
>> can_split_folio() will return true. In addition, __folio_split() does not
>> change folio LRU state, so these two checks are orthogonal to refcount
>> check, right? The placement of them does not matter, but earlier the better
>> to avoid unnecessary work. I see these are sanity checks for callers.
>
> In light of the discussion in this thread, if you have someone that takes the folio off the LRU concurrently, I think we could still run into a race here. Because that could happen just after we passed the test in __folio_split().
>
> That's why I think the test would have to happen when there are no such races possible anymore.

Make sense. Thanks for the explanation.

>
> But the real question is if it is okay to remove the folio from the LRU as done in the patch discussed here ...

I just read through the email thread. IIUC, when deferred_split_scan() split
a THP, it expects the THP is on LRU list. I think it makes sense since
all these THPs are in both the deferred_split_queue and LRU list.
And deferred_split_scan() uses split_folio() without providing a list
to store the after-split folios.

In terms of the patch, since unmap_poisoned_folio() does not handle large
folios, why not just split the large folios and add the after-split folios
to folio_list? Then, the while loop will go over all the after-split folios
one by one.

BTW, unmap_poisoned_folio() is also used in do_migrate_range() from
memory_hotplug.c and there is no guard for large folios either. That
also needs a fix?

Best Regards,
Yan, Zi


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] mm/vmscan: fix hwpoisoned large folio handling in shrink_folio_list
  2025-06-12 15:35                     ` Zi Yan
@ 2025-06-12 15:50                       ` David Hildenbrand
  2025-06-12 16:48                         ` Zi Yan
  2025-06-16 11:33                         ` Jinjiang Tu
  0 siblings, 2 replies; 18+ messages in thread
From: David Hildenbrand @ 2025-06-12 15:50 UTC (permalink / raw)
  To: Zi Yan, Jinjiang Tu; +Cc: akpm, linmiaohe, linux-mm, wangkefeng.wang

On 12.06.25 17:35, Zi Yan wrote:
> On 12 Jun 2025, at 3:53, David Hildenbrand wrote:
> 
>> On 11.06.25 19:52, Zi Yan wrote:
>>> On 11 Jun 2025, at 13:34, David Hildenbrand wrote:
>>>
>>>>> So __folio_split() has an implicit rule that:
>>>>> 1. if the given list is not NULL, the folio cannot be on LRU;
>>>>> 2. if the given list is NULL, the folio is on LRU.
>>>>>
>>>>> And the rule is buried deeply in lru_add_split_folio().
>>>>>
>>>>> Should we add some checks in __folio_split()?
>>>>>
>>>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>>>> index d3e66136e41a..8ce2734c9ca0 100644
>>>>> --- a/mm/huge_memory.c
>>>>> +++ b/mm/huge_memory.c
>>>>> @@ -3732,6 +3732,11 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>>     	VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
>>>>>     	VM_BUG_ON_FOLIO(!folio_test_large(folio), folio);
>>>>>
>>>>> +	if (list && folio_test_lru(folio))
>>>>> +		return -EINVAL;
>>>>> +	if (!list && !folio_test_lru(folio))
>>>>> +		return -EINVAL;
>>>>> +
>>>>
>>>> I guess we currently don't run into that, because whenever a folio is otherwise isolated, there is an additional reference or a page table mapping, so it cannot get split either way (e.g., freezing the refcount fails).
>>>>
>>>> So maybe these checks would be too early and they should happen after we froze the refcount?
>>>
>>> But if the caller does the isolation, the additional refcount is OK and
>>> can_split_folio() will return true. In addition, __folio_split() does not
>>> change folio LRU state, so these two checks are orthogonal to refcount
>>> check, right? The placement of them does not matter, but earlier the better
>>> to avoid unnecessary work. I see these are sanity checks for callers.
>>
>> In light of the discussion in this thread, if you have someone that takes the folio off the LRU concurrently, I think we could still run into a race here. Because that could happen just after we passed the test in __folio_split().
>>
>> That's why I think the test would have to happen when there are no such races possible anymore.
> 
> Make sense. Thanks for the explanation.
> 
>>
>> But the real question is if it is okay to remove the folio from the LRU as done in the patch discussed here ...
> 
> I just read through the email thread. IIUC, when deferred_split_scan() split
> a THP, it expects the THP is on LRU list. I think it makes sense since
> all these THPs are in both the deferred_split_queue and LRU list.
> And deferred_split_scan() uses split_folio() without providing a list
> to store the after-split folios.
> 
> In terms of the patch, since unmap_poisoned_folio() does not handle large
> folios, why not just split the large folios and add the after-split folios
> to folio_list?

That's what I raised, but apparently it might not be worth it for that 
corner case (splitting might fail).

Then, the while loop will go over all the after-split folios
> one by one.
> 
> BTW, unmap_poisoned_folio() is also used in do_migrate_range() from
> memory_hotplug.c and there is no guard for large folios either. That
> also needs a fix?

Yes, that was mentioned, and I was hoping we could let 
unmap_poisoned_folio() check+fail in that case.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] mm/vmscan: fix hwpoisoned large folio handling in shrink_folio_list
  2025-06-12 15:50                       ` David Hildenbrand
@ 2025-06-12 16:48                         ` Zi Yan
  2025-06-16 11:34                           ` Jinjiang Tu
  2025-06-16 11:33                         ` Jinjiang Tu
  1 sibling, 1 reply; 18+ messages in thread
From: Zi Yan @ 2025-06-12 16:48 UTC (permalink / raw)
  To: David Hildenbrand; +Cc: Jinjiang Tu, akpm, linmiaohe, linux-mm, wangkefeng.wang

On 12 Jun 2025, at 11:50, David Hildenbrand wrote:

> On 12.06.25 17:35, Zi Yan wrote:
>> On 12 Jun 2025, at 3:53, David Hildenbrand wrote:
>>
>>> On 11.06.25 19:52, Zi Yan wrote:
>>>> On 11 Jun 2025, at 13:34, David Hildenbrand wrote:
>>>>
>>>>>> So __folio_split() has an implicit rule that:
>>>>>> 1. if the given list is not NULL, the folio cannot be on LRU;
>>>>>> 2. if the given list is NULL, the folio is on LRU.
>>>>>>
>>>>>> And the rule is buried deeply in lru_add_split_folio().
>>>>>>
>>>>>> Should we add some checks in __folio_split()?
>>>>>>
>>>>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>>>>> index d3e66136e41a..8ce2734c9ca0 100644
>>>>>> --- a/mm/huge_memory.c
>>>>>> +++ b/mm/huge_memory.c
>>>>>> @@ -3732,6 +3732,11 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>>>     	VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
>>>>>>     	VM_BUG_ON_FOLIO(!folio_test_large(folio), folio);
>>>>>>
>>>>>> +	if (list && folio_test_lru(folio))
>>>>>> +		return -EINVAL;
>>>>>> +	if (!list && !folio_test_lru(folio))
>>>>>> +		return -EINVAL;
>>>>>> +
>>>>>
>>>>> I guess we currently don't run into that, because whenever a folio is otherwise isolated, there is an additional reference or a page table mapping, so it cannot get split either way (e.g., freezing the refcount fails).
>>>>>
>>>>> So maybe these checks would be too early and they should happen after we froze the refcount?
>>>>
>>>> But if the caller does the isolation, the additional refcount is OK and
>>>> can_split_folio() will return true. In addition, __folio_split() does not
>>>> change folio LRU state, so these two checks are orthogonal to refcount
>>>> check, right? The placement of them does not matter, but earlier the better
>>>> to avoid unnecessary work. I see these are sanity checks for callers.
>>>
>>> In light of the discussion in this thread, if you have someone that takes the folio off the LRU concurrently, I think we could still run into a race here. Because that could happen just after we passed the test in __folio_split().
>>>
>>> That's why I think the test would have to happen when there are no such races possible anymore.
>>
>> Make sense. Thanks for the explanation.
>>
>>>
>>> But the real question is if it is okay to remove the folio from the LRU as done in the patch discussed here ...
>>
>> I just read through the email thread. IIUC, when deferred_split_scan() split
>> a THP, it expects the THP is on LRU list. I think it makes sense since
>> all these THPs are in both the deferred_split_queue and LRU list.
>> And deferred_split_scan() uses split_folio() without providing a list
>> to store the after-split folios.
>>
>> In terms of the patch, since unmap_poisoned_folio() does not handle large
>> folios, why not just split the large folios and add the after-split folios
>> to folio_list?
>
> That's what I raised, but apparently it might not be worth it for that corner case (splitting might fail).

OK, the reason to not split is that handling split failures is too much work
and it is better to be done in memory_failure().

In terms of this patch, returning large poisoned folio back to LRU list
seems to be the right thing to do. My thought is that the reclaim code
is trying to free this folio, but it cannot reclaim a large poisoned folio,
so it puts the folio back like it cannot reclaim an in-use folio.

>
> Then, the while loop will go over all the after-split folios
>> one by one.
>>
>> BTW, unmap_poisoned_folio() is also used in do_migrate_range() from
>> memory_hotplug.c and there is no guard for large folios either. That
>> also needs a fix?
>
> Yes, that was mentioned, and I was hoping we could let unmap_poisoned_folio() check+fail in that case.

For this case, if unmap_poisoned_folio() fails, the whole range cannot
be offline?

Best Regards,
Yan, Zi


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] mm/vmscan: fix hwpoisoned large folio handling in shrink_folio_list
  2025-06-12 15:50                       ` David Hildenbrand
  2025-06-12 16:48                         ` Zi Yan
@ 2025-06-16 11:33                         ` Jinjiang Tu
  2025-06-16 19:27                           ` David Hildenbrand
  1 sibling, 1 reply; 18+ messages in thread
From: Jinjiang Tu @ 2025-06-16 11:33 UTC (permalink / raw)
  To: David Hildenbrand, Zi Yan; +Cc: akpm, linmiaohe, linux-mm, wangkefeng.wang


在 2025/6/12 23:50, David Hildenbrand 写道:
> On 12.06.25 17:35, Zi Yan wrote:
>> On 12 Jun 2025, at 3:53, David Hildenbrand wrote:
>>
>>> On 11.06.25 19:52, Zi Yan wrote:
>>>> On 11 Jun 2025, at 13:34, David Hildenbrand wrote:
>>>>
>>>>>> So __folio_split() has an implicit rule that:
>>>>>> 1. if the given list is not NULL, the folio cannot be on LRU;
>>>>>> 2. if the given list is NULL, the folio is on LRU.
>>>>>>
>>>>>> And the rule is buried deeply in lru_add_split_folio().
>>>>>>
>>>>>> Should we add some checks in __folio_split()?
>>>>>>
>>>>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>>>>> index d3e66136e41a..8ce2734c9ca0 100644
>>>>>> --- a/mm/huge_memory.c
>>>>>> +++ b/mm/huge_memory.c
>>>>>> @@ -3732,6 +3732,11 @@ static int __folio_split(struct folio 
>>>>>> *folio, unsigned int new_order,
>>>>>>         VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
>>>>>>         VM_BUG_ON_FOLIO(!folio_test_large(folio), folio);
>>>>>>
>>>>>> +    if (list && folio_test_lru(folio))
>>>>>> +        return -EINVAL;
>>>>>> +    if (!list && !folio_test_lru(folio))
>>>>>> +        return -EINVAL;
>>>>>> +
>>>>>
>>>>> I guess we currently don't run into that, because whenever a folio 
>>>>> is otherwise isolated, there is an additional reference or a page 
>>>>> table mapping, so it cannot get split either way (e.g., freezing 
>>>>> the refcount fails).
>>>>>
>>>>> So maybe these checks would be too early and they should happen 
>>>>> after we froze the refcount?
>>>>
>>>> But if the caller does the isolation, the additional refcount is OK 
>>>> and
>>>> can_split_folio() will return true. In addition, __folio_split() 
>>>> does not
>>>> change folio LRU state, so these two checks are orthogonal to refcount
>>>> check, right? The placement of them does not matter, but earlier 
>>>> the better
>>>> to avoid unnecessary work. I see these are sanity checks for callers.
>>>
>>> In light of the discussion in this thread, if you have someone that 
>>> takes the folio off the LRU concurrently, I think we could still run 
>>> into a race here. Because that could happen just after we passed the 
>>> test in __folio_split().
>>>
>>> That's why I think the test would have to happen when there are no 
>>> such races possible anymore.
>>
>> Make sense. Thanks for the explanation.
>>
>>>
>>> But the real question is if it is okay to remove the folio from the 
>>> LRU as done in the patch discussed here ...
>>
>> I just read through the email thread. IIUC, when 
>> deferred_split_scan() split
>> a THP, it expects the THP is on LRU list. I think it makes sense since
>> all these THPs are in both the deferred_split_queue and LRU list.
>> And deferred_split_scan() uses split_folio() without providing a list
>> to store the after-split folios.
>>
>> In terms of the patch, since unmap_poisoned_folio() does not handle 
>> large
>> folios, why not just split the large folios and add the after-split 
>> folios
>> to folio_list?
>
> That's what I raised, but apparently it might not be worth it for that 
> corner case (splitting might fail).
>
> Then, the while loop will go over all the after-split folios
>> one by one.
>>
>> BTW, unmap_poisoned_folio() is also used in do_migrate_range() from
>> memory_hotplug.c and there is no guard for large folios either. That
>> also needs a fix?
>
> Yes, that was mentioned, and I was hoping we could let 
> unmap_poisoned_folio() check+fail in that case.

Maybe we could fix do_migrate_range() like below:

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 8305483de38b..5a6d869e6b56 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1823,7 +1823,10 @@ static void do_migrate_range(unsigned long 
start_pfn, unsigned long end_pfn)
                         pfn = folio_pfn(folio) + folio_nr_pages(folio) - 1;

                 if (folio_contain_hwpoisoned_page(folio)) {
-                       if (WARN_ON(folio_test_lru(folio)))
+                       if (folio_test_large(folio))
+                               goto put_folio;
+
+                       if (folio_test_lru(folio))
                                 folio_isolate_lru(folio);
                         if (folio_mapped(folio)) {
                                 folio_lock(folio);

The folio may be on lru, if folio_test_lru() check happens between 
setting hwposion flag and isolating from lru in memory_failure().
So, I remove the WARN_ON.

Skip if it's a large folio before we isolate from lru.



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH] mm/vmscan: fix hwpoisoned large folio handling in shrink_folio_list
  2025-06-12 16:48                         ` Zi Yan
@ 2025-06-16 11:34                           ` Jinjiang Tu
  0 siblings, 0 replies; 18+ messages in thread
From: Jinjiang Tu @ 2025-06-16 11:34 UTC (permalink / raw)
  To: Zi Yan, David Hildenbrand; +Cc: akpm, linmiaohe, linux-mm, wangkefeng.wang


在 2025/6/13 0:48, Zi Yan 写道:
> On 12 Jun 2025, at 11:50, David Hildenbrand wrote:
>
>> On 12.06.25 17:35, Zi Yan wrote:
>>> On 12 Jun 2025, at 3:53, David Hildenbrand wrote:
>>>
>>>> On 11.06.25 19:52, Zi Yan wrote:
>>>>> On 11 Jun 2025, at 13:34, David Hildenbrand wrote:
>>>>>
>>>>>>> So __folio_split() has an implicit rule that:
>>>>>>> 1. if the given list is not NULL, the folio cannot be on LRU;
>>>>>>> 2. if the given list is NULL, the folio is on LRU.
>>>>>>>
>>>>>>> And the rule is buried deeply in lru_add_split_folio().
>>>>>>>
>>>>>>> Should we add some checks in __folio_split()?
>>>>>>>
>>>>>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>>>>>> index d3e66136e41a..8ce2734c9ca0 100644
>>>>>>> --- a/mm/huge_memory.c
>>>>>>> +++ b/mm/huge_memory.c
>>>>>>> @@ -3732,6 +3732,11 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>>>>      	VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
>>>>>>>      	VM_BUG_ON_FOLIO(!folio_test_large(folio), folio);
>>>>>>>
>>>>>>> +	if (list && folio_test_lru(folio))
>>>>>>> +		return -EINVAL;
>>>>>>> +	if (!list && !folio_test_lru(folio))
>>>>>>> +		return -EINVAL;
>>>>>>> +
>>>>>> I guess we currently don't run into that, because whenever a folio is otherwise isolated, there is an additional reference or a page table mapping, so it cannot get split either way (e.g., freezing the refcount fails).
>>>>>>
>>>>>> So maybe these checks would be too early and they should happen after we froze the refcount?
>>>>> But if the caller does the isolation, the additional refcount is OK and
>>>>> can_split_folio() will return true. In addition, __folio_split() does not
>>>>> change folio LRU state, so these two checks are orthogonal to refcount
>>>>> check, right? The placement of them does not matter, but earlier the better
>>>>> to avoid unnecessary work. I see these are sanity checks for callers.
>>>> In light of the discussion in this thread, if you have someone that takes the folio off the LRU concurrently, I think we could still run into a race here. Because that could happen just after we passed the test in __folio_split().
>>>>
>>>> That's why I think the test would have to happen when there are no such races possible anymore.
>>> Make sense. Thanks for the explanation.
>>>
>>>> But the real question is if it is okay to remove the folio from the LRU as done in the patch discussed here ...
>>> I just read through the email thread. IIUC, when deferred_split_scan() split
>>> a THP, it expects the THP is on LRU list. I think it makes sense since
>>> all these THPs are in both the deferred_split_queue and LRU list.
>>> And deferred_split_scan() uses split_folio() without providing a list
>>> to store the after-split folios.
>>>
>>> In terms of the patch, since unmap_poisoned_folio() does not handle large
>>> folios, why not just split the large folios and add the after-split folios
>>> to folio_list?
>> That's what I raised, but apparently it might not be worth it for that corner case (splitting might fail).
> OK, the reason to not split is that handling split failures is too much work
> and it is better to be done in memory_failure().
>
> In terms of this patch, returning large poisoned folio back to LRU list
> seems to be the right thing to do. My thought is that the reclaim code
> is trying to free this folio, but it cannot reclaim a large poisoned folio,
> so it puts the folio back like it cannot reclaim an in-use folio.
>
>> Then, the while loop will go over all the after-split folios
>>> one by one.
>>>
>>> BTW, unmap_poisoned_folio() is also used in do_migrate_range() from
>>> memory_hotplug.c and there is no guard for large folios either. That
>>> also needs a fix?
>> Yes, that was mentioned, and I was hoping we could let unmap_poisoned_folio() check+fail in that case.
> For this case, if unmap_poisoned_folio() fails, the whole range cannot
> be offline?

If we fix do_migrate_range() like below:

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 8305483de38b..5a6d869e6b56 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1823,7 +1823,10 @@ static void do_migrate_range(unsigned long 
start_pfn, unsigned long end_pfn)
                         pfn = folio_pfn(folio) + folio_nr_pages(folio) - 1;

                 if (folio_contain_hwpoisoned_page(folio)) {
-                       if (WARN_ON(folio_test_lru(folio)))
+                       if (folio_test_large(folio))
+                               goto put_folio;
+
+                       if (folio_test_lru(folio))
                                 folio_isolate_lru(folio);
                         if (folio_mapped(folio)) {
                                 folio_lock(folio);


If memory_failure calls try_to_split_thp_page() after do_migrate_range() 
puts the folio, the THP can be handled by memory_failure(),
and the memory can be offlined after memory_failure() handling.

However, If memory_failure() fails to split due to the extra reference, 
the whole memory can't be offlined forever.
>
> Best Regards,
> Yan, Zi


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH] mm/vmscan: fix hwpoisoned large folio handling in shrink_folio_list
  2025-06-16 11:33                         ` Jinjiang Tu
@ 2025-06-16 19:27                           ` David Hildenbrand
  2025-06-17  6:43                             ` Jinjiang Tu
  0 siblings, 1 reply; 18+ messages in thread
From: David Hildenbrand @ 2025-06-16 19:27 UTC (permalink / raw)
  To: Jinjiang Tu, Zi Yan; +Cc: akpm, linmiaohe, linux-mm, wangkefeng.wang

On 16.06.25 13:33, Jinjiang Tu wrote:
> 
> 在 2025/6/12 23:50, David Hildenbrand 写道:
>> On 12.06.25 17:35, Zi Yan wrote:
>>> On 12 Jun 2025, at 3:53, David Hildenbrand wrote:
>>>
>>>> On 11.06.25 19:52, Zi Yan wrote:
>>>>> On 11 Jun 2025, at 13:34, David Hildenbrand wrote:
>>>>>
>>>>>>> So __folio_split() has an implicit rule that:
>>>>>>> 1. if the given list is not NULL, the folio cannot be on LRU;
>>>>>>> 2. if the given list is NULL, the folio is on LRU.
>>>>>>>
>>>>>>> And the rule is buried deeply in lru_add_split_folio().
>>>>>>>
>>>>>>> Should we add some checks in __folio_split()?
>>>>>>>
>>>>>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>>>>>> index d3e66136e41a..8ce2734c9ca0 100644
>>>>>>> --- a/mm/huge_memory.c
>>>>>>> +++ b/mm/huge_memory.c
>>>>>>> @@ -3732,6 +3732,11 @@ static int __folio_split(struct folio
>>>>>>> *folio, unsigned int new_order,
>>>>>>>          VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
>>>>>>>          VM_BUG_ON_FOLIO(!folio_test_large(folio), folio);
>>>>>>>
>>>>>>> +    if (list && folio_test_lru(folio))
>>>>>>> +        return -EINVAL;
>>>>>>> +    if (!list && !folio_test_lru(folio))
>>>>>>> +        return -EINVAL;
>>>>>>> +
>>>>>>
>>>>>> I guess we currently don't run into that, because whenever a folio
>>>>>> is otherwise isolated, there is an additional reference or a page
>>>>>> table mapping, so it cannot get split either way (e.g., freezing
>>>>>> the refcount fails).
>>>>>>
>>>>>> So maybe these checks would be too early and they should happen
>>>>>> after we froze the refcount?
>>>>>
>>>>> But if the caller does the isolation, the additional refcount is OK
>>>>> and
>>>>> can_split_folio() will return true. In addition, __folio_split()
>>>>> does not
>>>>> change folio LRU state, so these two checks are orthogonal to refcount
>>>>> check, right? The placement of them does not matter, but earlier
>>>>> the better
>>>>> to avoid unnecessary work. I see these are sanity checks for callers.
>>>>
>>>> In light of the discussion in this thread, if you have someone that
>>>> takes the folio off the LRU concurrently, I think we could still run
>>>> into a race here. Because that could happen just after we passed the
>>>> test in __folio_split().
>>>>
>>>> That's why I think the test would have to happen when there are no
>>>> such races possible anymore.
>>>
>>> Make sense. Thanks for the explanation.
>>>
>>>>
>>>> But the real question is if it is okay to remove the folio from the
>>>> LRU as done in the patch discussed here ...
>>>
>>> I just read through the email thread. IIUC, when
>>> deferred_split_scan() split
>>> a THP, it expects the THP is on LRU list. I think it makes sense since
>>> all these THPs are in both the deferred_split_queue and LRU list.
>>> And deferred_split_scan() uses split_folio() without providing a list
>>> to store the after-split folios.
>>>
>>> In terms of the patch, since unmap_poisoned_folio() does not handle
>>> large
>>> folios, why not just split the large folios and add the after-split
>>> folios
>>> to folio_list?
>>
>> That's what I raised, but apparently it might not be worth it for that
>> corner case (splitting might fail).
>>
>> Then, the while loop will go over all the after-split folios
>>> one by one.
>>>
>>> BTW, unmap_poisoned_folio() is also used in do_migrate_range() from
>>> memory_hotplug.c and there is no guard for large folios either. That
>>> also needs a fix?
>>
>> Yes, that was mentioned, and I was hoping we could let
>> unmap_poisoned_folio() check+fail in that case.
> 
> Maybe we could fix do_migrate_range() like below:
> 
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 8305483de38b..5a6d869e6b56 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1823,7 +1823,10 @@ static void do_migrate_range(unsigned long
> start_pfn, unsigned long end_pfn)
>                           pfn = folio_pfn(folio) + folio_nr_pages(folio) - 1;
> 
>                   if (folio_contain_hwpoisoned_page(folio)) {
> -                       if (WARN_ON(folio_test_lru(folio)))
> +                       if (folio_test_large(folio))
> +                               goto put_folio;
> +
> +                       if (folio_test_lru(folio))
 >                                   folio_isolate_lru(folio);

Hm, what is supposed to happen if we fail folio_isolate_lru()?

>                           if (folio_mapped(folio)) {
>                                   folio_lock(folio);
> 
> The folio may be on lru, if folio_test_lru() check happens between
> setting hwposion flag and isolating from lru in memory_failure().
> So, I remove the WARN_ON.

I guess this would work, although this special-casing on large folios in 
the caller of unmap_poisoned_folio() is rather weird.

What is supposed to happen if unmap_poisoned_folio() failed for small 
folios? Why are we okay with having the LRU flag cleared and the folio 
isolated?

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] mm/vmscan: fix hwpoisoned large folio handling in shrink_folio_list
  2025-06-16 19:27                           ` David Hildenbrand
@ 2025-06-17  6:43                             ` Jinjiang Tu
  0 siblings, 0 replies; 18+ messages in thread
From: Jinjiang Tu @ 2025-06-17  6:43 UTC (permalink / raw)
  To: David Hildenbrand, Zi Yan; +Cc: akpm, linmiaohe, linux-mm, wangkefeng.wang


在 2025/6/17 3:27, David Hildenbrand 写道:
> On 16.06.25 13:33, Jinjiang Tu wrote:
>>
>> 在 2025/6/12 23:50, David Hildenbrand 写道:
>>> On 12.06.25 17:35, Zi Yan wrote:
>>>> On 12 Jun 2025, at 3:53, David Hildenbrand wrote:
>>>>
>>>>> On 11.06.25 19:52, Zi Yan wrote:
>>>>>> On 11 Jun 2025, at 13:34, David Hildenbrand wrote:
>>>>>>
>>>>>>>> So __folio_split() has an implicit rule that:
>>>>>>>> 1. if the given list is not NULL, the folio cannot be on LRU;
>>>>>>>> 2. if the given list is NULL, the folio is on LRU.
>>>>>>>>
>>>>>>>> And the rule is buried deeply in lru_add_split_folio().
>>>>>>>>
>>>>>>>> Should we add some checks in __folio_split()?
>>>>>>>>
>>>>>>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>>>>>>> index d3e66136e41a..8ce2734c9ca0 100644
>>>>>>>> --- a/mm/huge_memory.c
>>>>>>>> +++ b/mm/huge_memory.c
>>>>>>>> @@ -3732,6 +3732,11 @@ static int __folio_split(struct folio
>>>>>>>> *folio, unsigned int new_order,
>>>>>>>>          VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
>>>>>>>>          VM_BUG_ON_FOLIO(!folio_test_large(folio), folio);
>>>>>>>>
>>>>>>>> +    if (list && folio_test_lru(folio))
>>>>>>>> +        return -EINVAL;
>>>>>>>> +    if (!list && !folio_test_lru(folio))
>>>>>>>> +        return -EINVAL;
>>>>>>>> +
>>>>>>>
>>>>>>> I guess we currently don't run into that, because whenever a folio
>>>>>>> is otherwise isolated, there is an additional reference or a page
>>>>>>> table mapping, so it cannot get split either way (e.g., freezing
>>>>>>> the refcount fails).
>>>>>>>
>>>>>>> So maybe these checks would be too early and they should happen
>>>>>>> after we froze the refcount?
>>>>>>
>>>>>> But if the caller does the isolation, the additional refcount is OK
>>>>>> and
>>>>>> can_split_folio() will return true. In addition, __folio_split()
>>>>>> does not
>>>>>> change folio LRU state, so these two checks are orthogonal to 
>>>>>> refcount
>>>>>> check, right? The placement of them does not matter, but earlier
>>>>>> the better
>>>>>> to avoid unnecessary work. I see these are sanity checks for 
>>>>>> callers.
>>>>>
>>>>> In light of the discussion in this thread, if you have someone that
>>>>> takes the folio off the LRU concurrently, I think we could still run
>>>>> into a race here. Because that could happen just after we passed the
>>>>> test in __folio_split().
>>>>>
>>>>> That's why I think the test would have to happen when there are no
>>>>> such races possible anymore.
>>>>
>>>> Make sense. Thanks for the explanation.
>>>>
>>>>>
>>>>> But the real question is if it is okay to remove the folio from the
>>>>> LRU as done in the patch discussed here ...
>>>>
>>>> I just read through the email thread. IIUC, when
>>>> deferred_split_scan() split
>>>> a THP, it expects the THP is on LRU list. I think it makes sense since
>>>> all these THPs are in both the deferred_split_queue and LRU list.
>>>> And deferred_split_scan() uses split_folio() without providing a list
>>>> to store the after-split folios.
>>>>
>>>> In terms of the patch, since unmap_poisoned_folio() does not handle
>>>> large
>>>> folios, why not just split the large folios and add the after-split
>>>> folios
>>>> to folio_list?
>>>
>>> That's what I raised, but apparently it might not be worth it for that
>>> corner case (splitting might fail).
>>>
>>> Then, the while loop will go over all the after-split folios
>>>> one by one.
>>>>
>>>> BTW, unmap_poisoned_folio() is also used in do_migrate_range() from
>>>> memory_hotplug.c and there is no guard for large folios either. That
>>>> also needs a fix?
>>>
>>> Yes, that was mentioned, and I was hoping we could let
>>> unmap_poisoned_folio() check+fail in that case.
>>
>> Maybe we could fix do_migrate_range() like below:
>>
>> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
>> index 8305483de38b..5a6d869e6b56 100644
>> --- a/mm/memory_hotplug.c
>> +++ b/mm/memory_hotplug.c
>> @@ -1823,7 +1823,10 @@ static void do_migrate_range(unsigned long
>> start_pfn, unsigned long end_pfn)
>>                           pfn = folio_pfn(folio) + 
>> folio_nr_pages(folio) - 1;
>>
>>                   if (folio_contain_hwpoisoned_page(folio)) {
>> -                       if (WARN_ON(folio_test_lru(folio)))
>> +                       if (folio_test_large(folio))
>> +                               goto put_folio;
>> +
>> +                       if (folio_test_lru(folio))
> >                                   folio_isolate_lru(folio);
>
> Hm, what is supposed to happen if we fail folio_isolate_lru()?

unmap_poisoned_folio() seems not to be assumed with lru flag cleared.

But other calls of unmap_poisoned_folio() guarantee the folio is 
ioslated, maybe we should be consistent with them?

>
>>                           if (folio_mapped(folio)) {
>>                                   folio_lock(folio);
>>
>> The folio may be on lru, if folio_test_lru() check happens between
>> setting hwposion flag and isolating from lru in memory_failure().
>> So, I remove the WARN_ON.
>
> I guess this would work, although this special-casing on large folios 
> in the caller of unmap_poisoned_folio() is rather weird.
>
> What is supposed to happen if unmap_poisoned_folio() failed for small 
> folios? Why are we okay with having the LRU flag cleared and the folio 
> isolated?

In hwpoison_user_mappings(), if unmap_poisoned_folio() fails, the small 
folio is kept with isolated too.

IIUC, after the folio is kept with ioslated, other subsystems could not 
operate on this folio, to avoid introduing issues.




^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2025-06-17  6:43 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-11  7:46 [PATCH] mm/vmscan: fix hwpoisoned large folio handling in shrink_folio_list Jinjiang Tu
2025-06-11  7:59 ` David Hildenbrand
2025-06-11  8:29   ` Jinjiang Tu
2025-06-11  8:35     ` David Hildenbrand
2025-06-11  9:00       ` Jinjiang Tu
2025-06-11  9:20         ` David Hildenbrand
2025-06-11  9:24           ` David Hildenbrand
2025-06-11 14:30             ` Zi Yan
2025-06-11 17:34               ` David Hildenbrand
2025-06-11 17:52                 ` Zi Yan
2025-06-12  7:53                   ` David Hildenbrand
2025-06-12 15:35                     ` Zi Yan
2025-06-12 15:50                       ` David Hildenbrand
2025-06-12 16:48                         ` Zi Yan
2025-06-16 11:34                           ` Jinjiang Tu
2025-06-16 11:33                         ` Jinjiang Tu
2025-06-16 19:27                           ` David Hildenbrand
2025-06-17  6:43                             ` Jinjiang Tu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).