linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/2] fix two calls of unmap_poisoned_folio() for large folio
@ 2025-06-27 12:57 Jinjiang Tu
  2025-06-27 12:57 ` [PATCH v2 1/2] mm/vmscan: fix hwpoisoned large folio handling in shrink_folio_list Jinjiang Tu
  2025-06-27 12:57 ` [PATCH v2 2/2] mm/memory_hotplug: fix hwpoisoned large folio handling in do_migrate_range Jinjiang Tu
  0 siblings, 2 replies; 36+ messages in thread
From: Jinjiang Tu @ 2025-06-27 12:57 UTC (permalink / raw)
  To: akpm, linmiaohe, david, osalvador, mhocko
  Cc: linux-mm, wangkefeng.wang, tujinjiang

Fix calling unmap_poisoned_folio() for large folio in shrink_folio_list()
and do_migrate_range().

Changelog:
 v2:
  1) add Closes tag for patch 1
  2) add patch 2 to fix calling unmap_poisoned_folio() for large folio in
     do_migrate_range().

Jinjiang Tu (2):
  mm/vmscan: fix hwpoisoned large folio handling in shrink_folio_list
  mm/memory_hotplug: fix hwpoisoned large folio handling in
    do_migrate_range

 mm/memory_hotplug.c | 21 +++++++++++++--------
 mm/vmscan.c         |  8 ++++++++
 2 files changed, 21 insertions(+), 8 deletions(-)

-- 
2.43.0



^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH v2 1/2] mm/vmscan: fix hwpoisoned large folio handling in shrink_folio_list
  2025-06-27 12:57 [PATCH v2 0/2] fix two calls of unmap_poisoned_folio() for large folio Jinjiang Tu
@ 2025-06-27 12:57 ` Jinjiang Tu
  2025-06-27 17:10   ` David Hildenbrand
                     ` (3 more replies)
  2025-06-27 12:57 ` [PATCH v2 2/2] mm/memory_hotplug: fix hwpoisoned large folio handling in do_migrate_range Jinjiang Tu
  1 sibling, 4 replies; 36+ messages in thread
From: Jinjiang Tu @ 2025-06-27 12:57 UTC (permalink / raw)
  To: akpm, linmiaohe, david, osalvador, mhocko
  Cc: linux-mm, wangkefeng.wang, tujinjiang

In shrink_folio_list(), the hwpoisoned folio may be large folio, which
can't be handled by unmap_poisoned_folio().

Since UCE is rare in real world, and race with reclaimation is more rare,
just skipping the hwpoisoned large folio is enough. memory_failure() will
handle it if the UCE is triggered again.

Fixes: 1b0449544c64 ("mm/vmscan: don't try to reclaim hwpoison folio")
Reported-by: syzbot+3b220254df55d8ca8a61@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/68412d57.050a0220.2461cf.000e.GAE@google.com/
Signed-off-by: Jinjiang Tu <tujinjiang@huawei.com>
---
 mm/vmscan.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index f8dfd2864bbf..424412680cfc 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1138,6 +1138,14 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
 			goto keep;
 
 		if (folio_contain_hwpoisoned_page(folio)) {
+			/*
+			 * unmap_poisoned_folio() can't handle large
+			 * folio, just skip it. memory_failure() will
+			 * handle it if the UCE is triggered again.
+			 */
+			if (folio_test_large(folio))
+				goto keep_locked;
+
 			unmap_poisoned_folio(folio, folio_pfn(folio), false);
 			folio_unlock(folio);
 			folio_put(folio);
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v2 2/2] mm/memory_hotplug: fix hwpoisoned large folio handling in do_migrate_range
  2025-06-27 12:57 [PATCH v2 0/2] fix two calls of unmap_poisoned_folio() for large folio Jinjiang Tu
  2025-06-27 12:57 ` [PATCH v2 1/2] mm/vmscan: fix hwpoisoned large folio handling in shrink_folio_list Jinjiang Tu
@ 2025-06-27 12:57 ` Jinjiang Tu
  2025-07-01 14:21   ` Oscar Salvador
  2025-07-03  7:53   ` David Hildenbrand
  1 sibling, 2 replies; 36+ messages in thread
From: Jinjiang Tu @ 2025-06-27 12:57 UTC (permalink / raw)
  To: akpm, linmiaohe, david, osalvador, mhocko
  Cc: linux-mm, wangkefeng.wang, tujinjiang

In do_migrate_range(), the hwpoisoned folio may be large folio, which
can't be handled by unmap_poisoned_folio().

I can reproduce this issue in qemu after adding delay in memory_failure()

BUG: kernel NULL pointer dereference, address: 0000000000000000
Workqueue: kacpi_hotplug acpi_hotplug_work_fn
RIP: 0010:try_to_unmap_one+0x16a/0xfc0
 <TASK>
 rmap_walk_anon+0xda/0x1f0
 try_to_unmap+0x78/0x80
 ? __pfx_try_to_unmap_one+0x10/0x10
 ? __pfx_folio_not_mapped+0x10/0x10
 ? __pfx_folio_lock_anon_vma_read+0x10/0x10
 unmap_poisoned_folio+0x60/0x140
 do_migrate_range+0x4d1/0x600
 ? slab_memory_callback+0x6a/0x190
 ? notifier_call_chain+0x56/0xb0
 offline_pages+0x3e6/0x460
 memory_subsys_offline+0x130/0x1f0
 device_offline+0xba/0x110
 acpi_bus_offline+0xb7/0x130
 acpi_scan_hot_remove+0x77/0x290
 acpi_device_hotplug+0x1e0/0x240
 acpi_hotplug_work_fn+0x1a/0x30
 process_one_work+0x186/0x340

In this case, just make offline_pages() fail.

Besides, do_migrate_range() may be called between memory_failure set
hwposion flag and ioslate the folio from lru, so remove WARN_ON(). In other
places, unmap_poisoned_folio() is called when the folio is isolated, obey
it in do_migrate_range() too.

Fixes: b15c87263a69 ("hwpoison, memory_hotplug: allow hwpoisoned pages to be offlined")
Signed-off-by: Jinjiang Tu <tujinjiang@huawei.com>
---
 mm/memory_hotplug.c | 21 +++++++++++++--------
 1 file changed, 13 insertions(+), 8 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index b1caedbade5b..5fab212e3a2c 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1795,7 +1795,7 @@ static int scan_movable_pages(unsigned long start, unsigned long end,
 	return 0;
 }
 
-static void do_migrate_range(unsigned long start_pfn, unsigned long end_pfn)
+static int do_migrate_range(unsigned long start_pfn, unsigned long end_pfn)
 {
 	struct folio *folio;
 	unsigned long pfn;
@@ -1819,8 +1819,10 @@ static void do_migrate_range(unsigned long start_pfn, unsigned long end_pfn)
 			pfn = folio_pfn(folio) + folio_nr_pages(folio) - 1;
 
 		if (folio_contain_hwpoisoned_page(folio)) {
-			if (WARN_ON(folio_test_lru(folio)))
-				folio_isolate_lru(folio);
+			if (folio_test_large(folio) && !folio_test_hugetlb(folio))
+				goto err_out;
+			if (folio_test_lru(folio) && !folio_isolate_lru(folio))
+				goto err_out;
 			if (folio_mapped(folio)) {
 				folio_lock(folio);
 				unmap_poisoned_folio(folio, pfn, false);
@@ -1877,6 +1879,11 @@ static void do_migrate_range(unsigned long start_pfn, unsigned long end_pfn)
 			putback_movable_pages(&source);
 		}
 	}
+	return 0;
+err_out:
+	folio_put(folio);
+	putback_movable_pages(&source);
+	return -EBUSY;
 }
 
 static int __init cmdline_parse_movable_node(char *p)
@@ -2041,11 +2048,9 @@ int offline_pages(unsigned long start_pfn, unsigned long nr_pages,
 
 			ret = scan_movable_pages(pfn, end_pfn, &pfn);
 			if (!ret) {
-				/*
-				 * TODO: fatal migration failures should bail
-				 * out
-				 */
-				do_migrate_range(pfn, end_pfn);
+				ret = do_migrate_range(pfn, end_pfn);
+				if (ret)
+					break;
 			}
 		} while (!ret);
 
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 1/2] mm/vmscan: fix hwpoisoned large folio handling in shrink_folio_list
  2025-06-27 12:57 ` [PATCH v2 1/2] mm/vmscan: fix hwpoisoned large folio handling in shrink_folio_list Jinjiang Tu
@ 2025-06-27 17:10   ` David Hildenbrand
  2025-06-27 22:00   ` Andrew Morton
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 36+ messages in thread
From: David Hildenbrand @ 2025-06-27 17:10 UTC (permalink / raw)
  To: Jinjiang Tu, akpm, linmiaohe, osalvador, mhocko; +Cc: linux-mm, wangkefeng.wang

On 27.06.25 14:57, Jinjiang Tu wrote:
> In shrink_folio_list(), the hwpoisoned folio may be large folio, which
> can't be handled by unmap_poisoned_folio().
> 
> Since UCE is rare in real world, and race with reclaimation is more rare,
> just skipping the hwpoisoned large folio is enough. memory_failure() will
> handle it if the UCE is triggered again.
> 
> Fixes: 1b0449544c64 ("mm/vmscan: don't try to reclaim hwpoison folio")
> Reported-by: syzbot+3b220254df55d8ca8a61@syzkaller.appspotmail.com
> Closes: https://lore.kernel.org/all/68412d57.050a0220.2461cf.000e.GAE@google.com/
> Signed-off-by: Jinjiang Tu <tujinjiang@huawei.com>
> ---
>   mm/vmscan.c | 8 ++++++++
>   1 file changed, 8 insertions(+)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index f8dfd2864bbf..424412680cfc 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1138,6 +1138,14 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
>   			goto keep;
>   
>   		if (folio_contain_hwpoisoned_page(folio)) {
> +			/*
> +			 * unmap_poisoned_folio() can't handle large
> +			 * folio, just skip it. memory_failure() will
> +			 * handle it if the UCE is triggered again.
> +			 */
> +			if (folio_test_large(folio))
> +				goto keep_locked;
> +
>   			unmap_poisoned_folio(folio, folio_pfn(folio), false);
>   			folio_unlock(folio);
>   			folio_put(folio);

Acked-by: David Hildenbrand <david@redhat.com>

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 1/2] mm/vmscan: fix hwpoisoned large folio handling in shrink_folio_list
  2025-06-27 12:57 ` [PATCH v2 1/2] mm/vmscan: fix hwpoisoned large folio handling in shrink_folio_list Jinjiang Tu
  2025-06-27 17:10   ` David Hildenbrand
@ 2025-06-27 22:00   ` Andrew Morton
  2025-06-28  2:38     ` Jinjiang Tu
  2025-06-28  3:13   ` Miaohe Lin
  2025-07-01 14:13   ` Oscar Salvador
  3 siblings, 1 reply; 36+ messages in thread
From: Andrew Morton @ 2025-06-27 22:00 UTC (permalink / raw)
  To: Jinjiang Tu
  Cc: linmiaohe, david, osalvador, mhocko, linux-mm, wangkefeng.wang

On Fri, 27 Jun 2025 20:57:46 +0800 Jinjiang Tu <tujinjiang@huawei.com> wrote:

> In shrink_folio_list(), the hwpoisoned folio may be large folio, which
> can't be handled by unmap_poisoned_folio().
> 
> Since UCE is rare in real world, and race with reclaimation is more rare,
> just skipping the hwpoisoned large folio is enough. memory_failure() will
> handle it if the UCE is triggered again.
> 
> Fixes: 1b0449544c64 ("mm/vmscan: don't try to reclaim hwpoison folio")

Thanks.  Can you please fully describe the userspace-visible effects of
this bug?

Depending on that information, do you recommend that this fix be
backported into -stable kernels?



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 1/2] mm/vmscan: fix hwpoisoned large folio handling in shrink_folio_list
  2025-06-27 22:00   ` Andrew Morton
@ 2025-06-28  2:38     ` Jinjiang Tu
  0 siblings, 0 replies; 36+ messages in thread
From: Jinjiang Tu @ 2025-06-28  2:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linmiaohe, david, osalvador, mhocko, linux-mm, wangkefeng.wang


在 2025/6/28 6:00, Andrew Morton 写道:
> On Fri, 27 Jun 2025 20:57:46 +0800 Jinjiang Tu <tujinjiang@huawei.com> wrote:
>
>> In shrink_folio_list(), the hwpoisoned folio may be large folio, which
>> can't be handled by unmap_poisoned_folio().
>>
>> Since UCE is rare in real world, and race with reclaimation is more rare,
>> just skipping the hwpoisoned large folio is enough. memory_failure() will
>> handle it if the UCE is triggered again.
>>
>> Fixes: 1b0449544c64 ("mm/vmscan: don't try to reclaim hwpoison folio")
> Thanks.  Can you please fully describe the userspace-visible effects of
> this bug?
This happens when memory reclaimation for large folio races with 
memory_failure(), and will
lead to kernel panic. The race is as following:

cpu0            cpu1
  shrink_folio_list memory_failure
                       TestSetPageHWPoison
     unmap_poisoned_folio
      --> trigger BUG_ON due to
            unmap_poisoned_folio couldn't
            handle large folio
> Depending on that information, do you recommend that this fix be
> backported into -stable kernels?
The fixed commit 1b0449544c64 has been backported to -stable, so this
fix should be backported too.
>
>


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 1/2] mm/vmscan: fix hwpoisoned large folio handling in shrink_folio_list
  2025-06-27 12:57 ` [PATCH v2 1/2] mm/vmscan: fix hwpoisoned large folio handling in shrink_folio_list Jinjiang Tu
  2025-06-27 17:10   ` David Hildenbrand
  2025-06-27 22:00   ` Andrew Morton
@ 2025-06-28  3:13   ` Miaohe Lin
  2025-07-01 14:13   ` Oscar Salvador
  3 siblings, 0 replies; 36+ messages in thread
From: Miaohe Lin @ 2025-06-28  3:13 UTC (permalink / raw)
  To: Jinjiang Tu; +Cc: linux-mm, wangkefeng.wang, akpm, david, osalvador, mhocko

On 2025/6/27 20:57, Jinjiang Tu wrote:
> In shrink_folio_list(), the hwpoisoned folio may be large folio, which
> can't be handled by unmap_poisoned_folio().
> 
> Since UCE is rare in real world, and race with reclaimation is more rare,
> just skipping the hwpoisoned large folio is enough. memory_failure() will
> handle it if the UCE is triggered again.
> 
> Fixes: 1b0449544c64 ("mm/vmscan: don't try to reclaim hwpoison folio")
> Reported-by: syzbot+3b220254df55d8ca8a61@syzkaller.appspotmail.com
> Closes: https://lore.kernel.org/all/68412d57.050a0220.2461cf.000e.GAE@google.com/
> Signed-off-by: Jinjiang Tu <tujinjiang@huawei.com>

Thanks for your patch.

> ---
>  mm/vmscan.c | 8 ++++++++
>  1 file changed, 8 insertions(+)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index f8dfd2864bbf..424412680cfc 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1138,6 +1138,14 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
>  			goto keep;
>  
>  		if (folio_contain_hwpoisoned_page(folio)) {
> +			/*
> +			 * unmap_poisoned_folio() can't handle large
> +			 * folio, just skip it. memory_failure() will
> +			 * handle it if the UCE is triggered again.
> +			 */
> +			if (folio_test_large(folio))
> +				goto keep_locked;
> +

I think this should work. The hwpoisoned folio is kept in the LRU list so it could be splitted later
while the hwpoisoned flag for folio and sub-pages are still correct. And folio_mc_copy will handle
re-accessing the hwpoisoned page when one of its caller, alloc_contig_range_noprof, calls migrate_pages.

Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>

Thanks.
.




^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 1/2] mm/vmscan: fix hwpoisoned large folio handling in shrink_folio_list
  2025-06-27 12:57 ` [PATCH v2 1/2] mm/vmscan: fix hwpoisoned large folio handling in shrink_folio_list Jinjiang Tu
                     ` (2 preceding siblings ...)
  2025-06-28  3:13   ` Miaohe Lin
@ 2025-07-01 14:13   ` Oscar Salvador
  2025-07-03  7:30     ` Jinjiang Tu
  3 siblings, 1 reply; 36+ messages in thread
From: Oscar Salvador @ 2025-07-01 14:13 UTC (permalink / raw)
  To: Jinjiang Tu; +Cc: akpm, linmiaohe, david, mhocko, linux-mm, wangkefeng.wang

On Fri, Jun 27, 2025 at 08:57:46PM +0800, Jinjiang Tu wrote:
> In shrink_folio_list(), the hwpoisoned folio may be large folio, which
> can't be handled by unmap_poisoned_folio().

Uhm, why can't be handled? I think we should spell it out, and maybe
even put a comment above unmap_poisoned_folio.

> Since UCE is rare in real world, and race with reclaimation is more rare,
> just skipping the hwpoisoned large folio is enough. memory_failure() will
> handle it if the UCE is triggered again.
> 
> Fixes: 1b0449544c64 ("mm/vmscan: don't try to reclaim hwpoison folio")
> Reported-by: syzbot+3b220254df55d8ca8a61@syzkaller.appspotmail.com
> Closes: https://lore.kernel.org/all/68412d57.050a0220.2461cf.000e.GAE@google.com/
> Signed-off-by: Jinjiang Tu <tujinjiang@huawei.com>

With the above addressed:

Acked-by: Oscar Salvador <osalvador@suse.de>

 
-- 
Oscar Salvador
SUSE Labs


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 2/2] mm/memory_hotplug: fix hwpoisoned large folio handling in do_migrate_range
  2025-06-27 12:57 ` [PATCH v2 2/2] mm/memory_hotplug: fix hwpoisoned large folio handling in do_migrate_range Jinjiang Tu
@ 2025-07-01 14:21   ` Oscar Salvador
  2025-07-03  7:46     ` Jinjiang Tu
  2025-07-03  7:53   ` David Hildenbrand
  1 sibling, 1 reply; 36+ messages in thread
From: Oscar Salvador @ 2025-07-01 14:21 UTC (permalink / raw)
  To: Jinjiang Tu; +Cc: akpm, linmiaohe, david, mhocko, linux-mm, wangkefeng.wang

On Fri, Jun 27, 2025 at 08:57:47PM +0800, Jinjiang Tu wrote:
> In do_migrate_range(), the hwpoisoned folio may be large folio, which
> can't be handled by unmap_poisoned_folio().
> 
> I can reproduce this issue in qemu after adding delay in memory_failure()
> 
> BUG: kernel NULL pointer dereference, address: 0000000000000000
> Workqueue: kacpi_hotplug acpi_hotplug_work_fn
> RIP: 0010:try_to_unmap_one+0x16a/0xfc0
>  <TASK>
>  rmap_walk_anon+0xda/0x1f0
>  try_to_unmap+0x78/0x80
>  ? __pfx_try_to_unmap_one+0x10/0x10
>  ? __pfx_folio_not_mapped+0x10/0x10
>  ? __pfx_folio_lock_anon_vma_read+0x10/0x10
>  unmap_poisoned_folio+0x60/0x140
>  do_migrate_range+0x4d1/0x600
>  ? slab_memory_callback+0x6a/0x190
>  ? notifier_call_chain+0x56/0xb0
>  offline_pages+0x3e6/0x460
>  memory_subsys_offline+0x130/0x1f0
>  device_offline+0xba/0x110
>  acpi_bus_offline+0xb7/0x130
>  acpi_scan_hot_remove+0x77/0x290
>  acpi_device_hotplug+0x1e0/0x240
>  acpi_hotplug_work_fn+0x1a/0x30
>  process_one_work+0x186/0x340
> 
> In this case, just make offline_pages() fail.
> 
> Besides, do_migrate_range() may be called between memory_failure set
> hwposion flag and ioslate the folio from lru, so remove WARN_ON(). In other
> places, unmap_poisoned_folio() is called when the folio is isolated, obey
> it in do_migrate_range() too.
> 
> Fixes: b15c87263a69 ("hwpoison, memory_hotplug: allow hwpoisoned pages to be offlined")
> Signed-off-by: Jinjiang Tu <tujinjiang@huawei.com>
...
> @@ -2041,11 +2048,9 @@ int offline_pages(unsigned long start_pfn, unsigned long nr_pages,
>  
>  			ret = scan_movable_pages(pfn, end_pfn, &pfn);
>  			if (!ret) {
> -				/*
> -				 * TODO: fatal migration failures should bail
> -				 * out
> -				 */
> -				do_migrate_range(pfn, end_pfn);
> +				ret = do_migrate_range(pfn, end_pfn);
> +				if (ret)
> +					break;

I am not really sure about this one.
I get the reason you're adding it, but note that migrate_pages() can also return
"fatal" errors and we don't propagate that.

The moto has always been to migrate as much as possible, and this changes this
behaviour.
 

-- 
Oscar Salvador
SUSE Labs


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 1/2] mm/vmscan: fix hwpoisoned large folio handling in shrink_folio_list
  2025-07-01 14:13   ` Oscar Salvador
@ 2025-07-03  7:30     ` Jinjiang Tu
  0 siblings, 0 replies; 36+ messages in thread
From: Jinjiang Tu @ 2025-07-03  7:30 UTC (permalink / raw)
  To: Oscar Salvador; +Cc: akpm, linmiaohe, david, mhocko, linux-mm, wangkefeng.wang


在 2025/7/1 22:13, Oscar Salvador 写道:
> On Fri, Jun 27, 2025 at 08:57:46PM +0800, Jinjiang Tu wrote:
>> In shrink_folio_list(), the hwpoisoned folio may be large folio, which
>> can't be handled by unmap_poisoned_folio().
> Uhm, why can't be handled? I think we should spell it out, and maybe
> even put a comment above unmap_poisoned_folio.
This is disscussed in 
https://lore.kernel.org/all/68412d57.050a0220.2461cf.000e.GAE@google.com/
I will add more comments to explain it in the patch.

Thanks.
>> Since UCE is rare in real world, and race with reclaimation is more rare,
>> just skipping the hwpoisoned large folio is enough. memory_failure() will
>> handle it if the UCE is triggered again.
>>
>> Fixes: 1b0449544c64 ("mm/vmscan: don't try to reclaim hwpoison folio")
>> Reported-by: syzbot+3b220254df55d8ca8a61@syzkaller.appspotmail.com
>> Closes: https://lore.kernel.org/all/68412d57.050a0220.2461cf.000e.GAE@google.com/
>> Signed-off-by: Jinjiang Tu <tujinjiang@huawei.com>
> With the above addressed:
>
> Acked-by: Oscar Salvador <osalvador@suse.de>
>
>   


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 2/2] mm/memory_hotplug: fix hwpoisoned large folio handling in do_migrate_range
  2025-07-01 14:21   ` Oscar Salvador
@ 2025-07-03  7:46     ` Jinjiang Tu
  2025-07-03  7:57       ` David Hildenbrand
  0 siblings, 1 reply; 36+ messages in thread
From: Jinjiang Tu @ 2025-07-03  7:46 UTC (permalink / raw)
  To: Oscar Salvador; +Cc: akpm, linmiaohe, david, mhocko, linux-mm, wangkefeng.wang


在 2025/7/1 22:21, Oscar Salvador 写道:
> On Fri, Jun 27, 2025 at 08:57:47PM +0800, Jinjiang Tu wrote:
>> In do_migrate_range(), the hwpoisoned folio may be large folio, which
>> can't be handled by unmap_poisoned_folio().
>>
>> I can reproduce this issue in qemu after adding delay in memory_failure()
>>
>> BUG: kernel NULL pointer dereference, address: 0000000000000000
>> Workqueue: kacpi_hotplug acpi_hotplug_work_fn
>> RIP: 0010:try_to_unmap_one+0x16a/0xfc0
>>   <TASK>
>>   rmap_walk_anon+0xda/0x1f0
>>   try_to_unmap+0x78/0x80
>>   ? __pfx_try_to_unmap_one+0x10/0x10
>>   ? __pfx_folio_not_mapped+0x10/0x10
>>   ? __pfx_folio_lock_anon_vma_read+0x10/0x10
>>   unmap_poisoned_folio+0x60/0x140
>>   do_migrate_range+0x4d1/0x600
>>   ? slab_memory_callback+0x6a/0x190
>>   ? notifier_call_chain+0x56/0xb0
>>   offline_pages+0x3e6/0x460
>>   memory_subsys_offline+0x130/0x1f0
>>   device_offline+0xba/0x110
>>   acpi_bus_offline+0xb7/0x130
>>   acpi_scan_hot_remove+0x77/0x290
>>   acpi_device_hotplug+0x1e0/0x240
>>   acpi_hotplug_work_fn+0x1a/0x30
>>   process_one_work+0x186/0x340
>>
>> In this case, just make offline_pages() fail.
>>
>> Besides, do_migrate_range() may be called between memory_failure set
>> hwposion flag and ioslate the folio from lru, so remove WARN_ON(). In other
>> places, unmap_poisoned_folio() is called when the folio is isolated, obey
>> it in do_migrate_range() too.
>>
>> Fixes: b15c87263a69 ("hwpoison, memory_hotplug: allow hwpoisoned pages to be offlined")
>> Signed-off-by: Jinjiang Tu <tujinjiang@huawei.com>
> ...
>> @@ -2041,11 +2048,9 @@ int offline_pages(unsigned long start_pfn, unsigned long nr_pages,
>>   
>>   			ret = scan_movable_pages(pfn, end_pfn, &pfn);
>>   			if (!ret) {
>> -				/*
>> -				 * TODO: fatal migration failures should bail
>> -				 * out
>> -				 */
>> -				do_migrate_range(pfn, end_pfn);
>> +				ret = do_migrate_range(pfn, end_pfn);
>> +				if (ret)
>> +					break;
> I am not really sure about this one.
> I get the reason you're adding it, but note that migrate_pages() can also return
> "fatal" errors and we don't propagate that.
>
> The moto has always been to migrate as much as possible, and this changes this
> behaviour.
>   
If we just skip to next pfn, offline_pages() will deadloop meaningless 
util received signal.
It seems there is no document to guarantee memory offline have to 
migrate as much as possible.
>


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 2/2] mm/memory_hotplug: fix hwpoisoned large folio handling in do_migrate_range
  2025-06-27 12:57 ` [PATCH v2 2/2] mm/memory_hotplug: fix hwpoisoned large folio handling in do_migrate_range Jinjiang Tu
  2025-07-01 14:21   ` Oscar Salvador
@ 2025-07-03  7:53   ` David Hildenbrand
  1 sibling, 0 replies; 36+ messages in thread
From: David Hildenbrand @ 2025-07-03  7:53 UTC (permalink / raw)
  To: Jinjiang Tu, akpm, linmiaohe, osalvador, mhocko; +Cc: linux-mm, wangkefeng.wang

On 27.06.25 14:57, Jinjiang Tu wrote:
> In do_migrate_range(), the hwpoisoned folio may be large folio, which
> can't be handled by unmap_poisoned_folio().
> 
> I can reproduce this issue in qemu after adding delay in memory_failure()
> 
> BUG: kernel NULL pointer dereference, address: 0000000000000000
> Workqueue: kacpi_hotplug acpi_hotplug_work_fn
> RIP: 0010:try_to_unmap_one+0x16a/0xfc0
>   <TASK>
>   rmap_walk_anon+0xda/0x1f0
>   try_to_unmap+0x78/0x80
>   ? __pfx_try_to_unmap_one+0x10/0x10
>   ? __pfx_folio_not_mapped+0x10/0x10
>   ? __pfx_folio_lock_anon_vma_read+0x10/0x10
>   unmap_poisoned_folio+0x60/0x140
>   do_migrate_range+0x4d1/0x600
>   ? slab_memory_callback+0x6a/0x190
>   ? notifier_call_chain+0x56/0xb0
>   offline_pages+0x3e6/0x460
>   memory_subsys_offline+0x130/0x1f0
>   device_offline+0xba/0x110
>   acpi_bus_offline+0xb7/0x130
>   acpi_scan_hot_remove+0x77/0x290
>   acpi_device_hotplug+0x1e0/0x240
>   acpi_hotplug_work_fn+0x1a/0x30
>   process_one_work+0x186/0x340
> 
> In this case, just make offline_pages() fail.
> 
> Besides, do_migrate_range() may be called between memory_failure set
> hwposion flag and ioslate the folio from lru, so remove WARN_ON(). In other
> places, unmap_poisoned_folio() is called when the folio is isolated, obey
> it in do_migrate_range() too.
> 
> Fixes: b15c87263a69 ("hwpoison, memory_hotplug: allow hwpoisoned pages to be offlined")
> Signed-off-by: Jinjiang Tu <tujinjiang@huawei.com>
> ---
>   mm/memory_hotplug.c | 21 +++++++++++++--------
>   1 file changed, 13 insertions(+), 8 deletions(-)
> 
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index b1caedbade5b..5fab212e3a2c 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1795,7 +1795,7 @@ static int scan_movable_pages(unsigned long start, unsigned long end,
>   	return 0;
>   }
>   
> -static void do_migrate_range(unsigned long start_pfn, unsigned long end_pfn)
> +static int do_migrate_range(unsigned long start_pfn, unsigned long end_pfn)
>   {
>   	struct folio *folio;
>   	unsigned long pfn;
> @@ -1819,8 +1819,10 @@ static void do_migrate_range(unsigned long start_pfn, unsigned long end_pfn)
>   			pfn = folio_pfn(folio) + folio_nr_pages(folio) - 1;
>   
>   		if (folio_contain_hwpoisoned_page(folio)) {
> -			if (WARN_ON(folio_test_lru(folio)))
> -				folio_isolate_lru(folio);
> +			if (folio_test_large(folio) && !folio_test_hugetlb(folio))
> +				goto err_out;
> +			if (folio_test_lru(folio) && !folio_isolate_lru(folio))
> +				goto err_out;
>   			if (folio_mapped(folio)) {
>   				folio_lock(folio);
>   				unmap_poisoned_folio(folio, pfn, false);
> @@ -1877,6 +1879,11 @@ static void do_migrate_range(unsigned long start_pfn, unsigned long end_pfn)
>   			putback_movable_pages(&source);
>   		}
>   	}
> +	return 0;
> +err_out:
> +	folio_put(folio);
> +	putback_movable_pages(&source);
> +	return -EBUSY;
>   }
 >   >   static int __init cmdline_parse_movable_node(char *p)
> @@ -2041,11 +2048,9 @@ int offline_pages(unsigned long start_pfn, unsigned long nr_pages,
>   
>   			ret = scan_movable_pages(pfn, end_pfn, &pfn);
>   			if (!ret) {
> -				/*
> -				 * TODO: fatal migration failures should bail
> -				 * out
> -				 */
> -				do_migrate_range(pfn, end_pfn);
> +				ret = do_migrate_range(pfn, end_pfn);
> +				if (ret)
> +					break;
>   			}
>   		} while (!ret);
>   


I think this could simply be

...
	ret = scan_movable_pages(pfn, end_pfn, &pfn);
	if (!ret)
		ret = do_migrate_range(pfn, end_pfn);
} while (!ret);

because the while (!ret) will take care of the "break".

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 2/2] mm/memory_hotplug: fix hwpoisoned large folio handling in do_migrate_range
  2025-07-03  7:46     ` Jinjiang Tu
@ 2025-07-03  7:57       ` David Hildenbrand
  2025-07-03  8:24         ` Jinjiang Tu
  0 siblings, 1 reply; 36+ messages in thread
From: David Hildenbrand @ 2025-07-03  7:57 UTC (permalink / raw)
  To: Jinjiang Tu, Oscar Salvador
  Cc: akpm, linmiaohe, mhocko, linux-mm, wangkefeng.wang

On 03.07.25 09:46, Jinjiang Tu wrote:
> 
> 在 2025/7/1 22:21, Oscar Salvador 写道:
>> On Fri, Jun 27, 2025 at 08:57:47PM +0800, Jinjiang Tu wrote:
>>> In do_migrate_range(), the hwpoisoned folio may be large folio, which
>>> can't be handled by unmap_poisoned_folio().
>>>
>>> I can reproduce this issue in qemu after adding delay in memory_failure()
>>>
>>> BUG: kernel NULL pointer dereference, address: 0000000000000000
>>> Workqueue: kacpi_hotplug acpi_hotplug_work_fn
>>> RIP: 0010:try_to_unmap_one+0x16a/0xfc0
>>>    <TASK>
>>>    rmap_walk_anon+0xda/0x1f0
>>>    try_to_unmap+0x78/0x80
>>>    ? __pfx_try_to_unmap_one+0x10/0x10
>>>    ? __pfx_folio_not_mapped+0x10/0x10
>>>    ? __pfx_folio_lock_anon_vma_read+0x10/0x10
>>>    unmap_poisoned_folio+0x60/0x140
>>>    do_migrate_range+0x4d1/0x600
>>>    ? slab_memory_callback+0x6a/0x190
>>>    ? notifier_call_chain+0x56/0xb0
>>>    offline_pages+0x3e6/0x460
>>>    memory_subsys_offline+0x130/0x1f0
>>>    device_offline+0xba/0x110
>>>    acpi_bus_offline+0xb7/0x130
>>>    acpi_scan_hot_remove+0x77/0x290
>>>    acpi_device_hotplug+0x1e0/0x240
>>>    acpi_hotplug_work_fn+0x1a/0x30
>>>    process_one_work+0x186/0x340
>>>
>>> In this case, just make offline_pages() fail.
>>>
>>> Besides, do_migrate_range() may be called between memory_failure set
>>> hwposion flag and ioslate the folio from lru, so remove WARN_ON(). In other
>>> places, unmap_poisoned_folio() is called when the folio is isolated, obey
>>> it in do_migrate_range() too.
>>>
>>> Fixes: b15c87263a69 ("hwpoison, memory_hotplug: allow hwpoisoned pages to be offlined")
>>> Signed-off-by: Jinjiang Tu <tujinjiang@huawei.com>
>> ...
>>> @@ -2041,11 +2048,9 @@ int offline_pages(unsigned long start_pfn, unsigned long nr_pages,
>>>    
>>>    			ret = scan_movable_pages(pfn, end_pfn, &pfn);
>>>    			if (!ret) {
>>> -				/*
>>> -				 * TODO: fatal migration failures should bail
>>> -				 * out
>>> -				 */
>>> -				do_migrate_range(pfn, end_pfn);
>>> +				ret = do_migrate_range(pfn, end_pfn);
>>> +				if (ret)
>>> +					break;
>> I am not really sure about this one.
>> I get the reason you're adding it, but note that migrate_pages() can also return
>> "fatal" errors and we don't propagate that.
>>
>> The moto has always been to migrate as much as possible, and this changes this
>> behaviour.
>>    
> If we just skip to next pfn, offline_pages() will deadloop meaningless
> util received signal.

Yeah, that's also not good,

> It seems there is no document to guarantee memory offline have to
> migrate as much as possible.

We should try offlining as good as possible. But if there is something 
we just cannot possibly migrate, there is no sense in retrying.

Now, could we run into this case here because we are racing with other 
code, and actually retrying again could make it work?

Remind me again: how exactly do we arrive at this point of having a 
large folio that is hwpoisoned but still mapped?

In memory_failure(), we do on a  large folio

1) folio_set_has_hwpoisoned
2) try_to_split_thp_page
3) if splitting fails, kill_procs_now

So given that, couldn't we just retry the migration until the race is 
over and we are good?

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 2/2] mm/memory_hotplug: fix hwpoisoned large folio handling in do_migrate_range
  2025-07-03  7:57       ` David Hildenbrand
@ 2025-07-03  8:24         ` Jinjiang Tu
  2025-07-03  9:06           ` David Hildenbrand
  0 siblings, 1 reply; 36+ messages in thread
From: Jinjiang Tu @ 2025-07-03  8:24 UTC (permalink / raw)
  To: David Hildenbrand, Oscar Salvador
  Cc: akpm, linmiaohe, mhocko, linux-mm, wangkefeng.wang


在 2025/7/3 15:57, David Hildenbrand 写道:
> On 03.07.25 09:46, Jinjiang Tu wrote:
>>
>> 在 2025/7/1 22:21, Oscar Salvador 写道:
>>> On Fri, Jun 27, 2025 at 08:57:47PM +0800, Jinjiang Tu wrote:
>>>> In do_migrate_range(), the hwpoisoned folio may be large folio, which
>>>> can't be handled by unmap_poisoned_folio().
>>>>
>>>> I can reproduce this issue in qemu after adding delay in 
>>>> memory_failure()
>>>>
>>>> BUG: kernel NULL pointer dereference, address: 0000000000000000
>>>> Workqueue: kacpi_hotplug acpi_hotplug_work_fn
>>>> RIP: 0010:try_to_unmap_one+0x16a/0xfc0
>>>>    <TASK>
>>>>    rmap_walk_anon+0xda/0x1f0
>>>>    try_to_unmap+0x78/0x80
>>>>    ? __pfx_try_to_unmap_one+0x10/0x10
>>>>    ? __pfx_folio_not_mapped+0x10/0x10
>>>>    ? __pfx_folio_lock_anon_vma_read+0x10/0x10
>>>>    unmap_poisoned_folio+0x60/0x140
>>>>    do_migrate_range+0x4d1/0x600
>>>>    ? slab_memory_callback+0x6a/0x190
>>>>    ? notifier_call_chain+0x56/0xb0
>>>>    offline_pages+0x3e6/0x460
>>>>    memory_subsys_offline+0x130/0x1f0
>>>>    device_offline+0xba/0x110
>>>>    acpi_bus_offline+0xb7/0x130
>>>>    acpi_scan_hot_remove+0x77/0x290
>>>>    acpi_device_hotplug+0x1e0/0x240
>>>>    acpi_hotplug_work_fn+0x1a/0x30
>>>>    process_one_work+0x186/0x340
>>>>
>>>> In this case, just make offline_pages() fail.
>>>>
>>>> Besides, do_migrate_range() may be called between memory_failure set
>>>> hwposion flag and ioslate the folio from lru, so remove WARN_ON(). 
>>>> In other
>>>> places, unmap_poisoned_folio() is called when the folio is 
>>>> isolated, obey
>>>> it in do_migrate_range() too.
>>>>
>>>> Fixes: b15c87263a69 ("hwpoison, memory_hotplug: allow hwpoisoned 
>>>> pages to be offlined")
>>>> Signed-off-by: Jinjiang Tu <tujinjiang@huawei.com>
>>> ...
>>>> @@ -2041,11 +2048,9 @@ int offline_pages(unsigned long start_pfn, 
>>>> unsigned long nr_pages,
>>>>                   ret = scan_movable_pages(pfn, end_pfn, &pfn);
>>>>                if (!ret) {
>>>> -                /*
>>>> -                 * TODO: fatal migration failures should bail
>>>> -                 * out
>>>> -                 */
>>>> -                do_migrate_range(pfn, end_pfn);
>>>> +                ret = do_migrate_range(pfn, end_pfn);
>>>> +                if (ret)
>>>> +                    break;
>>> I am not really sure about this one.
>>> I get the reason you're adding it, but note that migrate_pages() can 
>>> also return
>>> "fatal" errors and we don't propagate that.
>>>
>>> The moto has always been to migrate as much as possible, and this 
>>> changes this
>>> behaviour.
>> If we just skip to next pfn, offline_pages() will deadloop meaningless
>> util received signal.
>
> Yeah, that's also not good,
>
>> It seems there is no document to guarantee memory offline have to
>> migrate as much as possible.
>
> We should try offlining as good as possible. But if there is something 
> we just cannot possibly migrate, there is no sense in retrying.
>
> Now, could we run into this case here because we are racing with other 
> code, and actually retrying again could make it work?
>
> Remind me again: how exactly do we arrive at this point of having a 
> large folio that is hwpoisoned but still mapped?
>
> In memory_failure(), we do on a  large folio
>
> 1) folio_set_has_hwpoisoned
> 2) try_to_split_thp_page
> 3) if splitting fails, kill_procs_now
If 2) is executed when do_migrate_range() increment the refcount of the 
folio, the split fails, and retry is meaningless.
Since we don't know if memory_failure() finishes, we couldn't know when 
to abort retry.
>
> So given that, couldn't we just retry the migration until the race is 
> over and we are good?
>


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 2/2] mm/memory_hotplug: fix hwpoisoned large folio handling in do_migrate_range
  2025-07-03  8:24         ` Jinjiang Tu
@ 2025-07-03  9:06           ` David Hildenbrand
  2025-07-07 11:51             ` Jinjiang Tu
  0 siblings, 1 reply; 36+ messages in thread
From: David Hildenbrand @ 2025-07-03  9:06 UTC (permalink / raw)
  To: Jinjiang Tu, Oscar Salvador
  Cc: akpm, linmiaohe, mhocko, linux-mm, wangkefeng.wang

On 03.07.25 10:24, Jinjiang Tu wrote:
> 
> 在 2025/7/3 15:57, David Hildenbrand 写道:
>> On 03.07.25 09:46, Jinjiang Tu wrote:
>>>
>>> 在 2025/7/1 22:21, Oscar Salvador 写道:
>>>> On Fri, Jun 27, 2025 at 08:57:47PM +0800, Jinjiang Tu wrote:
>>>>> In do_migrate_range(), the hwpoisoned folio may be large folio, which
>>>>> can't be handled by unmap_poisoned_folio().
>>>>>
>>>>> I can reproduce this issue in qemu after adding delay in
>>>>> memory_failure()
>>>>>
>>>>> BUG: kernel NULL pointer dereference, address: 0000000000000000
>>>>> Workqueue: kacpi_hotplug acpi_hotplug_work_fn
>>>>> RIP: 0010:try_to_unmap_one+0x16a/0xfc0
>>>>>     <TASK>
>>>>>     rmap_walk_anon+0xda/0x1f0
>>>>>     try_to_unmap+0x78/0x80
>>>>>     ? __pfx_try_to_unmap_one+0x10/0x10
>>>>>     ? __pfx_folio_not_mapped+0x10/0x10
>>>>>     ? __pfx_folio_lock_anon_vma_read+0x10/0x10
>>>>>     unmap_poisoned_folio+0x60/0x140
>>>>>     do_migrate_range+0x4d1/0x600
>>>>>     ? slab_memory_callback+0x6a/0x190
>>>>>     ? notifier_call_chain+0x56/0xb0
>>>>>     offline_pages+0x3e6/0x460
>>>>>     memory_subsys_offline+0x130/0x1f0
>>>>>     device_offline+0xba/0x110
>>>>>     acpi_bus_offline+0xb7/0x130
>>>>>     acpi_scan_hot_remove+0x77/0x290
>>>>>     acpi_device_hotplug+0x1e0/0x240
>>>>>     acpi_hotplug_work_fn+0x1a/0x30
>>>>>     process_one_work+0x186/0x340
>>>>>
>>>>> In this case, just make offline_pages() fail.
>>>>>
>>>>> Besides, do_migrate_range() may be called between memory_failure set
>>>>> hwposion flag and ioslate the folio from lru, so remove WARN_ON().
>>>>> In other
>>>>> places, unmap_poisoned_folio() is called when the folio is
>>>>> isolated, obey
>>>>> it in do_migrate_range() too.
>>>>>
>>>>> Fixes: b15c87263a69 ("hwpoison, memory_hotplug: allow hwpoisoned
>>>>> pages to be offlined")
>>>>> Signed-off-by: Jinjiang Tu <tujinjiang@huawei.com>
>>>> ...
>>>>> @@ -2041,11 +2048,9 @@ int offline_pages(unsigned long start_pfn,
>>>>> unsigned long nr_pages,
>>>>>                    ret = scan_movable_pages(pfn, end_pfn, &pfn);
>>>>>                 if (!ret) {
>>>>> -                /*
>>>>> -                 * TODO: fatal migration failures should bail
>>>>> -                 * out
>>>>> -                 */
>>>>> -                do_migrate_range(pfn, end_pfn);
>>>>> +                ret = do_migrate_range(pfn, end_pfn);
>>>>> +                if (ret)
>>>>> +                    break;
>>>> I am not really sure about this one.
>>>> I get the reason you're adding it, but note that migrate_pages() can
>>>> also return
>>>> "fatal" errors and we don't propagate that.
>>>>
>>>> The moto has always been to migrate as much as possible, and this
>>>> changes this
>>>> behaviour.
>>> If we just skip to next pfn, offline_pages() will deadloop meaningless
>>> util received signal.
>>
>> Yeah, that's also not good,
>>
>>> It seems there is no document to guarantee memory offline have to
>>> migrate as much as possible.
>>
>> We should try offlining as good as possible. But if there is something
>> we just cannot possibly migrate, there is no sense in retrying.
>>
>> Now, could we run into this case here because we are racing with other
>> code, and actually retrying again could make it work?
>>
>> Remind me again: how exactly do we arrive at this point of having a
>> large folio that is hwpoisoned but still mapped?
>>
>> In memory_failure(), we do on a  large folio
>>
>> 1) folio_set_has_hwpoisoned
>> 2) try_to_split_thp_page
>> 3) if splitting fails, kill_procs_now
> If 2) is executed when do_migrate_range() increment the refcount of the
> folio, the split fails, and retry is meaningless.

kill_procs_now will kill all processes, effectively unmapping the folio 
in that case?

So retrying would later just ... get us an unmapped folio and we can 
make progress?

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 2/2] mm/memory_hotplug: fix hwpoisoned large folio handling in do_migrate_range
  2025-07-03  9:06           ` David Hildenbrand
@ 2025-07-07 11:51             ` Jinjiang Tu
  2025-07-07 12:37               ` David Hildenbrand
  0 siblings, 1 reply; 36+ messages in thread
From: Jinjiang Tu @ 2025-07-07 11:51 UTC (permalink / raw)
  To: David Hildenbrand, Oscar Salvador
  Cc: akpm, linmiaohe, mhocko, linux-mm, wangkefeng.wang


在 2025/7/3 17:06, David Hildenbrand 写道:
> On 03.07.25 10:24, Jinjiang Tu wrote:
>>
>> 在 2025/7/3 15:57, David Hildenbrand 写道:
>>> On 03.07.25 09:46, Jinjiang Tu wrote:
>>>>
>>>> 在 2025/7/1 22:21, Oscar Salvador 写道:
>>>>> On Fri, Jun 27, 2025 at 08:57:47PM +0800, Jinjiang Tu wrote:
>>>>>> In do_migrate_range(), the hwpoisoned folio may be large folio, 
>>>>>> which
>>>>>> can't be handled by unmap_poisoned_folio().
>>>>>>
>>>>>> I can reproduce this issue in qemu after adding delay in
>>>>>> memory_failure()
>>>>>>
>>>>>> BUG: kernel NULL pointer dereference, address: 0000000000000000
>>>>>> Workqueue: kacpi_hotplug acpi_hotplug_work_fn
>>>>>> RIP: 0010:try_to_unmap_one+0x16a/0xfc0
>>>>>>     <TASK>
>>>>>>     rmap_walk_anon+0xda/0x1f0
>>>>>>     try_to_unmap+0x78/0x80
>>>>>>     ? __pfx_try_to_unmap_one+0x10/0x10
>>>>>>     ? __pfx_folio_not_mapped+0x10/0x10
>>>>>>     ? __pfx_folio_lock_anon_vma_read+0x10/0x10
>>>>>>     unmap_poisoned_folio+0x60/0x140
>>>>>>     do_migrate_range+0x4d1/0x600
>>>>>>     ? slab_memory_callback+0x6a/0x190
>>>>>>     ? notifier_call_chain+0x56/0xb0
>>>>>>     offline_pages+0x3e6/0x460
>>>>>>     memory_subsys_offline+0x130/0x1f0
>>>>>>     device_offline+0xba/0x110
>>>>>>     acpi_bus_offline+0xb7/0x130
>>>>>>     acpi_scan_hot_remove+0x77/0x290
>>>>>>     acpi_device_hotplug+0x1e0/0x240
>>>>>>     acpi_hotplug_work_fn+0x1a/0x30
>>>>>>     process_one_work+0x186/0x340
>>>>>>
>>>>>> In this case, just make offline_pages() fail.
>>>>>>
>>>>>> Besides, do_migrate_range() may be called between memory_failure set
>>>>>> hwposion flag and ioslate the folio from lru, so remove WARN_ON().
>>>>>> In other
>>>>>> places, unmap_poisoned_folio() is called when the folio is
>>>>>> isolated, obey
>>>>>> it in do_migrate_range() too.
>>>>>>
>>>>>> Fixes: b15c87263a69 ("hwpoison, memory_hotplug: allow hwpoisoned
>>>>>> pages to be offlined")
>>>>>> Signed-off-by: Jinjiang Tu <tujinjiang@huawei.com>
>>>>> ...
>>>>>> @@ -2041,11 +2048,9 @@ int offline_pages(unsigned long start_pfn,
>>>>>> unsigned long nr_pages,
>>>>>>                    ret = scan_movable_pages(pfn, end_pfn, &pfn);
>>>>>>                 if (!ret) {
>>>>>> -                /*
>>>>>> -                 * TODO: fatal migration failures should bail
>>>>>> -                 * out
>>>>>> -                 */
>>>>>> -                do_migrate_range(pfn, end_pfn);
>>>>>> +                ret = do_migrate_range(pfn, end_pfn);
>>>>>> +                if (ret)
>>>>>> +                    break;
>>>>> I am not really sure about this one.
>>>>> I get the reason you're adding it, but note that migrate_pages() can
>>>>> also return
>>>>> "fatal" errors and we don't propagate that.
>>>>>
>>>>> The moto has always been to migrate as much as possible, and this
>>>>> changes this
>>>>> behaviour.
>>>> If we just skip to next pfn, offline_pages() will deadloop meaningless
>>>> util received signal.
>>>
>>> Yeah, that's also not good,
>>>
>>>> It seems there is no document to guarantee memory offline have to
>>>> migrate as much as possible.
>>>
>>> We should try offlining as good as possible. But if there is something
>>> we just cannot possibly migrate, there is no sense in retrying.
>>>
>>> Now, could we run into this case here because we are racing with other
>>> code, and actually retrying again could make it work?
>>>
>>> Remind me again: how exactly do we arrive at this point of having a
>>> large folio that is hwpoisoned but still mapped?
>>>
>>> In memory_failure(), we do on a  large folio
>>>
>>> 1) folio_set_has_hwpoisoned
>>> 2) try_to_split_thp_page
>>> 3) if splitting fails, kill_procs_now
>> If 2) is executed when do_migrate_range() increment the refcount of the
>> folio, the split fails, and retry is meaningless.
>
> kill_procs_now will kill all processes, effectively unmapping the 
> folio in that case?
>
> So retrying would later just ... get us an unmapped folio and we can 
> make progress?
>
kill_procs_now()->collect_procs() collects the tasks to kill. But not 
all tasks that maps the folio
will be collected, 
collect_procs_anon()->task_early_kill()->find_early_kill_thread() will not
select the task (not current) if PF_MCE_PROCESS isn't set and 
sysctl_memory_failure_early_kill
isn't enabled (this is the default behaviour).


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 2/2] mm/memory_hotplug: fix hwpoisoned large folio handling in do_migrate_range
  2025-07-07 11:51             ` Jinjiang Tu
@ 2025-07-07 12:37               ` David Hildenbrand
  2025-07-08  1:15                 ` Jinjiang Tu
  0 siblings, 1 reply; 36+ messages in thread
From: David Hildenbrand @ 2025-07-07 12:37 UTC (permalink / raw)
  To: Jinjiang Tu, Oscar Salvador
  Cc: akpm, linmiaohe, mhocko, linux-mm, wangkefeng.wang

On 07.07.25 13:51, Jinjiang Tu wrote:
> 
> 在 2025/7/3 17:06, David Hildenbrand 写道:
>> On 03.07.25 10:24, Jinjiang Tu wrote:
>>>
>>> 在 2025/7/3 15:57, David Hildenbrand 写道:
>>>> On 03.07.25 09:46, Jinjiang Tu wrote:
>>>>>
>>>>> 在 2025/7/1 22:21, Oscar Salvador 写道:
>>>>>> On Fri, Jun 27, 2025 at 08:57:47PM +0800, Jinjiang Tu wrote:
>>>>>>> In do_migrate_range(), the hwpoisoned folio may be large folio,
>>>>>>> which
>>>>>>> can't be handled by unmap_poisoned_folio().
>>>>>>>
>>>>>>> I can reproduce this issue in qemu after adding delay in
>>>>>>> memory_failure()
>>>>>>>
>>>>>>> BUG: kernel NULL pointer dereference, address: 0000000000000000
>>>>>>> Workqueue: kacpi_hotplug acpi_hotplug_work_fn
>>>>>>> RIP: 0010:try_to_unmap_one+0x16a/0xfc0
>>>>>>>      <TASK>
>>>>>>>      rmap_walk_anon+0xda/0x1f0
>>>>>>>      try_to_unmap+0x78/0x80
>>>>>>>      ? __pfx_try_to_unmap_one+0x10/0x10
>>>>>>>      ? __pfx_folio_not_mapped+0x10/0x10
>>>>>>>      ? __pfx_folio_lock_anon_vma_read+0x10/0x10
>>>>>>>      unmap_poisoned_folio+0x60/0x140
>>>>>>>      do_migrate_range+0x4d1/0x600
>>>>>>>      ? slab_memory_callback+0x6a/0x190
>>>>>>>      ? notifier_call_chain+0x56/0xb0
>>>>>>>      offline_pages+0x3e6/0x460
>>>>>>>      memory_subsys_offline+0x130/0x1f0
>>>>>>>      device_offline+0xba/0x110
>>>>>>>      acpi_bus_offline+0xb7/0x130
>>>>>>>      acpi_scan_hot_remove+0x77/0x290
>>>>>>>      acpi_device_hotplug+0x1e0/0x240
>>>>>>>      acpi_hotplug_work_fn+0x1a/0x30
>>>>>>>      process_one_work+0x186/0x340
>>>>>>>
>>>>>>> In this case, just make offline_pages() fail.
>>>>>>>
>>>>>>> Besides, do_migrate_range() may be called between memory_failure set
>>>>>>> hwposion flag and ioslate the folio from lru, so remove WARN_ON().
>>>>>>> In other
>>>>>>> places, unmap_poisoned_folio() is called when the folio is
>>>>>>> isolated, obey
>>>>>>> it in do_migrate_range() too.
>>>>>>>
>>>>>>> Fixes: b15c87263a69 ("hwpoison, memory_hotplug: allow hwpoisoned
>>>>>>> pages to be offlined")
>>>>>>> Signed-off-by: Jinjiang Tu <tujinjiang@huawei.com>
>>>>>> ...
>>>>>>> @@ -2041,11 +2048,9 @@ int offline_pages(unsigned long start_pfn,
>>>>>>> unsigned long nr_pages,
>>>>>>>                     ret = scan_movable_pages(pfn, end_pfn, &pfn);
>>>>>>>                  if (!ret) {
>>>>>>> -                /*
>>>>>>> -                 * TODO: fatal migration failures should bail
>>>>>>> -                 * out
>>>>>>> -                 */
>>>>>>> -                do_migrate_range(pfn, end_pfn);
>>>>>>> +                ret = do_migrate_range(pfn, end_pfn);
>>>>>>> +                if (ret)
>>>>>>> +                    break;
>>>>>> I am not really sure about this one.
>>>>>> I get the reason you're adding it, but note that migrate_pages() can
>>>>>> also return
>>>>>> "fatal" errors and we don't propagate that.
>>>>>>
>>>>>> The moto has always been to migrate as much as possible, and this
>>>>>> changes this
>>>>>> behaviour.
>>>>> If we just skip to next pfn, offline_pages() will deadloop meaningless
>>>>> util received signal.
>>>>
>>>> Yeah, that's also not good,
>>>>
>>>>> It seems there is no document to guarantee memory offline have to
>>>>> migrate as much as possible.
>>>>
>>>> We should try offlining as good as possible. But if there is something
>>>> we just cannot possibly migrate, there is no sense in retrying.
>>>>
>>>> Now, could we run into this case here because we are racing with other
>>>> code, and actually retrying again could make it work?
>>>>
>>>> Remind me again: how exactly do we arrive at this point of having a
>>>> large folio that is hwpoisoned but still mapped?
>>>>
>>>> In memory_failure(), we do on a  large folio
>>>>
>>>> 1) folio_set_has_hwpoisoned
>>>> 2) try_to_split_thp_page
>>>> 3) if splitting fails, kill_procs_now
>>> If 2) is executed when do_migrate_range() increment the refcount of the
>>> folio, the split fails, and retry is meaningless.
>>
>> kill_procs_now will kill all processes, effectively unmapping the
>> folio in that case?
>>
>> So retrying would later just ... get us an unmapped folio and we can
>> make progress?
>>
> kill_procs_now()->collect_procs() collects the tasks to kill. But not
> all tasks that maps the folio
> will be collected,
> collect_procs_anon()->task_early_kill()->find_early_kill_thread() will not
> select the task (not current) if PF_MCE_PROCESS isn't set and
> sysctl_memory_failure_early_kill
> isn't enabled (this is the default behaviour).

I think you're right, that's rather nasty.

We fail to split, but keep the folio mapped into some processes.

And we can't unmap it because unmap_poisoned_folio() does not properly
support large folios yet.

We really should unmap the folio when splitting fail. :(

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 2/2] mm/memory_hotplug: fix hwpoisoned large folio handling in do_migrate_range
  2025-07-07 12:37               ` David Hildenbrand
@ 2025-07-08  1:15                 ` Jinjiang Tu
  2025-07-08  9:54                   ` David Hildenbrand
  0 siblings, 1 reply; 36+ messages in thread
From: Jinjiang Tu @ 2025-07-08  1:15 UTC (permalink / raw)
  To: David Hildenbrand, Oscar Salvador
  Cc: akpm, linmiaohe, mhocko, linux-mm, wangkefeng.wang


在 2025/7/7 20:37, David Hildenbrand 写道:
> On 07.07.25 13:51, Jinjiang Tu wrote:
>>
>> 在 2025/7/3 17:06, David Hildenbrand 写道:
>>> On 03.07.25 10:24, Jinjiang Tu wrote:
>>>>
>>>> 在 2025/7/3 15:57, David Hildenbrand 写道:
>>>>> On 03.07.25 09:46, Jinjiang Tu wrote:
>>>>>>
>>>>>> 在 2025/7/1 22:21, Oscar Salvador 写道:
>>>>>>> On Fri, Jun 27, 2025 at 08:57:47PM +0800, Jinjiang Tu wrote:
>>>>>>>> In do_migrate_range(), the hwpoisoned folio may be large folio,
>>>>>>>> which
>>>>>>>> can't be handled by unmap_poisoned_folio().
>>>>>>>>
>>>>>>>> I can reproduce this issue in qemu after adding delay in
>>>>>>>> memory_failure()
>>>>>>>>
>>>>>>>> BUG: kernel NULL pointer dereference, address: 0000000000000000
>>>>>>>> Workqueue: kacpi_hotplug acpi_hotplug_work_fn
>>>>>>>> RIP: 0010:try_to_unmap_one+0x16a/0xfc0
>>>>>>>>      <TASK>
>>>>>>>>      rmap_walk_anon+0xda/0x1f0
>>>>>>>>      try_to_unmap+0x78/0x80
>>>>>>>>      ? __pfx_try_to_unmap_one+0x10/0x10
>>>>>>>>      ? __pfx_folio_not_mapped+0x10/0x10
>>>>>>>>      ? __pfx_folio_lock_anon_vma_read+0x10/0x10
>>>>>>>>      unmap_poisoned_folio+0x60/0x140
>>>>>>>>      do_migrate_range+0x4d1/0x600
>>>>>>>>      ? slab_memory_callback+0x6a/0x190
>>>>>>>>      ? notifier_call_chain+0x56/0xb0
>>>>>>>>      offline_pages+0x3e6/0x460
>>>>>>>>      memory_subsys_offline+0x130/0x1f0
>>>>>>>>      device_offline+0xba/0x110
>>>>>>>>      acpi_bus_offline+0xb7/0x130
>>>>>>>>      acpi_scan_hot_remove+0x77/0x290
>>>>>>>>      acpi_device_hotplug+0x1e0/0x240
>>>>>>>>      acpi_hotplug_work_fn+0x1a/0x30
>>>>>>>>      process_one_work+0x186/0x340
>>>>>>>>
>>>>>>>> In this case, just make offline_pages() fail.
>>>>>>>>
>>>>>>>> Besides, do_migrate_range() may be called between 
>>>>>>>> memory_failure set
>>>>>>>> hwposion flag and ioslate the folio from lru, so remove WARN_ON().
>>>>>>>> In other
>>>>>>>> places, unmap_poisoned_folio() is called when the folio is
>>>>>>>> isolated, obey
>>>>>>>> it in do_migrate_range() too.
>>>>>>>>
>>>>>>>> Fixes: b15c87263a69 ("hwpoison, memory_hotplug: allow hwpoisoned
>>>>>>>> pages to be offlined")
>>>>>>>> Signed-off-by: Jinjiang Tu <tujinjiang@huawei.com>
>>>>>>> ...
>>>>>>>> @@ -2041,11 +2048,9 @@ int offline_pages(unsigned long start_pfn,
>>>>>>>> unsigned long nr_pages,
>>>>>>>>                     ret = scan_movable_pages(pfn, end_pfn, &pfn);
>>>>>>>>                  if (!ret) {
>>>>>>>> -                /*
>>>>>>>> -                 * TODO: fatal migration failures should bail
>>>>>>>> -                 * out
>>>>>>>> -                 */
>>>>>>>> -                do_migrate_range(pfn, end_pfn);
>>>>>>>> +                ret = do_migrate_range(pfn, end_pfn);
>>>>>>>> +                if (ret)
>>>>>>>> +                    break;
>>>>>>> I am not really sure about this one.
>>>>>>> I get the reason you're adding it, but note that migrate_pages() 
>>>>>>> can
>>>>>>> also return
>>>>>>> "fatal" errors and we don't propagate that.
>>>>>>>
>>>>>>> The moto has always been to migrate as much as possible, and this
>>>>>>> changes this
>>>>>>> behaviour.
>>>>>> If we just skip to next pfn, offline_pages() will deadloop 
>>>>>> meaningless
>>>>>> util received signal.
>>>>>
>>>>> Yeah, that's also not good,
>>>>>
>>>>>> It seems there is no document to guarantee memory offline have to
>>>>>> migrate as much as possible.
>>>>>
>>>>> We should try offlining as good as possible. But if there is 
>>>>> something
>>>>> we just cannot possibly migrate, there is no sense in retrying.
>>>>>
>>>>> Now, could we run into this case here because we are racing with 
>>>>> other
>>>>> code, and actually retrying again could make it work?
>>>>>
>>>>> Remind me again: how exactly do we arrive at this point of having a
>>>>> large folio that is hwpoisoned but still mapped?
>>>>>
>>>>> In memory_failure(), we do on a  large folio
>>>>>
>>>>> 1) folio_set_has_hwpoisoned
>>>>> 2) try_to_split_thp_page
>>>>> 3) if splitting fails, kill_procs_now
>>>> If 2) is executed when do_migrate_range() increment the refcount of 
>>>> the
>>>> folio, the split fails, and retry is meaningless.
>>>
>>> kill_procs_now will kill all processes, effectively unmapping the
>>> folio in that case?
>>>
>>> So retrying would later just ... get us an unmapped folio and we can
>>> make progress?
>>>
>> kill_procs_now()->collect_procs() collects the tasks to kill. But not
>> all tasks that maps the folio
>> will be collected,
>> collect_procs_anon()->task_early_kill()->find_early_kill_thread() 
>> will not
>> select the task (not current) if PF_MCE_PROCESS isn't set and
>> sysctl_memory_failure_early_kill
>> isn't enabled (this is the default behaviour).
>
> I think you're right, that's rather nasty.
>
> We fail to split, but keep the folio mapped into some processes.
>
> And we can't unmap it because unmap_poisoned_folio() does not properly
> support large folios yet.
>
> We really should unmap the folio when splitting fail. :(
unmap_poisoned_folio() doesn't guarantee the folio is unmapped 
successfully, according
to the return val. Although I don't know in which case we will fail to 
unmap.


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 2/2] mm/memory_hotplug: fix hwpoisoned large folio handling in do_migrate_range
  2025-07-08  1:15                 ` Jinjiang Tu
@ 2025-07-08  9:54                   ` David Hildenbrand
  2025-07-09 16:27                     ` Zi Yan
  0 siblings, 1 reply; 36+ messages in thread
From: David Hildenbrand @ 2025-07-08  9:54 UTC (permalink / raw)
  To: Jinjiang Tu, Oscar Salvador
  Cc: akpm, linmiaohe, mhocko, linux-mm, wangkefeng.wang, Zi Yan

On 08.07.25 03:15, Jinjiang Tu wrote:
> 
> 在 2025/7/7 20:37, David Hildenbrand 写道:
>> On 07.07.25 13:51, Jinjiang Tu wrote:
>>>
>>> 在 2025/7/3 17:06, David Hildenbrand 写道:
>>>> On 03.07.25 10:24, Jinjiang Tu wrote:
>>>>>
>>>>> 在 2025/7/3 15:57, David Hildenbrand 写道:
>>>>>> On 03.07.25 09:46, Jinjiang Tu wrote:
>>>>>>>
>>>>>>> 在 2025/7/1 22:21, Oscar Salvador 写道:
>>>>>>>> On Fri, Jun 27, 2025 at 08:57:47PM +0800, Jinjiang Tu wrote:
>>>>>>>>> In do_migrate_range(), the hwpoisoned folio may be large folio,
>>>>>>>>> which
>>>>>>>>> can't be handled by unmap_poisoned_folio().
>>>>>>>>>
>>>>>>>>> I can reproduce this issue in qemu after adding delay in
>>>>>>>>> memory_failure()
>>>>>>>>>
>>>>>>>>> BUG: kernel NULL pointer dereference, address: 0000000000000000
>>>>>>>>> Workqueue: kacpi_hotplug acpi_hotplug_work_fn
>>>>>>>>> RIP: 0010:try_to_unmap_one+0x16a/0xfc0
>>>>>>>>>       <TASK>
>>>>>>>>>       rmap_walk_anon+0xda/0x1f0
>>>>>>>>>       try_to_unmap+0x78/0x80
>>>>>>>>>       ? __pfx_try_to_unmap_one+0x10/0x10
>>>>>>>>>       ? __pfx_folio_not_mapped+0x10/0x10
>>>>>>>>>       ? __pfx_folio_lock_anon_vma_read+0x10/0x10
>>>>>>>>>       unmap_poisoned_folio+0x60/0x140
>>>>>>>>>       do_migrate_range+0x4d1/0x600
>>>>>>>>>       ? slab_memory_callback+0x6a/0x190
>>>>>>>>>       ? notifier_call_chain+0x56/0xb0
>>>>>>>>>       offline_pages+0x3e6/0x460
>>>>>>>>>       memory_subsys_offline+0x130/0x1f0
>>>>>>>>>       device_offline+0xba/0x110
>>>>>>>>>       acpi_bus_offline+0xb7/0x130
>>>>>>>>>       acpi_scan_hot_remove+0x77/0x290
>>>>>>>>>       acpi_device_hotplug+0x1e0/0x240
>>>>>>>>>       acpi_hotplug_work_fn+0x1a/0x30
>>>>>>>>>       process_one_work+0x186/0x340
>>>>>>>>>
>>>>>>>>> In this case, just make offline_pages() fail.
>>>>>>>>>
>>>>>>>>> Besides, do_migrate_range() may be called between
>>>>>>>>> memory_failure set
>>>>>>>>> hwposion flag and ioslate the folio from lru, so remove WARN_ON().
>>>>>>>>> In other
>>>>>>>>> places, unmap_poisoned_folio() is called when the folio is
>>>>>>>>> isolated, obey
>>>>>>>>> it in do_migrate_range() too.
>>>>>>>>>
>>>>>>>>> Fixes: b15c87263a69 ("hwpoison, memory_hotplug: allow hwpoisoned
>>>>>>>>> pages to be offlined")
>>>>>>>>> Signed-off-by: Jinjiang Tu <tujinjiang@huawei.com>
>>>>>>>> ...
>>>>>>>>> @@ -2041,11 +2048,9 @@ int offline_pages(unsigned long start_pfn,
>>>>>>>>> unsigned long nr_pages,
>>>>>>>>>                      ret = scan_movable_pages(pfn, end_pfn, &pfn);
>>>>>>>>>                   if (!ret) {
>>>>>>>>> -                /*
>>>>>>>>> -                 * TODO: fatal migration failures should bail
>>>>>>>>> -                 * out
>>>>>>>>> -                 */
>>>>>>>>> -                do_migrate_range(pfn, end_pfn);
>>>>>>>>> +                ret = do_migrate_range(pfn, end_pfn);
>>>>>>>>> +                if (ret)
>>>>>>>>> +                    break;
>>>>>>>> I am not really sure about this one.
>>>>>>>> I get the reason you're adding it, but note that migrate_pages()
>>>>>>>> can
>>>>>>>> also return
>>>>>>>> "fatal" errors and we don't propagate that.
>>>>>>>>
>>>>>>>> The moto has always been to migrate as much as possible, and this
>>>>>>>> changes this
>>>>>>>> behaviour.
>>>>>>> If we just skip to next pfn, offline_pages() will deadloop
>>>>>>> meaningless
>>>>>>> util received signal.
>>>>>>
>>>>>> Yeah, that's also not good,
>>>>>>
>>>>>>> It seems there is no document to guarantee memory offline have to
>>>>>>> migrate as much as possible.
>>>>>>
>>>>>> We should try offlining as good as possible. But if there is
>>>>>> something
>>>>>> we just cannot possibly migrate, there is no sense in retrying.
>>>>>>
>>>>>> Now, could we run into this case here because we are racing with
>>>>>> other
>>>>>> code, and actually retrying again could make it work?
>>>>>>
>>>>>> Remind me again: how exactly do we arrive at this point of having a
>>>>>> large folio that is hwpoisoned but still mapped?
>>>>>>
>>>>>> In memory_failure(), we do on a  large folio
>>>>>>
>>>>>> 1) folio_set_has_hwpoisoned
>>>>>> 2) try_to_split_thp_page
>>>>>> 3) if splitting fails, kill_procs_now
>>>>> If 2) is executed when do_migrate_range() increment the refcount of
>>>>> the
>>>>> folio, the split fails, and retry is meaningless.
>>>>
>>>> kill_procs_now will kill all processes, effectively unmapping the
>>>> folio in that case?
>>>>
>>>> So retrying would later just ... get us an unmapped folio and we can
>>>> make progress?
>>>>
>>> kill_procs_now()->collect_procs() collects the tasks to kill. But not
>>> all tasks that maps the folio
>>> will be collected,
>>> collect_procs_anon()->task_early_kill()->find_early_kill_thread()
>>> will not
>>> select the task (not current) if PF_MCE_PROCESS isn't set and
>>> sysctl_memory_failure_early_kill
>>> isn't enabled (this is the default behaviour).
>>
>> I think you're right, that's rather nasty.
>>
>> We fail to split, but keep the folio mapped into some processes.
>>
>> And we can't unmap it because unmap_poisoned_folio() does not properly
>> support large folios yet.
>>
>> We really should unmap the folio when splitting fail. :(
> unmap_poisoned_folio() doesn't guarantee the folio is unmapped
> successfully, according
> to the return val. Although I don't know in which case we will fail to
> unmap.

I think there are only cases for anon folios, when the folio is in the swapcache
or it's lazyfree (!swapbacked) and we run into the walk_abort cases. Retrying will
likely make it work in many cases I assume.

This is all really rather suboptimal and, I'm afraid, requires much bigger rework to improve it.

Failing memory offlining is also not nice.

I was wondering whether we can just keep splitting the folio until it works:

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index b1caedbade5b1..991b095ac7e78 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1819,6 +1819,19 @@ static void do_migrate_range(unsigned long start_pfn, unsigned long end_pfn)
                         pfn = folio_pfn(folio) + folio_nr_pages(folio) - 1;
  
                 if (folio_contain_hwpoisoned_page(folio)) {
+                       /*
+                        * unmap_poisoned_folio() cannot handle THPs, so
+                        * keep trying to split first.
+                        */
+                       if (folio_test_large(folio) && !folio_test_hugetlb(folio)) {
+                               folio_lock(folio);
+                               split_huge_page_to_list_to_order(&folio->page,
+                                                                NULL,
+                                                                min_order_for_split(folio));
+                               folio_unlock(folio);
+                               if (folio_test_large(folio))
+                                       goto put_folio;
+                       }
                         if (WARN_ON(folio_test_lru(folio)))
                                 folio_isolate_lru(folio);
                         if (folio_mapped(folio)) {


Probably the WARN_ON can indeed trigger now.


@Zi Yan, on a related note ...

in memory_failure(), we call try_to_split_thp_page(). If it works,
we assume that we have a small folio.

But that is not the case if split_huge_page() cannot split it to
order-0 ... min_order_for_split().

I'm afraid we havbe more such code that does not expect that if split_huge_page()
succeeds that we still have a large folio ...

-- 
Cheers,

David / dhildenb



^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 2/2] mm/memory_hotplug: fix hwpoisoned large folio handling in do_migrate_range
  2025-07-08  9:54                   ` David Hildenbrand
@ 2025-07-09 16:27                     ` Zi Yan
  2025-07-14 13:53                       ` Pankaj Raghav
  0 siblings, 1 reply; 36+ messages in thread
From: Zi Yan @ 2025-07-09 16:27 UTC (permalink / raw)
  To: David Hildenbrand, Matthew Wilcox, Pankaj Raghav,
	Luis Chamberlain
  Cc: Jinjiang Tu, Oscar Salvador, akpm, linmiaohe, mhocko, linux-mm,
	wangkefeng.wang

On 8 Jul 2025, at 5:54, David Hildenbrand wrote:

> On 08.07.25 03:15, Jinjiang Tu wrote:
>>
>> 在 2025/7/7 20:37, David Hildenbrand 写道:
>>> On 07.07.25 13:51, Jinjiang Tu wrote:
>>>>
>>>> 在 2025/7/3 17:06, David Hildenbrand 写道:
>>>>> On 03.07.25 10:24, Jinjiang Tu wrote:
>>>>>>
>>>>>> 在 2025/7/3 15:57, David Hildenbrand 写道:
>>>>>>> On 03.07.25 09:46, Jinjiang Tu wrote:
>>>>>>>>
>>>>>>>> 在 2025/7/1 22:21, Oscar Salvador 写道:
>>>>>>>>> On Fri, Jun 27, 2025 at 08:57:47PM +0800, Jinjiang Tu wrote:
>>>>>>>>>> In do_migrate_range(), the hwpoisoned folio may be large folio,
>>>>>>>>>> which
>>>>>>>>>> can't be handled by unmap_poisoned_folio().
>>>>>>>>>>
>>>>>>>>>> I can reproduce this issue in qemu after adding delay in
>>>>>>>>>> memory_failure()
>>>>>>>>>>
>>>>>>>>>> BUG: kernel NULL pointer dereference, address: 0000000000000000
>>>>>>>>>> Workqueue: kacpi_hotplug acpi_hotplug_work_fn
>>>>>>>>>> RIP: 0010:try_to_unmap_one+0x16a/0xfc0
>>>>>>>>>>       <TASK>
>>>>>>>>>>       rmap_walk_anon+0xda/0x1f0
>>>>>>>>>>       try_to_unmap+0x78/0x80
>>>>>>>>>>       ? __pfx_try_to_unmap_one+0x10/0x10
>>>>>>>>>>       ? __pfx_folio_not_mapped+0x10/0x10
>>>>>>>>>>       ? __pfx_folio_lock_anon_vma_read+0x10/0x10
>>>>>>>>>>       unmap_poisoned_folio+0x60/0x140
>>>>>>>>>>       do_migrate_range+0x4d1/0x600
>>>>>>>>>>       ? slab_memory_callback+0x6a/0x190
>>>>>>>>>>       ? notifier_call_chain+0x56/0xb0
>>>>>>>>>>       offline_pages+0x3e6/0x460
>>>>>>>>>>       memory_subsys_offline+0x130/0x1f0
>>>>>>>>>>       device_offline+0xba/0x110
>>>>>>>>>>       acpi_bus_offline+0xb7/0x130
>>>>>>>>>>       acpi_scan_hot_remove+0x77/0x290
>>>>>>>>>>       acpi_device_hotplug+0x1e0/0x240
>>>>>>>>>>       acpi_hotplug_work_fn+0x1a/0x30
>>>>>>>>>>       process_one_work+0x186/0x340
>>>>>>>>>>
>>>>>>>>>> In this case, just make offline_pages() fail.
>>>>>>>>>>
>>>>>>>>>> Besides, do_migrate_range() may be called between
>>>>>>>>>> memory_failure set
>>>>>>>>>> hwposion flag and ioslate the folio from lru, so remove WARN_ON().
>>>>>>>>>> In other
>>>>>>>>>> places, unmap_poisoned_folio() is called when the folio is
>>>>>>>>>> isolated, obey
>>>>>>>>>> it in do_migrate_range() too.
>>>>>>>>>>
>>>>>>>>>> Fixes: b15c87263a69 ("hwpoison, memory_hotplug: allow hwpoisoned
>>>>>>>>>> pages to be offlined")
>>>>>>>>>> Signed-off-by: Jinjiang Tu <tujinjiang@huawei.com>
>>>>>>>>> ...
>>>>>>>>>> @@ -2041,11 +2048,9 @@ int offline_pages(unsigned long start_pfn,
>>>>>>>>>> unsigned long nr_pages,
>>>>>>>>>>                      ret = scan_movable_pages(pfn, end_pfn, &pfn);
>>>>>>>>>>                   if (!ret) {
>>>>>>>>>> -                /*
>>>>>>>>>> -                 * TODO: fatal migration failures should bail
>>>>>>>>>> -                 * out
>>>>>>>>>> -                 */
>>>>>>>>>> -                do_migrate_range(pfn, end_pfn);
>>>>>>>>>> +                ret = do_migrate_range(pfn, end_pfn);
>>>>>>>>>> +                if (ret)
>>>>>>>>>> +                    break;
>>>>>>>>> I am not really sure about this one.
>>>>>>>>> I get the reason you're adding it, but note that migrate_pages()
>>>>>>>>> can
>>>>>>>>> also return
>>>>>>>>> "fatal" errors and we don't propagate that.
>>>>>>>>>
>>>>>>>>> The moto has always been to migrate as much as possible, and this
>>>>>>>>> changes this
>>>>>>>>> behaviour.
>>>>>>>> If we just skip to next pfn, offline_pages() will deadloop
>>>>>>>> meaningless
>>>>>>>> util received signal.
>>>>>>>
>>>>>>> Yeah, that's also not good,
>>>>>>>
>>>>>>>> It seems there is no document to guarantee memory offline have to
>>>>>>>> migrate as much as possible.
>>>>>>>
>>>>>>> We should try offlining as good as possible. But if there is
>>>>>>> something
>>>>>>> we just cannot possibly migrate, there is no sense in retrying.
>>>>>>>
>>>>>>> Now, could we run into this case here because we are racing with
>>>>>>> other
>>>>>>> code, and actually retrying again could make it work?
>>>>>>>
>>>>>>> Remind me again: how exactly do we arrive at this point of having a
>>>>>>> large folio that is hwpoisoned but still mapped?
>>>>>>>
>>>>>>> In memory_failure(), we do on a  large folio
>>>>>>>
>>>>>>> 1) folio_set_has_hwpoisoned
>>>>>>> 2) try_to_split_thp_page
>>>>>>> 3) if splitting fails, kill_procs_now
>>>>>> If 2) is executed when do_migrate_range() increment the refcount of
>>>>>> the
>>>>>> folio, the split fails, and retry is meaningless.
>>>>>
>>>>> kill_procs_now will kill all processes, effectively unmapping the
>>>>> folio in that case?
>>>>>
>>>>> So retrying would later just ... get us an unmapped folio and we can
>>>>> make progress?
>>>>>
>>>> kill_procs_now()->collect_procs() collects the tasks to kill. But not
>>>> all tasks that maps the folio
>>>> will be collected,
>>>> collect_procs_anon()->task_early_kill()->find_early_kill_thread()
>>>> will not
>>>> select the task (not current) if PF_MCE_PROCESS isn't set and
>>>> sysctl_memory_failure_early_kill
>>>> isn't enabled (this is the default behaviour).
>>>
>>> I think you're right, that's rather nasty.
>>>
>>> We fail to split, but keep the folio mapped into some processes.
>>>
>>> And we can't unmap it because unmap_poisoned_folio() does not properly
>>> support large folios yet.
>>>
>>> We really should unmap the folio when splitting fail. :(
>> unmap_poisoned_folio() doesn't guarantee the folio is unmapped
>> successfully, according
>> to the return val. Although I don't know in which case we will fail to
>> unmap.
>
> I think there are only cases for anon folios, when the folio is in the swapcache
> or it's lazyfree (!swapbacked) and we run into the walk_abort cases. Retrying will
> likely make it work in many cases I assume.
>
> This is all really rather suboptimal and, I'm afraid, requires much bigger rework to improve it.
>
> Failing memory offlining is also not nice.
>
> I was wondering whether we can just keep splitting the folio until it works:
>
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index b1caedbade5b1..991b095ac7e78 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1819,6 +1819,19 @@ static void do_migrate_range(unsigned long start_pfn, unsigned long end_pfn)
>                         pfn = folio_pfn(folio) + folio_nr_pages(folio) - 1;
>                  if (folio_contain_hwpoisoned_page(folio)) {
> +                       /*
> +                        * unmap_poisoned_folio() cannot handle THPs, so
> +                        * keep trying to split first.
> +                        */
> +                       if (folio_test_large(folio) && !folio_test_hugetlb(folio)) {
> +                               folio_lock(folio);
> +                               split_huge_page_to_list_to_order(&folio->page,
> +                                                                NULL,
> +                                                                min_order_for_split(folio));
> +                               folio_unlock(folio);
> +                               if (folio_test_large(folio))
> +                                       goto put_folio;
> +                       }
>                         if (WARN_ON(folio_test_lru(folio)))
>                                 folio_isolate_lru(folio);
>                         if (folio_mapped(folio)) {
>
>
> Probably the WARN_ON can indeed trigger now.
>
>
> @Zi Yan, on a related note ...
>
> in memory_failure(), we call try_to_split_thp_page(). If it works,
> we assume that we have a small folio.
>
> But that is not the case if split_huge_page() cannot split it to
> order-0 ... min_order_for_split().

Right. memory failure needs to learn about this. Either poison every
subpage or write back if necessary and drop the page cache folio.

>
> I'm afraid we havbe more such code that does not expect that if split_huge_page()
> succeeds that we still have a large folio ...

I did some search, here are the users of split_huge_page*():

1. ksm: since it is anonymous only, so split always goes to order-0;
2. userfaultfd: it is also anonymous;
3. madvise cold or pageout: a large pagecache folio will be split if it is partially
   mapped. And it will retry. It might cause a deadlock if the folio has a min order.
4. shmem: split always goes to order-0;
5. memory-failure: see above.

So we will need to take care of madvise cold or pageout case?

Hi Matthew, Pankaj, and Luis,

Is it possible to partially map a min-order folio in a fs with LBS? Based on my
understanding of  madvise_cold_or_pageout_pte_range(), it seems that it will try
to split the folio and expects a order-0 folio after a successful split.
But splitting a min-order folio is a nop. It could lead to a deadlock in the code.
Or I just get it wrong?


Best Regards,
Yan, Zi


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 2/2] mm/memory_hotplug: fix hwpoisoned large folio handling in do_migrate_range
  2025-07-09 16:27                     ` Zi Yan
@ 2025-07-14 13:53                       ` Pankaj Raghav
  2025-07-14 14:20                         ` Zi Yan
  0 siblings, 1 reply; 36+ messages in thread
From: Pankaj Raghav @ 2025-07-14 13:53 UTC (permalink / raw)
  To: Zi Yan, David Hildenbrand, Matthew Wilcox, Luis Chamberlain
  Cc: Jinjiang Tu, Oscar Salvador, akpm, linmiaohe, mhocko, linux-mm,
	wangkefeng.wang

Hi Zi Yan,

> >
> > Probably the WARN_ON can indeed trigger now.
> >
> >
> > @Zi Yan, on a related note ...
> >
> > in memory_failure(), we call try_to_split_thp_page(). If it works,
> > we assume that we have a small folio.
> >
> > But that is not the case if split_huge_page() cannot split it to
> > order-0 ... min_order_for_split().
>
> Right. memory failure needs to learn about this. Either poison every
> subpage or write back if necessary and drop the page cache folio.
>
> >
> > I'm afraid we havbe more such code that does not expect that if split_huge_page()
> > succeeds that we still have a large folio ...
>
> I did some search, here are the users of split_huge_page*():
>
> 1. ksm: since it is anonymous only, so split always goes to order-0;
> 2. userfaultfd: it is also anonymous;
> 3. madvise cold or pageout: a large pagecache folio will be split if it is partially
>    mapped. And it will retry. It might cause a deadlock if the folio has a min order.
> 4. shmem: split always goes to order-0;
> 5. memory-failure: see above.
>
> So we will need to take care of madvise cold or pageout case?
>
> Hi Matthew, Pankaj, and Luis,
>
> Is it possible to partially map a min-order folio in a fs with LBS? Based on my

Typically, FSs match the min order with the blocksize of the filesystem.
As a filesystem block is the smallest unit of data that the filesystem uses
to store file data on the disk, we cannot partially map them.

So if I understand your question correctly, the answer is no.

> understanding of  madvise_cold_or_pageout_pte_range(), it seems that it will try
> to split the folio and expects a order-0 folio after a successful split.
> But splitting a min-order folio is a nop. It could lead to a deadlock in the code.
> Or I just get it wrong?

Yeah, we have checks to make sure we never split a folio < min-order.

I hope it answers your question :)

--
Pankaj


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 2/2] mm/memory_hotplug: fix hwpoisoned large folio handling in do_migrate_range
  2025-07-14 13:53                       ` Pankaj Raghav
@ 2025-07-14 14:20                         ` Zi Yan
  2025-07-14 14:24                           ` David Hildenbrand
  0 siblings, 1 reply; 36+ messages in thread
From: Zi Yan @ 2025-07-14 14:20 UTC (permalink / raw)
  To: Pankaj Raghav
  Cc: David Hildenbrand, Matthew Wilcox, Luis Chamberlain, Jinjiang Tu,
	Oscar Salvador, akpm, linmiaohe, mhocko, linux-mm,
	wangkefeng.wang

On 14 Jul 2025, at 9:53, Pankaj Raghav wrote:

> Hi Zi Yan,
>
>>>
>>> Probably the WARN_ON can indeed trigger now.
>>>
>>>
>>> @Zi Yan, on a related note ...
>>>
>>> in memory_failure(), we call try_to_split_thp_page(). If it works,
>>> we assume that we have a small folio.
>>>
>>> But that is not the case if split_huge_page() cannot split it to
>>> order-0 ... min_order_for_split().
>>
>> Right. memory failure needs to learn about this. Either poison every
>> subpage or write back if necessary and drop the page cache folio.
>>
>>>
>>> I'm afraid we havbe more such code that does not expect that if split_huge_page()
>>> succeeds that we still have a large folio ...
>>
>> I did some search, here are the users of split_huge_page*():
>>
>> 1. ksm: since it is anonymous only, so split always goes to order-0;
>> 2. userfaultfd: it is also anonymous;
>> 3. madvise cold or pageout: a large pagecache folio will be split if it is partially
>>    mapped. And it will retry. It might cause a deadlock if the folio has a min order.
>> 4. shmem: split always goes to order-0;
>> 5. memory-failure: see above.
>>
>> So we will need to take care of madvise cold or pageout case?
>>
>> Hi Matthew, Pankaj, and Luis,
>>
>> Is it possible to partially map a min-order folio in a fs with LBS? Based on my
>
> Typically, FSs match the min order with the blocksize of the filesystem.
> As a filesystem block is the smallest unit of data that the filesystem uses
> to store file data on the disk, we cannot partially map them.
>
> So if I understand your question correctly, the answer is no.
>
>> understanding of  madvise_cold_or_pageout_pte_range(), it seems that it will try
>> to split the folio and expects a order-0 folio after a successful split.
>> But splitting a min-order folio is a nop. It could lead to a deadlock in the code.
>> Or I just get it wrong?
>
> Yeah, we have checks to make sure we never split a folio < min-order.
>
> I hope it answers your question :)

Thank you for the explanation. This means madvise cold or pageout case is good.

Best Regards,
Yan, Zi


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 2/2] mm/memory_hotplug: fix hwpoisoned large folio handling in do_migrate_range
  2025-07-14 14:20                         ` Zi Yan
@ 2025-07-14 14:24                           ` David Hildenbrand
  2025-07-14 15:09                             ` Pankaj Raghav (Samsung)
  0 siblings, 1 reply; 36+ messages in thread
From: David Hildenbrand @ 2025-07-14 14:24 UTC (permalink / raw)
  To: Zi Yan, Pankaj Raghav
  Cc: Matthew Wilcox, Luis Chamberlain, Jinjiang Tu, Oscar Salvador,
	akpm, linmiaohe, mhocko, linux-mm, wangkefeng.wang

On 14.07.25 16:20, Zi Yan wrote:
> On 14 Jul 2025, at 9:53, Pankaj Raghav wrote:
> 
>> Hi Zi Yan,
>>
>>>>
>>>> Probably the WARN_ON can indeed trigger now.
>>>>
>>>>
>>>> @Zi Yan, on a related note ...
>>>>
>>>> in memory_failure(), we call try_to_split_thp_page(). If it works,
>>>> we assume that we have a small folio.
>>>>
>>>> But that is not the case if split_huge_page() cannot split it to
>>>> order-0 ... min_order_for_split().
>>>
>>> Right. memory failure needs to learn about this. Either poison every
>>> subpage or write back if necessary and drop the page cache folio.
>>>
>>>>
>>>> I'm afraid we havbe more such code that does not expect that if split_huge_page()
>>>> succeeds that we still have a large folio ...
>>>
>>> I did some search, here are the users of split_huge_page*():
>>>
>>> 1. ksm: since it is anonymous only, so split always goes to order-0;
>>> 2. userfaultfd: it is also anonymous;
>>> 3. madvise cold or pageout: a large pagecache folio will be split if it is partially
>>>     mapped. And it will retry. It might cause a deadlock if the folio has a min order.
>>> 4. shmem: split always goes to order-0;
>>> 5. memory-failure: see above.
>>>
>>> So we will need to take care of madvise cold or pageout case?
>>>
>>> Hi Matthew, Pankaj, and Luis,
>>>
>>> Is it possible to partially map a min-order folio in a fs with LBS? Based on my
>>
>> Typically, FSs match the min order with the blocksize of the filesystem.
>> As a filesystem block is the smallest unit of data that the filesystem uses
>> to store file data on the disk, we cannot partially map them.
>>
>> So if I understand your question correctly, the answer is no.

I'm confused. Shouldn't this be trivially possible?

E.g., just mmap() a single page of such a file? Who would make that fail?

And if it doesn't fail, who would stop us from munmap()'ing everything 
except a single page.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 2/2] mm/memory_hotplug: fix hwpoisoned large folio handling in do_migrate_range
  2025-07-14 14:24                           ` David Hildenbrand
@ 2025-07-14 15:09                             ` Pankaj Raghav (Samsung)
  2025-07-14 15:14                               ` David Hildenbrand
  0 siblings, 1 reply; 36+ messages in thread
From: Pankaj Raghav (Samsung) @ 2025-07-14 15:09 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Zi Yan, Matthew Wilcox, Luis Chamberlain, Jinjiang Tu,
	Oscar Salvador, akpm, linmiaohe, mhocko, linux-mm,
	wangkefeng.wang

> > > > So we will need to take care of madvise cold or pageout case?
> > > > 
> > > > Hi Matthew, Pankaj, and Luis,
> > > > 
> > > > Is it possible to partially map a min-order folio in a fs with LBS? Based on my
> > > 
> > > Typically, FSs match the min order with the blocksize of the filesystem.
> > > As a filesystem block is the smallest unit of data that the filesystem uses
> > > to store file data on the disk, we cannot partially map them.
> > > 
> > > So if I understand your question correctly, the answer is no.
> 
> I'm confused. Shouldn't this be trivially possible?
> 
Hmm, maybe I misunderstood the question?

> E.g., just mmap() a single page of such a file? Who would make that fail?
> 

My point was, even if you try to mmap a single page of a file, page
cache will read the whole block (that corresponds to min order folio).

Technically we can mmap a single page of file, but FS will always read
and write **at least** in min folio order chunks.

> And if it doesn't fail, who would stop us from munmap()'ing everything
> except a single page.
> 
> -- 
> Cheers,
> 
> David / dhildenb
> 
> 

--
Pankaj


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 2/2] mm/memory_hotplug: fix hwpoisoned large folio handling in do_migrate_range
  2025-07-14 15:09                             ` Pankaj Raghav (Samsung)
@ 2025-07-14 15:14                               ` David Hildenbrand
  2025-07-14 15:25                                 ` Zi Yan
  0 siblings, 1 reply; 36+ messages in thread
From: David Hildenbrand @ 2025-07-14 15:14 UTC (permalink / raw)
  To: Pankaj Raghav (Samsung)
  Cc: Zi Yan, Matthew Wilcox, Luis Chamberlain, Jinjiang Tu,
	Oscar Salvador, akpm, linmiaohe, mhocko, linux-mm,
	wangkefeng.wang

On 14.07.25 17:09, Pankaj Raghav (Samsung) wrote:
>>>>> So we will need to take care of madvise cold or pageout case?
>>>>>
>>>>> Hi Matthew, Pankaj, and Luis,
>>>>>
>>>>> Is it possible to partially map a min-order folio in a fs with LBS? Based on my
>>>>
>>>> Typically, FSs match the min order with the blocksize of the filesystem.
>>>> As a filesystem block is the smallest unit of data that the filesystem uses
>>>> to store file data on the disk, we cannot partially map them.
>>>>
>>>> So if I understand your question correctly, the answer is no.
>>
>> I'm confused. Shouldn't this be trivially possible?
>>
> Hmm, maybe I misunderstood the question?
> 
>> E.g., just mmap() a single page of such a file? Who would make that fail?
>>
> 
> My point was, even if you try to mmap a single page of a file, page
> cache will read the whole block (that corresponds to min order folio).
> 
> Technically we can mmap a single page of file, but FS will always read
> and write **at least** in min folio order chunks.

Okay, so it can be partially mapped into page tables :) What happens in 
the background (page cache management) is a different story

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 2/2] mm/memory_hotplug: fix hwpoisoned large folio handling in do_migrate_range
  2025-07-14 15:14                               ` David Hildenbrand
@ 2025-07-14 15:25                                 ` Zi Yan
  2025-07-14 15:28                                   ` Zi Yan
  0 siblings, 1 reply; 36+ messages in thread
From: Zi Yan @ 2025-07-14 15:25 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Pankaj Raghav (Samsung), Matthew Wilcox, Luis Chamberlain,
	Jinjiang Tu, Oscar Salvador, akpm, linmiaohe, mhocko, linux-mm,
	wangkefeng.wang

On 14 Jul 2025, at 11:14, David Hildenbrand wrote:

> On 14.07.25 17:09, Pankaj Raghav (Samsung) wrote:
>>>>>> So we will need to take care of madvise cold or pageout case?
>>>>>>
>>>>>> Hi Matthew, Pankaj, and Luis,
>>>>>>
>>>>>> Is it possible to partially map a min-order folio in a fs with LBS? Based on my
>>>>>
>>>>> Typically, FSs match the min order with the blocksize of the filesystem.
>>>>> As a filesystem block is the smallest unit of data that the filesystem uses
>>>>> to store file data on the disk, we cannot partially map them.
>>>>>
>>>>> So if I understand your question correctly, the answer is no.
>>>
>>> I'm confused. Shouldn't this be trivially possible?
>>>
>> Hmm, maybe I misunderstood the question?
>>
>>> E.g., just mmap() a single page of such a file? Who would make that fail?
>>>
>>
>> My point was, even if you try to mmap a single page of a file, page
>> cache will read the whole block (that corresponds to min order folio).
>>
>> Technically we can mmap a single page of file, but FS will always read
>> and write **at least** in min folio order chunks.
>
> Okay, so it can be partially mapped into page tables :) What happens in the background (page cache management) is a different story

David, thanks for getting to the bottom of this.

OK. So we will see deadlock looping in madvise cold or pageout case.
I wonder how to proceed with this. Since the folio is seen as a whole
by fs, it should be marked cold/paged out as a whole. Maybe we should
skip the partially mapped region?


Best Regards,
Yan, Zi


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 2/2] mm/memory_hotplug: fix hwpoisoned large folio handling in do_migrate_range
  2025-07-14 15:25                                 ` Zi Yan
@ 2025-07-14 15:28                                   ` Zi Yan
  2025-07-14 15:33                                     ` David Hildenbrand
  0 siblings, 1 reply; 36+ messages in thread
From: Zi Yan @ 2025-07-14 15:28 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Pankaj Raghav (Samsung), Matthew Wilcox, Luis Chamberlain,
	Jinjiang Tu, Oscar Salvador, akpm, linmiaohe, mhocko, linux-mm,
	wangkefeng.wang

On 14 Jul 2025, at 11:25, Zi Yan wrote:

> On 14 Jul 2025, at 11:14, David Hildenbrand wrote:
>
>> On 14.07.25 17:09, Pankaj Raghav (Samsung) wrote:
>>>>>>> So we will need to take care of madvise cold or pageout case?
>>>>>>>
>>>>>>> Hi Matthew, Pankaj, and Luis,
>>>>>>>
>>>>>>> Is it possible to partially map a min-order folio in a fs with LBS? Based on my
>>>>>>
>>>>>> Typically, FSs match the min order with the blocksize of the filesystem.
>>>>>> As a filesystem block is the smallest unit of data that the filesystem uses
>>>>>> to store file data on the disk, we cannot partially map them.
>>>>>>
>>>>>> So if I understand your question correctly, the answer is no.
>>>>
>>>> I'm confused. Shouldn't this be trivially possible?
>>>>
>>> Hmm, maybe I misunderstood the question?
>>>
>>>> E.g., just mmap() a single page of such a file? Who would make that fail?
>>>>
>>>
>>> My point was, even if you try to mmap a single page of a file, page
>>> cache will read the whole block (that corresponds to min order folio).
>>>
>>> Technically we can mmap a single page of file, but FS will always read
>>> and write **at least** in min folio order chunks.
>>
>> Okay, so it can be partially mapped into page tables :) What happens in the background (page cache management) is a different story
>
> David, thanks for getting to the bottom of this.
>
> OK. So we will see deadlock looping in madvise cold or pageout case.
> I wonder how to proceed with this. Since the folio is seen as a whole
> by fs, it should be marked cold/paged out as a whole. Maybe we should
> skip the partially mapped region?

Actually, it is skipped, since split_folio() bumps new_order to the min
order, and if the folio order is already at min order, split code return
-EINVAL. This makes the madvise cold or pageout code move to the next
address.

Best Regards,
Yan, Zi


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 2/2] mm/memory_hotplug: fix hwpoisoned large folio handling in do_migrate_range
  2025-07-14 15:28                                   ` Zi Yan
@ 2025-07-14 15:33                                     ` David Hildenbrand
  2025-07-14 15:44                                       ` Zi Yan
  0 siblings, 1 reply; 36+ messages in thread
From: David Hildenbrand @ 2025-07-14 15:33 UTC (permalink / raw)
  To: Zi Yan
  Cc: Pankaj Raghav (Samsung), Matthew Wilcox, Luis Chamberlain,
	Jinjiang Tu, Oscar Salvador, akpm, linmiaohe, mhocko, linux-mm,
	wangkefeng.wang

On 14.07.25 17:28, Zi Yan wrote:
> On 14 Jul 2025, at 11:25, Zi Yan wrote:
> 
>> On 14 Jul 2025, at 11:14, David Hildenbrand wrote:
>>
>>> On 14.07.25 17:09, Pankaj Raghav (Samsung) wrote:
>>>>>>>> So we will need to take care of madvise cold or pageout case?
>>>>>>>>
>>>>>>>> Hi Matthew, Pankaj, and Luis,
>>>>>>>>
>>>>>>>> Is it possible to partially map a min-order folio in a fs with LBS? Based on my
>>>>>>>
>>>>>>> Typically, FSs match the min order with the blocksize of the filesystem.
>>>>>>> As a filesystem block is the smallest unit of data that the filesystem uses
>>>>>>> to store file data on the disk, we cannot partially map them.
>>>>>>>
>>>>>>> So if I understand your question correctly, the answer is no.
>>>>>
>>>>> I'm confused. Shouldn't this be trivially possible?
>>>>>
>>>> Hmm, maybe I misunderstood the question?
>>>>
>>>>> E.g., just mmap() a single page of such a file? Who would make that fail?
>>>>>
>>>>
>>>> My point was, even if you try to mmap a single page of a file, page
>>>> cache will read the whole block (that corresponds to min order folio).
>>>>
>>>> Technically we can mmap a single page of file, but FS will always read
>>>> and write **at least** in min folio order chunks.
>>>
>>> Okay, so it can be partially mapped into page tables :) What happens in the background (page cache management) is a different story
>>
>> David, thanks for getting to the bottom of this.
>>
>> OK. So we will see deadlock looping in madvise cold or pageout case.
>> I wonder how to proceed with this. Since the folio is seen as a whole
>> by fs, it should be marked cold/paged out as a whole. Maybe we should
>> skip the partially mapped region?
> 
> Actually, it is skipped, since split_folio() bumps new_order to the min
> order, and if the folio order is already at min order, split code return
> -EINVAL. This makes the madvise cold or pageout code move to the next
> address.

But what if the folio order is 2x min_order etc?

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 2/2] mm/memory_hotplug: fix hwpoisoned large folio handling in do_migrate_range
  2025-07-14 15:33                                     ` David Hildenbrand
@ 2025-07-14 15:44                                       ` Zi Yan
  2025-07-14 15:52                                         ` David Hildenbrand
  0 siblings, 1 reply; 36+ messages in thread
From: Zi Yan @ 2025-07-14 15:44 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Pankaj Raghav (Samsung), Matthew Wilcox, Luis Chamberlain,
	Jinjiang Tu, Oscar Salvador, akpm, linmiaohe, mhocko, linux-mm,
	wangkefeng.wang

On 14 Jul 2025, at 11:33, David Hildenbrand wrote:

> On 14.07.25 17:28, Zi Yan wrote:
>> On 14 Jul 2025, at 11:25, Zi Yan wrote:
>>
>>> On 14 Jul 2025, at 11:14, David Hildenbrand wrote:
>>>
>>>> On 14.07.25 17:09, Pankaj Raghav (Samsung) wrote:
>>>>>>>>> So we will need to take care of madvise cold or pageout case?
>>>>>>>>>
>>>>>>>>> Hi Matthew, Pankaj, and Luis,
>>>>>>>>>
>>>>>>>>> Is it possible to partially map a min-order folio in a fs with LBS? Based on my
>>>>>>>>
>>>>>>>> Typically, FSs match the min order with the blocksize of the filesystem.
>>>>>>>> As a filesystem block is the smallest unit of data that the filesystem uses
>>>>>>>> to store file data on the disk, we cannot partially map them.
>>>>>>>>
>>>>>>>> So if I understand your question correctly, the answer is no.
>>>>>>
>>>>>> I'm confused. Shouldn't this be trivially possible?
>>>>>>
>>>>> Hmm, maybe I misunderstood the question?
>>>>>
>>>>>> E.g., just mmap() a single page of such a file? Who would make that fail?
>>>>>>
>>>>>
>>>>> My point was, even if you try to mmap a single page of a file, page
>>>>> cache will read the whole block (that corresponds to min order folio).
>>>>>
>>>>> Technically we can mmap a single page of file, but FS will always read
>>>>> and write **at least** in min folio order chunks.
>>>>
>>>> Okay, so it can be partially mapped into page tables :) What happens in the background (page cache management) is a different story
>>>
>>> David, thanks for getting to the bottom of this.
>>>
>>> OK. So we will see deadlock looping in madvise cold or pageout case.
>>> I wonder how to proceed with this. Since the folio is seen as a whole
>>> by fs, it should be marked cold/paged out as a whole. Maybe we should
>>> skip the partially mapped region?
>>
>> Actually, it is skipped, since split_folio() bumps new_order to the min
>> order, and if the folio order is already at min order, split code return
>> -EINVAL. This makes the madvise cold or pageout code move to the next
>> address.
>
> But what if the folio order is 2x min_order etc?

If folio order is greater than min_order, the code is suboptimal.
It will first split the original folio to min_order, then
loop through after-split folios. If the after-split ones are fully
mapped, madvise ops will be performed. Otherwise, the code will
try to split after-split ones again, fail, and end up with skipping
the partially mapped range.

An improvement can be done by skipping the folio if it is partially
mapped and its order is already min_order. Something like:

diff --git a/mm/madvise.c b/mm/madvise.c
index e61e32b2cd91..545ab7920a81 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -499,6 +499,13 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
 			if (nr < folio_nr_pages(folio)) {
 				int err;

+				/*
+				 * Skip partially mapped folios that are
+				 * already at their min order
+				 */
+				if (folio_order(folio) ==
+				    min_order_for_split(folio))
+					continue;
 				if (folio_maybe_mapped_shared(folio))
 					continue;
 				if (pageout_anon_only_filter && !folio_test_anon(folio))




Best Regards,
Yan, Zi


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 2/2] mm/memory_hotplug: fix hwpoisoned large folio handling in do_migrate_range
  2025-07-14 15:44                                       ` Zi Yan
@ 2025-07-14 15:52                                         ` David Hildenbrand
  2025-07-20  2:23                                           ` Andrew Morton
  0 siblings, 1 reply; 36+ messages in thread
From: David Hildenbrand @ 2025-07-14 15:52 UTC (permalink / raw)
  To: Zi Yan
  Cc: Pankaj Raghav (Samsung), Matthew Wilcox, Luis Chamberlain,
	Jinjiang Tu, Oscar Salvador, akpm, linmiaohe, mhocko, linux-mm,
	wangkefeng.wang

On 14.07.25 17:44, Zi Yan wrote:
> On 14 Jul 2025, at 11:33, David Hildenbrand wrote:
> 
>> On 14.07.25 17:28, Zi Yan wrote:
>>> On 14 Jul 2025, at 11:25, Zi Yan wrote:
>>>
>>>> On 14 Jul 2025, at 11:14, David Hildenbrand wrote:
>>>>
>>>>> On 14.07.25 17:09, Pankaj Raghav (Samsung) wrote:
>>>>>>>>>> So we will need to take care of madvise cold or pageout case?
>>>>>>>>>>
>>>>>>>>>> Hi Matthew, Pankaj, and Luis,
>>>>>>>>>>
>>>>>>>>>> Is it possible to partially map a min-order folio in a fs with LBS? Based on my
>>>>>>>>>
>>>>>>>>> Typically, FSs match the min order with the blocksize of the filesystem.
>>>>>>>>> As a filesystem block is the smallest unit of data that the filesystem uses
>>>>>>>>> to store file data on the disk, we cannot partially map them.
>>>>>>>>>
>>>>>>>>> So if I understand your question correctly, the answer is no.
>>>>>>>
>>>>>>> I'm confused. Shouldn't this be trivially possible?
>>>>>>>
>>>>>> Hmm, maybe I misunderstood the question?
>>>>>>
>>>>>>> E.g., just mmap() a single page of such a file? Who would make that fail?
>>>>>>>
>>>>>>
>>>>>> My point was, even if you try to mmap a single page of a file, page
>>>>>> cache will read the whole block (that corresponds to min order folio).
>>>>>>
>>>>>> Technically we can mmap a single page of file, but FS will always read
>>>>>> and write **at least** in min folio order chunks.
>>>>>
>>>>> Okay, so it can be partially mapped into page tables :) What happens in the background (page cache management) is a different story
>>>>
>>>> David, thanks for getting to the bottom of this.
>>>>
>>>> OK. So we will see deadlock looping in madvise cold or pageout case.
>>>> I wonder how to proceed with this. Since the folio is seen as a whole
>>>> by fs, it should be marked cold/paged out as a whole. Maybe we should
>>>> skip the partially mapped region?
>>>
>>> Actually, it is skipped, since split_folio() bumps new_order to the min
>>> order, and if the folio order is already at min order, split code return
>>> -EINVAL. This makes the madvise cold or pageout code move to the next
>>> address.
>>
>> But what if the folio order is 2x min_order etc?
> 
> If folio order is greater than min_order, the code is suboptimal.
> It will first split the original folio to min_order, then
> loop through after-split folios. If the after-split ones are fully
> mapped, madvise ops will be performed. Otherwise, the code will
> try to split after-split ones again, fail, and end up with skipping
> the partially mapped range.

Indeed, thanks for checking, madvise() is fine then!

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 2/2] mm/memory_hotplug: fix hwpoisoned large folio handling in do_migrate_range
  2025-07-14 15:52                                         ` David Hildenbrand
@ 2025-07-20  2:23                                           ` Andrew Morton
  2025-07-22 15:30                                             ` David Hildenbrand
  0 siblings, 1 reply; 36+ messages in thread
From: Andrew Morton @ 2025-07-20  2:23 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Zi Yan, Pankaj Raghav (Samsung), Matthew Wilcox, Luis Chamberlain,
	Jinjiang Tu, Oscar Salvador, linmiaohe, mhocko, linux-mm,
	wangkefeng.wang


I continue to retain the original patch in mm-hotfixes as part of
akpm's lame bug-tracking system.  3 weeks in -next.

And I just added a cc:stable to it because December 2018.

I don't expect many real-world users will be putting fake delays in
memory_failure(), but it's there.

So what do we do here?  Add a TODO, merge it under the
better-than-it-was-before theory and move on?


From: Jinjiang Tu <tujinjiang@huawei.com>
Subject: mm/memory_hotplug: fix hwpoisoned large folio handling in do_migrate_range
Date: Fri, 27 Jun 2025 20:57:47 +0800

In do_migrate_range(), the hwpoisoned folio may be large folio, which
can't be handled by unmap_poisoned_folio().

I can reproduce this issue in qemu after adding delay in memory_failure()

BUG: kernel NULL pointer dereference, address: 0000000000000000
Workqueue: kacpi_hotplug acpi_hotplug_work_fn
RIP: 0010:try_to_unmap_one+0x16a/0xfc0
 <TASK>
 rmap_walk_anon+0xda/0x1f0
 try_to_unmap+0x78/0x80
 ? __pfx_try_to_unmap_one+0x10/0x10
 ? __pfx_folio_not_mapped+0x10/0x10
 ? __pfx_folio_lock_anon_vma_read+0x10/0x10
 unmap_poisoned_folio+0x60/0x140
 do_migrate_range+0x4d1/0x600
 ? slab_memory_callback+0x6a/0x190
 ? notifier_call_chain+0x56/0xb0
 offline_pages+0x3e6/0x460
 memory_subsys_offline+0x130/0x1f0
 device_offline+0xba/0x110
 acpi_bus_offline+0xb7/0x130
 acpi_scan_hot_remove+0x77/0x290
 acpi_device_hotplug+0x1e0/0x240
 acpi_hotplug_work_fn+0x1a/0x30
 process_one_work+0x186/0x340

In this case, just make offline_pages() fail.

Also, do_migrate_range() may be called between memory_failure() setting
the hwposion flag and isolation of the folio from the lru, so remove
WARN_ON().

Also, in other places unmap_poisoned_folio() is called when the folio is
isolated, so obey that in do_migrate_range().

Link: https://lkml.kernel.org/r/20250627125747.3094074-3-tujinjiang@huawei.com
Fixes: b15c87263a69 ("hwpoison, memory_hotplug: allow hwpoisoned pages to be offlined")
Signed-off-by: Jinjiang Tu <tujinjiang@huawei.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/memory_hotplug.c |   21 +++++++++++++--------
 1 file changed, 13 insertions(+), 8 deletions(-)

--- a/mm/memory_hotplug.c~mm-memory_hotplug-fix-hwpoisoned-large-folio-handling-in-do_migrate_range
+++ a/mm/memory_hotplug.c
@@ -1795,7 +1795,7 @@ found:
 	return 0;
 }
 
-static void do_migrate_range(unsigned long start_pfn, unsigned long end_pfn)
+static int do_migrate_range(unsigned long start_pfn, unsigned long end_pfn)
 {
 	struct folio *folio;
 	unsigned long pfn;
@@ -1819,8 +1819,10 @@ static void do_migrate_range(unsigned lo
 			pfn = folio_pfn(folio) + folio_nr_pages(folio) - 1;
 
 		if (folio_contain_hwpoisoned_page(folio)) {
-			if (WARN_ON(folio_test_lru(folio)))
-				folio_isolate_lru(folio);
+			if (folio_test_large(folio) && !folio_test_hugetlb(folio))
+				goto err_out;
+			if (folio_test_lru(folio) && !folio_isolate_lru(folio))
+				goto err_out;
 			if (folio_mapped(folio)) {
 				folio_lock(folio);
 				unmap_poisoned_folio(folio, pfn, false);
@@ -1877,6 +1879,11 @@ put_folio:
 			putback_movable_pages(&source);
 		}
 	}
+	return 0;
+err_out:
+	folio_put(folio);
+	putback_movable_pages(&source);
+	return -EBUSY;
 }
 
 static int __init cmdline_parse_movable_node(char *p)
@@ -2041,11 +2048,9 @@ int offline_pages(unsigned long start_pf
 
 			ret = scan_movable_pages(pfn, end_pfn, &pfn);
 			if (!ret) {
-				/*
-				 * TODO: fatal migration failures should bail
-				 * out
-				 */
-				do_migrate_range(pfn, end_pfn);
+				ret = do_migrate_range(pfn, end_pfn);
+				if (ret)
+					break;
 			}
 		} while (!ret);
 
_



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 2/2] mm/memory_hotplug: fix hwpoisoned large folio handling in do_migrate_range
  2025-07-20  2:23                                           ` Andrew Morton
@ 2025-07-22 15:30                                             ` David Hildenbrand
  2025-08-21  5:02                                               ` Andrew Morton
  0 siblings, 1 reply; 36+ messages in thread
From: David Hildenbrand @ 2025-07-22 15:30 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Zi Yan, Pankaj Raghav (Samsung), Matthew Wilcox, Luis Chamberlain,
	Jinjiang Tu, Oscar Salvador, linmiaohe, mhocko, linux-mm,
	wangkefeng.wang

On 20.07.25 04:23, Andrew Morton wrote:
> 
> I continue to retain the original patch in mm-hotfixes as part of
> akpm's lame bug-tracking system.  3 weeks in -next.
> 
> And I just added a cc:stable to it because December 2018.
> 
> I don't expect many real-world users will be putting fake delays in
> memory_failure(), but it's there.
> 
> So what do we do here?  Add a TODO, merge it under the
> better-than-it-was-before theory and move on?

I would feel better if we could just not fail memory offlining. Memory 
offlining is documented to loop forever if something bad happens, and 
user space can cancel it.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 2/2] mm/memory_hotplug: fix hwpoisoned large folio handling in do_migrate_range
  2025-07-22 15:30                                             ` David Hildenbrand
@ 2025-08-21  5:02                                               ` Andrew Morton
  2025-08-21 22:07                                                 ` David Hildenbrand
  0 siblings, 1 reply; 36+ messages in thread
From: Andrew Morton @ 2025-08-21  5:02 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Zi Yan, Pankaj Raghav (Samsung), Matthew Wilcox, Luis Chamberlain,
	Jinjiang Tu, Oscar Salvador, linmiaohe, mhocko, linux-mm,
	wangkefeng.wang

On Tue, 22 Jul 2025 17:30:06 +0200 David Hildenbrand <david@redhat.com> wrote:

> On 20.07.25 04:23, Andrew Morton wrote:
> > 
> > I continue to retain the original patch in mm-hotfixes as part of
> > akpm's lame bug-tracking system.  3 weeks in -next.
> > 
> > And I just added a cc:stable to it because December 2018.
> > 
> > I don't expect many real-world users will be putting fake delays in
> > memory_failure(), but it's there.
> > 
> > So what do we do here?  Add a TODO, merge it under the
> > better-than-it-was-before theory and move on?
> 
> I would feel better if we could just not fail memory offlining. Memory 
> offlining is documented to loop forever if something bad happens, and 
> user space can cancel it.
> 

Pathetic monthly prod to keep this on people's radar.


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 2/2] mm/memory_hotplug: fix hwpoisoned large folio handling in do_migrate_range
  2025-08-21  5:02                                               ` Andrew Morton
@ 2025-08-21 22:07                                                 ` David Hildenbrand
  2025-08-22 17:24                                                   ` Zi Yan
  2025-08-25  2:05                                                   ` Miaohe Lin
  0 siblings, 2 replies; 36+ messages in thread
From: David Hildenbrand @ 2025-08-21 22:07 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Zi Yan, Pankaj Raghav (Samsung), Matthew Wilcox, Luis Chamberlain,
	Jinjiang Tu, Oscar Salvador, linmiaohe, mhocko, linux-mm,
	wangkefeng.wang

On 21.08.25 07:02, Andrew Morton wrote:
> On Tue, 22 Jul 2025 17:30:06 +0200 David Hildenbrand <david@redhat.com> wrote:
> 
>> On 20.07.25 04:23, Andrew Morton wrote:
>>>
>>> I continue to retain the original patch in mm-hotfixes as part of
>>> akpm's lame bug-tracking system.  3 weeks in -next.
>>>
>>> And I just added a cc:stable to it because December 2018.
>>>
>>> I don't expect many real-world users will be putting fake delays in
>>> memory_failure(), but it's there.
>>>
>>> So what do we do here?  Add a TODO, merge it under the
>>> better-than-it-was-before theory and move on?
>>
>> I would feel better if we could just not fail memory offlining. Memory
>> offlining is documented to loop forever if something bad happens, and
>> user space can cancel it.
>>
> 
> Pathetic monthly prod to keep this on people's radar.
> 

Let's do something minimal for now:

 From 403b2a375a10c17fd6e2aeffbe0fdaf623faa621 Mon Sep 17 00:00:00 2001
From: Jinjiang Tu <tujinjiang@huawei.com>
Date: Fri, 27 Jun 2025 20:57:47 +0800
Subject: [PATCH] mm/memory_hotplug: fix hwpoisoned large folio handling in
  do_migrate_range

In do_migrate_range(), the hwpoisoned folio may be large folio, which
can't be handled by unmap_poisoned_folio().

I can reproduce this issue in qemu after adding delay in memory_failure()

BUG: kernel NULL pointer dereference, address: 0000000000000000
Workqueue: kacpi_hotplug acpi_hotplug_work_fn
RIP: 0010:try_to_unmap_one+0x16a/0xfc0
  <TASK>
  rmap_walk_anon+0xda/0x1f0
  try_to_unmap+0x78/0x80
  ? __pfx_try_to_unmap_one+0x10/0x10
  ? __pfx_folio_not_mapped+0x10/0x10
  ? __pfx_folio_lock_anon_vma_read+0x10/0x10
  unmap_poisoned_folio+0x60/0x140
  do_migrate_range+0x4d1/0x600
  ? slab_memory_callback+0x6a/0x190
  ? notifier_call_chain+0x56/0xb0
  offline_pages+0x3e6/0x460
  memory_subsys_offline+0x130/0x1f0
  device_offline+0xba/0x110
  acpi_bus_offline+0xb7/0x130
  acpi_scan_hot_remove+0x77/0x290
  acpi_device_hotplug+0x1e0/0x240
  acpi_hotplug_work_fn+0x1a/0x30
  process_one_work+0x186/0x340

Besides, do_migrate_range() may be called between memory_failure set
hwpoison flag and isolate the folio from lru, so remove WARN_ON(). In other
places, unmap_poisoned_folio() is called when the folio is isolated, obey
it in do_migrate_range() too.

Fixes: b15c87263a69 ("hwpoison, memory_hotplug: allow hwpoisoned pages to be offlined")
Signed-off-by: Jinjiang Tu <tujinjiang@huawei.com>
[ David: don't abort offlining, fixed typo, added comment ]
Signed-off-by: David Hildenbrand <david@redhat.com>
---
  mm/memory_hotplug.c | 10 ++++++++--
  1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 1f15af712bc34..74318c7877156 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1815,8 +1815,14 @@ static void do_migrate_range(unsigned long start_pfn, unsigned long end_pfn)
  			pfn = folio_pfn(folio) + folio_nr_pages(folio) - 1;
  
  		if (folio_contain_hwpoisoned_page(folio)) {
-			if (WARN_ON(folio_test_lru(folio)))
-				folio_isolate_lru(folio);
+			/*
+			 * unmap_poisoned_folio() cannot handle large folios
+			 * in all cases yet.
+			 */
+			if (folio_test_large(folio) && !folio_test_hugetlb(folio))
+				goto put_folio;
+			if (folio_test_lru(folio) && !folio_isolate_lru(folio))
+				goto put_folio;
  			if (folio_mapped(folio)) {
  				folio_lock(folio);
  				unmap_poisoned_folio(folio, pfn, false);
-- 
2.50.1


Man oh man, is hwpoison handling a mess.

-- 
Cheers

David / dhildenb



^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 2/2] mm/memory_hotplug: fix hwpoisoned large folio handling in do_migrate_range
  2025-08-21 22:07                                                 ` David Hildenbrand
@ 2025-08-22 17:24                                                   ` Zi Yan
  2025-08-25  2:05                                                   ` Miaohe Lin
  1 sibling, 0 replies; 36+ messages in thread
From: Zi Yan @ 2025-08-22 17:24 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Andrew Morton, Pankaj Raghav (Samsung), Matthew Wilcox,
	Luis Chamberlain, Jinjiang Tu, Oscar Salvador, linmiaohe, mhocko,
	linux-mm, wangkefeng.wang

On 21 Aug 2025, at 18:07, David Hildenbrand wrote:

> On 21.08.25 07:02, Andrew Morton wrote:
>> On Tue, 22 Jul 2025 17:30:06 +0200 David Hildenbrand <david@redhat.com> wrote:
>>
>>> On 20.07.25 04:23, Andrew Morton wrote:
>>>>
>>>> I continue to retain the original patch in mm-hotfixes as part of
>>>> akpm's lame bug-tracking system.  3 weeks in -next.
>>>>
>>>> And I just added a cc:stable to it because December 2018.
>>>>
>>>> I don't expect many real-world users will be putting fake delays in
>>>> memory_failure(), but it's there.
>>>>
>>>> So what do we do here?  Add a TODO, merge it under the
>>>> better-than-it-was-before theory and move on?
>>>
>>> I would feel better if we could just not fail memory offlining. Memory
>>> offlining is documented to loop forever if something bad happens, and
>>> user space can cancel it.
>>>
>>
>> Pathetic monthly prod to keep this on people's radar.
>>
>
> Let's do something minimal for now:
>
> From 403b2a375a10c17fd6e2aeffbe0fdaf623faa621 Mon Sep 17 00:00:00 2001
> From: Jinjiang Tu <tujinjiang@huawei.com>
> Date: Fri, 27 Jun 2025 20:57:47 +0800
> Subject: [PATCH] mm/memory_hotplug: fix hwpoisoned large folio handling in
>  do_migrate_range
>
> In do_migrate_range(), the hwpoisoned folio may be large folio, which
> can't be handled by unmap_poisoned_folio().
>
> I can reproduce this issue in qemu after adding delay in memory_failure()
>
> BUG: kernel NULL pointer dereference, address: 0000000000000000
> Workqueue: kacpi_hotplug acpi_hotplug_work_fn
> RIP: 0010:try_to_unmap_one+0x16a/0xfc0
>  <TASK>
>  rmap_walk_anon+0xda/0x1f0
>  try_to_unmap+0x78/0x80
>  ? __pfx_try_to_unmap_one+0x10/0x10
>  ? __pfx_folio_not_mapped+0x10/0x10
>  ? __pfx_folio_lock_anon_vma_read+0x10/0x10
>  unmap_poisoned_folio+0x60/0x140
>  do_migrate_range+0x4d1/0x600
>  ? slab_memory_callback+0x6a/0x190
>  ? notifier_call_chain+0x56/0xb0
>  offline_pages+0x3e6/0x460
>  memory_subsys_offline+0x130/0x1f0
>  device_offline+0xba/0x110
>  acpi_bus_offline+0xb7/0x130
>  acpi_scan_hot_remove+0x77/0x290
>  acpi_device_hotplug+0x1e0/0x240
>  acpi_hotplug_work_fn+0x1a/0x30
>  process_one_work+0x186/0x340
>
> Besides, do_migrate_range() may be called between memory_failure set
> hwpoison flag and isolate the folio from lru, so remove WARN_ON(). In other
> places, unmap_poisoned_folio() is called when the folio is isolated, obey
> it in do_migrate_range() too.
>
> Fixes: b15c87263a69 ("hwpoison, memory_hotplug: allow hwpoisoned pages to be offlined")
> Signed-off-by: Jinjiang Tu <tujinjiang@huawei.com>
> [ David: don't abort offlining, fixed typo, added comment ]
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  mm/memory_hotplug.c | 10 ++++++++--
>  1 file changed, 8 insertions(+), 2 deletions(-)

Acked-by: Zi Yan <ziy@nvidia.com>

>
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 1f15af712bc34..74318c7877156 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1815,8 +1815,14 @@ static void do_migrate_range(unsigned long start_pfn, unsigned long end_pfn)
>  			pfn = folio_pfn(folio) + folio_nr_pages(folio) - 1;
>   		if (folio_contain_hwpoisoned_page(folio)) {
> -			if (WARN_ON(folio_test_lru(folio)))
> -				folio_isolate_lru(folio);
> +			/*
> +			 * unmap_poisoned_folio() cannot handle large folios
> +			 * in all cases yet.
> +			 */
> +			if (folio_test_large(folio) && !folio_test_hugetlb(folio))
> +				goto put_folio;
> +			if (folio_test_lru(folio) && !folio_isolate_lru(folio))
> +				goto put_folio;
>  			if (folio_mapped(folio)) {
>  				folio_lock(folio);
>  				unmap_poisoned_folio(folio, pfn, false);
> -- 
> 2.50.1
>
>
> Man oh man, is hwpoison handling a mess.
>



--
Best Regards,
Yan, Zi


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 2/2] mm/memory_hotplug: fix hwpoisoned large folio handling in do_migrate_range
  2025-08-21 22:07                                                 ` David Hildenbrand
  2025-08-22 17:24                                                   ` Zi Yan
@ 2025-08-25  2:05                                                   ` Miaohe Lin
  1 sibling, 0 replies; 36+ messages in thread
From: Miaohe Lin @ 2025-08-25  2:05 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Zi Yan, Pankaj Raghav (Samsung), Matthew Wilcox, Luis Chamberlain,
	Jinjiang Tu, Oscar Salvador, mhocko, linux-mm, wangkefeng.wang,
	Andrew Morton

On 2025/8/22 6:07, David Hildenbrand wrote:
> On 21.08.25 07:02, Andrew Morton wrote:
>> On Tue, 22 Jul 2025 17:30:06 +0200 David Hildenbrand <david@redhat.com> wrote:
>>
>>> On 20.07.25 04:23, Andrew Morton wrote:
>>>>
>>>> I continue to retain the original patch in mm-hotfixes as part of
>>>> akpm's lame bug-tracking system.  3 weeks in -next.
>>>>
>>>> And I just added a cc:stable to it because December 2018.
>>>>
>>>> I don't expect many real-world users will be putting fake delays in
>>>> memory_failure(), but it's there.
>>>>
>>>> So what do we do here?  Add a TODO, merge it under the
>>>> better-than-it-was-before theory and move on?
>>>
>>> I would feel better if we could just not fail memory offlining. Memory
>>> offlining is documented to loop forever if something bad happens, and
>>> user space can cancel it.
>>>
>>
>> Pathetic monthly prod to keep this on people's radar.
>>
> 
> Let's do something minimal for now:
> 
> From 403b2a375a10c17fd6e2aeffbe0fdaf623faa621 Mon Sep 17 00:00:00 2001
> From: Jinjiang Tu <tujinjiang@huawei.com>
> Date: Fri, 27 Jun 2025 20:57:47 +0800
> Subject: [PATCH] mm/memory_hotplug: fix hwpoisoned large folio handling in
>  do_migrate_range
> 
> In do_migrate_range(), the hwpoisoned folio may be large folio, which
> can't be handled by unmap_poisoned_folio().
> 
> I can reproduce this issue in qemu after adding delay in memory_failure()
> 
> BUG: kernel NULL pointer dereference, address: 0000000000000000
> Workqueue: kacpi_hotplug acpi_hotplug_work_fn
> RIP: 0010:try_to_unmap_one+0x16a/0xfc0
>  <TASK>
>  rmap_walk_anon+0xda/0x1f0
>  try_to_unmap+0x78/0x80
>  ? __pfx_try_to_unmap_one+0x10/0x10
>  ? __pfx_folio_not_mapped+0x10/0x10
>  ? __pfx_folio_lock_anon_vma_read+0x10/0x10
>  unmap_poisoned_folio+0x60/0x140
>  do_migrate_range+0x4d1/0x600
>  ? slab_memory_callback+0x6a/0x190
>  ? notifier_call_chain+0x56/0xb0
>  offline_pages+0x3e6/0x460
>  memory_subsys_offline+0x130/0x1f0
>  device_offline+0xba/0x110
>  acpi_bus_offline+0xb7/0x130
>  acpi_scan_hot_remove+0x77/0x290
>  acpi_device_hotplug+0x1e0/0x240
>  acpi_hotplug_work_fn+0x1a/0x30
>  process_one_work+0x186/0x340
> 
> Besides, do_migrate_range() may be called between memory_failure set
> hwpoison flag and isolate the folio from lru, so remove WARN_ON(). In other
> places, unmap_poisoned_folio() is called when the folio is isolated, obey
> it in do_migrate_range() too.
> 
> Fixes: b15c87263a69 ("hwpoison, memory_hotplug: allow hwpoisoned pages to be offlined")
> Signed-off-by: Jinjiang Tu <tujinjiang@huawei.com>
> [ David: don't abort offlining, fixed typo, added comment ]
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  mm/memory_hotplug.c | 10 ++++++++--
>  1 file changed, 8 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 1f15af712bc34..74318c7877156 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1815,8 +1815,14 @@ static void do_migrate_range(unsigned long start_pfn, unsigned long end_pfn)
>              pfn = folio_pfn(folio) + folio_nr_pages(folio) - 1;
>  
>          if (folio_contain_hwpoisoned_page(folio)) {
> -            if (WARN_ON(folio_test_lru(folio)))
> -                folio_isolate_lru(folio);
> +            /*
> +             * unmap_poisoned_folio() cannot handle large folios
> +             * in all cases yet.
> +             */
> +            if (folio_test_large(folio) && !folio_test_hugetlb(folio))
> +                goto put_folio;
> +            if (folio_test_lru(folio) && !folio_isolate_lru(folio))
> +                goto put_folio;
>              if (folio_mapped(folio)) {
>                  folio_lock(folio);
>                  unmap_poisoned_folio(folio, pfn, false);

Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Thanks.
.


^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2025-08-25  2:06 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-27 12:57 [PATCH v2 0/2] fix two calls of unmap_poisoned_folio() for large folio Jinjiang Tu
2025-06-27 12:57 ` [PATCH v2 1/2] mm/vmscan: fix hwpoisoned large folio handling in shrink_folio_list Jinjiang Tu
2025-06-27 17:10   ` David Hildenbrand
2025-06-27 22:00   ` Andrew Morton
2025-06-28  2:38     ` Jinjiang Tu
2025-06-28  3:13   ` Miaohe Lin
2025-07-01 14:13   ` Oscar Salvador
2025-07-03  7:30     ` Jinjiang Tu
2025-06-27 12:57 ` [PATCH v2 2/2] mm/memory_hotplug: fix hwpoisoned large folio handling in do_migrate_range Jinjiang Tu
2025-07-01 14:21   ` Oscar Salvador
2025-07-03  7:46     ` Jinjiang Tu
2025-07-03  7:57       ` David Hildenbrand
2025-07-03  8:24         ` Jinjiang Tu
2025-07-03  9:06           ` David Hildenbrand
2025-07-07 11:51             ` Jinjiang Tu
2025-07-07 12:37               ` David Hildenbrand
2025-07-08  1:15                 ` Jinjiang Tu
2025-07-08  9:54                   ` David Hildenbrand
2025-07-09 16:27                     ` Zi Yan
2025-07-14 13:53                       ` Pankaj Raghav
2025-07-14 14:20                         ` Zi Yan
2025-07-14 14:24                           ` David Hildenbrand
2025-07-14 15:09                             ` Pankaj Raghav (Samsung)
2025-07-14 15:14                               ` David Hildenbrand
2025-07-14 15:25                                 ` Zi Yan
2025-07-14 15:28                                   ` Zi Yan
2025-07-14 15:33                                     ` David Hildenbrand
2025-07-14 15:44                                       ` Zi Yan
2025-07-14 15:52                                         ` David Hildenbrand
2025-07-20  2:23                                           ` Andrew Morton
2025-07-22 15:30                                             ` David Hildenbrand
2025-08-21  5:02                                               ` Andrew Morton
2025-08-21 22:07                                                 ` David Hildenbrand
2025-08-22 17:24                                                   ` Zi Yan
2025-08-25  2:05                                                   ` Miaohe Lin
2025-07-03  7:53   ` David Hildenbrand

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).