* [PATCH v2 0/3] Do not change split folio target order
@ 2025-10-16 3:34 Zi Yan
2025-10-16 3:34 ` [PATCH v2 1/3] mm/huge_memory: do not change split_huge_page*() target order silently Zi Yan
` (2 more replies)
0 siblings, 3 replies; 27+ messages in thread
From: Zi Yan @ 2025-10-16 3:34 UTC (permalink / raw)
To: linmiaohe, david, jane.chu, kernel, syzbot+e6367ea2fdab6ed46056,
syzkaller-bugs
Cc: ziy, akpm, mcgrof, nao.horiguchi, Lorenzo Stoakes, Baolin Wang,
Liam R. Howlett, Nico Pache, Ryan Roberts, Dev Jain, Barry Song,
Lance Yang, Matthew Wilcox (Oracle), Wei Yang, linux-fsdevel,
linux-kernel, linux-mm
Hi all,
Currently, huge page and large folio split APIs would bump new order
when the target folio has min_order_for_split() > 0 and return
success if split succeeds. When callers expect after-split folios to be
order-0, the actual ones are not. The callers might not be able to
handle them, since they call huge page and large folio split APIs to get
order-0 folios. This issue appears in a recent report on
memory_failure()[1], where memory_failure() used split_huge_page() to split
a large forlio to order-0, but after a successful split got non order-0
folios. Because memory_failure() can only handle order-0 folios, this
caused a WARNING.
Fix the issue by not changing split target order and failing the
split if min_order_for_split() is greater than the target order.
In addition, to avoid wasting memory in memory failure handling, a second
patch is added to always split a large folio to min_order_for_split()
even if it is not 0, so that folios not containing the poisoned page can
be freed for reuse. For soft offline, since the folio is still accessible,
do not split if min_order_for_split() is not zero to avoid potential
performance loss.
Changelog
===
From V1[2]:
1. Fixed !CONFIG_TRANSPARENT_HUGEPAGE version of try_folio_split()
signature.
2. Updated the comment of try_folio_split().
3. Renamed try_folio_split() to try_folio_split_to_order().
4. Removed unused list parameter from try_folio_split_to_order().
5. Added information on min_order_for_split() in
try_folio_split_to_order()'s comment.
6. Added a comment on non_uniform_split_supported() caller on
warns=false.
7. Added min_order_for_split() to !CONFIG_TRANSPARENT_HUGEPAGE.
8. Fixed kernel-doc comment format for try_folio_split_to_order(), folio_split,
__folio_split(), and __split_unmapped_folio().
Link: https://lore.kernel.org/all/68d2c943.a70a0220.1b52b.02b3.GAE@google.com/ [1]
Link: https://lore.kernel.org/linux-mm/20251010173906.3128789-2-ziy@nvidia.com/ [2]
Zi Yan (3):
mm/huge_memory: do not change split_huge_page*() target order
silently.
mm/memory-failure: improve large block size folio handling.
mm/huge_memory: fix kernel-doc comments for folio_split() and related.
include/linux/huge_mm.h | 61 ++++++++++++++++++-----------------------
mm/huge_memory.c | 36 +++++++++++-------------
mm/memory-failure.c | 25 ++++++++++++++---
mm/truncate.c | 6 ++--
4 files changed, 68 insertions(+), 60 deletions(-)
--
2.51.0
^ permalink raw reply [flat|nested] 27+ messages in thread
* [PATCH v2 1/3] mm/huge_memory: do not change split_huge_page*() target order silently.
2025-10-16 3:34 [PATCH v2 0/3] Do not change split folio target order Zi Yan
@ 2025-10-16 3:34 ` Zi Yan
2025-10-16 7:31 ` Wei Yang
2025-10-16 3:34 ` [PATCH v2 2/3] mm/memory-failure: improve large block size folio handling Zi Yan
2025-10-16 3:34 ` [PATCH v2 3/3] mm/huge_memory: fix kernel-doc comments for folio_split() and related Zi Yan
2 siblings, 1 reply; 27+ messages in thread
From: Zi Yan @ 2025-10-16 3:34 UTC (permalink / raw)
To: linmiaohe, david, jane.chu, kernel, syzbot+e6367ea2fdab6ed46056,
syzkaller-bugs
Cc: ziy, akpm, mcgrof, nao.horiguchi, Lorenzo Stoakes, Baolin Wang,
Liam R. Howlett, Nico Pache, Ryan Roberts, Dev Jain, Barry Song,
Lance Yang, Matthew Wilcox (Oracle), Wei Yang, linux-fsdevel,
linux-kernel, linux-mm, Pankaj Raghav
Page cache folios from a file system that support large block size (LBS)
can have minimal folio order greater than 0, thus a high order folio might
not be able to be split down to order-0. Commit e220917fa507 ("mm: split a
folio in minimum folio order chunks") bumps the target order of
split_huge_page*() to the minimum allowed order when splitting a LBS folio.
This causes confusion for some split_huge_page*() callers like memory
failure handling code, since they expect after-split folios all have
order-0 when split succeeds but in really get min_order_for_split() order
folios.
Fix it by failing a split if the folio cannot be split to the target order.
Rename try_folio_split() to try_folio_split_to_order() to reflect the added
new_order parameter. Remove its unused list parameter.
Fixes: e220917fa507 ("mm: split a folio in minimum folio order chunks")
[The test poisons LBS folios, which cannot be split to order-0 folios, and
also tries to poison all memory. The non split LBS folios take more memory
than the test anticipated, leading to OOM. The patch fixed the kernel
warning and the test needs some change to avoid OOM.]
Reported-by: syzbot+e6367ea2fdab6ed46056@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/68d2c943.a70a0220.1b52b.02b3.GAE@google.com/
Signed-off-by: Zi Yan <ziy@nvidia.com>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Pankaj Raghav <p.raghav@samsung.com>
---
include/linux/huge_mm.h | 55 +++++++++++++++++------------------------
mm/huge_memory.c | 9 +------
mm/truncate.c | 6 +++--
3 files changed, 28 insertions(+), 42 deletions(-)
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index c4a811958cda..3d9587f40c0b 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -383,45 +383,30 @@ static inline int split_huge_page_to_list_to_order(struct page *page, struct lis
}
/*
- * try_folio_split - try to split a @folio at @page using non uniform split.
+ * try_folio_split_to_order - try to split a @folio at @page to @new_order using
+ * non uniform split.
* @folio: folio to be split
- * @page: split to order-0 at the given page
- * @list: store the after-split folios
+ * @page: split to @order at the given page
+ * @new_order: the target split order
*
- * Try to split a @folio at @page using non uniform split to order-0, if
- * non uniform split is not supported, fall back to uniform split.
+ * Try to split a @folio at @page using non uniform split to @new_order, if
+ * non uniform split is not supported, fall back to uniform split. After-split
+ * folios are put back to LRU list. Use min_order_for_split() to get the lower
+ * bound of @new_order.
*
* Return: 0: split is successful, otherwise split failed.
*/
-static inline int try_folio_split(struct folio *folio, struct page *page,
- struct list_head *list)
+static inline int try_folio_split_to_order(struct folio *folio,
+ struct page *page, unsigned int new_order)
{
- int ret = min_order_for_split(folio);
-
- if (ret < 0)
- return ret;
-
- if (!non_uniform_split_supported(folio, 0, false))
- return split_huge_page_to_list_to_order(&folio->page, list,
- ret);
- return folio_split(folio, ret, page, list);
+ if (!non_uniform_split_supported(folio, new_order, /* warns= */ false))
+ return split_huge_page_to_list_to_order(&folio->page, NULL,
+ new_order);
+ return folio_split(folio, new_order, page, NULL);
}
static inline int split_huge_page(struct page *page)
{
- struct folio *folio = page_folio(page);
- int ret = min_order_for_split(folio);
-
- if (ret < 0)
- return ret;
-
- /*
- * split_huge_page() locks the page before splitting and
- * expects the same page that has been split to be locked when
- * returned. split_folio(page_folio(page)) cannot be used here
- * because it converts the page to folio and passes the head
- * page to be split.
- */
- return split_huge_page_to_list_to_order(page, NULL, ret);
+ return split_huge_page_to_list_to_order(page, NULL, 0);
}
void deferred_split_folio(struct folio *folio, bool partially_mapped);
#ifdef CONFIG_MEMCG
@@ -611,14 +596,20 @@ static inline int split_huge_page(struct page *page)
return -EINVAL;
}
+static inline int min_order_for_split(struct folio *folio)
+{
+ VM_WARN_ON_ONCE_FOLIO(1, folio);
+ return -EINVAL;
+}
+
static inline int split_folio_to_list(struct folio *folio, struct list_head *list)
{
VM_WARN_ON_ONCE_FOLIO(1, folio);
return -EINVAL;
}
-static inline int try_folio_split(struct folio *folio, struct page *page,
- struct list_head *list)
+static inline int try_folio_split_to_order(struct folio *folio,
+ struct page *page, unsigned int new_order)
{
VM_WARN_ON_ONCE_FOLIO(1, folio);
return -EINVAL;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 8c82a0ac6e69..f308f11dc72f 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3805,8 +3805,6 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
min_order = mapping_min_folio_order(folio->mapping);
if (new_order < min_order) {
- VM_WARN_ONCE(1, "Cannot split mapped folio below min-order: %u",
- min_order);
ret = -EINVAL;
goto out;
}
@@ -4158,12 +4156,7 @@ int min_order_for_split(struct folio *folio)
int split_folio_to_list(struct folio *folio, struct list_head *list)
{
- int ret = min_order_for_split(folio);
-
- if (ret < 0)
- return ret;
-
- return split_huge_page_to_list_to_order(&folio->page, list, ret);
+ return split_huge_page_to_list_to_order(&folio->page, list, 0);
}
/*
diff --git a/mm/truncate.c b/mm/truncate.c
index 91eb92a5ce4f..9210cf808f5c 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -194,6 +194,7 @@ bool truncate_inode_partial_folio(struct folio *folio, loff_t start, loff_t end)
size_t size = folio_size(folio);
unsigned int offset, length;
struct page *split_at, *split_at2;
+ unsigned int min_order;
if (pos < start)
offset = start - pos;
@@ -223,8 +224,9 @@ bool truncate_inode_partial_folio(struct folio *folio, loff_t start, loff_t end)
if (!folio_test_large(folio))
return true;
+ min_order = mapping_min_folio_order(folio->mapping);
split_at = folio_page(folio, PAGE_ALIGN_DOWN(offset) / PAGE_SIZE);
- if (!try_folio_split(folio, split_at, NULL)) {
+ if (!try_folio_split_to_order(folio, split_at, min_order)) {
/*
* try to split at offset + length to make sure folios within
* the range can be dropped, especially to avoid memory waste
@@ -254,7 +256,7 @@ bool truncate_inode_partial_folio(struct folio *folio, loff_t start, loff_t end)
*/
if (folio_test_large(folio2) &&
folio2->mapping == folio->mapping)
- try_folio_split(folio2, split_at2, NULL);
+ try_folio_split_to_order(folio2, split_at2, min_order);
folio_unlock(folio2);
out:
--
2.51.0
^ permalink raw reply related [flat|nested] 27+ messages in thread
* [PATCH v2 2/3] mm/memory-failure: improve large block size folio handling.
2025-10-16 3:34 [PATCH v2 0/3] Do not change split folio target order Zi Yan
2025-10-16 3:34 ` [PATCH v2 1/3] mm/huge_memory: do not change split_huge_page*() target order silently Zi Yan
@ 2025-10-16 3:34 ` Zi Yan
2025-10-17 9:33 ` Lorenzo Stoakes
2025-10-17 19:11 ` Yang Shi
2025-10-16 3:34 ` [PATCH v2 3/3] mm/huge_memory: fix kernel-doc comments for folio_split() and related Zi Yan
2 siblings, 2 replies; 27+ messages in thread
From: Zi Yan @ 2025-10-16 3:34 UTC (permalink / raw)
To: linmiaohe, david, jane.chu, kernel, syzbot+e6367ea2fdab6ed46056,
syzkaller-bugs
Cc: ziy, akpm, mcgrof, nao.horiguchi, Lorenzo Stoakes, Baolin Wang,
Liam R. Howlett, Nico Pache, Ryan Roberts, Dev Jain, Barry Song,
Lance Yang, Matthew Wilcox (Oracle), Wei Yang, linux-fsdevel,
linux-kernel, linux-mm
Large block size (LBS) folios cannot be split to order-0 folios but
min_order_for_folio(). Current split fails directly, but that is not
optimal. Split the folio to min_order_for_folio(), so that, after split,
only the folio containing the poisoned page becomes unusable instead.
For soft offline, do not split the large folio if it cannot be split to
order-0. Since the folio is still accessible from userspace and premature
split might lead to potential performance loss.
Suggested-by: Jane Chu <jane.chu@oracle.com>
Signed-off-by: Zi Yan <ziy@nvidia.com>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
---
mm/memory-failure.c | 25 +++++++++++++++++++++----
1 file changed, 21 insertions(+), 4 deletions(-)
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index f698df156bf8..443df9581c24 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1656,12 +1656,13 @@ static int identify_page_state(unsigned long pfn, struct page *p,
* there is still more to do, hence the page refcount we took earlier
* is still needed.
*/
-static int try_to_split_thp_page(struct page *page, bool release)
+static int try_to_split_thp_page(struct page *page, unsigned int new_order,
+ bool release)
{
int ret;
lock_page(page);
- ret = split_huge_page(page);
+ ret = split_huge_page_to_list_to_order(page, NULL, new_order);
unlock_page(page);
if (ret && release)
@@ -2280,6 +2281,7 @@ int memory_failure(unsigned long pfn, int flags)
folio_unlock(folio);
if (folio_test_large(folio)) {
+ int new_order = min_order_for_split(folio);
/*
* The flag must be set after the refcount is bumped
* otherwise it may race with THP split.
@@ -2294,7 +2296,14 @@ int memory_failure(unsigned long pfn, int flags)
* page is a valid handlable page.
*/
folio_set_has_hwpoisoned(folio);
- if (try_to_split_thp_page(p, false) < 0) {
+ /*
+ * If the folio cannot be split to order-0, kill the process,
+ * but split the folio anyway to minimize the amount of unusable
+ * pages.
+ */
+ if (try_to_split_thp_page(p, new_order, false) || new_order) {
+ /* get folio again in case the original one is split */
+ folio = page_folio(p);
res = -EHWPOISON;
kill_procs_now(p, pfn, flags, folio);
put_page(p);
@@ -2621,7 +2630,15 @@ static int soft_offline_in_use_page(struct page *page)
};
if (!huge && folio_test_large(folio)) {
- if (try_to_split_thp_page(page, true)) {
+ int new_order = min_order_for_split(folio);
+
+ /*
+ * If the folio cannot be split to order-0, do not split it at
+ * all to retain the still accessible large folio.
+ * NOTE: if getting free memory is perferred, split it like it
+ * is done in memory_failure().
+ */
+ if (new_order || try_to_split_thp_page(page, new_order, true)) {
pr_info("%#lx: thp split failed\n", pfn);
return -EBUSY;
}
--
2.51.0
^ permalink raw reply related [flat|nested] 27+ messages in thread
* [PATCH v2 3/3] mm/huge_memory: fix kernel-doc comments for folio_split() and related.
2025-10-16 3:34 [PATCH v2 0/3] Do not change split folio target order Zi Yan
2025-10-16 3:34 ` [PATCH v2 1/3] mm/huge_memory: do not change split_huge_page*() target order silently Zi Yan
2025-10-16 3:34 ` [PATCH v2 2/3] mm/memory-failure: improve large block size folio handling Zi Yan
@ 2025-10-16 3:34 ` Zi Yan
2025-10-17 9:20 ` Lorenzo Stoakes
2 siblings, 1 reply; 27+ messages in thread
From: Zi Yan @ 2025-10-16 3:34 UTC (permalink / raw)
To: linmiaohe, david, jane.chu, kernel, syzbot+e6367ea2fdab6ed46056,
syzkaller-bugs
Cc: ziy, akpm, mcgrof, nao.horiguchi, Lorenzo Stoakes, Baolin Wang,
Liam R. Howlett, Nico Pache, Ryan Roberts, Dev Jain, Barry Song,
Lance Yang, Matthew Wilcox (Oracle), Wei Yang, linux-fsdevel,
linux-kernel, linux-mm
try_folio_split_to_order(), folio_split, __folio_split(), and
__split_unmapped_folio() do not have correct kernel-doc comment format.
Fix them.
Signed-off-by: Zi Yan <ziy@nvidia.com>
---
include/linux/huge_mm.h | 10 ++++++----
mm/huge_memory.c | 27 +++++++++++++++------------
2 files changed, 21 insertions(+), 16 deletions(-)
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 3d9587f40c0b..1a1b9ed50acc 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -382,9 +382,9 @@ static inline int split_huge_page_to_list_to_order(struct page *page, struct lis
return __split_huge_page_to_list_to_order(page, list, new_order, false);
}
-/*
- * try_folio_split_to_order - try to split a @folio at @page to @new_order using
- * non uniform split.
+/**
+ * try_folio_split_to_order() - try to split a @folio at @page to @new_order
+ * using non uniform split.
* @folio: folio to be split
* @page: split to @order at the given page
* @new_order: the target split order
@@ -394,7 +394,7 @@ static inline int split_huge_page_to_list_to_order(struct page *page, struct lis
* folios are put back to LRU list. Use min_order_for_split() to get the lower
* bound of @new_order.
*
- * Return: 0: split is successful, otherwise split failed.
+ * Return: 0 - split is successful, otherwise split failed.
*/
static inline int try_folio_split_to_order(struct folio *folio,
struct page *page, unsigned int new_order)
@@ -483,6 +483,8 @@ static inline spinlock_t *pud_trans_huge_lock(pud_t *pud,
/**
* folio_test_pmd_mappable - Can we map this folio with a PMD?
* @folio: The folio to test
+ *
+ * Return: true - @folio can be mapped, false - @folio cannot be mapped.
*/
static inline bool folio_test_pmd_mappable(struct folio *folio)
{
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index f308f11dc72f..89179711539e 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3552,8 +3552,9 @@ static void __split_folio_to_order(struct folio *folio, int old_order,
ClearPageCompound(&folio->page);
}
-/*
- * It splits an unmapped @folio to lower order smaller folios in two ways.
+/**
+ * __split_unmapped_folio() - splits an unmapped @folio to lower order folios in
+ * two ways: uniform split or non-uniform split.
* @folio: the to-be-split folio
* @new_order: the smallest order of the after split folios (since buddy
* allocator like split generates folios with orders from @folio's
@@ -3588,8 +3589,8 @@ static void __split_folio_to_order(struct folio *folio, int old_order,
* folio containing @page. The caller needs to unlock and/or free after-split
* folios if necessary.
*
- * For !uniform_split, when -ENOMEM is returned, the original folio might be
- * split. The caller needs to check the input folio.
+ * Return: 0 - successful, <0 - failed (if -ENOMEM is returned, @folio might be
+ * split but not to @new_order, the caller needs to check)
*/
static int __split_unmapped_folio(struct folio *folio, int new_order,
struct page *split_at, struct xa_state *xas,
@@ -3703,8 +3704,8 @@ bool uniform_split_supported(struct folio *folio, unsigned int new_order,
return true;
}
-/*
- * __folio_split: split a folio at @split_at to a @new_order folio
+/**
+ * __folio_split() - split a folio at @split_at to a @new_order folio
* @folio: folio to split
* @new_order: the order of the new folio
* @split_at: a page within the new folio
@@ -3722,7 +3723,7 @@ bool uniform_split_supported(struct folio *folio, unsigned int new_order,
* 1. for uniform split, @lock_at points to one of @folio's subpages;
* 2. for buddy allocator like (non-uniform) split, @lock_at points to @folio.
*
- * return: 0: successful, <0 failed (if -ENOMEM is returned, @folio might be
+ * Return: 0 - successful, <0 - failed (if -ENOMEM is returned, @folio might be
* split but not to @new_order, the caller needs to check)
*/
static int __folio_split(struct folio *folio, unsigned int new_order,
@@ -4111,14 +4112,13 @@ int __split_huge_page_to_list_to_order(struct page *page, struct list_head *list
unmapped);
}
-/*
- * folio_split: split a folio at @split_at to a @new_order folio
+/**
+ * folio_split() - split a folio at @split_at to a @new_order folio
* @folio: folio to split
* @new_order: the order of the new folio
* @split_at: a page within the new folio
- *
- * return: 0: successful, <0 failed (if -ENOMEM is returned, @folio might be
- * split but not to @new_order, the caller needs to check)
+ * @list: after-split folios are added to @list if not null, otherwise to LRU
+ * list
*
* It has the same prerequisites and returns as
* split_huge_page_to_list_to_order().
@@ -4132,6 +4132,9 @@ int __split_huge_page_to_list_to_order(struct page *page, struct list_head *list
* [order-4, {order-3}, order-3, order-5, order-6, order-7, order-8].
*
* After split, folio is left locked for caller.
+ *
+ * Return: 0 - successful, <0 - failed (if -ENOMEM is returned, @folio might be
+ * split but not to @new_order, the caller needs to check)
*/
int folio_split(struct folio *folio, unsigned int new_order,
struct page *split_at, struct list_head *list)
--
2.51.0
^ permalink raw reply related [flat|nested] 27+ messages in thread
* Re: [PATCH v2 1/3] mm/huge_memory: do not change split_huge_page*() target order silently.
2025-10-16 3:34 ` [PATCH v2 1/3] mm/huge_memory: do not change split_huge_page*() target order silently Zi Yan
@ 2025-10-16 7:31 ` Wei Yang
2025-10-16 14:32 ` Zi Yan
0 siblings, 1 reply; 27+ messages in thread
From: Wei Yang @ 2025-10-16 7:31 UTC (permalink / raw)
To: Zi Yan
Cc: linmiaohe, david, jane.chu, kernel, syzbot+e6367ea2fdab6ed46056,
syzkaller-bugs, akpm, mcgrof, nao.horiguchi, Lorenzo Stoakes,
Baolin Wang, Liam R. Howlett, Nico Pache, Ryan Roberts, Dev Jain,
Barry Song, Lance Yang, Matthew Wilcox (Oracle), Wei Yang,
linux-fsdevel, linux-kernel, linux-mm, Pankaj Raghav
On Wed, Oct 15, 2025 at 11:34:50PM -0400, Zi Yan wrote:
>Page cache folios from a file system that support large block size (LBS)
>can have minimal folio order greater than 0, thus a high order folio might
>not be able to be split down to order-0. Commit e220917fa507 ("mm: split a
>folio in minimum folio order chunks") bumps the target order of
>split_huge_page*() to the minimum allowed order when splitting a LBS folio.
>This causes confusion for some split_huge_page*() callers like memory
>failure handling code, since they expect after-split folios all have
>order-0 when split succeeds but in really get min_order_for_split() order
>folios.
>
>Fix it by failing a split if the folio cannot be split to the target order.
>Rename try_folio_split() to try_folio_split_to_order() to reflect the added
>new_order parameter. Remove its unused list parameter.
>
>Fixes: e220917fa507 ("mm: split a folio in minimum folio order chunks")
>[The test poisons LBS folios, which cannot be split to order-0 folios, and
>also tries to poison all memory. The non split LBS folios take more memory
>than the test anticipated, leading to OOM. The patch fixed the kernel
>warning and the test needs some change to avoid OOM.]
>Reported-by: syzbot+e6367ea2fdab6ed46056@syzkaller.appspotmail.com
>Closes: https://lore.kernel.org/all/68d2c943.a70a0220.1b52b.02b3.GAE@google.com/
>Signed-off-by: Zi Yan <ziy@nvidia.com>
>Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
>Reviewed-by: Pankaj Raghav <p.raghav@samsung.com>
Do we want to cc stable?
>---
> include/linux/huge_mm.h | 55 +++++++++++++++++------------------------
> mm/huge_memory.c | 9 +------
> mm/truncate.c | 6 +++--
> 3 files changed, 28 insertions(+), 42 deletions(-)
>
>diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>index c4a811958cda..3d9587f40c0b 100644
>--- a/include/linux/huge_mm.h
>+++ b/include/linux/huge_mm.h
>@@ -383,45 +383,30 @@ static inline int split_huge_page_to_list_to_order(struct page *page, struct lis
> }
>
> /*
>- * try_folio_split - try to split a @folio at @page using non uniform split.
>+ * try_folio_split_to_order - try to split a @folio at @page to @new_order using
>+ * non uniform split.
> * @folio: folio to be split
>- * @page: split to order-0 at the given page
>- * @list: store the after-split folios
>+ * @page: split to @order at the given page
split to @new_order?
>+ * @new_order: the target split order
> *
>- * Try to split a @folio at @page using non uniform split to order-0, if
>- * non uniform split is not supported, fall back to uniform split.
>+ * Try to split a @folio at @page using non uniform split to @new_order, if
>+ * non uniform split is not supported, fall back to uniform split. After-split
>+ * folios are put back to LRU list. Use min_order_for_split() to get the lower
>+ * bound of @new_order.
We removed min_order_for_split() here right?
> *
> * Return: 0: split is successful, otherwise split failed.
> */
>-static inline int try_folio_split(struct folio *folio, struct page *page,
>- struct list_head *list)
>+static inline int try_folio_split_to_order(struct folio *folio,
>+ struct page *page, unsigned int new_order)
> {
>- int ret = min_order_for_split(folio);
>-
>- if (ret < 0)
>- return ret;
>-
>- if (!non_uniform_split_supported(folio, 0, false))
>- return split_huge_page_to_list_to_order(&folio->page, list,
>- ret);
>- return folio_split(folio, ret, page, list);
>+ if (!non_uniform_split_supported(folio, new_order, /* warns= */ false))
>+ return split_huge_page_to_list_to_order(&folio->page, NULL,
>+ new_order);
>+ return folio_split(folio, new_order, page, NULL);
> }
--
Wei Yang
Help you, Help me
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v2 1/3] mm/huge_memory: do not change split_huge_page*() target order silently.
2025-10-16 7:31 ` Wei Yang
@ 2025-10-16 14:32 ` Zi Yan
2025-10-16 20:59 ` Andrew Morton
2025-10-17 1:01 ` Wei Yang
0 siblings, 2 replies; 27+ messages in thread
From: Zi Yan @ 2025-10-16 14:32 UTC (permalink / raw)
To: Wei Yang
Cc: linmiaohe, david, jane.chu, kernel, syzbot+e6367ea2fdab6ed46056,
syzkaller-bugs, akpm, mcgrof, nao.horiguchi, Lorenzo Stoakes,
Baolin Wang, Liam R. Howlett, Nico Pache, Ryan Roberts, Dev Jain,
Barry Song, Lance Yang, Matthew Wilcox (Oracle), linux-fsdevel,
linux-kernel, linux-mm, Pankaj Raghav
On 16 Oct 2025, at 3:31, Wei Yang wrote:
> On Wed, Oct 15, 2025 at 11:34:50PM -0400, Zi Yan wrote:
>> Page cache folios from a file system that support large block size (LBS)
>> can have minimal folio order greater than 0, thus a high order folio might
>> not be able to be split down to order-0. Commit e220917fa507 ("mm: split a
>> folio in minimum folio order chunks") bumps the target order of
>> split_huge_page*() to the minimum allowed order when splitting a LBS folio.
>> This causes confusion for some split_huge_page*() callers like memory
>> failure handling code, since they expect after-split folios all have
>> order-0 when split succeeds but in really get min_order_for_split() order
>> folios.
>>
>> Fix it by failing a split if the folio cannot be split to the target order.
>> Rename try_folio_split() to try_folio_split_to_order() to reflect the added
>> new_order parameter. Remove its unused list parameter.
>>
>> Fixes: e220917fa507 ("mm: split a folio in minimum folio order chunks")
>> [The test poisons LBS folios, which cannot be split to order-0 folios, and
>> also tries to poison all memory. The non split LBS folios take more memory
>> than the test anticipated, leading to OOM. The patch fixed the kernel
>> warning and the test needs some change to avoid OOM.]
>> Reported-by: syzbot+e6367ea2fdab6ed46056@syzkaller.appspotmail.com
>> Closes: https://lore.kernel.org/all/68d2c943.a70a0220.1b52b.02b3.GAE@google.com/
>> Signed-off-by: Zi Yan <ziy@nvidia.com>
>> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
>> Reviewed-by: Pankaj Raghav <p.raghav@samsung.com>
>
> Do we want to cc stable?
This only triggers a warning, so I am inclined not to.
But some config decides to crash on kernel warnings. If anyone thinks
it is worth ccing stable, please let me know.
>
>> ---
>> include/linux/huge_mm.h | 55 +++++++++++++++++------------------------
>> mm/huge_memory.c | 9 +------
>> mm/truncate.c | 6 +++--
>> 3 files changed, 28 insertions(+), 42 deletions(-)
>>
>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>> index c4a811958cda..3d9587f40c0b 100644
>> --- a/include/linux/huge_mm.h
>> +++ b/include/linux/huge_mm.h
>> @@ -383,45 +383,30 @@ static inline int split_huge_page_to_list_to_order(struct page *page, struct lis
>> }
>>
>> /*
>> - * try_folio_split - try to split a @folio at @page using non uniform split.
>> + * try_folio_split_to_order - try to split a @folio at @page to @new_order using
>> + * non uniform split.
>> * @folio: folio to be split
>> - * @page: split to order-0 at the given page
>> - * @list: store the after-split folios
>> + * @page: split to @order at the given page
>
> split to @new_order?
Will fix it.
>
>> + * @new_order: the target split order
>> *
>> - * Try to split a @folio at @page using non uniform split to order-0, if
>> - * non uniform split is not supported, fall back to uniform split.
>> + * Try to split a @folio at @page using non uniform split to @new_order, if
>> + * non uniform split is not supported, fall back to uniform split. After-split
>> + * folios are put back to LRU list. Use min_order_for_split() to get the lower
>> + * bound of @new_order.
>
> We removed min_order_for_split() here right?
We removed it from the code, but caller should use min_order_for_split()
to get the lower bound of new_order if they do not want to split to fail
unexpectedly.
Thank you for the review.
>
>> *
>> * Return: 0: split is successful, otherwise split failed.
>> */
>> -static inline int try_folio_split(struct folio *folio, struct page *page,
>> - struct list_head *list)
>> +static inline int try_folio_split_to_order(struct folio *folio,
>> + struct page *page, unsigned int new_order)
>> {
>> - int ret = min_order_for_split(folio);
>> -
>> - if (ret < 0)
>> - return ret;
>> -
>> - if (!non_uniform_split_supported(folio, 0, false))
>> - return split_huge_page_to_list_to_order(&folio->page, list,
>> - ret);
>> - return folio_split(folio, ret, page, list);
>> + if (!non_uniform_split_supported(folio, new_order, /* warns= */ false))
>> + return split_huge_page_to_list_to_order(&folio->page, NULL,
>> + new_order);
>> + return folio_split(folio, new_order, page, NULL);
>> }
>
> --
> Wei Yang
> Help you, Help me
--
Best Regards,
Yan, Zi
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v2 1/3] mm/huge_memory: do not change split_huge_page*() target order silently.
2025-10-16 14:32 ` Zi Yan
@ 2025-10-16 20:59 ` Andrew Morton
2025-10-17 1:03 ` Zi Yan
2025-10-17 1:01 ` Wei Yang
1 sibling, 1 reply; 27+ messages in thread
From: Andrew Morton @ 2025-10-16 20:59 UTC (permalink / raw)
To: Zi Yan
Cc: Wei Yang, linmiaohe, david, jane.chu, kernel,
syzbot+e6367ea2fdab6ed46056, syzkaller-bugs, mcgrof,
nao.horiguchi, Lorenzo Stoakes, Baolin Wang, Liam R. Howlett,
Nico Pache, Ryan Roberts, Dev Jain, Barry Song, Lance Yang,
Matthew Wilcox (Oracle), linux-fsdevel, linux-kernel, linux-mm,
Pankaj Raghav
On Thu, 16 Oct 2025 10:32:17 -0400 Zi Yan <ziy@nvidia.com> wrote:
> > Do we want to cc stable?
>
> This only triggers a warning, so I am inclined not to.
> But some config decides to crash on kernel warnings. If anyone thinks
> it is worth ccing stable, please let me know.
Yes please. Kernel warnings are pretty serious and I do like to fix
them in -stable when possible.
That means this patch will have a different routing and priority than
the other two so please split the warning fix out from the series.
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v2 1/3] mm/huge_memory: do not change split_huge_page*() target order silently.
2025-10-16 14:32 ` Zi Yan
2025-10-16 20:59 ` Andrew Morton
@ 2025-10-17 1:01 ` Wei Yang
1 sibling, 0 replies; 27+ messages in thread
From: Wei Yang @ 2025-10-17 1:01 UTC (permalink / raw)
To: Zi Yan
Cc: Wei Yang, linmiaohe, david, jane.chu, kernel,
syzbot+e6367ea2fdab6ed46056, syzkaller-bugs, akpm, mcgrof,
nao.horiguchi, Lorenzo Stoakes, Baolin Wang, Liam R. Howlett,
Nico Pache, Ryan Roberts, Dev Jain, Barry Song, Lance Yang,
Matthew Wilcox (Oracle), linux-fsdevel, linux-kernel, linux-mm,
Pankaj Raghav
On Thu, Oct 16, 2025 at 10:32:17AM -0400, Zi Yan wrote:
>On 16 Oct 2025, at 3:31, Wei Yang wrote:
>
[...]
>
>>
>>> + * @new_order: the target split order
>>> *
>>> - * Try to split a @folio at @page using non uniform split to order-0, if
>>> - * non uniform split is not supported, fall back to uniform split.
>>> + * Try to split a @folio at @page using non uniform split to @new_order, if
>>> + * non uniform split is not supported, fall back to uniform split. After-split
>>> + * folios are put back to LRU list. Use min_order_for_split() to get the lower
>>> + * bound of @new_order.
>>
>> We removed min_order_for_split() here right?
>
>We removed it from the code, but caller should use min_order_for_split()
>to get the lower bound of new_order if they do not want to split to fail
>unexpectedly.
>
>Thank you for the review.
Thanks, my poor English, I got what you mean.
No other comments.
Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
--
Wei Yang
Help you, Help me
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v2 1/3] mm/huge_memory: do not change split_huge_page*() target order silently.
2025-10-16 20:59 ` Andrew Morton
@ 2025-10-17 1:03 ` Zi Yan
2025-10-17 9:06 ` Lorenzo Stoakes
0 siblings, 1 reply; 27+ messages in thread
From: Zi Yan @ 2025-10-17 1:03 UTC (permalink / raw)
To: Andrew Morton
Cc: Wei Yang, linmiaohe, david, jane.chu, kernel,
syzbot+e6367ea2fdab6ed46056, syzkaller-bugs, mcgrof,
nao.horiguchi, Lorenzo Stoakes, Baolin Wang, Liam R. Howlett,
Nico Pache, Ryan Roberts, Dev Jain, Barry Song, Lance Yang,
Matthew Wilcox (Oracle), linux-fsdevel, linux-kernel, linux-mm,
Pankaj Raghav
On 16 Oct 2025, at 16:59, Andrew Morton wrote:
> On Thu, 16 Oct 2025 10:32:17 -0400 Zi Yan <ziy@nvidia.com> wrote:
>
>>> Do we want to cc stable?
>>
>> This only triggers a warning, so I am inclined not to.
>> But some config decides to crash on kernel warnings. If anyone thinks
>> it is worth ccing stable, please let me know.
>
> Yes please. Kernel warnings are pretty serious and I do like to fix
> them in -stable when possible.
>
> That means this patch will have a different routing and priority than
> the other two so please split the warning fix out from the series.
OK. Let me send this one and cc stable.
Best Regards,
Yan, Zi
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v2 1/3] mm/huge_memory: do not change split_huge_page*() target order silently.
2025-10-17 1:03 ` Zi Yan
@ 2025-10-17 9:06 ` Lorenzo Stoakes
2025-10-17 9:10 ` Lorenzo Stoakes
0 siblings, 1 reply; 27+ messages in thread
From: Lorenzo Stoakes @ 2025-10-17 9:06 UTC (permalink / raw)
To: Zi Yan
Cc: Andrew Morton, Wei Yang, linmiaohe, david, jane.chu, kernel,
syzbot+e6367ea2fdab6ed46056, syzkaller-bugs, mcgrof,
nao.horiguchi, Baolin Wang, Liam R. Howlett, Nico Pache,
Ryan Roberts, Dev Jain, Barry Song, Lance Yang,
Matthew Wilcox (Oracle), linux-fsdevel, linux-kernel, linux-mm,
Pankaj Raghav
On Thu, Oct 16, 2025 at 09:03:27PM -0400, Zi Yan wrote:
> On 16 Oct 2025, at 16:59, Andrew Morton wrote:
>
> > On Thu, 16 Oct 2025 10:32:17 -0400 Zi Yan <ziy@nvidia.com> wrote:
> >
> >>> Do we want to cc stable?
> >>
> >> This only triggers a warning, so I am inclined not to.
> >> But some config decides to crash on kernel warnings. If anyone thinks
> >> it is worth ccing stable, please let me know.
> >
> > Yes please. Kernel warnings are pretty serious and I do like to fix
> > them in -stable when possible.
> >
> > That means this patch will have a different routing and priority than
> > the other two so please split the warning fix out from the series.
>
> OK. Let me send this one and cc stable.
You've added a bunch of confusion here, now if I review the rest of this series
it looks like I'm reviewing it with this stale patch included.
Can you please resend the remainder of the series as a v3 so it's clear? Thanks!
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v2 1/3] mm/huge_memory: do not change split_huge_page*() target order silently.
2025-10-17 9:06 ` Lorenzo Stoakes
@ 2025-10-17 9:10 ` Lorenzo Stoakes
2025-10-17 14:16 ` Zi Yan
0 siblings, 1 reply; 27+ messages in thread
From: Lorenzo Stoakes @ 2025-10-17 9:10 UTC (permalink / raw)
To: Zi Yan
Cc: Andrew Morton, Wei Yang, linmiaohe, david, jane.chu, kernel,
syzbot+e6367ea2fdab6ed46056, syzkaller-bugs, mcgrof,
nao.horiguchi, Baolin Wang, Liam R. Howlett, Nico Pache,
Ryan Roberts, Dev Jain, Barry Song, Lance Yang,
Matthew Wilcox (Oracle), linux-fsdevel, linux-kernel, linux-mm,
Pankaj Raghav
On Fri, Oct 17, 2025 at 10:06:41AM +0100, Lorenzo Stoakes wrote:
> On Thu, Oct 16, 2025 at 09:03:27PM -0400, Zi Yan wrote:
> > On 16 Oct 2025, at 16:59, Andrew Morton wrote:
> >
> > > On Thu, 16 Oct 2025 10:32:17 -0400 Zi Yan <ziy@nvidia.com> wrote:
> > >
> > >>> Do we want to cc stable?
> > >>
> > >> This only triggers a warning, so I am inclined not to.
> > >> But some config decides to crash on kernel warnings. If anyone thinks
> > >> it is worth ccing stable, please let me know.
> > >
> > > Yes please. Kernel warnings are pretty serious and I do like to fix
> > > them in -stable when possible.
> > >
> > > That means this patch will have a different routing and priority than
> > > the other two so please split the warning fix out from the series.
> >
> > OK. Let me send this one and cc stable.
>
> You've added a bunch of confusion here, now if I review the rest of this series
> it looks like I'm reviewing it with this stale patch included.
>
> Can you please resend the remainder of the series as a v3 so it's clear? Thanks!
Oh and now this entire series relies on that one landing to work :/
What a mess - Can't we just live with one patch from a series being stable and
the rest not? Seems crazy otherwise.
I guess when you resend you'll need to put explicitly in the cover letter
'relies on patch xxxx'
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v2 3/3] mm/huge_memory: fix kernel-doc comments for folio_split() and related.
2025-10-16 3:34 ` [PATCH v2 3/3] mm/huge_memory: fix kernel-doc comments for folio_split() and related Zi Yan
@ 2025-10-17 9:20 ` Lorenzo Stoakes
0 siblings, 0 replies; 27+ messages in thread
From: Lorenzo Stoakes @ 2025-10-17 9:20 UTC (permalink / raw)
To: Zi Yan
Cc: linmiaohe, david, jane.chu, kernel, syzbot+e6367ea2fdab6ed46056,
syzkaller-bugs, akpm, mcgrof, nao.horiguchi, Baolin Wang,
Liam R. Howlett, Nico Pache, Ryan Roberts, Dev Jain, Barry Song,
Lance Yang, Matthew Wilcox (Oracle), Wei Yang, linux-fsdevel,
linux-kernel, linux-mm
On Wed, Oct 15, 2025 at 11:34:52PM -0400, Zi Yan wrote:
> try_folio_split_to_order(), folio_split, __folio_split(), and
> __split_unmapped_folio() do not have correct kernel-doc comment format.
> Fix them.
>
> Signed-off-by: Zi Yan <ziy@nvidia.com>
Thanks for doing this! LGTM, so:
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> ---
> include/linux/huge_mm.h | 10 ++++++----
> mm/huge_memory.c | 27 +++++++++++++++------------
> 2 files changed, 21 insertions(+), 16 deletions(-)
>
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 3d9587f40c0b..1a1b9ed50acc 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -382,9 +382,9 @@ static inline int split_huge_page_to_list_to_order(struct page *page, struct lis
> return __split_huge_page_to_list_to_order(page, list, new_order, false);
> }
>
> -/*
> - * try_folio_split_to_order - try to split a @folio at @page to @new_order using
> - * non uniform split.
> +/**
> + * try_folio_split_to_order() - try to split a @folio at @page to @new_order
> + * using non uniform split.
> * @folio: folio to be split
> * @page: split to @order at the given page
> * @new_order: the target split order
> @@ -394,7 +394,7 @@ static inline int split_huge_page_to_list_to_order(struct page *page, struct lis
> * folios are put back to LRU list. Use min_order_for_split() to get the lower
> * bound of @new_order.
> *
> - * Return: 0: split is successful, otherwise split failed.
> + * Return: 0 - split is successful, otherwise split failed.
> */
> static inline int try_folio_split_to_order(struct folio *folio,
> struct page *page, unsigned int new_order)
> @@ -483,6 +483,8 @@ static inline spinlock_t *pud_trans_huge_lock(pud_t *pud,
> /**
> * folio_test_pmd_mappable - Can we map this folio with a PMD?
> * @folio: The folio to test
> + *
> + * Return: true - @folio can be mapped, false - @folio cannot be mapped.
> */
> static inline bool folio_test_pmd_mappable(struct folio *folio)
> {
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index f308f11dc72f..89179711539e 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -3552,8 +3552,9 @@ static void __split_folio_to_order(struct folio *folio, int old_order,
> ClearPageCompound(&folio->page);
> }
>
> -/*
> - * It splits an unmapped @folio to lower order smaller folios in two ways.
> +/**
> + * __split_unmapped_folio() - splits an unmapped @folio to lower order folios in
> + * two ways: uniform split or non-uniform split.
> * @folio: the to-be-split folio
> * @new_order: the smallest order of the after split folios (since buddy
> * allocator like split generates folios with orders from @folio's
> @@ -3588,8 +3589,8 @@ static void __split_folio_to_order(struct folio *folio, int old_order,
> * folio containing @page. The caller needs to unlock and/or free after-split
> * folios if necessary.
> *
> - * For !uniform_split, when -ENOMEM is returned, the original folio might be
> - * split. The caller needs to check the input folio.
> + * Return: 0 - successful, <0 - failed (if -ENOMEM is returned, @folio might be
> + * split but not to @new_order, the caller needs to check)
> */
> static int __split_unmapped_folio(struct folio *folio, int new_order,
> struct page *split_at, struct xa_state *xas,
> @@ -3703,8 +3704,8 @@ bool uniform_split_supported(struct folio *folio, unsigned int new_order,
> return true;
> }
>
> -/*
> - * __folio_split: split a folio at @split_at to a @new_order folio
> +/**
> + * __folio_split() - split a folio at @split_at to a @new_order folio
> * @folio: folio to split
> * @new_order: the order of the new folio
> * @split_at: a page within the new folio
> @@ -3722,7 +3723,7 @@ bool uniform_split_supported(struct folio *folio, unsigned int new_order,
> * 1. for uniform split, @lock_at points to one of @folio's subpages;
> * 2. for buddy allocator like (non-uniform) split, @lock_at points to @folio.
> *
> - * return: 0: successful, <0 failed (if -ENOMEM is returned, @folio might be
> + * Return: 0 - successful, <0 - failed (if -ENOMEM is returned, @folio might be
> * split but not to @new_order, the caller needs to check)
> */
> static int __folio_split(struct folio *folio, unsigned int new_order,
> @@ -4111,14 +4112,13 @@ int __split_huge_page_to_list_to_order(struct page *page, struct list_head *list
> unmapped);
> }
>
> -/*
> - * folio_split: split a folio at @split_at to a @new_order folio
> +/**
> + * folio_split() - split a folio at @split_at to a @new_order folio
> * @folio: folio to split
> * @new_order: the order of the new folio
> * @split_at: a page within the new folio
> - *
> - * return: 0: successful, <0 failed (if -ENOMEM is returned, @folio might be
> - * split but not to @new_order, the caller needs to check)
> + * @list: after-split folios are added to @list if not null, otherwise to LRU
> + * list
> *
> * It has the same prerequisites and returns as
> * split_huge_page_to_list_to_order().
> @@ -4132,6 +4132,9 @@ int __split_huge_page_to_list_to_order(struct page *page, struct list_head *list
> * [order-4, {order-3}, order-3, order-5, order-6, order-7, order-8].
> *
> * After split, folio is left locked for caller.
> + *
> + * Return: 0 - successful, <0 - failed (if -ENOMEM is returned, @folio might be
> + * split but not to @new_order, the caller needs to check)
> */
> int folio_split(struct folio *folio, unsigned int new_order,
> struct page *split_at, struct list_head *list)
> --
> 2.51.0
>
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v2 2/3] mm/memory-failure: improve large block size folio handling.
2025-10-16 3:34 ` [PATCH v2 2/3] mm/memory-failure: improve large block size folio handling Zi Yan
@ 2025-10-17 9:33 ` Lorenzo Stoakes
2025-10-20 20:09 ` Zi Yan
2025-10-17 19:11 ` Yang Shi
1 sibling, 1 reply; 27+ messages in thread
From: Lorenzo Stoakes @ 2025-10-17 9:33 UTC (permalink / raw)
To: Zi Yan
Cc: linmiaohe, david, jane.chu, kernel, syzbot+e6367ea2fdab6ed46056,
syzkaller-bugs, akpm, mcgrof, nao.horiguchi, Baolin Wang,
Liam R. Howlett, Nico Pache, Ryan Roberts, Dev Jain, Barry Song,
Lance Yang, Matthew Wilcox (Oracle), Wei Yang, linux-fsdevel,
linux-kernel, linux-mm
On Wed, Oct 15, 2025 at 11:34:51PM -0400, Zi Yan wrote:
> Large block size (LBS) folios cannot be split to order-0 folios but
> min_order_for_folio(). Current split fails directly, but that is not
> optimal. Split the folio to min_order_for_folio(), so that, after split,
> only the folio containing the poisoned page becomes unusable instead.
>
> For soft offline, do not split the large folio if it cannot be split to
> order-0. Since the folio is still accessible from userspace and premature
> split might lead to potential performance loss.
>
> Suggested-by: Jane Chu <jane.chu@oracle.com>
> Signed-off-by: Zi Yan <ziy@nvidia.com>
> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
> ---
> mm/memory-failure.c | 25 +++++++++++++++++++++----
> 1 file changed, 21 insertions(+), 4 deletions(-)
>
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index f698df156bf8..443df9581c24 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -1656,12 +1656,13 @@ static int identify_page_state(unsigned long pfn, struct page *p,
> * there is still more to do, hence the page refcount we took earlier
> * is still needed.
> */
> -static int try_to_split_thp_page(struct page *page, bool release)
> +static int try_to_split_thp_page(struct page *page, unsigned int new_order,
> + bool release)
> {
> int ret;
>
> lock_page(page);
> - ret = split_huge_page(page);
> + ret = split_huge_page_to_list_to_order(page, NULL, new_order);
I wonder if we need a wrapper for these list==NULL cases, as
split_huge_page_to_list_to_order suggests you always have a list provided... and
this is ugly :)
split_huge_page_to_order() seems good.
> unlock_page(page);
>
> if (ret && release)
> @@ -2280,6 +2281,7 @@ int memory_failure(unsigned long pfn, int flags)
> folio_unlock(folio);
>
> if (folio_test_large(folio)) {
> + int new_order = min_order_for_split(folio);
Newline after decl?
> /*
> * The flag must be set after the refcount is bumped
> * otherwise it may race with THP split.
> @@ -2294,7 +2296,14 @@ int memory_failure(unsigned long pfn, int flags)
> * page is a valid handlable page.
> */
> folio_set_has_hwpoisoned(folio);
> - if (try_to_split_thp_page(p, false) < 0) {
> + /*
> + * If the folio cannot be split to order-0, kill the process,
> + * but split the folio anyway to minimize the amount of unusable
> + * pages.
> + */
> + if (try_to_split_thp_page(p, new_order, false) || new_order) {
Please use /* release= */false here
I'm also not sure about the logic here, it feels unclear.
Something like:
err = try_to_to_split_thp_page(p, new_order, /* release= */false);
/*
* If the folio cannot be split, kill the process.
* If it can be split, but not to order-0, then this defeats the
* expectation that we do so, but we want the split to have been
* made to
*/
if (err || new_order > 0) {
}
> + /* get folio again in case the original one is split */
> + folio = page_folio(p);
> res = -EHWPOISON;
> kill_procs_now(p, pfn, flags, folio);
> put_page(p);
> @@ -2621,7 +2630,15 @@ static int soft_offline_in_use_page(struct page *page)
> };
>
> if (!huge && folio_test_large(folio)) {
> - if (try_to_split_thp_page(page, true)) {
> + int new_order = min_order_for_split(folio);
> +
> + /*
> + * If the folio cannot be split to order-0, do not split it at
> + * all to retain the still accessible large folio.
> + * NOTE: if getting free memory is perferred, split it like it
Typo perferred -> preferred.
> + * is done in memory_failure().
I'm confused as to your comment here though, we're not splitting it like
memory_failure()? We're splitting a. with release and b. only if we can target
order-0.
So how would this preference in any way be a thing that happens? :) I may be
missing something here.
> + */
> + if (new_order || try_to_split_thp_page(page, new_order, true)) {
Same comment as above with /* release= */true.
You should pass 0 not new_order to try_to_split_thp_page() here as it has to be
0 for the function to be invoked and that's just obviously clearer.
> pr_info("%#lx: thp split failed\n", pfn);
> return -EBUSY;
> }
> --
> 2.51.0
>
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v2 1/3] mm/huge_memory: do not change split_huge_page*() target order silently.
2025-10-17 9:10 ` Lorenzo Stoakes
@ 2025-10-17 14:16 ` Zi Yan
2025-10-17 14:32 ` Lorenzo Stoakes
0 siblings, 1 reply; 27+ messages in thread
From: Zi Yan @ 2025-10-17 14:16 UTC (permalink / raw)
To: Lorenzo Stoakes, Andrew Morton
Cc: Wei Yang, linmiaohe, david, jane.chu, kernel,
syzbot+e6367ea2fdab6ed46056, syzkaller-bugs, mcgrof,
nao.horiguchi, Baolin Wang, Liam R. Howlett, Nico Pache,
Ryan Roberts, Dev Jain, Barry Song, Lance Yang,
Matthew Wilcox (Oracle), linux-fsdevel, linux-kernel, linux-mm,
Pankaj Raghav
On 17 Oct 2025, at 5:10, Lorenzo Stoakes wrote:
> On Fri, Oct 17, 2025 at 10:06:41AM +0100, Lorenzo Stoakes wrote:
>> On Thu, Oct 16, 2025 at 09:03:27PM -0400, Zi Yan wrote:
>>> On 16 Oct 2025, at 16:59, Andrew Morton wrote:
>>>
>>>> On Thu, 16 Oct 2025 10:32:17 -0400 Zi Yan <ziy@nvidia.com> wrote:
>>>>
>>>>>> Do we want to cc stable?
>>>>>
>>>>> This only triggers a warning, so I am inclined not to.
>>>>> But some config decides to crash on kernel warnings. If anyone thinks
>>>>> it is worth ccing stable, please let me know.
>>>>
>>>> Yes please. Kernel warnings are pretty serious and I do like to fix
>>>> them in -stable when possible.
>>>>
>>>> That means this patch will have a different routing and priority than
>>>> the other two so please split the warning fix out from the series.
>>>
>>> OK. Let me send this one and cc stable.
>>
>> You've added a bunch of confusion here, now if I review the rest of this series
What confusion I have added here? Do you mind elaborating?
>> it looks like I'm reviewing it with this stale patch included.
>>
>> Can you please resend the remainder of the series as a v3 so it's clear? Thanks!
>
> Oh and now this entire series relies on that one landing to work :/
>
> What a mess - Can't we just live with one patch from a series being stable and
> the rest not? Seems crazy otherwise.
This is what Andrew told me. Please settle this with Andrew if you do not like
it. I will hold on sending new version of this patchset until either you or
Andrew give me a clear guidance on how to send this patchset.
>
> I guess when you resend you'll need to put explicitly in the cover letter
> 'relies on patch xxxx'
Why? I will simply wait until this patch is merged, then I can send the rest
of two. Separate patchsets with dependency is hard for review, why would I
send them at the same time?
--
Best Regards,
Yan, Zi
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v2 1/3] mm/huge_memory: do not change split_huge_page*() target order silently.
2025-10-17 14:16 ` Zi Yan
@ 2025-10-17 14:32 ` Lorenzo Stoakes
2025-10-18 0:05 ` Andrew Morton
0 siblings, 1 reply; 27+ messages in thread
From: Lorenzo Stoakes @ 2025-10-17 14:32 UTC (permalink / raw)
To: Zi Yan
Cc: Andrew Morton, Wei Yang, linmiaohe, david, jane.chu, kernel,
syzbot+e6367ea2fdab6ed46056, syzkaller-bugs, mcgrof,
nao.horiguchi, Baolin Wang, Liam R. Howlett, Nico Pache,
Ryan Roberts, Dev Jain, Barry Song, Lance Yang,
Matthew Wilcox (Oracle), linux-fsdevel, linux-kernel, linux-mm,
Pankaj Raghav
On Fri, Oct 17, 2025 at 10:16:10AM -0400, Zi Yan wrote:
> On 17 Oct 2025, at 5:10, Lorenzo Stoakes wrote:
>
> > On Fri, Oct 17, 2025 at 10:06:41AM +0100, Lorenzo Stoakes wrote:
> >> On Thu, Oct 16, 2025 at 09:03:27PM -0400, Zi Yan wrote:
> >>> On 16 Oct 2025, at 16:59, Andrew Morton wrote:
> >>>
> >>>> On Thu, 16 Oct 2025 10:32:17 -0400 Zi Yan <ziy@nvidia.com> wrote:
> >>>>
> >>>>>> Do we want to cc stable?
> >>>>>
> >>>>> This only triggers a warning, so I am inclined not to.
> >>>>> But some config decides to crash on kernel warnings. If anyone thinks
> >>>>> it is worth ccing stable, please let me know.
> >>>>
> >>>> Yes please. Kernel warnings are pretty serious and I do like to fix
> >>>> them in -stable when possible.
> >>>>
> >>>> That means this patch will have a different routing and priority than
> >>>> the other two so please split the warning fix out from the series.
> >>>
> >>> OK. Let me send this one and cc stable.
> >>
> >> You've added a bunch of confusion here, now if I review the rest of this series
>
> What confusion I have added here? Do you mind elaborating?
There's 2 series in the tree now:
v2 -> with a stale patch 1/3 + 2/3, 3/3
v3 -> 1/3 separate
If I use any tooling (b4 shazam etc.) to pull this series to review, it'll pull
the state patch.
if 2/3 or 3/3 depend on 1/3 then it's super confused.
All I'm asking is for you to resend/respin the 2 patches without the stale one.
>
> >> it looks like I'm reviewing it with this stale patch included.
> >>
> >> Can you please resend the remainder of the series as a v3 so it's clear? Thanks!
> >
> > Oh and now this entire series relies on that one landing to work :/
> >
> > What a mess - Can't we just live with one patch from a series being stable and
> > the rest not? Seems crazy otherwise.
>
> This is what Andrew told me. Please settle this with Andrew if you do not like
Didn't he just ask you to send 1/3 separately? I don't think he said send 1/3
separately and do not resend 2/3, 3/3...
> it. I will hold on sending new version of this patchset until either you or
> Andrew give me a clear guidance on how to send this patchset.
I mean if you want to delay resending this until the hotfix is sorted out then
just reply to 0/3 saying 'please drop this until that patch is merged'.
Otherwise it looks live.
>
> >
> > I guess when you resend you'll need to put explicitly in the cover letter
> > 'relies on patch xxxx'
>
> Why? I will simply wait until this patch is merged, then I can send the rest
> of two. Separate patchsets with dependency is hard for review, why would I
> send them at the same time?
So you're planning to only resend once the hotfix is upstreamed completely?
Sometimes this can be delayed a couple weeks. But fine.
As long as there's clarity.
>
> --
> Best Regards,
> Yan, Zi
Thanks, Lorenzo
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v2 2/3] mm/memory-failure: improve large block size folio handling.
2025-10-16 3:34 ` [PATCH v2 2/3] mm/memory-failure: improve large block size folio handling Zi Yan
2025-10-17 9:33 ` Lorenzo Stoakes
@ 2025-10-17 19:11 ` Yang Shi
2025-10-20 19:46 ` Zi Yan
1 sibling, 1 reply; 27+ messages in thread
From: Yang Shi @ 2025-10-17 19:11 UTC (permalink / raw)
To: Zi Yan
Cc: linmiaohe, david, jane.chu, kernel, syzbot+e6367ea2fdab6ed46056,
syzkaller-bugs, akpm, mcgrof, nao.horiguchi, Lorenzo Stoakes,
Baolin Wang, Liam R. Howlett, Nico Pache, Ryan Roberts, Dev Jain,
Barry Song, Lance Yang, Matthew Wilcox (Oracle), Wei Yang,
linux-fsdevel, linux-kernel, linux-mm
On Wed, Oct 15, 2025 at 8:38 PM Zi Yan <ziy@nvidia.com> wrote:
>
> Large block size (LBS) folios cannot be split to order-0 folios but
> min_order_for_folio(). Current split fails directly, but that is not
> optimal. Split the folio to min_order_for_folio(), so that, after split,
> only the folio containing the poisoned page becomes unusable instead.
>
> For soft offline, do not split the large folio if it cannot be split to
> order-0. Since the folio is still accessible from userspace and premature
> split might lead to potential performance loss.
>
> Suggested-by: Jane Chu <jane.chu@oracle.com>
> Signed-off-by: Zi Yan <ziy@nvidia.com>
> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
> ---
> mm/memory-failure.c | 25 +++++++++++++++++++++----
> 1 file changed, 21 insertions(+), 4 deletions(-)
>
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index f698df156bf8..443df9581c24 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -1656,12 +1656,13 @@ static int identify_page_state(unsigned long pfn, struct page *p,
> * there is still more to do, hence the page refcount we took earlier
> * is still needed.
> */
> -static int try_to_split_thp_page(struct page *page, bool release)
> +static int try_to_split_thp_page(struct page *page, unsigned int new_order,
> + bool release)
> {
> int ret;
>
> lock_page(page);
> - ret = split_huge_page(page);
> + ret = split_huge_page_to_list_to_order(page, NULL, new_order);
> unlock_page(page);
>
> if (ret && release)
> @@ -2280,6 +2281,7 @@ int memory_failure(unsigned long pfn, int flags)
> folio_unlock(folio);
>
> if (folio_test_large(folio)) {
> + int new_order = min_order_for_split(folio);
> /*
> * The flag must be set after the refcount is bumped
> * otherwise it may race with THP split.
> @@ -2294,7 +2296,14 @@ int memory_failure(unsigned long pfn, int flags)
> * page is a valid handlable page.
> */
> folio_set_has_hwpoisoned(folio);
> - if (try_to_split_thp_page(p, false) < 0) {
> + /*
> + * If the folio cannot be split to order-0, kill the process,
> + * but split the folio anyway to minimize the amount of unusable
> + * pages.
> + */
> + if (try_to_split_thp_page(p, new_order, false) || new_order) {
folio split will clear PG_has_hwpoisoned flag. It is ok for splitting
to order-0 folios because the PG_hwpoisoned flag is set on the
poisoned page. But if you split the folio to some smaller order large
folios, it seems you need to keep PG_has_hwpoisoned flag on the
poisoned folio.
Yang
> + /* get folio again in case the original one is split */
> + folio = page_folio(p);
> res = -EHWPOISON;
> kill_procs_now(p, pfn, flags, folio);
> put_page(p);
> @@ -2621,7 +2630,15 @@ static int soft_offline_in_use_page(struct page *page)
> };
>
> if (!huge && folio_test_large(folio)) {
> - if (try_to_split_thp_page(page, true)) {
> + int new_order = min_order_for_split(folio);
> +
> + /*
> + * If the folio cannot be split to order-0, do not split it at
> + * all to retain the still accessible large folio.
> + * NOTE: if getting free memory is perferred, split it like it
> + * is done in memory_failure().
> + */
> + if (new_order || try_to_split_thp_page(page, new_order, true)) {
> pr_info("%#lx: thp split failed\n", pfn);
> return -EBUSY;
> }
> --
> 2.51.0
>
>
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v2 1/3] mm/huge_memory: do not change split_huge_page*() target order silently.
2025-10-17 14:32 ` Lorenzo Stoakes
@ 2025-10-18 0:05 ` Andrew Morton
0 siblings, 0 replies; 27+ messages in thread
From: Andrew Morton @ 2025-10-18 0:05 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: Zi Yan, Wei Yang, linmiaohe, david, jane.chu, kernel,
syzbot+e6367ea2fdab6ed46056, syzkaller-bugs, mcgrof,
nao.horiguchi, Baolin Wang, Liam R. Howlett, Nico Pache,
Ryan Roberts, Dev Jain, Barry Song, Lance Yang,
Matthew Wilcox (Oracle), linux-fsdevel, linux-kernel, linux-mm,
Pankaj Raghav
On Fri, 17 Oct 2025 15:32:13 +0100 Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:
> > it. I will hold on sending new version of this patchset until either you or
> > Andrew give me a clear guidance on how to send this patchset.
>
> I mean if you want to delay resending this until the hotfix is sorted out then
> just reply to 0/3 saying 'please drop this until that patch is merged'.
>
> Otherwise it looks live.
Yeah, hotfixes come first and separately please. A hotfix will hit
mainline in a week or so. Whether or not they are cc:stable. The
not-hotfix material won't hit mainline for as long as two months!
So mixing hotfixes with next-merge-window patches is to be avoided.
Note that a "hotfix" may or may not be cc:stable - it depends on
whether the Fixes: commit was present in earlier kernel releases.
Actually, if a developer has a hotfix as well as a bunch of
next-merge-window material then it's really best to send the hotfix
only. Hold off on the next-merge-window material so the hotfix gets
standalone testing. Because it's possible that the next-merge-window
material accidentally fixes an issue in the hotfix.
(otoh the hotfixes *will* get that standalone testing from people who
test Linus-latest, but it's bad of us to depend on that!)
I regularly get patchsets which mix hotfixes (sometimes cc:stable) with
next-merge-window material. Pretty often the hotfix isn't very urgent
so I'll say screwit and merge it all as-is, after adding a cc:stable.
The hotfix will get merged and backported eventually.
I hope that nobody really needs to worry much about all this stuff.
Juggling patch priority and timing is what akpms are for.
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v2 2/3] mm/memory-failure: improve large block size folio handling.
2025-10-17 19:11 ` Yang Shi
@ 2025-10-20 19:46 ` Zi Yan
2025-10-20 23:41 ` Yang Shi
2025-10-22 6:39 ` Miaohe Lin
0 siblings, 2 replies; 27+ messages in thread
From: Zi Yan @ 2025-10-20 19:46 UTC (permalink / raw)
To: Yang Shi, linmiaohe, jane.chu
Cc: david, kernel, syzbot+e6367ea2fdab6ed46056, syzkaller-bugs, akpm,
mcgrof, nao.horiguchi, Lorenzo Stoakes, Baolin Wang,
Liam R. Howlett, Nico Pache, Ryan Roberts, Dev Jain, Barry Song,
Lance Yang, Matthew Wilcox (Oracle), Wei Yang, linux-fsdevel,
linux-kernel, linux-mm
On 17 Oct 2025, at 15:11, Yang Shi wrote:
> On Wed, Oct 15, 2025 at 8:38 PM Zi Yan <ziy@nvidia.com> wrote:
>>
>> Large block size (LBS) folios cannot be split to order-0 folios but
>> min_order_for_folio(). Current split fails directly, but that is not
>> optimal. Split the folio to min_order_for_folio(), so that, after split,
>> only the folio containing the poisoned page becomes unusable instead.
>>
>> For soft offline, do not split the large folio if it cannot be split to
>> order-0. Since the folio is still accessible from userspace and premature
>> split might lead to potential performance loss.
>>
>> Suggested-by: Jane Chu <jane.chu@oracle.com>
>> Signed-off-by: Zi Yan <ziy@nvidia.com>
>> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
>> ---
>> mm/memory-failure.c | 25 +++++++++++++++++++++----
>> 1 file changed, 21 insertions(+), 4 deletions(-)
>>
>> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
>> index f698df156bf8..443df9581c24 100644
>> --- a/mm/memory-failure.c
>> +++ b/mm/memory-failure.c
>> @@ -1656,12 +1656,13 @@ static int identify_page_state(unsigned long pfn, struct page *p,
>> * there is still more to do, hence the page refcount we took earlier
>> * is still needed.
>> */
>> -static int try_to_split_thp_page(struct page *page, bool release)
>> +static int try_to_split_thp_page(struct page *page, unsigned int new_order,
>> + bool release)
>> {
>> int ret;
>>
>> lock_page(page);
>> - ret = split_huge_page(page);
>> + ret = split_huge_page_to_list_to_order(page, NULL, new_order);
>> unlock_page(page);
>>
>> if (ret && release)
>> @@ -2280,6 +2281,7 @@ int memory_failure(unsigned long pfn, int flags)
>> folio_unlock(folio);
>>
>> if (folio_test_large(folio)) {
>> + int new_order = min_order_for_split(folio);
>> /*
>> * The flag must be set after the refcount is bumped
>> * otherwise it may race with THP split.
>> @@ -2294,7 +2296,14 @@ int memory_failure(unsigned long pfn, int flags)
>> * page is a valid handlable page.
>> */
>> folio_set_has_hwpoisoned(folio);
>> - if (try_to_split_thp_page(p, false) < 0) {
>> + /*
>> + * If the folio cannot be split to order-0, kill the process,
>> + * but split the folio anyway to minimize the amount of unusable
>> + * pages.
>> + */
>> + if (try_to_split_thp_page(p, new_order, false) || new_order) {
>
> folio split will clear PG_has_hwpoisoned flag. It is ok for splitting
> to order-0 folios because the PG_hwpoisoned flag is set on the
> poisoned page. But if you split the folio to some smaller order large
> folios, it seems you need to keep PG_has_hwpoisoned flag on the
> poisoned folio.
OK, this means all pages in a folio with folio_test_has_hwpoisoned() should be
checked to be able to set after-split folio's flag properly. Current folio
split code does not do that. I am thinking about whether that causes any
issue. Probably not, because:
1. before Patch 1 is applied, large after-split folios are already causing
a warning in memory_failure(). That kinda masks this issue.
2. after Patch 1 is applied, no large after-split folios will appear,
since the split will fail.
@Miaohe and @Jane, please let me know if my above reasoning makes sense or not.
To make this patch right, folio's has_hwpoisoned flag needs to be preserved
like what Yang described above. My current plan is to move
folio_clear_has_hwpoisoned(folio) into __split_folio_to_order() and
scan every page in the folio if the folio's has_hwpoisoned is set.
There will be redundant scans in non uniform split case, since a has_hwpoisoned
folio can be split multiple times (leading to multiple page scans), unless
the scan result is stored.
@Miaohe and @Jane, is it possible to have multiple HW poisoned pages in
a folio? Is the memory failure process like 1) page access causing MCE,
2) memory_failure() is used to handle it and split the large folio containing
it? Or multiple MCEs can be received and multiple pages in a folio are marked
then a split would happen?
>
> Yang
>
>
>> + /* get folio again in case the original one is split */
>> + folio = page_folio(p);
>> res = -EHWPOISON;
>> kill_procs_now(p, pfn, flags, folio);
>> put_page(p);
>> @@ -2621,7 +2630,15 @@ static int soft_offline_in_use_page(struct page *page)
>> };
>>
>> if (!huge && folio_test_large(folio)) {
>> - if (try_to_split_thp_page(page, true)) {
>> + int new_order = min_order_for_split(folio);
>> +
>> + /*
>> + * If the folio cannot be split to order-0, do not split it at
>> + * all to retain the still accessible large folio.
>> + * NOTE: if getting free memory is perferred, split it like it
>> + * is done in memory_failure().
>> + */
>> + if (new_order || try_to_split_thp_page(page, new_order, true)) {
>> pr_info("%#lx: thp split failed\n", pfn);
>> return -EBUSY;
>> }
>> --
>> 2.51.0
>>
>>
--
Best Regards,
Yan, Zi
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v2 2/3] mm/memory-failure: improve large block size folio handling.
2025-10-17 9:33 ` Lorenzo Stoakes
@ 2025-10-20 20:09 ` Zi Yan
0 siblings, 0 replies; 27+ messages in thread
From: Zi Yan @ 2025-10-20 20:09 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: linmiaohe, david, jane.chu, kernel, syzbot+e6367ea2fdab6ed46056,
syzkaller-bugs, akpm, mcgrof, nao.horiguchi, Baolin Wang,
Liam R. Howlett, Nico Pache, Ryan Roberts, Dev Jain, Barry Song,
Lance Yang, Matthew Wilcox (Oracle), Wei Yang, linux-fsdevel,
linux-kernel, linux-mm
On 17 Oct 2025, at 5:33, Lorenzo Stoakes wrote:
> On Wed, Oct 15, 2025 at 11:34:51PM -0400, Zi Yan wrote:
>> Large block size (LBS) folios cannot be split to order-0 folios but
>> min_order_for_folio(). Current split fails directly, but that is not
>> optimal. Split the folio to min_order_for_folio(), so that, after split,
>> only the folio containing the poisoned page becomes unusable instead.
>>
>> For soft offline, do not split the large folio if it cannot be split to
>> order-0. Since the folio is still accessible from userspace and premature
>> split might lead to potential performance loss.
>>
>> Suggested-by: Jane Chu <jane.chu@oracle.com>
>> Signed-off-by: Zi Yan <ziy@nvidia.com>
>> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
>> ---
>> mm/memory-failure.c | 25 +++++++++++++++++++++----
>> 1 file changed, 21 insertions(+), 4 deletions(-)
>>
>> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
>> index f698df156bf8..443df9581c24 100644
>> --- a/mm/memory-failure.c
>> +++ b/mm/memory-failure.c
>> @@ -1656,12 +1656,13 @@ static int identify_page_state(unsigned long pfn, struct page *p,
>> * there is still more to do, hence the page refcount we took earlier
>> * is still needed.
>> */
>> -static int try_to_split_thp_page(struct page *page, bool release)
>> +static int try_to_split_thp_page(struct page *page, unsigned int new_order,
>> + bool release)
>> {
>> int ret;
>>
>> lock_page(page);
>> - ret = split_huge_page(page);
>> + ret = split_huge_page_to_list_to_order(page, NULL, new_order);
>
> I wonder if we need a wrapper for these list==NULL cases, as
> split_huge_page_to_list_to_order suggests you always have a list provided... and
> this is ugly :)
>
> split_huge_page_to_order() seems good.
Yes, this suggestion motivated me to remove unused list==NULL parameter in
try_folio_split_to_order(). Thanks.
>
>> unlock_page(page);
>>
>> if (ret && release)
>> @@ -2280,6 +2281,7 @@ int memory_failure(unsigned long pfn, int flags)
>> folio_unlock(folio);
>>
>> if (folio_test_large(folio)) {
>> + int new_order = min_order_for_split(folio);
>
> Newline after decl?
Sure.
>
>> /*
>> * The flag must be set after the refcount is bumped
>> * otherwise it may race with THP split.
>> @@ -2294,7 +2296,14 @@ int memory_failure(unsigned long pfn, int flags)
>> * page is a valid handlable page.
>> */
>> folio_set_has_hwpoisoned(folio);
>> - if (try_to_split_thp_page(p, false) < 0) {
>> + /*
>> + * If the folio cannot be split to order-0, kill the process,
>> + * but split the folio anyway to minimize the amount of unusable
>> + * pages.
>> + */
>> + if (try_to_split_thp_page(p, new_order, false) || new_order) {
>
> Please use /* release= */false here
OK.
>
>
> I'm also not sure about the logic here, it feels unclear.
>
> Something like:
>
> err = try_to_to_split_thp_page(p, new_order, /* release= */false);
>
> /*
> * If the folio cannot be split, kill the process.
> * If it can be split, but not to order-0, then this defeats the
> * expectation that we do so, but we want the split to have been
> * made to
> */
>
> if (err || new_order > 0) {
> }
Will make the change.
>
>
>> + /* get folio again in case the original one is split */
>> + folio = page_folio(p);
>> res = -EHWPOISON;
>> kill_procs_now(p, pfn, flags, folio);
>> put_page(p);
>> @@ -2621,7 +2630,15 @@ static int soft_offline_in_use_page(struct page *page)
>> };
>>
>> if (!huge && folio_test_large(folio)) {
>> - if (try_to_split_thp_page(page, true)) {
>> + int new_order = min_order_for_split(folio);
>> +
>> + /*
>> + * If the folio cannot be split to order-0, do not split it at
>> + * all to retain the still accessible large folio.
>> + * NOTE: if getting free memory is perferred, split it like it
>
> Typo perferred -> preferred.
Got it.
>
>
>> + * is done in memory_failure().
>
> I'm confused as to your comment here though, we're not splitting it like
> memory_failure()? We're splitting a. with release and b. only if we can target
> order-0.
>
> So how would this preference in any way be a thing that happens? :) I may be
> missing something here.
For non LBS folios, min_order_for_split() returns 0. In that case, the split
would happen.
>
>> + */
>> + if (new_order || try_to_split_thp_page(page, new_order, true)) {
>
> Same comment as above with /* release= */true.
Sure.
>
> You should pass 0 not new_order to try_to_split_thp_page() here as it has to be
> 0 for the function to be invoked and that's just obviously clearer.
OK. How about try_to_split_thp_page(page, /* new_order= */ 0, /* release= */ true)?
So that readers can tell 0 is the value of new_order.
>
>
>> pr_info("%#lx: thp split failed\n", pfn);
>> return -EBUSY;
>> }
Thank you for the feedback.
--
Best Regards,
Yan, Zi
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v2 2/3] mm/memory-failure: improve large block size folio handling.
2025-10-20 19:46 ` Zi Yan
@ 2025-10-20 23:41 ` Yang Shi
2025-10-21 1:23 ` Zi Yan
2025-10-22 6:39 ` Miaohe Lin
1 sibling, 1 reply; 27+ messages in thread
From: Yang Shi @ 2025-10-20 23:41 UTC (permalink / raw)
To: Zi Yan
Cc: linmiaohe, jane.chu, david, kernel, syzbot+e6367ea2fdab6ed46056,
syzkaller-bugs, akpm, mcgrof, nao.horiguchi, Lorenzo Stoakes,
Baolin Wang, Liam R. Howlett, Nico Pache, Ryan Roberts, Dev Jain,
Barry Song, Lance Yang, Matthew Wilcox (Oracle), Wei Yang,
linux-fsdevel, linux-kernel, linux-mm
On Mon, Oct 20, 2025 at 12:46 PM Zi Yan <ziy@nvidia.com> wrote:
>
> On 17 Oct 2025, at 15:11, Yang Shi wrote:
>
> > On Wed, Oct 15, 2025 at 8:38 PM Zi Yan <ziy@nvidia.com> wrote:
> >>
> >> Large block size (LBS) folios cannot be split to order-0 folios but
> >> min_order_for_folio(). Current split fails directly, but that is not
> >> optimal. Split the folio to min_order_for_folio(), so that, after split,
> >> only the folio containing the poisoned page becomes unusable instead.
> >>
> >> For soft offline, do not split the large folio if it cannot be split to
> >> order-0. Since the folio is still accessible from userspace and premature
> >> split might lead to potential performance loss.
> >>
> >> Suggested-by: Jane Chu <jane.chu@oracle.com>
> >> Signed-off-by: Zi Yan <ziy@nvidia.com>
> >> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
> >> ---
> >> mm/memory-failure.c | 25 +++++++++++++++++++++----
> >> 1 file changed, 21 insertions(+), 4 deletions(-)
> >>
> >> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> >> index f698df156bf8..443df9581c24 100644
> >> --- a/mm/memory-failure.c
> >> +++ b/mm/memory-failure.c
> >> @@ -1656,12 +1656,13 @@ static int identify_page_state(unsigned long pfn, struct page *p,
> >> * there is still more to do, hence the page refcount we took earlier
> >> * is still needed.
> >> */
> >> -static int try_to_split_thp_page(struct page *page, bool release)
> >> +static int try_to_split_thp_page(struct page *page, unsigned int new_order,
> >> + bool release)
> >> {
> >> int ret;
> >>
> >> lock_page(page);
> >> - ret = split_huge_page(page);
> >> + ret = split_huge_page_to_list_to_order(page, NULL, new_order);
> >> unlock_page(page);
> >>
> >> if (ret && release)
> >> @@ -2280,6 +2281,7 @@ int memory_failure(unsigned long pfn, int flags)
> >> folio_unlock(folio);
> >>
> >> if (folio_test_large(folio)) {
> >> + int new_order = min_order_for_split(folio);
> >> /*
> >> * The flag must be set after the refcount is bumped
> >> * otherwise it may race with THP split.
> >> @@ -2294,7 +2296,14 @@ int memory_failure(unsigned long pfn, int flags)
> >> * page is a valid handlable page.
> >> */
> >> folio_set_has_hwpoisoned(folio);
> >> - if (try_to_split_thp_page(p, false) < 0) {
> >> + /*
> >> + * If the folio cannot be split to order-0, kill the process,
> >> + * but split the folio anyway to minimize the amount of unusable
> >> + * pages.
> >> + */
> >> + if (try_to_split_thp_page(p, new_order, false) || new_order) {
> >
> > folio split will clear PG_has_hwpoisoned flag. It is ok for splitting
> > to order-0 folios because the PG_hwpoisoned flag is set on the
> > poisoned page. But if you split the folio to some smaller order large
> > folios, it seems you need to keep PG_has_hwpoisoned flag on the
> > poisoned folio.
>
> OK, this means all pages in a folio with folio_test_has_hwpoisoned() should be
> checked to be able to set after-split folio's flag properly. Current folio
> split code does not do that. I am thinking about whether that causes any
> issue. Probably not, because:
>
> 1. before Patch 1 is applied, large after-split folios are already causing
> a warning in memory_failure(). That kinda masks this issue.
> 2. after Patch 1 is applied, no large after-split folios will appear,
> since the split will fail.
I'm a little bit confused. Didn't this patch split large folio to
new-order-large-folio (new order is min order)? So this patch had
code:
if (try_to_split_thp_page(p, new_order, false) || new_order) {
Thanks,
Yang
>
> @Miaohe and @Jane, please let me know if my above reasoning makes sense or not.
>
> To make this patch right, folio's has_hwpoisoned flag needs to be preserved
> like what Yang described above. My current plan is to move
> folio_clear_has_hwpoisoned(folio) into __split_folio_to_order() and
> scan every page in the folio if the folio's has_hwpoisoned is set.
> There will be redundant scans in non uniform split case, since a has_hwpoisoned
> folio can be split multiple times (leading to multiple page scans), unless
> the scan result is stored.
>
> @Miaohe and @Jane, is it possible to have multiple HW poisoned pages in
> a folio? Is the memory failure process like 1) page access causing MCE,
> 2) memory_failure() is used to handle it and split the large folio containing
> it? Or multiple MCEs can be received and multiple pages in a folio are marked
> then a split would happen?
>
> >
> > Yang
> >
> >
> >> + /* get folio again in case the original one is split */
> >> + folio = page_folio(p);
> >> res = -EHWPOISON;
> >> kill_procs_now(p, pfn, flags, folio);
> >> put_page(p);
> >> @@ -2621,7 +2630,15 @@ static int soft_offline_in_use_page(struct page *page)
> >> };
> >>
> >> if (!huge && folio_test_large(folio)) {
> >> - if (try_to_split_thp_page(page, true)) {
> >> + int new_order = min_order_for_split(folio);
> >> +
> >> + /*
> >> + * If the folio cannot be split to order-0, do not split it at
> >> + * all to retain the still accessible large folio.
> >> + * NOTE: if getting free memory is perferred, split it like it
> >> + * is done in memory_failure().
> >> + */
> >> + if (new_order || try_to_split_thp_page(page, new_order, true)) {
> >> pr_info("%#lx: thp split failed\n", pfn);
> >> return -EBUSY;
> >> }
> >> --
> >> 2.51.0
> >>
> >>
>
>
> --
> Best Regards,
> Yan, Zi
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v2 2/3] mm/memory-failure: improve large block size folio handling.
2025-10-20 23:41 ` Yang Shi
@ 2025-10-21 1:23 ` Zi Yan
2025-10-21 15:44 ` David Hildenbrand
0 siblings, 1 reply; 27+ messages in thread
From: Zi Yan @ 2025-10-21 1:23 UTC (permalink / raw)
To: Yang Shi
Cc: linmiaohe, jane.chu, david, kernel, syzbot+e6367ea2fdab6ed46056,
syzkaller-bugs, akpm, mcgrof, nao.horiguchi, Lorenzo Stoakes,
Baolin Wang, Liam R. Howlett, Nico Pache, Ryan Roberts, Dev Jain,
Barry Song, Lance Yang, Matthew Wilcox (Oracle), Wei Yang,
linux-fsdevel, linux-kernel, linux-mm
On 20 Oct 2025, at 19:41, Yang Shi wrote:
> On Mon, Oct 20, 2025 at 12:46 PM Zi Yan <ziy@nvidia.com> wrote:
>>
>> On 17 Oct 2025, at 15:11, Yang Shi wrote:
>>
>>> On Wed, Oct 15, 2025 at 8:38 PM Zi Yan <ziy@nvidia.com> wrote:
>>>>
>>>> Large block size (LBS) folios cannot be split to order-0 folios but
>>>> min_order_for_folio(). Current split fails directly, but that is not
>>>> optimal. Split the folio to min_order_for_folio(), so that, after split,
>>>> only the folio containing the poisoned page becomes unusable instead.
>>>>
>>>> For soft offline, do not split the large folio if it cannot be split to
>>>> order-0. Since the folio is still accessible from userspace and premature
>>>> split might lead to potential performance loss.
>>>>
>>>> Suggested-by: Jane Chu <jane.chu@oracle.com>
>>>> Signed-off-by: Zi Yan <ziy@nvidia.com>
>>>> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
>>>> ---
>>>> mm/memory-failure.c | 25 +++++++++++++++++++++----
>>>> 1 file changed, 21 insertions(+), 4 deletions(-)
>>>>
>>>> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
>>>> index f698df156bf8..443df9581c24 100644
>>>> --- a/mm/memory-failure.c
>>>> +++ b/mm/memory-failure.c
>>>> @@ -1656,12 +1656,13 @@ static int identify_page_state(unsigned long pfn, struct page *p,
>>>> * there is still more to do, hence the page refcount we took earlier
>>>> * is still needed.
>>>> */
>>>> -static int try_to_split_thp_page(struct page *page, bool release)
>>>> +static int try_to_split_thp_page(struct page *page, unsigned int new_order,
>>>> + bool release)
>>>> {
>>>> int ret;
>>>>
>>>> lock_page(page);
>>>> - ret = split_huge_page(page);
>>>> + ret = split_huge_page_to_list_to_order(page, NULL, new_order);
>>>> unlock_page(page);
>>>>
>>>> if (ret && release)
>>>> @@ -2280,6 +2281,7 @@ int memory_failure(unsigned long pfn, int flags)
>>>> folio_unlock(folio);
>>>>
>>>> if (folio_test_large(folio)) {
>>>> + int new_order = min_order_for_split(folio);
>>>> /*
>>>> * The flag must be set after the refcount is bumped
>>>> * otherwise it may race with THP split.
>>>> @@ -2294,7 +2296,14 @@ int memory_failure(unsigned long pfn, int flags)
>>>> * page is a valid handlable page.
>>>> */
>>>> folio_set_has_hwpoisoned(folio);
>>>> - if (try_to_split_thp_page(p, false) < 0) {
>>>> + /*
>>>> + * If the folio cannot be split to order-0, kill the process,
>>>> + * but split the folio anyway to minimize the amount of unusable
>>>> + * pages.
>>>> + */
>>>> + if (try_to_split_thp_page(p, new_order, false) || new_order) {
>>>
>>> folio split will clear PG_has_hwpoisoned flag. It is ok for splitting
>>> to order-0 folios because the PG_hwpoisoned flag is set on the
>>> poisoned page. But if you split the folio to some smaller order large
>>> folios, it seems you need to keep PG_has_hwpoisoned flag on the
>>> poisoned folio.
>>
>> OK, this means all pages in a folio with folio_test_has_hwpoisoned() should be
>> checked to be able to set after-split folio's flag properly. Current folio
>> split code does not do that. I am thinking about whether that causes any
>> issue. Probably not, because:
>>
>> 1. before Patch 1 is applied, large after-split folios are already causing
>> a warning in memory_failure(). That kinda masks this issue.
>> 2. after Patch 1 is applied, no large after-split folios will appear,
>> since the split will fail.
>
> I'm a little bit confused. Didn't this patch split large folio to
> new-order-large-folio (new order is min order)? So this patch had
> code:
> if (try_to_split_thp_page(p, new_order, false) || new_order) {
Yes, but this is Patch 2 in this series. Patch 1 is
"mm/huge_memory: do not change split_huge_page*() target order silently."
and sent separately as a hotfix[1].
Patch 2 and 3 in this series will be sent later when 1) Patch 1 is merged,
and 2) a prerequisite patch to address the issue you mentioned above is added
long with them.
[1] https://lore.kernel.org/linux-mm/20251017013630.139907-1-ziy@nvidia.com/
>
> Thanks,
> Yang
>
>>
>> @Miaohe and @Jane, please let me know if my above reasoning makes sense or not.
>>
>> To make this patch right, folio's has_hwpoisoned flag needs to be preserved
>> like what Yang described above. My current plan is to move
>> folio_clear_has_hwpoisoned(folio) into __split_folio_to_order() and
>> scan every page in the folio if the folio's has_hwpoisoned is set.
>> There will be redundant scans in non uniform split case, since a has_hwpoisoned
>> folio can be split multiple times (leading to multiple page scans), unless
>> the scan result is stored.
>>
>> @Miaohe and @Jane, is it possible to have multiple HW poisoned pages in
>> a folio? Is the memory failure process like 1) page access causing MCE,
>> 2) memory_failure() is used to handle it and split the large folio containing
>> it? Or multiple MCEs can be received and multiple pages in a folio are marked
>> then a split would happen?
>>
>>>
>>> Yang
>>>
>>>
>>>> + /* get folio again in case the original one is split */
>>>> + folio = page_folio(p);
>>>> res = -EHWPOISON;
>>>> kill_procs_now(p, pfn, flags, folio);
>>>> put_page(p);
>>>> @@ -2621,7 +2630,15 @@ static int soft_offline_in_use_page(struct page *page)
>>>> };
>>>>
>>>> if (!huge && folio_test_large(folio)) {
>>>> - if (try_to_split_thp_page(page, true)) {
>>>> + int new_order = min_order_for_split(folio);
>>>> +
>>>> + /*
>>>> + * If the folio cannot be split to order-0, do not split it at
>>>> + * all to retain the still accessible large folio.
>>>> + * NOTE: if getting free memory is perferred, split it like it
>>>> + * is done in memory_failure().
>>>> + */
>>>> + if (new_order || try_to_split_thp_page(page, new_order, true)) {
>>>> pr_info("%#lx: thp split failed\n", pfn);
>>>> return -EBUSY;
>>>> }
>>>> --
>>>> 2.51.0
>>>>
>>>>
>>
>>
>> --
>> Best Regards,
>> Yan, Zi
--
Best Regards,
Yan, Zi
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v2 2/3] mm/memory-failure: improve large block size folio handling.
2025-10-21 1:23 ` Zi Yan
@ 2025-10-21 15:44 ` David Hildenbrand
2025-10-21 15:55 ` Zi Yan
0 siblings, 1 reply; 27+ messages in thread
From: David Hildenbrand @ 2025-10-21 15:44 UTC (permalink / raw)
To: Zi Yan, Yang Shi
Cc: linmiaohe, jane.chu, kernel, syzbot+e6367ea2fdab6ed46056,
syzkaller-bugs, akpm, mcgrof, nao.horiguchi, Lorenzo Stoakes,
Baolin Wang, Liam R. Howlett, Nico Pache, Ryan Roberts, Dev Jain,
Barry Song, Lance Yang, Matthew Wilcox (Oracle), Wei Yang,
linux-fsdevel, linux-kernel, linux-mm
On 21.10.25 03:23, Zi Yan wrote:
> On 20 Oct 2025, at 19:41, Yang Shi wrote:
>
>> On Mon, Oct 20, 2025 at 12:46 PM Zi Yan <ziy@nvidia.com> wrote:
>>>
>>> On 17 Oct 2025, at 15:11, Yang Shi wrote:
>>>
>>>> On Wed, Oct 15, 2025 at 8:38 PM Zi Yan <ziy@nvidia.com> wrote:
>>>>>
>>>>> Large block size (LBS) folios cannot be split to order-0 folios but
>>>>> min_order_for_folio(). Current split fails directly, but that is not
>>>>> optimal. Split the folio to min_order_for_folio(), so that, after split,
>>>>> only the folio containing the poisoned page becomes unusable instead.
>>>>>
>>>>> For soft offline, do not split the large folio if it cannot be split to
>>>>> order-0. Since the folio is still accessible from userspace and premature
>>>>> split might lead to potential performance loss.
>>>>>
>>>>> Suggested-by: Jane Chu <jane.chu@oracle.com>
>>>>> Signed-off-by: Zi Yan <ziy@nvidia.com>
>>>>> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
>>>>> ---
>>>>> mm/memory-failure.c | 25 +++++++++++++++++++++----
>>>>> 1 file changed, 21 insertions(+), 4 deletions(-)
>>>>>
>>>>> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
>>>>> index f698df156bf8..443df9581c24 100644
>>>>> --- a/mm/memory-failure.c
>>>>> +++ b/mm/memory-failure.c
>>>>> @@ -1656,12 +1656,13 @@ static int identify_page_state(unsigned long pfn, struct page *p,
>>>>> * there is still more to do, hence the page refcount we took earlier
>>>>> * is still needed.
>>>>> */
>>>>> -static int try_to_split_thp_page(struct page *page, bool release)
>>>>> +static int try_to_split_thp_page(struct page *page, unsigned int new_order,
>>>>> + bool release)
>>>>> {
>>>>> int ret;
>>>>>
>>>>> lock_page(page);
>>>>> - ret = split_huge_page(page);
>>>>> + ret = split_huge_page_to_list_to_order(page, NULL, new_order);
>>>>> unlock_page(page);
>>>>>
>>>>> if (ret && release)
>>>>> @@ -2280,6 +2281,7 @@ int memory_failure(unsigned long pfn, int flags)
>>>>> folio_unlock(folio);
>>>>>
>>>>> if (folio_test_large(folio)) {
>>>>> + int new_order = min_order_for_split(folio);
>>>>> /*
>>>>> * The flag must be set after the refcount is bumped
>>>>> * otherwise it may race with THP split.
>>>>> @@ -2294,7 +2296,14 @@ int memory_failure(unsigned long pfn, int flags)
>>>>> * page is a valid handlable page.
>>>>> */
>>>>> folio_set_has_hwpoisoned(folio);
>>>>> - if (try_to_split_thp_page(p, false) < 0) {
>>>>> + /*
>>>>> + * If the folio cannot be split to order-0, kill the process,
>>>>> + * but split the folio anyway to minimize the amount of unusable
>>>>> + * pages.
>>>>> + */
>>>>> + if (try_to_split_thp_page(p, new_order, false) || new_order) {
>>>>
>>>> folio split will clear PG_has_hwpoisoned flag. It is ok for splitting
>>>> to order-0 folios because the PG_hwpoisoned flag is set on the
>>>> poisoned page. But if you split the folio to some smaller order large
>>>> folios, it seems you need to keep PG_has_hwpoisoned flag on the
>>>> poisoned folio.
>>>
>>> OK, this means all pages in a folio with folio_test_has_hwpoisoned() should be
>>> checked to be able to set after-split folio's flag properly. Current folio
>>> split code does not do that. I am thinking about whether that causes any
>>> issue. Probably not, because:
>>>
>>> 1. before Patch 1 is applied, large after-split folios are already causing
>>> a warning in memory_failure(). That kinda masks this issue.
>>> 2. after Patch 1 is applied, no large after-split folios will appear,
>>> since the split will fail.
>>
>> I'm a little bit confused. Didn't this patch split large folio to
>> new-order-large-folio (new order is min order)? So this patch had
>> code:
>> if (try_to_split_thp_page(p, new_order, false) || new_order) {
>
> Yes, but this is Patch 2 in this series. Patch 1 is
> "mm/huge_memory: do not change split_huge_page*() target order silently."
> and sent separately as a hotfix[1].
I'm confused now as well. I'd like to review, will there be a v3 that
only contains patch #2+#3?
Thanks!
--
Cheers
David / dhildenb
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v2 2/3] mm/memory-failure: improve large block size folio handling.
2025-10-21 15:44 ` David Hildenbrand
@ 2025-10-21 15:55 ` Zi Yan
2025-10-21 18:28 ` David Hildenbrand
0 siblings, 1 reply; 27+ messages in thread
From: Zi Yan @ 2025-10-21 15:55 UTC (permalink / raw)
To: David Hildenbrand
Cc: Yang Shi, linmiaohe, jane.chu, kernel,
syzbot+e6367ea2fdab6ed46056, syzkaller-bugs, akpm, mcgrof,
nao.horiguchi, Lorenzo Stoakes, Baolin Wang, Liam R. Howlett,
Nico Pache, Ryan Roberts, Dev Jain, Barry Song, Lance Yang,
Matthew Wilcox (Oracle), Wei Yang, linux-fsdevel, linux-kernel,
linux-mm
On 21 Oct 2025, at 11:44, David Hildenbrand wrote:
> On 21.10.25 03:23, Zi Yan wrote:
>> On 20 Oct 2025, at 19:41, Yang Shi wrote:
>>
>>> On Mon, Oct 20, 2025 at 12:46 PM Zi Yan <ziy@nvidia.com> wrote:
>>>>
>>>> On 17 Oct 2025, at 15:11, Yang Shi wrote:
>>>>
>>>>> On Wed, Oct 15, 2025 at 8:38 PM Zi Yan <ziy@nvidia.com> wrote:
>>>>>>
>>>>>> Large block size (LBS) folios cannot be split to order-0 folios but
>>>>>> min_order_for_folio(). Current split fails directly, but that is not
>>>>>> optimal. Split the folio to min_order_for_folio(), so that, after split,
>>>>>> only the folio containing the poisoned page becomes unusable instead.
>>>>>>
>>>>>> For soft offline, do not split the large folio if it cannot be split to
>>>>>> order-0. Since the folio is still accessible from userspace and premature
>>>>>> split might lead to potential performance loss.
>>>>>>
>>>>>> Suggested-by: Jane Chu <jane.chu@oracle.com>
>>>>>> Signed-off-by: Zi Yan <ziy@nvidia.com>
>>>>>> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
>>>>>> ---
>>>>>> mm/memory-failure.c | 25 +++++++++++++++++++++----
>>>>>> 1 file changed, 21 insertions(+), 4 deletions(-)
>>>>>>
>>>>>> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
>>>>>> index f698df156bf8..443df9581c24 100644
>>>>>> --- a/mm/memory-failure.c
>>>>>> +++ b/mm/memory-failure.c
>>>>>> @@ -1656,12 +1656,13 @@ static int identify_page_state(unsigned long pfn, struct page *p,
>>>>>> * there is still more to do, hence the page refcount we took earlier
>>>>>> * is still needed.
>>>>>> */
>>>>>> -static int try_to_split_thp_page(struct page *page, bool release)
>>>>>> +static int try_to_split_thp_page(struct page *page, unsigned int new_order,
>>>>>> + bool release)
>>>>>> {
>>>>>> int ret;
>>>>>>
>>>>>> lock_page(page);
>>>>>> - ret = split_huge_page(page);
>>>>>> + ret = split_huge_page_to_list_to_order(page, NULL, new_order);
>>>>>> unlock_page(page);
>>>>>>
>>>>>> if (ret && release)
>>>>>> @@ -2280,6 +2281,7 @@ int memory_failure(unsigned long pfn, int flags)
>>>>>> folio_unlock(folio);
>>>>>>
>>>>>> if (folio_test_large(folio)) {
>>>>>> + int new_order = min_order_for_split(folio);
>>>>>> /*
>>>>>> * The flag must be set after the refcount is bumped
>>>>>> * otherwise it may race with THP split.
>>>>>> @@ -2294,7 +2296,14 @@ int memory_failure(unsigned long pfn, int flags)
>>>>>> * page is a valid handlable page.
>>>>>> */
>>>>>> folio_set_has_hwpoisoned(folio);
>>>>>> - if (try_to_split_thp_page(p, false) < 0) {
>>>>>> + /*
>>>>>> + * If the folio cannot be split to order-0, kill the process,
>>>>>> + * but split the folio anyway to minimize the amount of unusable
>>>>>> + * pages.
>>>>>> + */
>>>>>> + if (try_to_split_thp_page(p, new_order, false) || new_order) {
>>>>>
>>>>> folio split will clear PG_has_hwpoisoned flag. It is ok for splitting
>>>>> to order-0 folios because the PG_hwpoisoned flag is set on the
>>>>> poisoned page. But if you split the folio to some smaller order large
>>>>> folios, it seems you need to keep PG_has_hwpoisoned flag on the
>>>>> poisoned folio.
>>>>
>>>> OK, this means all pages in a folio with folio_test_has_hwpoisoned() should be
>>>> checked to be able to set after-split folio's flag properly. Current folio
>>>> split code does not do that. I am thinking about whether that causes any
>>>> issue. Probably not, because:
>>>>
>>>> 1. before Patch 1 is applied, large after-split folios are already causing
>>>> a warning in memory_failure(). That kinda masks this issue.
>>>> 2. after Patch 1 is applied, no large after-split folios will appear,
>>>> since the split will fail.
>>>
>>> I'm a little bit confused. Didn't this patch split large folio to
>>> new-order-large-folio (new order is min order)? So this patch had
>>> code:
>>> if (try_to_split_thp_page(p, new_order, false) || new_order) {
>>
>> Yes, but this is Patch 2 in this series. Patch 1 is
>> "mm/huge_memory: do not change split_huge_page*() target order silently."
>> and sent separately as a hotfix[1].
>
> I'm confused now as well. I'd like to review, will there be a v3 that only contains patch #2+#3?
Yes. The new V3 will have 3 patches:
1. a new patch addresses Yang’s concern on setting has_hwpoisoned on after-split
large folios.
2. patch#2,
3. patch#3.
The plan is to send them out once patch 1 is upstreamed. Let me know if you think
it is OK to send them out earlier as Andrew already picked up patch 1.
I also would like to get some feedback on my approach to setting has_hwpoisoned:
folio's has_hwpoisoned flag needs to be preserved
like what Yang described above. My current plan is to move
folio_clear_has_hwpoisoned(folio) into __split_folio_to_order() and
scan every page in the folio if the folio's has_hwpoisoned is set.
There will be redundant scans in non uniform split case, since a has_hwpoisoned
folio can be split multiple times (leading to multiple page scans), unless
the scan result is stored.
Best Regards,
Yan, Zi
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v2 2/3] mm/memory-failure: improve large block size folio handling.
2025-10-21 15:55 ` Zi Yan
@ 2025-10-21 18:28 ` David Hildenbrand
2025-10-21 18:57 ` Zi Yan
0 siblings, 1 reply; 27+ messages in thread
From: David Hildenbrand @ 2025-10-21 18:28 UTC (permalink / raw)
To: Zi Yan
Cc: Yang Shi, linmiaohe, jane.chu, kernel,
syzbot+e6367ea2fdab6ed46056, syzkaller-bugs, akpm, mcgrof,
nao.horiguchi, Lorenzo Stoakes, Baolin Wang, Liam R. Howlett,
Nico Pache, Ryan Roberts, Dev Jain, Barry Song, Lance Yang,
Matthew Wilcox (Oracle), Wei Yang, linux-fsdevel, linux-kernel,
linux-mm
On 21.10.25 17:55, Zi Yan wrote:
> On 21 Oct 2025, at 11:44, David Hildenbrand wrote:
>
>> On 21.10.25 03:23, Zi Yan wrote:
>>> On 20 Oct 2025, at 19:41, Yang Shi wrote:
>>>
>>>> On Mon, Oct 20, 2025 at 12:46 PM Zi Yan <ziy@nvidia.com> wrote:
>>>>>
>>>>> On 17 Oct 2025, at 15:11, Yang Shi wrote:
>>>>>
>>>>>> On Wed, Oct 15, 2025 at 8:38 PM Zi Yan <ziy@nvidia.com> wrote:
>>>>>>>
>>>>>>> Large block size (LBS) folios cannot be split to order-0 folios but
>>>>>>> min_order_for_folio(). Current split fails directly, but that is not
>>>>>>> optimal. Split the folio to min_order_for_folio(), so that, after split,
>>>>>>> only the folio containing the poisoned page becomes unusable instead.
>>>>>>>
>>>>>>> For soft offline, do not split the large folio if it cannot be split to
>>>>>>> order-0. Since the folio is still accessible from userspace and premature
>>>>>>> split might lead to potential performance loss.
>>>>>>>
>>>>>>> Suggested-by: Jane Chu <jane.chu@oracle.com>
>>>>>>> Signed-off-by: Zi Yan <ziy@nvidia.com>
>>>>>>> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
>>>>>>> ---
>>>>>>> mm/memory-failure.c | 25 +++++++++++++++++++++----
>>>>>>> 1 file changed, 21 insertions(+), 4 deletions(-)
>>>>>>>
>>>>>>> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
>>>>>>> index f698df156bf8..443df9581c24 100644
>>>>>>> --- a/mm/memory-failure.c
>>>>>>> +++ b/mm/memory-failure.c
>>>>>>> @@ -1656,12 +1656,13 @@ static int identify_page_state(unsigned long pfn, struct page *p,
>>>>>>> * there is still more to do, hence the page refcount we took earlier
>>>>>>> * is still needed.
>>>>>>> */
>>>>>>> -static int try_to_split_thp_page(struct page *page, bool release)
>>>>>>> +static int try_to_split_thp_page(struct page *page, unsigned int new_order,
>>>>>>> + bool release)
>>>>>>> {
>>>>>>> int ret;
>>>>>>>
>>>>>>> lock_page(page);
>>>>>>> - ret = split_huge_page(page);
>>>>>>> + ret = split_huge_page_to_list_to_order(page, NULL, new_order);
>>>>>>> unlock_page(page);
>>>>>>>
>>>>>>> if (ret && release)
>>>>>>> @@ -2280,6 +2281,7 @@ int memory_failure(unsigned long pfn, int flags)
>>>>>>> folio_unlock(folio);
>>>>>>>
>>>>>>> if (folio_test_large(folio)) {
>>>>>>> + int new_order = min_order_for_split(folio);
>>>>>>> /*
>>>>>>> * The flag must be set after the refcount is bumped
>>>>>>> * otherwise it may race with THP split.
>>>>>>> @@ -2294,7 +2296,14 @@ int memory_failure(unsigned long pfn, int flags)
>>>>>>> * page is a valid handlable page.
>>>>>>> */
>>>>>>> folio_set_has_hwpoisoned(folio);
>>>>>>> - if (try_to_split_thp_page(p, false) < 0) {
>>>>>>> + /*
>>>>>>> + * If the folio cannot be split to order-0, kill the process,
>>>>>>> + * but split the folio anyway to minimize the amount of unusable
>>>>>>> + * pages.
>>>>>>> + */
>>>>>>> + if (try_to_split_thp_page(p, new_order, false) || new_order) {
>>>>>>
>>>>>> folio split will clear PG_has_hwpoisoned flag. It is ok for splitting
>>>>>> to order-0 folios because the PG_hwpoisoned flag is set on the
>>>>>> poisoned page. But if you split the folio to some smaller order large
>>>>>> folios, it seems you need to keep PG_has_hwpoisoned flag on the
>>>>>> poisoned folio.
>>>>>
>>>>> OK, this means all pages in a folio with folio_test_has_hwpoisoned() should be
>>>>> checked to be able to set after-split folio's flag properly. Current folio
>>>>> split code does not do that. I am thinking about whether that causes any
>>>>> issue. Probably not, because:
>>>>>
>>>>> 1. before Patch 1 is applied, large after-split folios are already causing
>>>>> a warning in memory_failure(). That kinda masks this issue.
>>>>> 2. after Patch 1 is applied, no large after-split folios will appear,
>>>>> since the split will fail.
>>>>
>>>> I'm a little bit confused. Didn't this patch split large folio to
>>>> new-order-large-folio (new order is min order)? So this patch had
>>>> code:
>>>> if (try_to_split_thp_page(p, new_order, false) || new_order) {
>>>
>>> Yes, but this is Patch 2 in this series. Patch 1 is
>>> "mm/huge_memory: do not change split_huge_page*() target order silently."
>>> and sent separately as a hotfix[1].
>>
>> I'm confused now as well. I'd like to review, will there be a v3 that only contains patch #2+#3?
>
> Yes. The new V3 will have 3 patches:
> 1. a new patch addresses Yang’s concern on setting has_hwpoisoned on after-split
> large folios.
> 2. patch#2,
> 3. patch#3.
Okay, I'll wait with the review until you resend :)
>
> The plan is to send them out once patch 1 is upstreamed. Let me know if you think
> it is OK to send them out earlier as Andrew already picked up patch 1.
It's in mm/mm-new + mm/mm-unstable, AFAIKT. So sure, send it against one
of the tress (I prefer mm-unstable but usually we should target mm-new).
>
> I also would like to get some feedback on my approach to setting has_hwpoisoned:
>
> folio's has_hwpoisoned flag needs to be preserved
> like what Yang described above. My current plan is to move
> folio_clear_has_hwpoisoned(folio) into __split_folio_to_order() and
> scan every page in the folio if the folio's has_hwpoisoned is set.
Oh, that's nasty indeed ... will have to think about that a bit.
Maybe we can keep it simple and always set folio_set_has_hwpoisoned() on
all split folios? Essentially turning it into a "maybe_has" semantics.
IIUC, the existing folio_stest_has_hwpoisoned users can deal with that?
--
Cheers
David / dhildenb
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v2 2/3] mm/memory-failure: improve large block size folio handling.
2025-10-21 18:28 ` David Hildenbrand
@ 2025-10-21 18:57 ` Zi Yan
2025-10-21 19:07 ` Yang Shi
0 siblings, 1 reply; 27+ messages in thread
From: Zi Yan @ 2025-10-21 18:57 UTC (permalink / raw)
To: David Hildenbrand
Cc: Yang Shi, linmiaohe, jane.chu, kernel,
syzbot+e6367ea2fdab6ed46056, syzkaller-bugs, akpm, mcgrof,
nao.horiguchi, Lorenzo Stoakes, Baolin Wang, Liam R. Howlett,
Nico Pache, Ryan Roberts, Dev Jain, Barry Song, Lance Yang,
Matthew Wilcox (Oracle), Wei Yang, linux-fsdevel, linux-kernel,
linux-mm
On 21 Oct 2025, at 14:28, David Hildenbrand wrote:
> On 21.10.25 17:55, Zi Yan wrote:
>> On 21 Oct 2025, at 11:44, David Hildenbrand wrote:
>>
>>> On 21.10.25 03:23, Zi Yan wrote:
>>>> On 20 Oct 2025, at 19:41, Yang Shi wrote:
>>>>
>>>>> On Mon, Oct 20, 2025 at 12:46 PM Zi Yan <ziy@nvidia.com> wrote:
>>>>>>
>>>>>> On 17 Oct 2025, at 15:11, Yang Shi wrote:
>>>>>>
>>>>>>> On Wed, Oct 15, 2025 at 8:38 PM Zi Yan <ziy@nvidia.com> wrote:
>>>>>>>>
>>>>>>>> Large block size (LBS) folios cannot be split to order-0 folios but
>>>>>>>> min_order_for_folio(). Current split fails directly, but that is not
>>>>>>>> optimal. Split the folio to min_order_for_folio(), so that, after split,
>>>>>>>> only the folio containing the poisoned page becomes unusable instead.
>>>>>>>>
>>>>>>>> For soft offline, do not split the large folio if it cannot be split to
>>>>>>>> order-0. Since the folio is still accessible from userspace and premature
>>>>>>>> split might lead to potential performance loss.
>>>>>>>>
>>>>>>>> Suggested-by: Jane Chu <jane.chu@oracle.com>
>>>>>>>> Signed-off-by: Zi Yan <ziy@nvidia.com>
>>>>>>>> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
>>>>>>>> ---
>>>>>>>> mm/memory-failure.c | 25 +++++++++++++++++++++----
>>>>>>>> 1 file changed, 21 insertions(+), 4 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
>>>>>>>> index f698df156bf8..443df9581c24 100644
>>>>>>>> --- a/mm/memory-failure.c
>>>>>>>> +++ b/mm/memory-failure.c
>>>>>>>> @@ -1656,12 +1656,13 @@ static int identify_page_state(unsigned long pfn, struct page *p,
>>>>>>>> * there is still more to do, hence the page refcount we took earlier
>>>>>>>> * is still needed.
>>>>>>>> */
>>>>>>>> -static int try_to_split_thp_page(struct page *page, bool release)
>>>>>>>> +static int try_to_split_thp_page(struct page *page, unsigned int new_order,
>>>>>>>> + bool release)
>>>>>>>> {
>>>>>>>> int ret;
>>>>>>>>
>>>>>>>> lock_page(page);
>>>>>>>> - ret = split_huge_page(page);
>>>>>>>> + ret = split_huge_page_to_list_to_order(page, NULL, new_order);
>>>>>>>> unlock_page(page);
>>>>>>>>
>>>>>>>> if (ret && release)
>>>>>>>> @@ -2280,6 +2281,7 @@ int memory_failure(unsigned long pfn, int flags)
>>>>>>>> folio_unlock(folio);
>>>>>>>>
>>>>>>>> if (folio_test_large(folio)) {
>>>>>>>> + int new_order = min_order_for_split(folio);
>>>>>>>> /*
>>>>>>>> * The flag must be set after the refcount is bumped
>>>>>>>> * otherwise it may race with THP split.
>>>>>>>> @@ -2294,7 +2296,14 @@ int memory_failure(unsigned long pfn, int flags)
>>>>>>>> * page is a valid handlable page.
>>>>>>>> */
>>>>>>>> folio_set_has_hwpoisoned(folio);
>>>>>>>> - if (try_to_split_thp_page(p, false) < 0) {
>>>>>>>> + /*
>>>>>>>> + * If the folio cannot be split to order-0, kill the process,
>>>>>>>> + * but split the folio anyway to minimize the amount of unusable
>>>>>>>> + * pages.
>>>>>>>> + */
>>>>>>>> + if (try_to_split_thp_page(p, new_order, false) || new_order) {
>>>>>>>
>>>>>>> folio split will clear PG_has_hwpoisoned flag. It is ok for splitting
>>>>>>> to order-0 folios because the PG_hwpoisoned flag is set on the
>>>>>>> poisoned page. But if you split the folio to some smaller order large
>>>>>>> folios, it seems you need to keep PG_has_hwpoisoned flag on the
>>>>>>> poisoned folio.
>>>>>>
>>>>>> OK, this means all pages in a folio with folio_test_has_hwpoisoned() should be
>>>>>> checked to be able to set after-split folio's flag properly. Current folio
>>>>>> split code does not do that. I am thinking about whether that causes any
>>>>>> issue. Probably not, because:
>>>>>>
>>>>>> 1. before Patch 1 is applied, large after-split folios are already causing
>>>>>> a warning in memory_failure(). That kinda masks this issue.
>>>>>> 2. after Patch 1 is applied, no large after-split folios will appear,
>>>>>> since the split will fail.
>>>>>
>>>>> I'm a little bit confused. Didn't this patch split large folio to
>>>>> new-order-large-folio (new order is min order)? So this patch had
>>>>> code:
>>>>> if (try_to_split_thp_page(p, new_order, false) || new_order) {
>>>>
>>>> Yes, but this is Patch 2 in this series. Patch 1 is
>>>> "mm/huge_memory: do not change split_huge_page*() target order silently."
>>>> and sent separately as a hotfix[1].
>>>
>>> I'm confused now as well. I'd like to review, will there be a v3 that only contains patch #2+#3?
>>
>> Yes. The new V3 will have 3 patches:
>> 1. a new patch addresses Yang’s concern on setting has_hwpoisoned on after-split
>> large folios.
>> 2. patch#2,
>> 3. patch#3.
>
> Okay, I'll wait with the review until you resend :)
>
>>
>> The plan is to send them out once patch 1 is upstreamed. Let me know if you think
>> it is OK to send them out earlier as Andrew already picked up patch 1.
>
> It's in mm/mm-new + mm/mm-unstable, AFAIKT. So sure, send it against one of the tress (I prefer mm-unstable but usually we should target mm-new).
Sure.
>
>>
>> I also would like to get some feedback on my approach to setting has_hwpoisoned:
>>
>> folio's has_hwpoisoned flag needs to be preserved
>> like what Yang described above. My current plan is to move
>> folio_clear_has_hwpoisoned(folio) into __split_folio_to_order() and
>> scan every page in the folio if the folio's has_hwpoisoned is set.
>
> Oh, that's nasty indeed ... will have to think about that a bit.
>
> Maybe we can keep it simple and always set folio_set_has_hwpoisoned() on all split folios? Essentially turning it into a "maybe_has" semantics.
>
> IIUC, the existing folio_stest_has_hwpoisoned users can deal with that?
folio_test_has_hwpoisoned() direct users are fine. They are shmem.c
and memory.c, where the former would copy data in PAGE_SIZE instead of folio size
and the latter would not install PMD entry for the folio (impossible to hit
this until we have > PMD mTHPs and split them to PMD THPs).
The caller of folio_contain_hwpoisoned_page(), which calls
folio_test_has_hwpoisoned(), would have issues:
1. shmem_write_begin() in shmem.c: it returns -EIO for shmem writes.
2. thp_underused() in huge_memory.c: it does not scan the folio.
3. shrink_folio_list() in vmscan.c: it does not reclaim large hwpoisoned folios.
4. do_migrate_range() in memory_hotplug.c: it skips the large hwpoisoned folios.
These behaviors are fine for folios truly containing hwpoisoned pages,
but might not be desirable for false positive cases. A scan to make sure
hwpoisoned pages are indeed present is inevitable. Rather than making
all callers to do the scan, scanning at split time might be better, IMHO.
Let me send a patchset with scanning at split time. Hopefully, more people
can chime in to provide feedbacks.
--
Best Regards,
Yan, Zi
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v2 2/3] mm/memory-failure: improve large block size folio handling.
2025-10-21 18:57 ` Zi Yan
@ 2025-10-21 19:07 ` Yang Shi
0 siblings, 0 replies; 27+ messages in thread
From: Yang Shi @ 2025-10-21 19:07 UTC (permalink / raw)
To: Zi Yan
Cc: David Hildenbrand, linmiaohe, jane.chu, kernel,
syzbot+e6367ea2fdab6ed46056, syzkaller-bugs, akpm, mcgrof,
nao.horiguchi, Lorenzo Stoakes, Baolin Wang, Liam R. Howlett,
Nico Pache, Ryan Roberts, Dev Jain, Barry Song, Lance Yang,
Matthew Wilcox (Oracle), Wei Yang, linux-fsdevel, linux-kernel,
linux-mm
On Tue, Oct 21, 2025 at 11:58 AM Zi Yan <ziy@nvidia.com> wrote:
>
> On 21 Oct 2025, at 14:28, David Hildenbrand wrote:
>
> > On 21.10.25 17:55, Zi Yan wrote:
> >> On 21 Oct 2025, at 11:44, David Hildenbrand wrote:
> >>
> >>> On 21.10.25 03:23, Zi Yan wrote:
> >>>> On 20 Oct 2025, at 19:41, Yang Shi wrote:
> >>>>
> >>>>> On Mon, Oct 20, 2025 at 12:46 PM Zi Yan <ziy@nvidia.com> wrote:
> >>>>>>
> >>>>>> On 17 Oct 2025, at 15:11, Yang Shi wrote:
> >>>>>>
> >>>>>>> On Wed, Oct 15, 2025 at 8:38 PM Zi Yan <ziy@nvidia.com> wrote:
> >>>>>>>>
> >>>>>>>> Large block size (LBS) folios cannot be split to order-0 folios but
> >>>>>>>> min_order_for_folio(). Current split fails directly, but that is not
> >>>>>>>> optimal. Split the folio to min_order_for_folio(), so that, after split,
> >>>>>>>> only the folio containing the poisoned page becomes unusable instead.
> >>>>>>>>
> >>>>>>>> For soft offline, do not split the large folio if it cannot be split to
> >>>>>>>> order-0. Since the folio is still accessible from userspace and premature
> >>>>>>>> split might lead to potential performance loss.
> >>>>>>>>
> >>>>>>>> Suggested-by: Jane Chu <jane.chu@oracle.com>
> >>>>>>>> Signed-off-by: Zi Yan <ziy@nvidia.com>
> >>>>>>>> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
> >>>>>>>> ---
> >>>>>>>> mm/memory-failure.c | 25 +++++++++++++++++++++----
> >>>>>>>> 1 file changed, 21 insertions(+), 4 deletions(-)
> >>>>>>>>
> >>>>>>>> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> >>>>>>>> index f698df156bf8..443df9581c24 100644
> >>>>>>>> --- a/mm/memory-failure.c
> >>>>>>>> +++ b/mm/memory-failure.c
> >>>>>>>> @@ -1656,12 +1656,13 @@ static int identify_page_state(unsigned long pfn, struct page *p,
> >>>>>>>> * there is still more to do, hence the page refcount we took earlier
> >>>>>>>> * is still needed.
> >>>>>>>> */
> >>>>>>>> -static int try_to_split_thp_page(struct page *page, bool release)
> >>>>>>>> +static int try_to_split_thp_page(struct page *page, unsigned int new_order,
> >>>>>>>> + bool release)
> >>>>>>>> {
> >>>>>>>> int ret;
> >>>>>>>>
> >>>>>>>> lock_page(page);
> >>>>>>>> - ret = split_huge_page(page);
> >>>>>>>> + ret = split_huge_page_to_list_to_order(page, NULL, new_order);
> >>>>>>>> unlock_page(page);
> >>>>>>>>
> >>>>>>>> if (ret && release)
> >>>>>>>> @@ -2280,6 +2281,7 @@ int memory_failure(unsigned long pfn, int flags)
> >>>>>>>> folio_unlock(folio);
> >>>>>>>>
> >>>>>>>> if (folio_test_large(folio)) {
> >>>>>>>> + int new_order = min_order_for_split(folio);
> >>>>>>>> /*
> >>>>>>>> * The flag must be set after the refcount is bumped
> >>>>>>>> * otherwise it may race with THP split.
> >>>>>>>> @@ -2294,7 +2296,14 @@ int memory_failure(unsigned long pfn, int flags)
> >>>>>>>> * page is a valid handlable page.
> >>>>>>>> */
> >>>>>>>> folio_set_has_hwpoisoned(folio);
> >>>>>>>> - if (try_to_split_thp_page(p, false) < 0) {
> >>>>>>>> + /*
> >>>>>>>> + * If the folio cannot be split to order-0, kill the process,
> >>>>>>>> + * but split the folio anyway to minimize the amount of unusable
> >>>>>>>> + * pages.
> >>>>>>>> + */
> >>>>>>>> + if (try_to_split_thp_page(p, new_order, false) || new_order) {
> >>>>>>>
> >>>>>>> folio split will clear PG_has_hwpoisoned flag. It is ok for splitting
> >>>>>>> to order-0 folios because the PG_hwpoisoned flag is set on the
> >>>>>>> poisoned page. But if you split the folio to some smaller order large
> >>>>>>> folios, it seems you need to keep PG_has_hwpoisoned flag on the
> >>>>>>> poisoned folio.
> >>>>>>
> >>>>>> OK, this means all pages in a folio with folio_test_has_hwpoisoned() should be
> >>>>>> checked to be able to set after-split folio's flag properly. Current folio
> >>>>>> split code does not do that. I am thinking about whether that causes any
> >>>>>> issue. Probably not, because:
> >>>>>>
> >>>>>> 1. before Patch 1 is applied, large after-split folios are already causing
> >>>>>> a warning in memory_failure(). That kinda masks this issue.
> >>>>>> 2. after Patch 1 is applied, no large after-split folios will appear,
> >>>>>> since the split will fail.
> >>>>>
> >>>>> I'm a little bit confused. Didn't this patch split large folio to
> >>>>> new-order-large-folio (new order is min order)? So this patch had
> >>>>> code:
> >>>>> if (try_to_split_thp_page(p, new_order, false) || new_order) {
> >>>>
> >>>> Yes, but this is Patch 2 in this series. Patch 1 is
> >>>> "mm/huge_memory: do not change split_huge_page*() target order silently."
> >>>> and sent separately as a hotfix[1].
> >>>
> >>> I'm confused now as well. I'd like to review, will there be a v3 that only contains patch #2+#3?
> >>
> >> Yes. The new V3 will have 3 patches:
> >> 1. a new patch addresses Yang’s concern on setting has_hwpoisoned on after-split
> >> large folios.
> >> 2. patch#2,
> >> 3. patch#3.
> >
> > Okay, I'll wait with the review until you resend :)
> >
> >>
> >> The plan is to send them out once patch 1 is upstreamed. Let me know if you think
> >> it is OK to send them out earlier as Andrew already picked up patch 1.
> >
> > It's in mm/mm-new + mm/mm-unstable, AFAIKT. So sure, send it against one of the tress (I prefer mm-unstable but usually we should target mm-new).
>
> Sure.
> >
> >>
> >> I also would like to get some feedback on my approach to setting has_hwpoisoned:
> >>
> >> folio's has_hwpoisoned flag needs to be preserved
> >> like what Yang described above. My current plan is to move
> >> folio_clear_has_hwpoisoned(folio) into __split_folio_to_order() and
> >> scan every page in the folio if the folio's has_hwpoisoned is set.
> >
> > Oh, that's nasty indeed ... will have to think about that a bit.
> >
> > Maybe we can keep it simple and always set folio_set_has_hwpoisoned() on all split folios? Essentially turning it into a "maybe_has" semantics.
> >
> > IIUC, the existing folio_stest_has_hwpoisoned users can deal with that?
>
> folio_test_has_hwpoisoned() direct users are fine. They are shmem.c
> and memory.c, where the former would copy data in PAGE_SIZE instead of folio size
> and the latter would not install PMD entry for the folio (impossible to hit
> this until we have > PMD mTHPs and split them to PMD THPs).
>
> The caller of folio_contain_hwpoisoned_page(), which calls
> folio_test_has_hwpoisoned(), would have issues:
>
> 1. shmem_write_begin() in shmem.c: it returns -EIO for shmem writes.
> 2. thp_underused() in huge_memory.c: it does not scan the folio.
> 3. shrink_folio_list() in vmscan.c: it does not reclaim large hwpoisoned folios.
> 4. do_migrate_range() in memory_hotplug.c: it skips the large hwpoisoned folios.
>
> These behaviors are fine for folios truly containing hwpoisoned pages,
> but might not be desirable for false positive cases. A scan to make sure
> hwpoisoned pages are indeed present is inevitable. Rather than making
> all callers to do the scan, scanning at split time might be better, IMHO.
Yeah, I was trying to figure out a simpler way too. For example, we
can defer to set this flag to page fault time when page fault sees the
poisoned page when installing PTEs. But it can't cover most of the
cases mentioned by Zi Yan above. We may run into them before any page
fault happens.
Thanks,
Yang
>
> Let me send a patchset with scanning at split time. Hopefully, more people
> can chime in to provide feedbacks.
>
>
> --
> Best Regards,
> Yan, Zi
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v2 2/3] mm/memory-failure: improve large block size folio handling.
2025-10-20 19:46 ` Zi Yan
2025-10-20 23:41 ` Yang Shi
@ 2025-10-22 6:39 ` Miaohe Lin
1 sibling, 0 replies; 27+ messages in thread
From: Miaohe Lin @ 2025-10-22 6:39 UTC (permalink / raw)
To: Zi Yan
Cc: david, kernel, syzbot+e6367ea2fdab6ed46056, syzkaller-bugs, akpm,
mcgrof, nao.horiguchi, Lorenzo Stoakes, Baolin Wang,
Liam R. Howlett, Nico Pache, Ryan Roberts, Dev Jain, Barry Song,
Lance Yang, Matthew Wilcox (Oracle), Wei Yang, linux-fsdevel,
linux-kernel, linux-mm, Yang Shi, jane.chu
On 2025/10/21 3:46, Zi Yan wrote:
> On 17 Oct 2025, at 15:11, Yang Shi wrote:
>
>> On Wed, Oct 15, 2025 at 8:38 PM Zi Yan <ziy@nvidia.com> wrote:
>>>
>>> Large block size (LBS) folios cannot be split to order-0 folios but
>>> min_order_for_folio(). Current split fails directly, but that is not
>>> optimal. Split the folio to min_order_for_folio(), so that, after split,
>>> only the folio containing the poisoned page becomes unusable instead.
>>>
>>> For soft offline, do not split the large folio if it cannot be split to
>>> order-0. Since the folio is still accessible from userspace and premature
>>> split might lead to potential performance loss.
>>>
>>> Suggested-by: Jane Chu <jane.chu@oracle.com>
>>> Signed-off-by: Zi Yan <ziy@nvidia.com>
>>> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
>>> ---
>>> mm/memory-failure.c | 25 +++++++++++++++++++++----
>>> 1 file changed, 21 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
>>> index f698df156bf8..443df9581c24 100644
>>> --- a/mm/memory-failure.c
>>> +++ b/mm/memory-failure.c
>>> @@ -1656,12 +1656,13 @@ static int identify_page_state(unsigned long pfn, struct page *p,
>>> * there is still more to do, hence the page refcount we took earlier
>>> * is still needed.
>>> */
>>> -static int try_to_split_thp_page(struct page *page, bool release)
>>> +static int try_to_split_thp_page(struct page *page, unsigned int new_order,
>>> + bool release)
>>> {
>>> int ret;
>>>
>>> lock_page(page);
>>> - ret = split_huge_page(page);
>>> + ret = split_huge_page_to_list_to_order(page, NULL, new_order);
>>> unlock_page(page);
>>>
>>> if (ret && release)
>>> @@ -2280,6 +2281,7 @@ int memory_failure(unsigned long pfn, int flags)
>>> folio_unlock(folio);
>>>
>>> if (folio_test_large(folio)) {
>>> + int new_order = min_order_for_split(folio);
>>> /*
>>> * The flag must be set after the refcount is bumped
>>> * otherwise it may race with THP split.
>>> @@ -2294,7 +2296,14 @@ int memory_failure(unsigned long pfn, int flags)
>>> * page is a valid handlable page.
>>> */
>>> folio_set_has_hwpoisoned(folio);
>>> - if (try_to_split_thp_page(p, false) < 0) {
>>> + /*
>>> + * If the folio cannot be split to order-0, kill the process,
>>> + * but split the folio anyway to minimize the amount of unusable
>>> + * pages.
>>> + */
>>> + if (try_to_split_thp_page(p, new_order, false) || new_order) {
>>
>> folio split will clear PG_has_hwpoisoned flag. It is ok for splitting
>> to order-0 folios because the PG_hwpoisoned flag is set on the
>> poisoned page. But if you split the folio to some smaller order large
>> folios, it seems you need to keep PG_has_hwpoisoned flag on the
>> poisoned folio.
>
> OK, this means all pages in a folio with folio_test_has_hwpoisoned() should be
> checked to be able to set after-split folio's flag properly. Current folio
> split code does not do that. I am thinking about whether that causes any
> issue. Probably not, because:
>
> 1. before Patch 1 is applied, large after-split folios are already causing
> a warning in memory_failure(). That kinda masks this issue.
> 2. after Patch 1 is applied, no large after-split folios will appear,
> since the split will fail.
>
> @Miaohe and @Jane, please let me know if my above reasoning makes sense or not.
>
> To make this patch right, folio's has_hwpoisoned flag needs to be preserved
> like what Yang described above. My current plan is to move
> folio_clear_has_hwpoisoned(folio) into __split_folio_to_order() and
> scan every page in the folio if the folio's has_hwpoisoned is set.
> There will be redundant scans in non uniform split case, since a has_hwpoisoned
> folio can be split multiple times (leading to multiple page scans), unless
> the scan result is stored.
>
> @Miaohe and @Jane, is it possible to have multiple HW poisoned pages in
> a folio? Is the memory failure process like 1) page access causing MCE,
> 2) memory_failure() is used to handle it and split the large folio containing
> it? Or multiple MCEs can be received and multiple pages in a folio are marked
> then a split would happen?
memory_failure() is called with mf_mutex held. So I think even if multiple pages
in a folio trigger multiple MCEs at the same time, only one page will have HWPoison
flag set when splitting folio. If folio is successfully split, things look fine.
But if it fails to split folio due to e.g. extra refcnt held by others, the following
pages will see that there are already multiple pages in a folio are marked as HWPoison.
This is the scenario I can think of at the moment.
Thanks.
.
^ permalink raw reply [flat|nested] 27+ messages in thread
end of thread, other threads:[~2025-10-22 6:39 UTC | newest]
Thread overview: 27+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-16 3:34 [PATCH v2 0/3] Do not change split folio target order Zi Yan
2025-10-16 3:34 ` [PATCH v2 1/3] mm/huge_memory: do not change split_huge_page*() target order silently Zi Yan
2025-10-16 7:31 ` Wei Yang
2025-10-16 14:32 ` Zi Yan
2025-10-16 20:59 ` Andrew Morton
2025-10-17 1:03 ` Zi Yan
2025-10-17 9:06 ` Lorenzo Stoakes
2025-10-17 9:10 ` Lorenzo Stoakes
2025-10-17 14:16 ` Zi Yan
2025-10-17 14:32 ` Lorenzo Stoakes
2025-10-18 0:05 ` Andrew Morton
2025-10-17 1:01 ` Wei Yang
2025-10-16 3:34 ` [PATCH v2 2/3] mm/memory-failure: improve large block size folio handling Zi Yan
2025-10-17 9:33 ` Lorenzo Stoakes
2025-10-20 20:09 ` Zi Yan
2025-10-17 19:11 ` Yang Shi
2025-10-20 19:46 ` Zi Yan
2025-10-20 23:41 ` Yang Shi
2025-10-21 1:23 ` Zi Yan
2025-10-21 15:44 ` David Hildenbrand
2025-10-21 15:55 ` Zi Yan
2025-10-21 18:28 ` David Hildenbrand
2025-10-21 18:57 ` Zi Yan
2025-10-21 19:07 ` Yang Shi
2025-10-22 6:39 ` Miaohe Lin
2025-10-16 3:34 ` [PATCH v2 3/3] mm/huge_memory: fix kernel-doc comments for folio_split() and related Zi Yan
2025-10-17 9:20 ` Lorenzo Stoakes
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).