* [RFC PATCH v1.2 04/11] selftests/damon/sysfs.sh: test multiple probe dirs creation
From: SeongJae Park @ 2026-06-25 14:23 UTC (permalink / raw)
Cc: SeongJae Park, Shuah Khan, damon, linux-kernel, linux-kselftest,
linux-mm
In-Reply-To: <20260625142357.103500-1-sj@kernel.org>
DAMON sysfs essential file operations test (sysfs.sh) was extended to
test DAMON probes sysfs directory, by commit 14885da09b0f
("selftests/damon/sysfs.sh: test probes dir"). Unlike other DAMON sysfs
files, it is testing only a single directory case. Extend it for
multiple directories.
Signed-off-by: SeongJae Park <sj@kernel.org>
---
tools/testing/selftests/damon/sysfs.sh | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/tools/testing/selftests/damon/sysfs.sh b/tools/testing/selftests/damon/sysfs.sh
index 78f4badb5bebb..0f2ef462a6b6a 100755
--- a/tools/testing/selftests/damon/sysfs.sh
+++ b/tools/testing/selftests/damon/sysfs.sh
@@ -346,8 +346,13 @@ test_probes()
ensure_write_succ "$probes_dir/nr_probes" "1" "valid input"
test_probe "$probes_dir/0"
+ ensure_write_succ "$probes_dir/nr_probes" "2" "valid input"
+ test_probe "$probes_dir/0"
+ test_probe "$probes_dir/1"
+
ensure_write_succ "$probes_dir/nr_probes" "0" "valid input"
ensure_dir "$probes_dir/0" "not_exist"
+ ensure_dir "$probes_dir/1" "not_exist"
}
test_monitoring_attrs()
--
2.47.3
^ permalink raw reply related
* [RFC PATCH v1.2 03/11] mm/damon/tests/core-kunit: test damon_rand()
From: SeongJae Park @ 2026-06-25 14:23 UTC (permalink / raw)
Cc: SeongJae Park, Andrew Morton, Brendan Higgins, David Gow, damon,
kunit-dev, linux-kernel, linux-kselftest, linux-mm
In-Reply-To: <20260625142357.103500-1-sj@kernel.org>
Commit 9012c4e647df ("mm/damon: replace damon_rand() with a per-ctx
lockless PRNG") optimized DAMON for better performance. Add a kunit
test for ensuring the bounds of the output.
Signed-off-by: SeongJae Park <sj@kernel.org>
---
mm/damon/tests/core-kunit.h | 17 +++++++++++++++++
1 file changed, 17 insertions(+)
diff --git a/mm/damon/tests/core-kunit.h b/mm/damon/tests/core-kunit.h
index 1cfb8c176b873..eec7cb325a431 100644
--- a/mm/damon/tests/core-kunit.h
+++ b/mm/damon/tests/core-kunit.h
@@ -1460,6 +1460,22 @@ static void damon_test_is_last_region(struct kunit *test)
damon_free_target(t);
}
+static void damon_test_rand(struct kunit *test)
+{
+ struct damon_ctx ctx;
+ int counts[10] = {};
+ int i;
+
+ prandom_seed_state(&ctx.rnd_state, get_random_u64());
+ for (i = 0; i < 10000; i++) {
+ unsigned long rnd = damon_rand(&ctx, 0, 10);
+
+ KUNIT_EXPECT_GE(test, rnd, 0);
+ KUNIT_EXPECT_LE(test, rnd, 9);
+ counts[rnd]++;
+ }
+}
+
static struct kunit_case damon_test_cases[] = {
KUNIT_CASE(damon_test_target),
KUNIT_CASE(damon_test_regions),
@@ -1489,6 +1505,7 @@ static struct kunit_case damon_test_cases[] = {
KUNIT_CASE(damon_test_set_filters_default_reject),
KUNIT_CASE(damon_test_apply_min_nr_regions),
KUNIT_CASE(damon_test_is_last_region),
+ KUNIT_CASE(damon_test_rand),
{},
};
--
2.47.3
^ permalink raw reply related
* [RFC PATCH v1.2 02/11] Docs/ABI/damon: document probe files
From: SeongJae Park @ 2026-06-25 14:23 UTC (permalink / raw)
Cc: SeongJae Park, Liam R. Howlett, Andrew Morton, David Hildenbrand,
Lorenzo Stoakes, Michal Hocko, Mike Rapoport, Suren Baghdasaryan,
Vlastimil Babka, damon, linux-kernel, linux-mm
In-Reply-To: <20260625142357.103500-1-sj@kernel.org>
DAMON ABI document is not updated for the DAMON probe sysfs files.
Update.
Signed-off-by: SeongJae Park <sj@kernel.org>
---
.../ABI/testing/sysfs-kernel-mm-damon | 40 +++++++++++++++++++
1 file changed, 40 insertions(+)
diff --git a/Documentation/ABI/testing/sysfs-kernel-mm-damon b/Documentation/ABI/testing/sysfs-kernel-mm-damon
index b73e6bc28ea5f..f914aab79fced 100644
--- a/Documentation/ABI/testing/sysfs-kernel-mm-damon
+++ b/Documentation/ABI/testing/sysfs-kernel-mm-damon
@@ -157,6 +157,46 @@ Description: Writing a value to this file sets the maximum number of
monitoring regions of the DAMON context as the value. Reading
this file returns the value.
+What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/monitoring_attrs/probes/nr_probes
+Date: May 2026
+Contact: SeongJae Park <sj@kernel.org>
+Description: Writing a number 'N' to this file creates the number of
+ directories for each DAMON probe named '0' to 'N-1' under the
+ probes/ directory.
+
+What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/monitoring_attrs/probes/<P>/filters/nr_filters
+Date: May 2026
+Contact: SeongJae Park <sj@kernel.org>
+Description: Writing a number 'N' to this file creates the number of
+ directories for each DAMON probe filter named '0' to 'N-1'
+ under the filters/ directory.
+
+What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/monitoring_attrs/probes/<P>/filters/<F>/type
+Date: May 2026
+Contact: SeongJae Park <sj@kernel.org>
+Description: Writing to and reading from this file sets and gets the type of
+ the memory of the interest.
+
+What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/monitoring_attrs/probes/<P>/filters/<F>/path
+Date: May 2026
+Contact: SeongJae Park <sj@kernel.org>
+Description: If 'memcg' is written to the 'type' file, writing to and
+ reading from this file sets and gets the path to the memory
+ cgroup of the interest.
+
+What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/monitoring_attrs/probes/<P>/filters/<F>/matching
+Date: May 2026
+Contact: SeongJae Park <sj@kernel.org>
+Description: Writing 'Y' or 'N' to this file sets whether the filter is for
+ the memory of the 'type', or all except the 'type'.
+
+What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/monitoring_attrs/probes/<P>/filters/<F>/allow
+Date: May 2026
+Contact: SeongJae Park <sj@kernel.org>
+Description: Writing 'Y' or 'N' to this file sets whether to allow or reject
+ hitting the probe for the memory that satisfies the 'type' and
+ the 'matching' of the directory.
+
What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/targets/nr_targets
Date: Mar 2022
Contact: SeongJae Park <sj@kernel.org>
--
2.47.3
^ permalink raw reply related
* [RFC PATCH v1.2 01/11] Docs/mm/damon/design: update for DAMOS_QUOTA_NODE_ELIGIBLE_MEM_BP
From: SeongJae Park @ 2026-06-25 14:23 UTC (permalink / raw)
Cc: SeongJae Park, Liam R. Howlett, Andrew Morton, David Hildenbrand,
Jonathan Corbet, Lorenzo Stoakes, Michal Hocko, Mike Rapoport,
Shuah Khan, Suren Baghdasaryan, Vlastimil Babka, damon, linux-doc,
linux-kernel, linux-mm
In-Reply-To: <20260625142357.103500-1-sj@kernel.org>
Commit 9138e27a3bc3 ("mm/damon: add node_eligible_mem_bp goal metric")
introduced DAMOS_QUOTA_NODE_ELIGIBLE_MEM_BP but forgot updating the
DAMON design document for that. Update.
Signed-off-by: SeongJae Park <sj@kernel.org>
---
Documentation/mm/damon/design.rst | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/Documentation/mm/damon/design.rst b/Documentation/mm/damon/design.rst
index 2da7ca0d3d17a..9dbace087a329 100644
--- a/Documentation/mm/damon/design.rst
+++ b/Documentation/mm/damon/design.rst
@@ -686,9 +686,11 @@ mechanism tries to make ``current_value`` of ``target_metric`` be same to
(1/10,000).
- ``inactive_mem_bp``: Inactive to active + inactive (LRU) memory size ratio in
bp (1/10,000).
+- ``node_eligible_mem_bp``: Scheme target access pattern-eligible memory ratio
+ of a node in bp (1/10,000).
-``nid`` is optionally required for only ``node_mem_used_bp``,
-``node_mem_free_bp``, ``node_memcg_used_bp`` and ``node_memcg_free_bp`` to
+``nid`` is optionally required for ``node_mem_used_bp``, ``node_mem_free_bp``,
+``node_memcg_used_bp`,` ``node_memcg_free_bp`` and ``node_eligible_mem_bp`` to
point the specific NUMA node.
``path`` is optionally required for only ``node_memcg_used_bp`` and
--
2.47.3
^ permalink raw reply related
* [RFC PATCH v1.2 00/11] mm/damon: update, optimize, and clean up doc, tests, and code
From: SeongJae Park @ 2026-06-25 14:23 UTC (permalink / raw)
Cc: SeongJae Park, Liam R. Howlett, Andrew Morton, Brendan Higgins,
David Gow, David Hildenbrand, Jonathan Corbet, Lorenzo Stoakes,
Michal Hocko, Mike Rapoport, Shuah Khan, Shuah Khan,
Suren Baghdasaryan, Vlastimil Babka, damon, kunit-dev, linux-doc,
linux-kernel, linux-kselftest, linux-mm
Patches 1 and 2 update the design and ABI documents for recently added
DAMON features. Patches 3-7 add or update more unit and self tests for
DAMON to cover recently changed or added functions and sysfs files.
Patch 8 optimizes damon_commit_target_regions() to skip unnecessary
adjacent ranges setup. Patches 9-11 clean and fix up recently added
DAMON sysfs interface code for readability.
Changes from RFC v1.1
- RFC v1.1: https://lore.kernel.org/20260625050756.91115-1-sj@kernel.org
- Document nid requirement for node_eligible_mem_bp.
- Fix typos: s/memmcg/memcg/, s/geets/gets/.
- Drop damon_rnd() randomness test case; test boundness only.
- Fixup dests dir selftest to do real test with correct file permission checks.
Changes from RFC
- RFC: https://lore.kernel.org/20260624142008.87180-1-sj@kernel.org
- Rebase directly to latest mm-new.
SeongJae Park (11):
Docs/mm/damon/design: update for DAMOS_QUOTA_NODE_ELIGIBLE_MEM_BP
Docs/ABI/damon: document probe files
mm/damon/tests/core-kunit: test damon_rand()
selftests/damon/sysfs.sh: test multiple probe dirs creation
selftests/damon/sysfs.sh: test {core,ops}_filters/ directories
selftests/damon/sysfs.sh: test dests dir
selftests/damon/sysfs.sh: test all files in quota goal dir
mm/damon/core: reduce range setup in damon_commit_target_regions()
mm/damon/sysfs: split probe setup function out
mm/damon/sysfs: split out filters setup function
mm/damon/sysfs: fix typos in probe_{add,rm}_dirs: s/attr/probe/
.../ABI/testing/sysfs-kernel-mm-damon | 40 +++++++
Documentation/mm/damon/design.rst | 6 +-
mm/damon/core.c | 22 +++-
mm/damon/sysfs.c | 102 ++++++++++--------
mm/damon/tests/core-kunit.h | 17 +++
tools/testing/selftests/damon/sysfs.sh | 71 +++++++++++-
6 files changed, 205 insertions(+), 53 deletions(-)
base-commit: ada7832345164eed1bbca10543b0c46f13738215
--
2.47.3
^ permalink raw reply
* Re: [PATCH 6.18.y v4 0/9] mm: backport sticky VMA flags and soft-dirty fix
From: Lorenzo Stoakes @ 2026-06-25 14:16 UTC (permalink / raw)
To: Ahmed Elaidy; +Cc: stable, linux-mm, akpm, avagin
In-Reply-To: <20260515124218.151966-2-elaidya225@gmail.com>
On Fri, May 15, 2026 at 03:42:10PM +0300, Ahmed Elaidy wrote:
> This series backports the sticky VMA flags infrastructure and the
> VM_SOFTDIRTY-on-merge fix to linux-6.18.y.
Thanks again for doing this Ahmed! :)
Cheers, Lorenzo
>
> Motivation: CRIU incremental dump/restore can hit a missing-parent-pagemap
> failure when VM_SOFTDIRTY is lost during VMA merge operations.
>
> Patch 8 is the target fix:
> mm: propagate VM_SOFTDIRTY on merge
>
> The preceding patches provide required dependencies on 6.18.y and are included
> to preserve upstream behavior, as requested by maintainers for stable backports.
>
> Changes since v3:
> - Reverted to sending the full 9-patch series as requested by Greg KH and Lorenzo.
> - Updated Lorenzo's email to ljs@kernel.org across all patches.
> - Added Cc: stable@vger.kernel.org # 6.18.x to all patches.
> - Added Fixes tag for soft-dirty merging in Patch 8.
>
> Lorenzo Stoakes (9):
> mm: introduce VM_MAYBE_GUARD and make visible in /proc/$pid/smaps
> mm: add atomic VMA flags and set VM_MAYBE_GUARD as such
> mm: update vma_modify_flags() to handle residual flags, document
> mm: implement sticky VMA flags
> mm: introduce copy-on-fork VMAs and make VM_MAYBE_GUARD one
> mm: set the VM_MAYBE_GUARD flag on guard region install
> tools/testing/vma: add VMA sticky userland tests
> mm: propagate VM_SOFTDIRTY on merge
> testing/selftests/mm: add soft-dirty merge self-test
>
> Documentation/filesystems/proc.rst | 5 +-
> fs/proc/task_mmu.c | 1 +
> include/linux/mm.h | 100 +++++++++++++++++
> include/trace/events/mmflags.h | 1 +
> mm/khugepaged.c | 71 +++++++-----
> mm/madvise.c | 24 +++--
> mm/memory.c | 14 +--
> mm/mlock.c | 2 +-
> mm/mprotect.c | 2 +-
> mm/mseal.c | 7 +-
> mm/vma.c | 81 +++++++-------
> mm/vma.h | 138 +++++++++++++++++-------
> tools/testing/selftests/mm/soft-dirty.c | 127 +++++++++++++++++++++-
> tools/testing/vma/vma.c | 92 ++++++++++++++--
> tools/testing/vma/vma_internal.h | 49 +++++++++
> 15 files changed, 579 insertions(+), 135 deletions(-)
>
> --
> 2.54.0
>
^ permalink raw reply
* Re: [PATCH] mm/page_vma_mapped: guard check_pmd() with CONFIG_TRANSPARENT_HUGEPAGE
From: Lorenzo Stoakes @ 2026-06-25 14:02 UTC (permalink / raw)
To: David Hildenbrand (Arm)
Cc: Wei Yang, akpm, riel, liam, vbabka, harry, jannh, willy, linux-mm,
linux-kernel, lance.yang
In-Reply-To: <c66f4bec-0933-401b-bf2f-a1b2e256023f@kernel.org>
On Thu, Jun 25, 2026 at 03:49:59PM +0200, David Hildenbrand (Arm) wrote:
> On 6/25/26 15:45, Lorenzo Stoakes wrote:
> > On Wed, Jun 24, 2026 at 08:23:59AM +0000, Wei Yang wrote:
> >> The kernel test robot reported a build failure on the parisc architecture
> >> when expanding HPAGE_PMD_NR in check_pmd().
> >
> > Let me first say that I absolutely hate that we continue to support museum
> > piece architectures to the point that we have to make changes in core code
> > to accommodate them.
>
> I wonder why we shouldn't be able to trigger that on other archs with
> !CONFIG_TRANSPARENT_HUGEPAGE ?
I think this should just use CONFIG_PGTABLE_HAS_HUGE_LEAVES, since that's the
property that literally defines whether check_pmd() makes any sense.
>
> I think the code just relies on pmd_trans_huge() == false, and consequently
> check_pmd will get compiled out completely.
>
> Now, the report was against Wei's new patch.
>
> There is *nothing* to be fixed for existing code.
OK so it's a fix sent in the merge window, against a patch sent in the merge
window. Great.
I mean, let's all chill here. Sip some wine. Some brandy. Some absinthe. Perhaps
even some turpentine* for the connoisseurs!
Slow down a bit Wei!
You're sending a lot of fiddly series that require a lot of review and we're
extremely busy with review already.
Please just relax and maybe go water your garden a bit. Have a cornetto :)
>
>
> Fixes: 2aff7a4755be ("mm: Convert page_vma_mapped_walk to work on PFNs")
>
> is just wrong?
Yes therefore it is indeed.
Though it's really horrible that we relied on things getting compiled out like
that... nasty!
>
> --
> Cheers,
>
> David
Cheers, Lorenzo
*Obligatory safety notice for the overly literal: do not do this, this is a
joke.
^ permalink raw reply
* Re: [PATCH 0/5] Fix incorrect access of hugetlb pte entries
From: Zi Yan @ 2026-06-25 13:59 UTC (permalink / raw)
To: Dev Jain, muchun.song, osalvador, akpm, ljs, david, liam
Cc: riel, vbabka, harry, jannh, lance.yang, kas, linux-mm,
linux-kernel, rcampbell, apopple, matthew.brost, joshua.hahnjy,
rakie.kim, byungchul, gourry, ying.huang, mel, nao.horiguchi, ak,
j-nomura, pfalcato, dave.hansen, tglx, jpoimboe, ryan.roberts,
anshuman.khandual
In-Reply-To: <20260625112955.3254283-1-dev.jain@arm.com>
On Thu Jun 25, 2026 at 7:29 AM EDT, Dev Jain wrote:
> There are various places which use ptep_get() to get the pte entry
> corresponding to a hugetlb folio. Some arches have special handling
I think it is better to mention s390 as a concrete example.
> to compute the pteval, so they provide huge_ptep_get(). Use this
> helper consistently.
>
> Dev Jain (5):
> mm/rmap: use huge_ptep_get() in try_to_unmap_one()
> mm/rmap: use huge_ptep_get() in try_to_migrate_one()
> mm/migrate: use huge_ptep_get() in remove_migration_pte()
> mm/page_vma_mapped: use huge_ptep_get() for hugetlb
> mm/mprotect: use huge_ptep_get() for hugetlb
>
> include/linux/hugetlb.h | 3 +++
> mm/migrate.c | 6 +++++-
> mm/mprotect.c | 8 +++++++-
> mm/page_vma_mapped.c | 8 +++++++-
> mm/rmap.c | 32 ++++++++++++++++++++------------
> 5 files changed, 42 insertions(+), 15 deletions(-)
--
Best Regards,
Yan, Zi
^ permalink raw reply
* Re: [PATCH] mm/page_vma_mapped: guard check_pmd() with CONFIG_TRANSPARENT_HUGEPAGE
From: Lorenzo Stoakes @ 2026-06-25 13:51 UTC (permalink / raw)
To: Wei Yang
Cc: Andrew Morton, david, riel, liam, vbabka, harry, jannh, willy,
linux-mm, linux-kernel, lance.yang, balbirs, Roman Gushchin
In-Reply-To: <20260625064102.tcmvrctcqibl54yr@master>
+cc Roman for Sashiko discussion
On Thu, Jun 25, 2026 at 06:41:02AM +0000, Wei Yang wrote:
> +cc Balbir
>
> On Wed, Jun 24, 2026 at 09:59:43PM -0700, Andrew Morton wrote:
> >On Thu, 25 Jun 2026 03:46:29 +0000 Wei Yang <richard.weiyang@gmail.com> wrote:
> >
> >> >Sashiko had an off-topic complaint about the surrounding code:
> >> > https://lore.kernel.org/oe-kbuild-all/202606240042.ffPsEXVc-lkp@intel.com/
> >>
> >> I see this robot reply, but not see the Sashiko comment.
> >>
> >> How can I view Sashiko's commnet?
> >
> >oop sorry.
> >
> >You can go to https://sashiko.dev/ and search for the email subject.
> >
> >Or append your Message-ID to "https://sashiko.dev/#/patchset":
> >
> > https://sashiko.dev/#/patchset/20260624082359.2869-1-richard.weiyang@gmail.com
> >
>
> Got it, thansk
>
> This one mentioned two things:
>
> a. page_vma_mapped_walk() return without check
> b. whether __split_huge_pmd_locked() would split device-private pmd
>
> For a., it is being fixing at [1].
>
> For b., to be honest I am not 100% for sure. If a device-private pmd could be
> file backed, then this looks like a bug.
>
> Balbir,
>
> Would you mind taking a look at the second comment raised by Sashiko?
>
> [1]: https://lore.kernel.org/linux-mm/20260624065353.1622-1-richard.weiyang@gmail.com/
I continue to dislike that sashiko does this.
Series with... interesting use of AI :).. are already taking up more of the time
we reviewers don't have... but interrupting existing review to mention random
stuff is unhelpful I feel :)
I think the better use of time here would be for Balbir to perhaps ask AI to
examine all cases where a PMD device private entry might crop up and to check to
see if there's any other bugs similar to the ones we've encountered before?
Given Sashiko is very token-constrained, I also wonder whether this feature
wouldn't be better disabled (or maybe have the ability to turn off
per-subsystem?)
In a couple other cases the 'also consider' stuff actually took a bunch of time
unnecessarily and I felt they interferred with the series landing.
Given the time constraints we all work under, it'd be better not to add to
workload this way (having to figure out if the points are valid are a time drain
in themselves).
Thanks, Lorenzo
^ permalink raw reply
* Re: [PATCH v5 5/9] mm/memory_hotplug: offline_and_remove_memory_ranges()
From: Gregory Price @ 2026-06-25 13:51 UTC (permalink / raw)
To: David Hildenbrand (Arm)
Cc: linux-mm, nvdimm, linux-kernel, linux-cxl, driver-core,
linux-kselftest, kernel-team, osalvador, gregkh, rafael, dakr,
djbw, vishal.l.verma, dave.jiang, akpm, ljs, liam, vbabka, rppt,
surenb, mhocko, shuah, alison.schofield,
Smita.KoralahalliChannabasappa, ira.weiny, apopple
In-Reply-To: <d48feca1-0203-43ff-bd66-6243291a51ba@kernel.org>
On Thu, Jun 25, 2026 at 09:22:01AM +0200, David Hildenbrand (Arm) wrote:
> On 6/24/26 16:57, Gregory Price wrote:
> > extern int offline_and_remove_memory(u64 start, u64 size);
> > +int offline_and_remove_memory_ranges(const struct range *ranges, int nr_ranges);
> >
> > #else
> > static inline void try_offline_node(int nid) {}
> > @@ -283,6 +284,12 @@ static inline int remove_memory(u64 start, u64 size)
> > }
> >
> > static inline void __remove_memory(u64 start, u64 size) {}
> > +
> > +static inline int offline_and_remove_memory_ranges(const struct range *ranges,
> > + int nr_ranges)
>
> Best to use "unsigned int" right from the start and use two tabs to indent.
>
ack, ack. need to reprogram my brain to two-indent style, i keep doing
this reflexively.
> > +int offline_and_remove_memory_ranges(const struct range *ranges, int nr_ranges)
> > +{
> > + unsigned long mb_total = 0;
> > uint8_t *online_types, *tmp;
> > - int rc;
> > + int i, rc = 0;
> >
> > - if (!IS_ALIGNED(start, memory_block_size_bytes()) ||
> > - !IS_ALIGNED(size, memory_block_size_bytes()) || !size)
> > + if (!ranges || nr_ranges <= 0)
>
> With "unsigned int" this will be !nr_ranges.
>
> Wondering whether we would WARN_ON_ONCE() here.
>
Seems reasonable. Do we normally WARN when callers send dumb arguments?
Seems like sending -EINVAL is sufficient?
> > - online_types = kmalloc_array(mb_count, sizeof(*online_types),
> > + online_types = kmalloc_array(mb_total, sizeof(*online_types),
> > GFP_KERNEL);
>
> Is "mb_total" really more expressive than "mb_count"?
>
No, this was mostly my way ok keeping try of what was being moved around
while working it. I will change it back.
> > /*
> > - * In case we succeeded to offline all memory, remove it.
> > - * This cannot fail as it cannot get onlined in the meantime.
> > + * Phase 2: Remove each range. This essentially cannot fail as we hold
> > + * the hotplug lock . WARN if that assumption is ever broken.
> > */
> > if (!rc) {
> > - rc = try_remove_memory(start, size);
> > - if (rc)
> > - pr_err("%s: Failed to remove memory: %d", __func__, rc);
> > + for (i = 0; i < nr_ranges; i++) {
> > + rc = try_remove_memory(ranges[i].start,
> > + range_len(&ranges[i]));
> > + if (WARN_ON_ONCE(rc)) {
> > + pr_err("%s: Failed to remove memory: %d",
> > + __func__, rc);
> > + break;
>
> Do we really want to break? I'd say, just warn and continue, and fake rc == 0.
> Something is seriously messed up already, and we partially removed memory. There
> is no clean rollback possible.
>
> Similar to __remove_memory(), ignoring the error because it offlined it already.
>
This seems reasonable, will change to warn and continue + return error.
Sashiko actually pointed out there there's a corner condition here with
offline rollback, so i needed to tweak this chunk anyway.
~Gregory
^ permalink raw reply
* Re: [PATCH] mm/page_vma_mapped: guard check_pmd() with CONFIG_TRANSPARENT_HUGEPAGE
From: David Hildenbrand (Arm) @ 2026-06-25 13:49 UTC (permalink / raw)
To: Lorenzo Stoakes, Wei Yang
Cc: akpm, riel, liam, vbabka, harry, jannh, willy, linux-mm,
linux-kernel, lance.yang
In-Reply-To: <aj0vjdBN-oNMI2yI@lucifer>
On 6/25/26 15:45, Lorenzo Stoakes wrote:
> On Wed, Jun 24, 2026 at 08:23:59AM +0000, Wei Yang wrote:
>> The kernel test robot reported a build failure on the parisc architecture
>> when expanding HPAGE_PMD_NR in check_pmd().
>
> Let me first say that I absolutely hate that we continue to support museum
> piece architectures to the point that we have to make changes in core code
> to accommodate them.
I wonder why we shouldn't be able to trigger that on other archs with
!CONFIG_TRANSPARENT_HUGEPAGE ?
I think the code just relies on pmd_trans_huge() == false, and consequently
check_pmd will get compiled out completely.
Now, the report was against Wei's new patch.
There is *nothing* to be fixed for existing code.
Fixes: 2aff7a4755be ("mm: Convert page_vma_mapped_walk to work on PFNs")
is just wrong?
--
Cheers,
David
^ permalink raw reply
* Re: [RFC PATCH] mm: Avoiding split large folios if swap has no space
From: David Hildenbrand (Arm) @ 2026-06-25 13:45 UTC (permalink / raw)
To: Johannes Weiner
Cc: Barry Song, akpm, axelrasmussen, baolin.wang, dev.jain, kasong,
lance.yang, liam, linux-kernel, linux-mm, ljs, npache, qi.zheng,
ryan.roberts, shakeel.butt, weixugc, yuanchu, zhaonanzhe, ziy,
Michal Hocko, Roman Gushchin
In-Reply-To: <aj0u12N_GzGtQT6K@cmpxchg.org>
On 6/25/26 15:36, Johannes Weiner wrote:
> On Thu, Jun 25, 2026 at 09:49:56AM +0200, David Hildenbrand (Arm) wrote:
>>>
>>> I don't quite understand you. get_nr_swap_pages() returns
>>> nr_swap_pages, which increases or decreases as swap is allocated or
>>> freed. I guess it just reflects how many swaps we currently have
>>> available?
>>
>> Indeed, I was confused by the function name it's "free swap pages". So all goof :)
>>
>>>
>>>
>>> Yep. The tricky part is that mem_cgroup_try_charge_swap() cannot
>>> return how much swap quota is available in the memcg. Do you prefer to
>>> add an output argument to mem_cgroup_try_charge_swap() to expose
>>> that
>> That would probably be cleanest, if that is easily possible. We would want to
>> get memcg maintainer feedback on that.
>>
>> @memcg folks: we'd like to know whether splitting a large folio would make
>> mem_cgroup_try_charge_swap() succeed on a split (smaller) part, to distinguish
>> "there is no way we can swap out anything, don't split" vs. "we could swap out,
>> split".
>
> It's technically doable, but is this worth the bother? The remaining
> headroom is less than a large folio. You can split this one, but you
> cannot even swap out all of its subpages anymore?
I was asking myself the same, but when we think in terms of THPs on arm64 64k
we're in the range of double-digit MiBs.
> From the cgroup
> side, we don't need the limit to be obeyed this rigidly. We overcharge
> temporarily in other places if it's convenient to do so. A fuzz factor
> around the limit is acceptable.
Thanks for that information.
>
> But if you still want to do it, here is how:
>
> The page_counter_try_charge() in __mem_cgroup_try_charge_swap() walks
> the hierarchy upwards. If it fails, it will store the first level that
> failed against its limit. You can do the mem_cgroup_margin() math
> against this counter to determine headroom. An ancestor *could* be
> more restrictive, so you need to finish the hierarchy walk to the root
> and use the min() of all the swap.max - page_counter_read(swap). Then
> return that in a return argument from __mem_cgroup_try_charge_swap().
Thanks! @Barry, up to you if we want to implement that right away or if we're
simply going to assume that if charging fails, not worth splitting (changing the
existing handling IIUC).
--
Cheers,
David
^ permalink raw reply
* Re: [PATCH] mm/page_vma_mapped: guard check_pmd() with CONFIG_TRANSPARENT_HUGEPAGE
From: Lorenzo Stoakes @ 2026-06-25 13:45 UTC (permalink / raw)
To: Wei Yang
Cc: akpm, david, riel, liam, vbabka, harry, jannh, willy, linux-mm,
linux-kernel, lance.yang
In-Reply-To: <20260624082359.2869-1-richard.weiyang@gmail.com>
On Wed, Jun 24, 2026 at 08:23:59AM +0000, Wei Yang wrote:
> The kernel test robot reported a build failure on the parisc architecture
> when expanding HPAGE_PMD_NR in check_pmd().
Let me first say that I absolutely hate that we continue to support museum
piece architectures to the point that we have to make changes in core code
to accommodate them.
It's not unreasonable to ask retro people to either use older kernels or
make a downstream fork.
People having to think about this upstream is so incredibly silly. As if we
don't have enough work already...
Anyway, with that said...
>
> mm/page_vma_mapped.c:142:13: note: in expansion of macro 'HPAGE_PMD_NR'
> if ((pfn + HPAGE_PMD_NR - 1) < pvmw->pfn)
> ^~~~~~~~~~~~
>
> The config [1] in report link shows neither TRANSPARENT_HUGEPAGE nor
> HUGETLB_PAGE is defined. Then trigger the BUILD_BUG.
>
> Fix it by define check_pmd() under CONFIG_TRANSPARENT_HUGEPAGE.
>
> [1]: https://download.01.org/0day-ci/archive/20260624/202606240042.ffPsEXVc-lkp@intel.com/config
I think the fact this wasn't detected for 4 odd years goes to show how well
tested stuff on this arch is... (unless this is a very unusual
configuration at least).
>
> Fixes: 2aff7a4755be ("mm: Convert page_vma_mapped_walk to work on PFNs")
> Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
> Reported-by: kernel test robot <lkp@intel.com>
> Closes: https://lore.kernel.org/oe-kbuild-all/202606240042.ffPsEXVc-lkp@intel.com/
> ---
> mm/page_vma_mapped.c | 7 +++++++
> 1 file changed, 7 insertions(+)
>
> diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
> index 17dff8aab9f9..4aac94d9e8a9 100644
> --- a/mm/page_vma_mapped.c
> +++ b/mm/page_vma_mapped.c
> @@ -136,6 +136,7 @@ static bool check_pte(struct page_vma_mapped_walk *pvmw, unsigned long pte_nr)
> return true;
> }
>
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
As per Andrew, this should be CONFIG_PGTABLE_HAS_HUGE_LEAVES I think.
I don't like that CONFIG_T..HP is taken to mean 'anything to do with leaf
page tables'. That's a mess and one we should unwind.
So don't make it worse, use CONFIG_PGTABLE_HAS_HUGE_LEAVES.
> /* Returns true if the two ranges overlap. Careful to not overflow. */
> static bool check_pmd(unsigned long pfn, struct page_vma_mapped_walk *pvmw)
> {
> @@ -145,6 +146,12 @@ static bool check_pmd(unsigned long pfn, struct page_vma_mapped_walk *pvmw)
> return false;
> return true;
> }
> +#else
> +static bool check_pmd(unsigned long pfn, struct page_vma_mapped_walk *pvmw)
> +{
> + return false;
Should have a WARN_ON_ONCE("bug in stupid arch") or similar here ;)
> +}
> +#endif
>
> static void step_forward(struct page_vma_mapped_walk *pvmw, unsigned long size)
> {
> --
> 2.34.1
>
Thanks, Lorenzo
^ permalink raw reply
* Re: [RFC PATCH] mm: Avoiding split large folios if swap has no space
From: Johannes Weiner @ 2026-06-25 13:36 UTC (permalink / raw)
To: David Hildenbrand (Arm)
Cc: Barry Song, akpm, axelrasmussen, baolin.wang, dev.jain, kasong,
lance.yang, liam, linux-kernel, linux-mm, ljs, npache, qi.zheng,
ryan.roberts, shakeel.butt, weixugc, yuanchu, zhaonanzhe, ziy,
Michal Hocko, Roman Gushchin
In-Reply-To: <c29f90c6-2075-43e8-8f0d-0d6718a0f124@kernel.org>
On Thu, Jun 25, 2026 at 09:49:56AM +0200, David Hildenbrand (Arm) wrote:
> >>
> >> But now I wonder whether we would also want to check "is there any free swap
> >> space", not just "is there any swap".
> >
> > I don't quite understand you. get_nr_swap_pages() returns
> > nr_swap_pages, which increases or decreases as swap is allocated or
> > freed. I guess it just reflects how many swaps we currently have
> > available?
>
> Indeed, I was confused by the function name it's "free swap pages". So all goof :)
>
> >
> >>
> >>
> >> Essentially, try returning -E2BIG if there is the chance to swap out after
> >> split, and -ENOSPC / -ENOMEM if a split wouldn't help.
> >>
> >>> }
> >>>
> >>> again:
> >>> @@ -1769,11 +1772,13 @@ int folio_alloc_swap(struct folio *folio)
> >>> }
> >>>
> >>> /* Need to call this even if allocation failed, for MEMCG_SWAP_FAIL. */
> >>> - if (unlikely(mem_cgroup_try_charge_swap(folio)))
> >>> + if (unlikely(mem_cgroup_try_charge_swap(folio))) {
> >>> swap_cache_del_folio(folio);
> >>> + return -ENOMEM;
> >>
> >> Here we wouldn't have the information whether we could charge after a split.
> >>
> >> So that would require a rework to signal this more cleanly to the caller.
> >
> > Yep. The tricky part is that mem_cgroup_try_charge_swap() cannot
> > return how much swap quota is available in the memcg. Do you prefer to
> > add an output argument to mem_cgroup_try_charge_swap() to expose
> > that
> That would probably be cleanest, if that is easily possible. We would want to
> get memcg maintainer feedback on that.
>
> @memcg folks: we'd like to know whether splitting a large folio would make
> mem_cgroup_try_charge_swap() succeed on a split (smaller) part, to distinguish
> "there is no way we can swap out anything, don't split" vs. "we could swap out,
> split".
It's technically doable, but is this worth the bother? The remaining
headroom is less than a large folio. You can split this one, but you
cannot even swap out all of its subpages anymore? From the cgroup
side, we don't need the limit to be obeyed this rigidly. We overcharge
temporarily in other places if it's convenient to do so. A fuzz factor
around the limit is acceptable.
But if you still want to do it, here is how:
The page_counter_try_charge() in __mem_cgroup_try_charge_swap() walks
the hierarchy upwards. If it fails, it will store the first level that
failed against its limit. You can do the mem_cgroup_margin() math
against this counter to determine headroom. An ancestor *could* be
more restrictive, so you need to finish the hierarchy walk to the root
and use the min() of all the swap.max - page_counter_read(swap). Then
return that in a return argument from __mem_cgroup_try_charge_swap().
^ permalink raw reply
* Re: [PATCH v5 8/9] dax/kmem: add sysfs interface for atomic whole-device hotplug
From: Gregory Price @ 2026-06-25 13:35 UTC (permalink / raw)
To: David Hildenbrand (Arm)
Cc: linux-mm, nvdimm, linux-kernel, linux-cxl, driver-core,
linux-kselftest, kernel-team, osalvador, gregkh, rafael, dakr,
djbw, vishal.l.verma, dave.jiang, akpm, ljs, liam, vbabka, rppt,
surenb, mhocko, shuah, alison.schofield,
Smita.KoralahalliChannabasappa, ira.weiny, apopple,
Hannes Reinecke
In-Reply-To: <1d8f74a7-502b-43cb-a0f0-1923049aa213@kernel.org>
On Thu, Jun 25, 2026 at 09:40:02AM +0200, David Hildenbrand (Arm) wrote:
> > Documentation/ABI/testing/sysfs-bus-dax | 26 +++
> > drivers/base/memory.c | 9 +
>
> Can we have this ...
>
> > drivers/dax/kmem.c | 224 ++++++++++++++++++++----
> > include/linux/memory_hotplug.h | 1 +
> >
>
> ... and this as a separate patch, please?
>
> Nothing else jumped at me.
>
ack
~Gregory
^ permalink raw reply
* Re: [PATCH v18 4/8] rust: page: convert to `Ownable`
From: Gary Guo @ 2026-06-25 13:32 UTC (permalink / raw)
To: Andreas Hindborg, Danilo Krummrich, Lorenzo Stoakes,
Vlastimil Babka, Liam R. Howlett, Uladzislau Rezki, Miguel Ojeda,
Boqun Feng, Gary Guo, Björn Roy Baron, Benno Lossin,
Alice Ryhl, Trevor Gross, Daniel Almeida, Tamir Duberstein,
Alexandre Courbot, Onur Özkan, Lyude Paul,
Greg Kroah-Hartman, Arve Hjønnevåg, Todd Kjos,
Christian Brauner, Carlos Llamas, Rafael J. Wysocki, Dave Ertman,
Ira Weiny, Leon Romanovsky, Paul Moore, Serge Hallyn,
David Airlie, Simona Vetter, Alexander Viro, Jan Kara,
Igor Korotin, Viresh Kumar, Nishanth Menon, Stephen Boyd,
Bjorn Helgaas, Krzysztof Wilczyński, Pavel Tikhomirov,
Michal Wilczynski
Cc: Philipp Stanner, rust-for-linux, linux-kernel, linux-mm,
driver-core, linux-block, linux-security-module, dri-devel,
linux-fsdevel, linux-pm, linux-pci, linux-pwm, Asahi Lina
In-Reply-To: <20260625-unique-ref-v18-4-4e06b5896d47@kernel.org>
On Thu Jun 25, 2026 at 11:15 AM BST, Andreas Hindborg wrote:
> From: Asahi Lina <lina@asahilina.net>
>
> This allows Page references to be returned as borrowed references,
> without necessarily owning the struct page.
>
> Remove `BorrowedPage` and update users to use `Owned<Page>`.
>
> Signed-off-by: Asahi Lina <lina@asahilina.net>
> [ Andreas: Fix formatting and add a safety comment, update users. ]
> Signed-off-by: Andreas Hindborg <a.hindborg@kernel.org>
Nice to see `BorrowedPage` going away.
Reviewed-by: Gary Guo <gary@garyguo.net>
> ---
> drivers/android/binder/page_range.rs | 10 +--
> rust/kernel/alloc/allocator.rs | 19 +++---
> rust/kernel/alloc/allocator/iter.rs | 6 +-
> rust/kernel/page.rs | 122 +++++++++--------------------------
> 4 files changed, 46 insertions(+), 111 deletions(-)
^ permalink raw reply
* Re: [PATCH v18 3/8] rust: implement `ForeignOwnable` for `Owned`
From: Gary Guo @ 2026-06-25 13:29 UTC (permalink / raw)
To: Andreas Hindborg, Danilo Krummrich, Lorenzo Stoakes,
Vlastimil Babka, Liam R. Howlett, Uladzislau Rezki, Miguel Ojeda,
Boqun Feng, Gary Guo, Björn Roy Baron, Benno Lossin,
Alice Ryhl, Trevor Gross, Daniel Almeida, Tamir Duberstein,
Alexandre Courbot, Onur Özkan, Lyude Paul,
Greg Kroah-Hartman, Arve Hjønnevåg, Todd Kjos,
Christian Brauner, Carlos Llamas, Rafael J. Wysocki, Dave Ertman,
Ira Weiny, Leon Romanovsky, Paul Moore, Serge Hallyn,
David Airlie, Simona Vetter, Alexander Viro, Jan Kara,
Igor Korotin, Viresh Kumar, Nishanth Menon, Stephen Boyd,
Bjorn Helgaas, Krzysztof Wilczyński, Pavel Tikhomirov,
Michal Wilczynski
Cc: Philipp Stanner, rust-for-linux, linux-kernel, linux-mm,
driver-core, linux-block, linux-security-module, dri-devel,
linux-fsdevel, linux-pm, linux-pci, linux-pwm
In-Reply-To: <20260625-unique-ref-v18-3-4e06b5896d47@kernel.org>
On Thu Jun 25, 2026 at 11:15 AM BST, Andreas Hindborg wrote:
> Implement `ForeignOwnable` for `Owned<T>`. This allows use of `Owned<T>` in
> places such as the `XArray`.
>
> Note that `T` does not need to implement `ForeignOwnable` for `Owned<T>` to
> implement `ForeignOwnable`.
>
> Signed-off-by: Andreas Hindborg <a.hindborg@kernel.org>
> ---
> rust/kernel/owned.rs | 53 ++++++++++++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 53 insertions(+)
>
> diff --git a/rust/kernel/owned.rs b/rust/kernel/owned.rs
> index 7fe9ec3e55126..9c92d4a83cc1b 100644
> --- a/rust/kernel/owned.rs
> +++ b/rust/kernel/owned.rs
> @@ -15,6 +15,8 @@
> ptr::NonNull, //
> };
>
> +use kernel::types::ForeignOwnable;
> +
> /// Types that specify their own way of performing allocation and destruction. Typically, this trait
> /// is implemented on types from the C side.
> ///
> @@ -186,3 +188,54 @@ fn drop(&mut self) {
> unsafe { T::release(self.ptr) };
> }
> }
> +
> +// SAFETY: We derive the pointer to `T` from a valid `T`, so the returned
> +// pointer satisfy alignment requirements of `T`.
> +unsafe impl<T: Ownable> ForeignOwnable for Owned<T> {
> + const FOREIGN_ALIGN: usize = core::mem::align_of::<T>();
> +
> + type Borrowed<'a>
> + = &'a T
> + where
> + Self: 'a;
> + type BorrowedMut<'a>
> + = Pin<&'a mut T>
> + where
> + Self: 'a;
> +
> + #[inline]
> + fn into_foreign(self) -> *mut kernel::ffi::c_void {
> + let ptr = self.ptr.as_ptr().cast();
> + core::mem::forget(self);
> + ptr
I think the pattern in `into_raw` is better:
ManuallyDrop::new(self).ptr.as_ptr().cast()
Or perhaps this can just use `Self::into_raw(self).as_ptr().cast()`.
> + }
> +
> + #[inline]
> + unsafe fn from_foreign(ptr: *mut kernel::ffi::c_void) -> Self {
> + // INVARIANT: By the function safety contract, `ptr` was returned by `into_foreign`, which
> + // gave up exclusive ownership of a valid, pinned `T`; we retake that ownership here.
> + Self {
> + // SAFETY: By function safety contract, `ptr` came from
> + // `into_foreign` and cannot be null.
> + ptr: unsafe { NonNull::new_unchecked(ptr.cast()) },
> + }
> + }
Same here, could be using `Self::from_raw`.
However, the current code looks correct to me regardless, so:
Reviewed-by: Gary Guo <gary@garyguo.net>
Best,
Gary
> +
> + #[inline]
> + unsafe fn borrow<'a>(ptr: *mut kernel::ffi::c_void) -> Self::Borrowed<'a> {
> + // SAFETY: By function safety requirements, `ptr` is valid for use as a
> + // reference for `'a`.
> + unsafe { &*ptr.cast() }
> + }
> +
> + #[inline]
> + unsafe fn borrow_mut<'a>(ptr: *mut kernel::ffi::c_void) -> Self::BorrowedMut<'a> {
> + // SAFETY: By function safety requirements, `ptr` is valid for use as a
> + // unique reference for `'a`.
> + let inner = unsafe { &mut *ptr.cast() };
> +
> + // SAFETY: We never move out of inner, and we do not hand out mutable
> + // references when `T: !Unpin`.
> + unsafe { Pin::new_unchecked(inner) }
> + }
> +}
^ permalink raw reply
* Re: [Patch mm-hotfixes v4] mm/page_vma_mapped: fix device-private PMD handling
From: Lorenzo Stoakes @ 2026-06-25 13:12 UTC (permalink / raw)
To: Lance Yang
Cc: richard.weiyang, akpm, david, riel, liam, vbabka, harry, jannh,
ziy, sj, balbirs, linux-mm, linux-kernel, stable
In-Reply-To: <20260624085756.6598-1-lance.yang@linux.dev>
On Wed, Jun 24, 2026 at 04:57:56PM +0800, Lance Yang wrote:
>
> On Wed, Jun 24, 2026 at 06:53:53AM +0000, Wei Yang wrote:
> >Commit 65edfda6f3f2 ("mm/rmap: extend rmap and migration support
> >device-private entries") introduced the concept of device-private
> >PMD entries, but did not correctly update the rmap walk code to
> >account for them.
> >
> >As a result, when page_vma_mapped_walk() encounters device-private
> >PMD entries, it takes no action other than to acquire the PMD lock
> >and exit.
> >
> >However this is highly problematic for two reasons - firstly,
> >device private entries possess a PFN so check_pmd() needs to be
> >called to ensure an overlapping PFN range.
> >
> >Secondly, and more importantly, if PVMW_MIGRATION is set the
> >caller assumes the returned entry is a migration entry, resulting
> >in memory corruption when the caller tries to interpret the device
> >private entry as such.
> >
> >In addition, commit 146287290023 ("mm/huge_memory: implement
> >device-private THP splitting") allowed device private PMDs to be
> >split like THP mappings, but again did not update this code path.
> >
> >As a result, we might race a PMD split prior to acquiring the PMD
> >lock.
> >
> >This patch addresses all of these issues by invoking check_pmd(),
> >ensuring PMVW_MIGRATION is not set and checks whether a split raced
> >us we do for PMD THP and migration entries.
> >
> >Fixes: 65edfda6f3f2 ("mm/rmap: extend rmap and migration support device-private entries")
> >Cc: <stable@vger.kernel.org>
> >Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
> >Suggested-by: David Hildenbrand <david@kernel.org>
>
> Shouldn't we add
>
> Suggested-by: Lorenzo Stoakes <ljs@kernel.org>
>
> as well?
>
> v4 mostly follows Lorenzo's comments, code bits included. Feels only fair.
Thanks Lance :)
I'm kinda indifferent about it really, I'm really keen to ensure people sending
patches get the credit for their work, so if I send a patch in reply as a
shorthand for 'I think this might work better', I don't expect/require any
credit at all, it's just sometimes a quicker way of responding!
But if Wei wants to add a S-b that's fine by me also! :)
Cheers, Lorenzo
^ permalink raw reply
* Re: 回复: [PATCH 0/3] mm/mmu_notifier, drm/amdgpu: block THP for GPU user mappings
From: Christian König @ 2026-06-25 13:06 UTC (permalink / raw)
To: 蒋 亦韬, Alex Deucher, David Airlie,
Simona Vetter, Felix Kuehling, Andrew Morton, David Hildenbrand,
Lorenzo Stoakes, Yang, Philip
Cc: Zi Yan, Baolin Wang, Liam R . Howlett, Nico Pache, Ryan Roberts,
Dev Jain, Barry Song, Lance Yang, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Jann Horn,
amd-gfx@lists.freedesktop.org, dri-devel@lists.freedesktop.org,
linux-kernel@vger.kernel.org, linux-mm@kvack.org
In-Reply-To: <SY5PR01MB10599BCF8625DB8EAE9827293C0EC2@SY5PR01MB10599.ausprd01.prod.outlook.com>
Hi Yitao,
adding Philip Yang.
Thanks for the investigation, that sounds like some kind of bug in the KFD SVM handling. The driver should be perfectly capable of handling this.
I strongly suggest to open up a bug report for ROCm and describe how to reproduce this, Philip can probably point you to the right location for that.
Regards,
Christian.
On 6/25/26 15:01, 蒋 亦韬 wrote:
> Hi Christian,
>
> I agree that my previous approach was wrong. Sorry about that. Please let me clarify the problem I was seeing and how I ended up with that incorrect conclusion.
>
> The original problem was not a synthetic THP test. I was running ROCm/PyTorch ML training on an AMD Radeon 780M system, and the workload frequently failed with asynchronous HIP kernel launch failures. The userspace error usually surfaced later in PyTorch, for example around a copy/to_device/SetDevice path, but the kernel log showed GPU resets and KFD/MES queue eviction failures.
>
> The relevant kernel messages I repeatedly saw were along these lines:
>
> MES failed to respond to msg=REMOVE_QUEUE
> MES failed to respond to msg=SUSPEND
> failed to suspend all gangs
> failed to remove hardware queue from MES
> Failed to evict queue
> Failed to evict process queues
> GPU reset begin
>
> While trying to reduce the issue, I saw memory invalidations and THP-related page-table/backing-page activity driving the AMDGPU/KFD path through SVM eviction. On this system, the path I was looking at was roughly:
>
> svm_range_cpu_invalidate_pagetables()
> -> svm_range_evict()
> -> kgd2kfd_quiesce_mm()
> -> KFD process queue eviction
> -> MES REMOVE_QUEUE / SUSPEND
>
> One thing that misled me was the XNACK-disabled path. Since the issue appeared on an XNACK-disabled APU, and that path requires queue eviction/quiesce when CPU page table invalidations affect GPU mappings, I incorrectly thought the backing-page change itself was something the driver had to prevent.
>
> Another thing that misled me was that the application was not intentionally asking for THP behavior. From the workload’s point of view, these page transitions looked unrelated to the model computation. I therefore incorrectly assumed that userspace should not be able to change backing-page characteristics in a way that affects a driver mapping already registered with MMU interval notifiers. I now understand from the MM feedback that this is expected behavior, and that the notifier user must handle unmap/remap correctly.
>
> So the more precise problem is that THP/remap is only one way to trigger the invalidation path. What is failing for my workload is the AMDGPU/KFD/MES queue quiesce/eviction path during those invalidations. When that fails, the GPU resets, and userspace later observes an asynchronous HIP failure.
>
> Please allow me to continue investigating a more appropriate fix for this problem. I will try to keep the fix boundary within AMDGPU/KFD/MES and avoid changing MM-core or THP policy semantics.
>
> Regards,
> Yitao
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> *发件人:* Christian König <christian.koenig@amd.com>
> *发送时间:* 2026年6月25日 8:35
> *收件人:* Yitao Jiang <jytscientist@hotmail.com>; Alex Deucher <alexander.deucher@amd.com>; David Airlie <airlied@gmail.com>; Simona Vetter <simona@ffwll.ch>; Felix Kuehling <Felix.Kuehling@amd.com>; Andrew Morton <akpm@linux-foundation.org>; David Hildenbrand <david@kernel.org>; Lorenzo Stoakes <ljs@kernel.org>
> *抄送:* Zi Yan <ziy@nvidia.com>; Baolin Wang <baolin.wang@linux.alibaba.com>; Liam R . Howlett <liam@infradead.org>; Nico Pache <npache@redhat.com>; Ryan Roberts <ryan.roberts@arm.com>; Dev Jain <dev.jain@arm.com>; Barry Song <baohua@kernel.org>; Lance Yang <lance.yang@linux.dev>; Vlastimil Babka <vbabka@kernel.org>; Mike Rapoport <rppt@kernel.org>; Suren Baghdasaryan <surenb@google.com>; Michal Hocko <mhocko@suse.com>; Jann Horn <jannh@google.com>; amd-gfx@lists.freedesktop.org <amd-gfx@lists.freedesktop.org>; dri-devel@lists.freedesktop.org <dri-devel@lists.freedesktop.org>; linux-kernel@vger.kernel.org <linux-kernel@vger.kernel.org>; linux-mm@kvack.org <linux-mm@kvack.org>
> *主题:* Re: [PATCH 0/3] mm/mmu_notifier, drm/amdgpu: block THP for GPU user mappings
>
> On 6/25/26 12:59, Yitao Jiang wrote:
>> Hi,
>>
>> This series fixes a THP policy problem I found while debugging
>> frequent ROCm GPU failures on an AMD Radeon 780M system during ML
>> training.
>>
>> Some AMDGPU/KFD user mappings are registered through interval
>> notifiers and cannot safely tolerate the backing VMA changing from base
>> pages to a transparent huge page after registration.
>
> That's certainly not correct. This is a must have for a whole lot of use cases.
>
> Why exactly isn't that working for your use case?
>
> Regards,
> Christian.
>
>> Userspace can
>> still apply MADV_HUGEPAGE or MADV_COLLAPSE, and khugepaged can also
>> collapse the range, after the GPU mapping has been registered.
>>
>> On my system this showed up as asynchronous ROCm/HIP kernel launch
>> failures, often reported later at a synchronization or copy point. I
>> expect the issue to be relevant to AMDGPU/KFD mappings on
>> XNACK-disabled GPUs more generally, because those mappings cannot rely
>> on replayable GPU faults after a CPU-side THP remap. I have validated
>> the failure and fix on AMD Radeon 780M / gfx1103.
>>
>> Patch 1 adds MMU_INTERVAL_NOTIFIER_BLOCK_THP so interval notifier
>> users can ask the MM core to keep the covered VMA range out of THP
>> while the notifier is active. The MM core applies VM_NOHUGEPAGE and
>> clears VM_HUGEPAGE under mmap_lock for write. A later MADV_HUGEPAGE
>> over an active opt-in range is treated as an ignored hint, and
>> MADV_COLLAPSE is rejected by the existing VM_NOHUGEPAGE checks.
>>
>> Patches 2 and 3 opt in the AMDGPU/KFD paths that need this behavior:
>> HSA userptr BOs, KFD SVM ranges when XNACK is disabled, and
>> GPU_ALWAYS_MAPPED SVM ranges. Other interval notifier users keep their
>> current behavior.
>>
>> This does not disable THP globally and does not add work to GPU
>> command submission or kernel launch paths. Additional work is limited
>> to opt-in notifier registration, opt-in notifier flag transitions, and
>> MADV_HUGEPAGE attempts that overlap an active opt-in range.
>>
>> I tested this on top of torvalds/linux commit ab9de95c9cf9 with:
>>
>> - scripts/checkpatch.pl --strict --no-tree
>> - git apply --check
>> - x86_64 defconfig build with TRANSPARENT_HUGEPAGE=y,
>> DRM_AMDGPU=m, and HSA_AMD=y for mm/ and AMDGPU/KFD objects
>> - standalone HSA/HIP reproducers and the ROCm/PyTorch workload that
>> originally exposed the failure on my Radeon 780M system
>>
>> The standalone reproducers depend on ROCm userspace libraries, so I
>> have not included them in this series. I can send them separately if
>> useful.
>>
>> This series was prepared with assistance from OpenAI Codex (GPT-5.5).
>> I reviewed the resulting code and take responsibility for the
>> submission.
>>
>> Yitao Jiang (3):
>> mm/mmu_notifier: let interval notifiers block THP
>> drm/amdgpu: block THP for HSA userptr notifiers
>> drm/amdkfd: block THP for non-replayable SVM ranges
>>
>> drivers/gpu/drm/amd/amdgpu/amdgpu_hmm.c | 25 ++-
>> drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 36 ++++-
>> include/linux/huge_mm.h | 5 +-
>> include/linux/mmu_notifier.h | 28 ++++
>> mm/khugepaged.c | 9 +-
>> mm/madvise.c | 3 +-
>> mm/mmu_notifier.c | 204 +++++++++++++++++++++++-
>> 7 files changed, 286 insertions(+), 24 deletions(-)
>>
>>
>> base-commit: ab9de95c9cf952332ab79453b4b5d1bfca8e514f
>> --
>> 2.53.0
>
^ permalink raw reply
* 回复: [PATCH 0/3] mm/mmu_notifier, drm/amdgpu: block THP for GPU user mappings
From: 蒋 亦韬 @ 2026-06-25 13:01 UTC (permalink / raw)
To: Christian König, Alex Deucher, David Airlie, Simona Vetter,
Felix Kuehling, Andrew Morton, David Hildenbrand, Lorenzo Stoakes
Cc: Zi Yan, Baolin Wang, Liam R . Howlett, Nico Pache, Ryan Roberts,
Dev Jain, Barry Song, Lance Yang, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Jann Horn,
amd-gfx@lists.freedesktop.org, dri-devel@lists.freedesktop.org,
linux-kernel@vger.kernel.org, linux-mm@kvack.org
In-Reply-To: <8f68ce2f-c33d-49f5-a671-7a4ab2f4f3d3@amd.com>
[-- Attachment #1: Type: text/plain, Size: 7324 bytes --]
Hi Christian,
I agree that my previous approach was wrong. Sorry about that. Please let me clarify the problem I was seeing and how I ended up with that incorrect conclusion.
The original problem was not a synthetic THP test. I was running ROCm/PyTorch ML training on an AMD Radeon 780M system, and the workload frequently failed with asynchronous HIP kernel launch failures. The userspace error usually surfaced later in PyTorch, for example around a copy/to_device/SetDevice path, but the kernel log showed GPU resets and KFD/MES queue eviction failures.
The relevant kernel messages I repeatedly saw were along these lines:
MES failed to respond to msg=REMOVE_QUEUE
MES failed to respond to msg=SUSPEND
failed to suspend all gangs
failed to remove hardware queue from MES
Failed to evict queue
Failed to evict process queues
GPU reset begin
While trying to reduce the issue, I saw memory invalidations and THP-related page-table/backing-page activity driving the AMDGPU/KFD path through SVM eviction. On this system, the path I was looking at was roughly:
svm_range_cpu_invalidate_pagetables()
-> svm_range_evict()
-> kgd2kfd_quiesce_mm()
-> KFD process queue eviction
-> MES REMOVE_QUEUE / SUSPEND
One thing that misled me was the XNACK-disabled path. Since the issue appeared on an XNACK-disabled APU, and that path requires queue eviction/quiesce when CPU page table invalidations affect GPU mappings, I incorrectly thought the backing-page change itself was something the driver had to prevent.
Another thing that misled me was that the application was not intentionally asking for THP behavior. From the workload’s point of view, these page transitions looked unrelated to the model computation. I therefore incorrectly assumed that userspace should not be able to change backing-page characteristics in a way that affects a driver mapping already registered with MMU interval notifiers. I now understand from the MM feedback that this is expected behavior, and that the notifier user must handle unmap/remap correctly.
So the more precise problem is that THP/remap is only one way to trigger the invalidation path. What is failing for my workload is the AMDGPU/KFD/MES queue quiesce/eviction path during those invalidations. When that fails, the GPU resets, and userspace later observes an asynchronous HIP failure.
Please allow me to continue investigating a more appropriate fix for this problem. I will try to keep the fix boundary within AMDGPU/KFD/MES and avoid changing MM-core or THP policy semantics.
Regards,
Yitao
________________________________
发件人: Christian König <christian.koenig@amd.com>
发送时间: 2026年6月25日 8:35
收件人: Yitao Jiang <jytscientist@hotmail.com>; Alex Deucher <alexander.deucher@amd.com>; David Airlie <airlied@gmail.com>; Simona Vetter <simona@ffwll.ch>; Felix Kuehling <Felix.Kuehling@amd.com>; Andrew Morton <akpm@linux-foundation.org>; David Hildenbrand <david@kernel.org>; Lorenzo Stoakes <ljs@kernel.org>
抄送: Zi Yan <ziy@nvidia.com>; Baolin Wang <baolin.wang@linux.alibaba.com>; Liam R . Howlett <liam@infradead.org>; Nico Pache <npache@redhat.com>; Ryan Roberts <ryan.roberts@arm.com>; Dev Jain <dev.jain@arm.com>; Barry Song <baohua@kernel.org>; Lance Yang <lance.yang@linux.dev>; Vlastimil Babka <vbabka@kernel.org>; Mike Rapoport <rppt@kernel.org>; Suren Baghdasaryan <surenb@google.com>; Michal Hocko <mhocko@suse.com>; Jann Horn <jannh@google.com>; amd-gfx@lists.freedesktop.org <amd-gfx@lists.freedesktop.org>; dri-devel@lists.freedesktop.org <dri-devel@lists.freedesktop.org>; linux-kernel@vger.kernel.org <linux-kernel@vger.kernel.org>; linux-mm@kvack.org <linux-mm@kvack.org>
主题: Re: [PATCH 0/3] mm/mmu_notifier, drm/amdgpu: block THP for GPU user mappings
On 6/25/26 12:59, Yitao Jiang wrote:
> Hi,
>
> This series fixes a THP policy problem I found while debugging
> frequent ROCm GPU failures on an AMD Radeon 780M system during ML
> training.
>
> Some AMDGPU/KFD user mappings are registered through interval
> notifiers and cannot safely tolerate the backing VMA changing from base
> pages to a transparent huge page after registration.
That's certainly not correct. This is a must have for a whole lot of use cases.
Why exactly isn't that working for your use case?
Regards,
Christian.
> Userspace can
> still apply MADV_HUGEPAGE or MADV_COLLAPSE, and khugepaged can also
> collapse the range, after the GPU mapping has been registered.
>
> On my system this showed up as asynchronous ROCm/HIP kernel launch
> failures, often reported later at a synchronization or copy point. I
> expect the issue to be relevant to AMDGPU/KFD mappings on
> XNACK-disabled GPUs more generally, because those mappings cannot rely
> on replayable GPU faults after a CPU-side THP remap. I have validated
> the failure and fix on AMD Radeon 780M / gfx1103.
>
> Patch 1 adds MMU_INTERVAL_NOTIFIER_BLOCK_THP so interval notifier
> users can ask the MM core to keep the covered VMA range out of THP
> while the notifier is active. The MM core applies VM_NOHUGEPAGE and
> clears VM_HUGEPAGE under mmap_lock for write. A later MADV_HUGEPAGE
> over an active opt-in range is treated as an ignored hint, and
> MADV_COLLAPSE is rejected by the existing VM_NOHUGEPAGE checks.
>
> Patches 2 and 3 opt in the AMDGPU/KFD paths that need this behavior:
> HSA userptr BOs, KFD SVM ranges when XNACK is disabled, and
> GPU_ALWAYS_MAPPED SVM ranges. Other interval notifier users keep their
> current behavior.
>
> This does not disable THP globally and does not add work to GPU
> command submission or kernel launch paths. Additional work is limited
> to opt-in notifier registration, opt-in notifier flag transitions, and
> MADV_HUGEPAGE attempts that overlap an active opt-in range.
>
> I tested this on top of torvalds/linux commit ab9de95c9cf9 with:
>
> - scripts/checkpatch.pl --strict --no-tree
> - git apply --check
> - x86_64 defconfig build with TRANSPARENT_HUGEPAGE=y,
> DRM_AMDGPU=m, and HSA_AMD=y for mm/ and AMDGPU/KFD objects
> - standalone HSA/HIP reproducers and the ROCm/PyTorch workload that
> originally exposed the failure on my Radeon 780M system
>
> The standalone reproducers depend on ROCm userspace libraries, so I
> have not included them in this series. I can send them separately if
> useful.
>
> This series was prepared with assistance from OpenAI Codex (GPT-5.5).
> I reviewed the resulting code and take responsibility for the
> submission.
>
> Yitao Jiang (3):
> mm/mmu_notifier: let interval notifiers block THP
> drm/amdgpu: block THP for HSA userptr notifiers
> drm/amdkfd: block THP for non-replayable SVM ranges
>
> drivers/gpu/drm/amd/amdgpu/amdgpu_hmm.c | 25 ++-
> drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 36 ++++-
> include/linux/huge_mm.h | 5 +-
> include/linux/mmu_notifier.h | 28 ++++
> mm/khugepaged.c | 9 +-
> mm/madvise.c | 3 +-
> mm/mmu_notifier.c | 204 +++++++++++++++++++++++-
> 7 files changed, 286 insertions(+), 24 deletions(-)
>
>
> base-commit: ab9de95c9cf952332ab79453b4b5d1bfca8e514f
> --
> 2.53.0
[-- Attachment #2: Type: text/html, Size: 13806 bytes --]
^ permalink raw reply
* Re: [PATCH] mm/memcontrol: remove unused for_each_mem_cgroup macro and cleanup
From: Johannes Weiner @ 2026-06-25 13:00 UTC (permalink / raw)
To: Joshua Hahn
Cc: linux-mm, Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song,
Andrew Morton, cgroups, linux-kernel, kernel-team
In-Reply-To: <20260624183700.1152742-1-joshua.hahnjy@gmail.com>
On Wed, Jun 24, 2026 at 11:36:59AM -0700, Joshua Hahn wrote:
> Commit 7e1c0d6f58207 ("memcg: switch lruvec stats to rstat") removed the
> last caller of for_each_mem_cgroup back in 2021, and there have not been
> any new callers since. Remove the macro.
>
> A comment in mem_cgroup_css_online has also been out of date since 2021,
> when 2bfd36374edd9 ("mm: vmscan: consolidate shrinker_maps handling
> code") open-coded the for_each_mem_cgroup iterator. Update the comment.
>
> Finally, 99430ab8b804c ("mm: introduce BPF kfuncs to access memcg
> statistics and events") added a second declaration for memcg_events to
> include/linux/memcontrol.h, duplicating the one in mm/memcontrol-v1.h.
> Let's clean that up too.
>
> No functional changes intended.
>
> Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
^ permalink raw reply
* Re: [PATCH v8 18/46] KVM: guest_memfd: Handle lru_add fbatch refcounts during conversion safety check
From: David Hildenbrand (Arm) @ 2026-06-25 12:57 UTC (permalink / raw)
To: Sean Christopherson, Ackerley Tng
Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, jmattson,
jthoughton, michael.roth, oupton, pankaj.gupta, qperret,
rick.p.edgecombe, rientjes, shivankg, steven.price, tabba, willy,
wyihan, yan.y.zhao, forkloop, pratyush, suzuki.poulose,
aneesh.kumar, liam, Paolo Bonzini, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
Kairui Song, Kemeng Shi, Nhat Pham, Barry Song, Axel Rasmussen,
Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt,
Kiryl Shutsemau, Baoquan He, Jason Gunthorpe, Vlastimil Babka,
kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
linux-mm, linux-coco
In-Reply-To: <ajx3vmNPRf-M9kR6@google.com>
On 6/25/26 02:35, Sean Christopherson wrote:
> On Wed, Jun 24, 2026, Ackerley Tng wrote:
>> Sean Christopherson <seanjc@google.com> writes:
>>
>>>
>>> Under what circumstances does this happen,
>>
>> It happened 100% of the time in selftests. Perhaps it's because in the
>> selftests the pages are almost always freshly allocated and so the
>> lru_add fbatch isn't full yet? (and that the host isn't super busy so
>> lru_add fbatch doesn't get drained yet).
>
> I chatted with Ackerley about this. What I wanted to understand is why guest_memfd
> pages were getting put onto per-CPU batches for lru_add(), given that guest_memfd
> pages are unevictable. The answer (assuming I read the code right), is that
> lruvec_add_folio() updates stats and other per-lru metadata for the unevictable
> lru, and does so under a per-lru lock. I.e. we don't want to skip that stuff
> entirely.
Hm. Our pages don't participate in any LRU activity (including
isolation+migration). Isolation+migration would only apply once we'd support
page migration.
But yes, secretmem also does it like that: filemap_add_folio() will call
folio_add_lru().
Traditionally we used the unevictable LRU only for mlock purposes.
But yeah, there are "unevictable" stats involved ....
>
> One thought I had, to avoid the IPIs that draining all per-CPU caches requires,
> was to disallow putting guest_memfd pages in folio batches, e.g. by hacking
> something into folio_may_be_lru_cached(). But due to taking a per-lru lock,
> that would penalize the relatively hot path and definitely common operation of
> faulting in guest memory. On the other hand, memory conversion is already a
> relatively slow operation and is relatively uncommon compared to page faults,
> (and likely very uncommon for real world setups). I.e. having to drain all
> caches if conversion isn't safe penalizes a relatively slow, relatively uncommon
> path.
Yeah, the lru_add_drain_all is rather messy.
We have similar code in
collect_longterm_unpinnable_folios(), where we first try a lru_add_drain(), to
then escalate to a lru_add_drain_all().
Maybe we could factor that (suboptimal code) out to not have to reinvent the
same thing multiple times?
--
Cheers,
David
^ permalink raw reply
* Re: [PATCH v11 0/4] mm/page_owner: add per-fd filter infrastructure for print_mode and NUMA filtering
From: zhen.ni @ 2026-06-25 12:57 UTC (permalink / raw)
To: Andrew Morton
Cc: vbabka, surenb, mhocko, jackmanb, hannes, ziy, linux-mm,
linux-kernel, Yichong Chen, Ye Liu
In-Reply-To: <20260624215526.ed20169b440c62d71a3f9d90@linux-foundation.org>
在 2026/6/25 12:55, Andrew Morton 写道:
> On Thu, 25 Jun 2026 12:30:57 +0800 Zhen Ni <zhen.ni@easystack.cn> wrote:
>
>> This patch series introduces per-file-descriptor filtering capabilities to the
>> page_owner feature.
>
> Well, I assume this work was inspired by your own operational
> experience with page_owner. There's no better inspiration than this!
>
> Review is thin (absent) at v11. This is typical with page_owner
> changes :(. I'll add the series for testing while interested people
> check over it (please).
>
> AI review might have found a few things which you might choose to
> address. Please check it out:
>
> https://sashiko.dev/#/patchset/20260625043101.338794-1-zhen.ni@easystack.cn
>
>
>
>
Hi,
Thanks for the review. Let me address the questions:
Q1: Can empty write silently revert concurrent filter changes?
Q3: Can concurrent writes clobber independent filter settings?
A1&3: Yes, this is theoretically possible when multiple threads share
the same fd. The current implementation uses short-duration spinlocks as
a practical trade-off rather than holding locks during the entire
parsing process.
However, I believe the current design meets functional requirements:
1. Most users will use the page_owner_filter tool rather than
programming against page_owner directly. For concurrent filtering needs,
multiple processes can use independent file descriptors.
2. Even in the multi-threaded shared-fd case, the worst outcome is that
filter settings get overwritten. Since page_owner is a debug feature,
the impact is limited.
---
Q2: How can users disable/clear the NUMA node filter?
A2: Clearing the NUMA filter while holding the fd open is an edge case
without strong practical necessity. If users need to change filter
behavior, they can simply:
- Apply a different filter in the next write operation, or
- Close and reopen the file descriptor
The filter is designed for targeted debugging sessions where the
configuration is set up front and used for the session.
If you believe that holding locks for the entire write process is
necessary, please let me know.
Thanks,
Zhen
^ permalink raw reply
* Re: [PATCH 2/3] ovl: support cachestat() syscall on overlayfs files
From: Matthew Wilcox @ 2026-06-25 12:53 UTC (permalink / raw)
To: Nhat Pham
Cc: Amir Goldstein, Pavel Tikhomirov, Miklos Szeredi, Alexander Viro,
Christian Brauner, Jan Kara, Andrew Morton, Johannes Weiner,
Shuah Khan, linux-unionfs, linux-kernel, linux-fsdevel, linux-mm,
linux-kselftest
In-Reply-To: <CAKEwX=MedyK8aDM=oe=Hbzo77mJOEp0Pmo0u8N4HhvikosVbSw@mail.gmail.com>
On Wed, Jun 24, 2026 at 12:06:54PM -0700, Nhat Pham wrote:
> I'm more concerned with undocumented/unexpected behavior (error type
> in this case). -EIO was an example that I saw in ovl_real_file()
> itself, but I'm not familiar enough with overlayfs to know if that's
> the extent of it.
>
> But I'm OK with just updating the documentation with a simple note
> that other error maybe propagated from the underlying fs, if no one
> else thinks it's a problem :)
That's ALWAYS true. POSIX even says so explicitly in section 2.3:
Implementations may support additional errors not included in this list,
may generate errors included in this list under circumstances other
than those described here, or may contain extensions or limitations
that prevent some errors from occurring.
We don't generally bother to document that pretty much every syscall may
return -ENOMEM if it can't allocate memory. That's just ... expected.
open(2) documents the possibility, but read(2) doesn't. I think it's
the same for EIO. Any operation which accesses storage can return -EIO.
^ permalink raw reply
* Re: [RFC PATCH] mm: bypass swap readahead for zswap
From: Alexandre Ghiti @ 2026-06-25 12:52 UTC (permalink / raw)
To: Barry Song
Cc: akpm, hannes, yosry, nphamcs, chengming.zhou, david, ljs, liam,
vbabka, rppt, surenb, mhocko, kasong, chrisl, usama.arif,
linux-mm, linux-kernel
In-Reply-To: <CAGsJ_4xM=WnFO-OuwVude-DQ_UkaCi_zF_bXBP9Li5+ZPNy0VQ@mail.gmail.com>
Hi Barry,
On 6/24/26 21:24, Barry Song wrote:
> On Wed, Jun 24, 2026 at 3:57 PM Alexandre Ghiti <alex@ghiti.fr> wrote:
>> Commit 0bcac06f27d7 ("mm, swap: skip swapcache for swapin of synchronous
>> device") made SWP_SYNCHRONOUS_IO devices (e.g. zram) skip swap readahead.
>>
>> zswap is the same kind of in-memory, synchronous backend as zram, not a
>> swap device flagged SWP_SYNCHRONOUS_IO so it still goes through
>> swapin_readahead().
>>
>> Here are the results from bypassing readahead for zswap too: it was
>> measured with a kernel build (make -j16) in a memcg, zswap=zstd, shrinker
>> off, on Sapphire Rapids and 3 iterations.
>>
>> 768M memcg (sustained swap thrash):
>> metric mm-new + bypass delta
>> build time (s) 405.0 341.7 -15.6%
>> zswap-in (GB) 79.5 53.0 -33%
>> zswap-out (GB) 144.8 115.6 -20%
>> swap readahead (pages) 6.79M 0.45M -93%
>> swap_ra hit (%) 72.1 89.9 +18pp
>>
>> 1G memcg (light pressure, build not memory-bound):
>> metric mm-new + bypass delta
>> build time (s) 177.7 176.0 ~same (no regression)
>> zswap-in (GB) 10.2 7.5 -26%
>> zswap-out (GB) 27.7 25.1 -9%
>> swap readahead (pages) 1.07M 0.08M -93%
>> swap_ra hit (%) 68.6 87.2 +19pp
>>
>> The gain is from no longer prefetching pages that are pointless for an
>> in-memory backend: readahead inflates anon residency and thrashes the
>> page cache (file pages get evicted and re-read), lengthens each fault by
>> synchronously (de)compressing a cluster of neighbours, and adds
>> compression traffic when those extra pages are reclaimed.
>>
>> Bypassing swap readahead for zswap therefore makes sense.
>>
>> Signed-off-by: Alexandre Ghiti <alex@ghiti.fr>
>> ---
>>
>> - This bypass originally comes from Usama's series that implements
>> large folio zswapin: while working on improving this series, I noticed
>> the gains I got only came from the bypass of readahead.
>>
>> include/linux/zswap.h | 6 ++++++
>> mm/memory.c | 5 +++--
>> mm/zswap.c | 11 +++++++++++
>> 3 files changed, 20 insertions(+), 2 deletions(-)
>>
>> diff --git a/include/linux/zswap.h b/include/linux/zswap.h
>> index 30c193a1207e..b6f0e6198b6f 100644
>> --- a/include/linux/zswap.h
>> +++ b/include/linux/zswap.h
>> @@ -35,6 +35,7 @@ void zswap_lruvec_state_init(struct lruvec *lruvec);
>> void zswap_folio_swapin(struct folio *folio);
>> bool zswap_is_enabled(void);
>> bool zswap_never_enabled(void);
>> +bool zswap_present_test(swp_entry_t swp);
>> #else
>>
>> struct zswap_lruvec_state {};
>> @@ -69,6 +70,11 @@ static inline bool zswap_never_enabled(void)
>> return true;
>> }
>>
>> +static inline bool zswap_present_test(swp_entry_t swp)
>> +{
>> + return false;
>> +}
>> +
>> #endif
>>
>> #endif /* _LINUX_ZSWAP_H */
>> diff --git a/mm/memory.c b/mm/memory.c
>> index ff338c2abe92..5aa1ea9eb48a 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -4827,8 +4827,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>> if (folio)
>> swap_update_readahead(folio, vma, vmf->address);
>> if (!folio) {
>> - /* Swapin bypasses readahead for SWP_SYNCHRONOUS_IO devices */
>> - if (data_race(si->flags & SWP_SYNCHRONOUS_IO))
>> + /* Swapin bypasses readahead for SWP_SYNCHRONOUS_IO devices and zswap */
>> + if (data_race(si->flags & SWP_SYNCHRONOUS_IO) ||
>> + zswap_present_test(entry))
>> folio = swapin_sync(entry, GFP_HIGHUSER_MOVABLE,
>> thp_swapin_suitable_orders(vmf) | BIT(0),
>> vmf, NULL, 0);
> Basically, I have been seeing the same issue recently. If the
> readahead swap entries are also in zswap, we end up doing the
> decompression during one page fault, but then need another page fault
> to fetch the page from the swap cache and install the mapping. In that
> case, readahead may not be beneficial.
Oh I had not noticed that, indeed since zswap readahead is synchronous,
we can clearly avoid the second page fault!
>
> On the other hand, if the readahead swap entries are not in zswap, the
> situation is different.
>
> For example, suppose we fault on the swap entry for address 1 MB and
> readahead brings in the entry for 1 MB + 4 KB. If both entries are in
> zswap, readahead does not seem like a good trade-off. However, if the
> 1 MB + 4 KB entry is not in zswap and would otherwise require storage
> I/O, then readahead can be beneficial.
Yosry made the same comment, I'll explore this.
>
> So I implemented a rather ugly fault_around-like mechanism in
> do_swap_page(). At least with page-cluster == 1, I am seeing a
> performance improvement, as the readahead folios can be mapped
> directly and do not require a second page fault.
IIUC the code below, you wait for the end of the page fault to try and
map a folio that would have been readahead right? I guess you do that at
the end in the hope that the io has finished by then?
Maybe we can do that synchronously for zswap since the readahead is
synchronous? And for the readahead pages that require io, wouldn't it
be possible to do it in the end of io callback instead?
Thanks for your comments,
Alex
>
> It is admittedly quite ugly and is only meant as a proof of concept :-)
>
> Subject: [PATCH PoC] mm: enable do_swap_page fault_around
>
> Signed-off-by: Barry Song (Xiaomi) <baohua@kernel.org>
> ---
> mm/memory.c | 95 +++++++++++++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 95 insertions(+)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index c00a31a6d1d0..1db79f45a575 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4736,6 +4736,100 @@ static void check_swap_exclusive(struct folio
> *folio, swp_entry_t entry,
> } while (--nr_pages);
> }
>
> +static void do_swap_map_around(struct vm_fault *vmf, struct
> swap_info_struct *si)
> +{
> + struct vm_area_struct *vma = vmf->vma;
> + int nr_around = 1 << page_cluster;
> + unsigned long start = max3(vma->vm_start, vmf->address -
> (nr_around - 1) * PAGE_SIZE,
> + vmf->address & PMD_MASK);
> + unsigned long end = min3(vma->vm_end, vmf->address +
> nr_around * PAGE_SIZE,
> + (vmf->address & PMD_MASK) + PMD_SIZE);
> + unsigned long nr_pages = (end - start) >> PAGE_SHIFT;
> + unsigned long delta_pages = (vmf->address - start) >> PAGE_SHIFT;
> + pte_t *ptep = vmf->pte - delta_pages;
> +
> + for (int i = 0; i < nr_pages; i++, ptep++) {
> + unsigned long address = start + (i << PAGE_SHIFT);
> + rmap_t rmap_flags = RMAP_NONE;
> + pte_t orig_pte, pte;
> + struct folio *folio;
> + struct page *page;
> + softleaf_t entry;
> + bool exclusive;
> +
> + if (ptep == vmf->pte)
> + continue;
> + orig_pte = ptep_get(ptep);
> + exclusive = pte_swp_exclusive(orig_pte);
> + if (!exclusive)
> + continue;
> + entry = softleaf_from_pte(orig_pte);
> + if (!softleaf_is_swap(entry))
> + continue;
> + folio = swap_cache_get_folio(entry);
> + if (!folio)
> + continue;
> + if (unlikely(!folio_matches_swap_entry(folio, entry)))
> + goto skip;
> + if (folio_test_locked(folio))
> + goto skip;
> + if (!folio_test_uptodate(folio))
> + goto skip;
> + if (!folio_trylock(folio))
> + goto skip;
> + if (folio_test_ksm(folio) || folio_test_large(folio) ||
> + !folio_test_uptodate(folio))
> + goto unlock;
> + if (exclusive && folio_test_writeback(folio) &&
> + data_race(si->flags & SWP_STABLE_WRITES))
> + exclusive = false;
> +
> + arch_swap_restore(folio_swap(entry, folio), folio);
> +
> + page = folio_page(folio, 0);
> + add_mm_counter(vma->vm_mm, MM_ANONPAGES, 1);
> + add_mm_counter(vma->vm_mm, MM_SWAPENTS, -1);
> + pte = mk_pte(page, vma->vm_page_prot);
> + if (pte_swp_soft_dirty(orig_pte))
> + pte = pte_mksoft_dirty(pte);
> + if (pte_swp_uffd_wp(orig_pte))
> + pte = pte_mkuffd_wp(pte);
> +
> + if (exclusive) {
> + if ((vma->vm_flags & VM_WRITE) &&
> !userfaultfd_pte_wp(vma, pte) &&
> + !pte_needs_soft_dirty_wp(vma, pte)) {
> + pte = pte_mkwrite(pte, vma);
> + }
> + rmap_flags |= RMAP_EXCLUSIVE;
> + }
> + flush_icache_pages(vma, page, 1);
> +
> + if (!folio_test_anon(folio)) {
> + folio_add_new_anon_rmap(folio, vma, address,
> rmap_flags);
> + folio_put_swap(folio, NULL);
> + } else {
> + folio_add_anon_rmap_ptes(folio, page, 1, vma, address,
> + rmap_flags);
> + folio_put_swap(folio, page);
> + }
> +
> + set_ptes(vma->vm_mm, address, ptep, pte, 1);
> + arch_do_swap_page_nr(vma->vm_mm, vma, address,
> + pte, pte, 1);
> +
> + if (should_try_to_free_swap(si, folio, vma, 1, vmf->flags))
> + folio_free_swap(folio);
> + folio_unlock(folio);
> + swap_update_readahead(folio, vma, address);
> + update_mmu_cache_range(vmf, vma, address, ptep, 1);
> + continue;
> +unlock:
> + folio_unlock(folio);
> +skip:
> + folio_put(folio);
> + };
> +}
> +
> /*
> * We enter with either the VMA lock or the mmap_lock held (see
> * FAULT_FLAG_VMA_LOCK), and pte mapped but not yet locked.
> @@ -5121,6 +5215,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>
> /* No need to invalidate - it was non-present before */
> update_mmu_cache_range(vmf, vma, address, ptep, nr_pages);
> + do_swap_map_around(vmf, si);
> unlock:
> if (vmf->pte)
> pte_unmap_unlock(vmf->pte, vmf->ptl);
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox