Linux-mm Archive on lore.kernel.org

Linux-mm Archive on lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v2 0/7] vmsplice: fix some problems in my previous vmsplice patchset
From: Askar Safin @ 2026-06-25  8:34 UTC (permalink / raw)
  To: linux-fsdevel, Christian Brauner, Alexander Viro, Jan Kara
  Cc: linux-kernel, linux-mm, linux-api, netdev, fuse-devel,
	Linus Torvalds, Matthew Wilcox, Jens Axboe, Christoph Hellwig,
	David Howells, Andrew Morton, David Hildenbrand, Pedro Falcato,
	Miklos Szeredi, Andy Lutomirski, Collin Funk, David Laight,
	Stefan Metzmacher, The 8472, Willy Tarreau, Joanne Koong,
	Val Packett, Andrei Vagin, patches

This patchset is for VFS. Of course, it depends on my previous vmsplice
patchset ( https://lore.kernel.org/all/20260531010107.1953702-1-safinaskar@gmail.com/ ).

I fix some problems in my previous patchset.

1. Fix problem with CLASS(fd, f)(fd). See first patch in this patchset
for details. This is probably not so important, but I fix it anyway.

2. Change "unsigned long" back to "int". See second patch for details.
Again, this is probably not important, but I want to fix this anyway.

3. Fix that LTP vmsplice01 bug.

4. libfuse relies on sharing vmsplice behavior. So we detect particular
combination of flags to pipe2(2) and vmsplice(2) and return -EINVAL.
This forces libfuse to fail back to non-vmsplice code path.
I. e. we fix libfuse-related regression [1].
I did debian code search for regex "vmsplice.*SPLICE_F_NONBLOCK" and
I found no other packages with this particular combination of flags
except for fuse itself. (Okay, other packages are fio and stress-ng,
but these are merely testers.) So, I think this is okay to return
EINVAL here, breakage will be minimal.

5. Set FMODE_NOWAIT for named FIFOs. CRIU relies on ability to do
vmsplice(SPLICE_F_NONBLOCK) on named FIFOs. So, I fix this CRIU-related
regression [2]. But there is another CRIU-related regression, which I do not
fix [3]: CRIU behavior in splice mode becomes so slow that splice mode
becomes useless. I personally still believe that removing vmsplice is
right thing to do. Other option is doing nothing. Yet another option
is to implement some deprecation period [3]. Let other developers
decide.

See patches for details.

Please, run that LTP vmsplice01 test again.

Notes:

- I want to repeat: I change behavior around SPLICE_F_NONBLOCK.
Previously, vmsplice ignored whether pipe itself was opened as
non-blocking file. Now it is not ignored. And in my opinion
new behavior is better.
- vmsplice(2) now is in fs/read_write.c . It is very similar to
preadv2 and pwritev2 now, so I think it belongs to fs/read_write.c now.

Please, review this patchset carefully. I'm still new contributor.
In particular, please, review that do-while loop, I'm not sure I did
everything right.

Tested in Qemu.

[1] https://lore.kernel.org/all/CAJnrk1Y9egYizkx1H9K0cqxSYuB+7vLvQbV7Tf4C5eHFqnnC-A@mail.gmail.com/
[2] https://lore.kernel.org/all/CANaxB-zK5q=Xw6UZTmeFtXsDZjUsPkFk=p485m-wtNTBnf4hgg@mail.gmail.com/
[3] https://lore.kernel.org/all/CANaxB-xUrLQYGiRJZc4Boi+KX=0TJSWymErNovANVko20fMDVA@mail.gmail.com/

v1: https://lore.kernel.org/lkml/20260606061031.3744880-1-safinaskar@gmail.com/

Changes since v1: fix fuse-related and CRIU-related regressions (see above).

Askar Safin (7):
  vmsplice: open-code do_writev and do_readv
  vmsplice: change argument type back to "int"
  splice: turn wait_for_space flags argument into bool
  pipe: move wait_for_space to fs/pipe.c and rename it
  vmsplice: make sure we don't wait after writing some data
  vmsplice: return -EINVAL for particular combination of flags
  pipe: set FMODE_NOWAIT for named FIFOs

 fs/pipe.c                 | 23 +++++++++++++
 fs/read_write.c           | 71 +++++++++++++++++++++++++++++++++++----
 fs/splice.c               | 19 +----------
 include/linux/pipe_fs_i.h |  2 ++
 include/linux/syscalls.h  |  2 +-
 5 files changed, 91 insertions(+), 26 deletions(-)

base-commit: 8d86fcfc2857d64af85f5c87c193c25655c970af
-- 
2.47.3

^ permalink raw reply

* Re: [PATCH] mm/rmap: use huge_ptep_get() in try_to_unmap_one()
From: David Hildenbrand (Arm) @ 2026-06-25  8:28 UTC (permalink / raw)
  To: Dev Jain, akpm, ljs
  Cc: riel, liam, vbabka, harry, jannh, kas, linux-mm, linux-kernel,
	ryan.roberts, anshuman.khandual, stable
In-Reply-To: <614e89a8-5108-4ac1-bdf8-fecca48bd91b@arm.com>

On 6/25/26 10:03, Dev Jain wrote:
> 
> 
> On 25/06/26 1:26 pm, David Hildenbrand (Arm) wrote:
>> On 6/25/26 06:28, Dev Jain wrote:
>>> try_to_unmap_one() handles hugetlb folios when memory failure needs
>>> to replace a poisoned hugetlb mapping with a hwpoison entry. In that
>>> case page_vma_mapped_walk() returns the hugetlb entry in pvmw.pte, but
>>> the code reads it with ptep_get() before decoding the PFN.
>>>
>>> That is wrong on architectures where hugetlb entries are not encoded as
>>> regular PTEs. On s390, for example, a raw huge RSTE must be converted
>>> by huge_ptep_get() before helpers such as pte_pfn() can inspect it. A
>>> raw decode can select the wrong subpage, so try_to_unmap_one() can
>>> install a hwpoison entry for the wrong PFN.
>>>
>>> The userspace-visible result is that a later access to the poisoned
>>> hugetlb subpage can miss the expected SIGBUS. With DEBUG_VM, the wrong
>>> subpage can also trip the PageHWPoison check.
>>>
>>> Use huge_ptep_get() for hugetlb mappings before decoding the PFN.
>>>
>>> Before c7ab0d2fdc84, the bug existed in the form of a plain dereference:
>>> we would check the head page pfn of the hugetlb with pte_pfn(*pte), and
>>> bail out on mismatch. This would mean that the hwpoisoned entry will not
>>> get installed.
>>>
>>> I am not sure what is the procedure on such kinds of very old bugs - how
>>> back should I really go?
>>>
>>> Fixes: c7ab0d2fdc84 ("mm: convert try_to_unmap_one() to use page_vma_mapped_walk()")
>>> Cc: stable@vger.kernel.org
>>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>>> ---
>>> Applies on mm-unstable (d17fe8a046a2).
>>> There are similar old bugs present, in try_to_migrate_one(), check_pte(),
>>> remove_migration_pte(), prot_none_hugetlb_entry().
>>
>> Yeah, we should handle all these cases properly. Can you send fixes?
>>
>> Using ptep_get() on something that's not a PTE entry is shaky on some architectures.
> 
> I can send the fixes blaming the commit till which backport is relatively simple. The bug will
> still remain before that, where we don't even do ptep_get(), just a plain dereference, if
> that is fine. Probably no one is running pre-2017 kernels.

The issue is that we would have to analyze in which cases exactly it would cause
problems, like when migrating prot-none hugetlb folios on s390x, where
pte_present() would not work as expected.

I don't think any of us has time (or motivation) for that detailed analysis to
make some odd hugetlb cases happy.

So I'd say, let's just fix it in a simple way and be done with it. Use
best-effort Fixes: but rather state in the patch description that this was found
by code inspection and that the actual effects are unclear (e.g., pte_present()
misbehaving on s390x), and using huge_ptep_get() is just the right thing to do.
-- 
Cheers,

David


^ permalink raw reply

* Re: [PATCH v2] mm: mglru: fix stale batch updates after memcg reparenting
From: Peiyang He @ 2026-06-25  8:09 UTC (permalink / raw)
  To: Qi Zheng, Harry Yoo, akpm, david, kasong, shakeel.butt, baohua,
	axelrasmussen, yuanchu, weixugc, hannes, muchun.song, mhocko,
	roman.gushchin, ljs
  Cc: linux-mm, linux-kernel, Qi Zheng, stable
In-Reply-To: <1db11ccc-ae05-4b26-b360-c34ac9f97299@linux.dev>



On 2026/6/25 15:37, Qi Zheng wrote:
> 
> 
> On 6/25/26 2:32 PM, Harry Yoo wrote:
>>
>>
>> On 6/25/26 3:11 PM, Qi Zheng wrote:
>>> On 6/25/26 12:16 PM, Harry Yoo wrote:
>>>>
>>> [...]
>>>
>>>>
>>>>> So lock_batch_lruvec() can be implemented like this:
>>>>>
>>>>> #ifdef CONFIG_MEMCG
>>>>> static struct lruvec *lock_batch_lruvec(struct lruvec *lruvec)
>>>>> {
>>>>>       struct pglist_data *pgdat = lruvec_pgdat(lruvec);
>>>>>       struct mem_cgroup *memcg = lruvec_memcg(lruvec);
>>>>>
>>>>>       rcu_read_lock();
>>>>>
>>>>>       /*
>>>>>        * The memcg can be NULL when the memory controller is disabled.
>>>>>        * Otherwise, the caller keeps the memcg owning @lruvec alive.
>>>>>        */
>>>>>       if (!memcg || !css_is_dying(&memcg->css))
>>>>>           goto lock;
>>>>>
>>>>>       do {
>>>>>           memcg = parent_mem_cgroup(memcg);
>>>>>       } while (memcg && css_is_dying(&memcg->css));
>>>>>       lruvec = mem_cgroup_lruvec(memcg, pgdat);
>>>>>
>>>>> lock:
>>>>>       spin_lock_irq(&lruvec->lru_lock);
>>>>>
>>>>>       return lruvec;
>>>>> }
>>>>> #else
>>>>> static struct lruvec *lock_batch_lruvec(struct lruvec *lruvec)
>>>>> {
>>>>>       lruvec_lock_irq(lruvec);
>>>>>
>>>>>       return lruvec;
>>>>> }
>>>>> #endif
>>>>>
>>>>> Does this make sense?
>>>>
>>>> Yes, looks good to me!
>>>
>>> OK, this sync method makes more sense as it doesn't require adding a
>>> new lrugen->reparente. I'll go with this method and update v3.
>>
>> Thanks!
>>
>> Just one thing to clarify...
>>
>> So, when we check something that's updated _before_ grace period
>> (CSS_DYING), RCU is sufficient.
>>
>> But in folio_lruvec_lock*(), that is not the case because reparenting
>> is performed in the RCU work, under the lruvec lock. So the check needs
>> to be done under RCU and the lruvec lock.
>>
>> This is quite subtle :D
> 
> Indeed.
> 
> And in theory, the l->nr_items check in lock_list_lru_of_memcg() could
> also be replaced by the CSS_DYING check.
> 
>>
>>> Hi Barry and Baolin, what do you think? Since the sync method has been
>>> changed, I will temporarily drop your previous Reviewed-by tags in v3. ;)
>>
>> And hopefully Peiyang would kindly double check v3 still not reproduced
>> on the machine :)
> 
> Yeah!
No problem! I can help test v3.> 
>>
> 
> 




^ permalink raw reply

* Re: [PATCH v8 39/46] KVM: selftests: Test conversion with elevated page refcount
From: Fuad Tabba @ 2026-06-25  8:04 UTC (permalink / raw)
  To: ackerleytng
  Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
	jmattson, jthoughton, michael.roth, oupton, pankaj.gupta, qperret,
	rick.p.edgecombe, rientjes, shivankg, steven.price, willy, wyihan,
	yan.y.zhao, forkloop, pratyush, suzuki.poulose, aneesh.kumar,
	liam, Paolo Bonzini, Sean Christopherson, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, x86, H. Peter Anvin,
	Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
	Jonathan Corbet, Shuah Khan, Shuah Khan, Vishal Annapurve,
	Andrew Morton, Chris Li, Kairui Song, Kemeng Shi, Nhat Pham,
	Barry Song, Axel Rasmussen, Yuanchu Xie, Wei Xu, Youngjun Park,
	Qi Zheng, Shakeel Butt, Kiryl Shutsemau, Baoquan He,
	Jason Gunthorpe, Vlastimil Babka, kvm, linux-kernel,
	linux-trace-kernel, linux-doc, linux-kselftest, linux-mm,
	linux-coco
In-Reply-To: <20260618-gmem-inplace-conversion-v8-39-9d2959357853@google.com>

On Fri, 19 Jun 2026 at 01:32, Ackerley Tng via B4 Relay
<devnull+ackerleytng.google.com@kernel.org> wrote:
>
> From: Ackerley Tng <ackerleytng@google.com>
>
> Add a selftest to verify that converting a shared guest_memfd page to a
> private page fails if the page has an elevated reference count.
>
> When KVM converts a shared page to a private one, it expects the page to
> have a reference count equal to the reference counts taken by the
> filemap. If another kernel subsystem holds a reference to the page, the
> conversion must be aborted.
>
> The test asserts that both bulk and single-page conversion attempts
> correctly fail with EAGAIN for the pinned page. After the page is unpinned,
> the test verifies that subsequent conversions succeed.
>
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
> Co-developed-by: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>

Not sure Sashiko's concern is worth it.

Reviewed-by: Fuad Tabba <tabba@google.com>

Cheers,
/fuad

> ---
>  .../kvm/x86/guest_memfd_conversions_test.c         | 56 ++++++++++++++++++++++
>  1 file changed, 56 insertions(+)
>
> diff --git a/tools/testing/selftests/kvm/x86/guest_memfd_conversions_test.c b/tools/testing/selftests/kvm/x86/guest_memfd_conversions_test.c
> index 99b0023609670..4ebbd29029526 100644
> --- a/tools/testing/selftests/kvm/x86/guest_memfd_conversions_test.c
> +++ b/tools/testing/selftests/kvm/x86/guest_memfd_conversions_test.c
> @@ -441,6 +441,62 @@ GMEM_CONVERSION_TEST_INIT_SHARED(forked_accesses)
>  #undef TEST_STATE_AWAIT
>  }
>
> +static void test_convert_to_private_fails(test_data_t *t, u64 pgoff,
> +                                         size_t nr_pages,
> +                                         u64 expected_error_offset)
> +{
> +       /* +1 to make it anything but expected_error_offset. */
> +       u64 error_offset = expected_error_offset + 1;
> +       u64 offset = pgoff * page_size;
> +       int ret;
> +
> +       do {
> +               ret = __gmem_set_private(t->gmem_fd, offset,
> +                                        nr_pages * page_size, &error_offset);
> +       } while (ret == -1 && errno == EINTR);
> +       TEST_ASSERT(ret == -1 && errno == EAGAIN,
> +                   "Wanted EAGAIN on page %lu, got %d (ret = %d)", pgoff,
> +                   errno, ret);
> +       TEST_ASSERT_EQ(error_offset, expected_error_offset);
> +}
> +
> +GMEM_CONVERSION_MULTIPAGE_TEST_INIT_SHARED(elevated_refcount, 4)
> +{
> +       int i;
> +
> +       pin_pages(t->mem + test_page * page_size, page_size);
> +
> +       for (i = 0; i < nr_pages; i++)
> +               test_shared(t, i, 0, 'A', 'B');
> +
> +       /*
> +        * Converting in bulk should fail as long any page in the range has
> +        * unexpected refcounts.
> +        */
> +       test_convert_to_private_fails(t, 0, nr_pages, test_page * page_size);
> +
> +       for (i = 0; i < nr_pages; i++) {
> +               /*
> +                * Converting page-wise should also fail as long any page in the
> +                * range has unexpected refcounts.
> +                */
> +               if (i == test_page)
> +                       test_convert_to_private_fails(t, i, 1, test_page * page_size);
> +               else
> +                       test_convert_to_private(t, i, 'B', 'C');
> +       }
> +
> +       unpin_pages();
> +
> +       gmem_set_private(t->gmem_fd, 0, nr_pages * page_size);
> +
> +       for (i = 0; i < nr_pages; i++) {
> +               char expected = i == test_page ? 'B' : 'C';
> +
> +               test_private(t, i, expected, 'D');
> +       }
> +}
> +
>  int main(int argc, char *argv[])
>  {
>         TEST_REQUIRE(kvm_check_cap(KVM_CAP_VM_TYPES) & BIT(KVM_X86_SW_PROTECTED_VM));
>
> --
> 2.55.0.rc0.738.g0c8ab3ebcc-goog
>
>


^ permalink raw reply

* Re: [PATCH] mm/rmap: use huge_ptep_get() in try_to_unmap_one()
From: Dev Jain @ 2026-06-25  8:03 UTC (permalink / raw)
  To: David Hildenbrand (Arm), akpm, ljs
  Cc: riel, liam, vbabka, harry, jannh, kas, linux-mm, linux-kernel,
	ryan.roberts, anshuman.khandual, stable
In-Reply-To: <ca31c254-504f-4857-bec7-10b8c2de94ed@kernel.org>



On 25/06/26 1:26 pm, David Hildenbrand (Arm) wrote:
> On 6/25/26 06:28, Dev Jain wrote:
>> try_to_unmap_one() handles hugetlb folios when memory failure needs
>> to replace a poisoned hugetlb mapping with a hwpoison entry. In that
>> case page_vma_mapped_walk() returns the hugetlb entry in pvmw.pte, but
>> the code reads it with ptep_get() before decoding the PFN.
>>
>> That is wrong on architectures where hugetlb entries are not encoded as
>> regular PTEs. On s390, for example, a raw huge RSTE must be converted
>> by huge_ptep_get() before helpers such as pte_pfn() can inspect it. A
>> raw decode can select the wrong subpage, so try_to_unmap_one() can
>> install a hwpoison entry for the wrong PFN.
>>
>> The userspace-visible result is that a later access to the poisoned
>> hugetlb subpage can miss the expected SIGBUS. With DEBUG_VM, the wrong
>> subpage can also trip the PageHWPoison check.
>>
>> Use huge_ptep_get() for hugetlb mappings before decoding the PFN.
>>
>> Before c7ab0d2fdc84, the bug existed in the form of a plain dereference:
>> we would check the head page pfn of the hugetlb with pte_pfn(*pte), and
>> bail out on mismatch. This would mean that the hwpoisoned entry will not
>> get installed.
>>
>> I am not sure what is the procedure on such kinds of very old bugs - how
>> back should I really go?
>>
>> Fixes: c7ab0d2fdc84 ("mm: convert try_to_unmap_one() to use page_vma_mapped_walk()")
>> Cc: stable@vger.kernel.org
>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>> ---
>> Applies on mm-unstable (d17fe8a046a2).
>> There are similar old bugs present, in try_to_migrate_one(), check_pte(),
>> remove_migration_pte(), prot_none_hugetlb_entry().
> 
> Yeah, we should handle all these cases properly. Can you send fixes?
> 
> Using ptep_get() on something that's not a PTE entry is shaky on some architectures.

I can send the fixes blaming the commit till which backport is relatively simple. The bug will
still remain before that, where we don't even do ptep_get(), just a plain dereference, if
that is fine. Probably no one is running pre-2017 kernels.

> 



^ permalink raw reply

* Re: [RFC PATCH] mm: bypass swap readahead for zswap
From: Alexandre Ghiti @ 2026-06-25  7:59 UTC (permalink / raw)
  To: Nhat Pham, Kairui Song
  Cc: akpm, hannes, yosry, chengming.zhou, david, ljs, liam, vbabka,
	rppt, surenb, mhocko, chrisl, baohua, usama.arif, linux-mm,
	linux-kernel
In-Reply-To: <CAKEwX=M4A4kAFS-Fb=toAK7e=mseVc191_2r7vTrMiJnyT9+wA@mail.gmail.com>

Hi Nhat,

On 6/24/26 19:43, Nhat Pham wrote:
> On Wed, Jun 24, 2026 at 3:31 AM Kairui Song <ryncsn@gmail.com> wrote:
>>
>> Better check zswap_never_enabled first to avoid a xa_load if not needed.
> +1.
>
> Maybe also xa_empty() when we're at it? :)


Yes I'll add that too.

Thanks,

Alex



^ permalink raw reply

* Re: [PATCH] mm/rmap: use huge_ptep_get() in try_to_unmap_one()
From: David Hildenbrand (Arm) @ 2026-06-25  7:56 UTC (permalink / raw)
  To: Dev Jain, akpm, ljs
  Cc: riel, liam, vbabka, harry, jannh, kas, linux-mm, linux-kernel,
	ryan.roberts, anshuman.khandual, stable
In-Reply-To: <20260625042853.2752898-1-dev.jain@arm.com>

On 6/25/26 06:28, Dev Jain wrote:
> try_to_unmap_one() handles hugetlb folios when memory failure needs
> to replace a poisoned hugetlb mapping with a hwpoison entry. In that
> case page_vma_mapped_walk() returns the hugetlb entry in pvmw.pte, but
> the code reads it with ptep_get() before decoding the PFN.
> 
> That is wrong on architectures where hugetlb entries are not encoded as
> regular PTEs. On s390, for example, a raw huge RSTE must be converted
> by huge_ptep_get() before helpers such as pte_pfn() can inspect it. A
> raw decode can select the wrong subpage, so try_to_unmap_one() can
> install a hwpoison entry for the wrong PFN.
> 
> The userspace-visible result is that a later access to the poisoned
> hugetlb subpage can miss the expected SIGBUS. With DEBUG_VM, the wrong
> subpage can also trip the PageHWPoison check.
> 
> Use huge_ptep_get() for hugetlb mappings before decoding the PFN.
> 
> Before c7ab0d2fdc84, the bug existed in the form of a plain dereference:
> we would check the head page pfn of the hugetlb with pte_pfn(*pte), and
> bail out on mismatch. This would mean that the hwpoisoned entry will not
> get installed.
> 
> I am not sure what is the procedure on such kinds of very old bugs - how
> back should I really go?
> 
> Fixes: c7ab0d2fdc84 ("mm: convert try_to_unmap_one() to use page_vma_mapped_walk()")
> Cc: stable@vger.kernel.org
> Signed-off-by: Dev Jain <dev.jain@arm.com>
> ---
> Applies on mm-unstable (d17fe8a046a2).
> There are similar old bugs present, in try_to_migrate_one(), check_pte(),
> remove_migration_pte(), prot_none_hugetlb_entry().

Yeah, we should handle all these cases properly. Can you send fixes?

Using ptep_get() on something that's not a PTE entry is shaky on some architectures.

-- 
Cheers,

David


^ permalink raw reply

* Re: [PATCH] mm/rmap: use huge_ptep_get() in try_to_unmap_one()
From: David Hildenbrand (Arm) @ 2026-06-25  7:54 UTC (permalink / raw)
  To: Dev Jain, kernel test robot, akpm, ljs
  Cc: llvm, oe-kbuild-all, riel, liam, vbabka, harry, jannh, kas,
	linux-mm, linux-kernel, ryan.roberts, anshuman.khandual, stable
In-Reply-To: <2fd3688b-b4d0-4a2f-8d49-4d4b9c512c66@arm.com>

On 6/25/26 08:59, Dev Jain wrote:
> 
> 
> On 25/06/26 11:15 am, kernel test robot wrote:
>> Hi Dev,
>>
>> kernel test robot noticed the following build errors:
>>
>> [auto build test ERROR on akpm-mm/mm-everything]
>>
>> url:    https://github.com/intel-lab-lkp/linux/commits/Dev-Jain/mm-rmap-use-huge_ptep_get-in-try_to_unmap_one/20260625-123050
>> base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
>> patch link:    https://lore.kernel.org/r/20260625042853.2752898-1-dev.jain%40arm.com
>> patch subject: [PATCH] mm/rmap: use huge_ptep_get() in try_to_unmap_one()
>> config: hexagon-allnoconfig (https://download.01.org/0day-ci/archive/20260625/202606251341.jfIr1D7m-lkp@intel.com/config)
>> compiler: clang version 23.0.0git (https://github.com/llvm/llvm-project 6cc609bb250b21b47fc7d394b4019101e9983597)
>> reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260625/202606251341.jfIr1D7m-lkp@intel.com/reproduce)
>>
>> If you fix the issue in a separate patch/commit (i.e. not just a new version of
>> the same patch/commit), kindly add following tags
>> | Reported-by: kernel test robot <lkp@intel.com>
>> | Closes: https://lore.kernel.org/oe-kbuild-all/202606251341.jfIr1D7m-lkp@intel.com/
>>
>> All errors (new ones prefixed by >>):
>>
>>     2100 |                         pteval = huge_ptep_get(mm, address, pvmw.pte);
>>          |                                  ^
>>     2100 |                         pteval = huge_ptep_get(mm, address, pvmw.pte);
>>          |                                ^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>>    2 errors generated.
> 
> Weird that I need a stub. This should do:
> 
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index 2abaf99321e90..4661f88eee55b 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -1261,6 +1261,16 @@ static inline void hugetlb_count_sub(long l, struct mm_struct *mm)
>  {
>  }
> 
> +static inline pte_t huge_ptep_get(struct mm_struct *mm, unsigned long addr,
> +				  pte_t *ptep)
> +{
> +#ifdef CONFIG_MMU
> +	return ptep_get(ptep);
> +#else
> +	return *ptep;
> +#endif

Without CONFIG_HUGETLB_PAGE, folio_test_hugetlb() == false and the compiler will
never end up actually linking this function.

So probably you can just let the linker deal with that

pte_t huge_ptep_get(struct mm_struct *mm, unsigned long addr, pte_t *ptep);


If abused, the linker would complain. In you case, the compiler will optimize it
out (and be happy) and the linker will never start looking for the symbol (that
doesn't exist).

-- 
Cheers,

David


^ permalink raw reply

* Re: [RFC PATCH] mm: Avoiding split large folios if swap has no space
From: David Hildenbrand (Arm) @ 2026-06-25  7:49 UTC (permalink / raw)
  To: Barry Song
  Cc: akpm, axelrasmussen, baolin.wang, dev.jain, kasong, lance.yang,
	liam, linux-kernel, linux-mm, ljs, npache, qi.zheng, ryan.roberts,
	shakeel.butt, weixugc, yuanchu, zhaonanzhe, ziy, Johannes Weiner,
	Michal Hocko, Roman Gushchin
In-Reply-To: <CAGsJ_4xR0Ge707s8JAfd5Fw10+U+25_nuBGuVayx9qnxR+NnGQ@mail.gmail.com>

>>
>> But now I wonder whether we would also want to check "is there any free swap
>> space", not just "is there any swap".
> 
> I don't quite understand you. get_nr_swap_pages() returns
> nr_swap_pages, which increases or decreases as swap is allocated or
> freed. I guess it just reflects how many swaps we currently have
> available?

Indeed, I was confused by the function name it's "free swap pages". So all goof :)

> 
>>
>>
>> Essentially, try returning -E2BIG if there is the chance to swap out after
>> split, and  -ENOSPC / -ENOMEM if a split wouldn't help.
>>
>>>       }
>>>
>>>  again:
>>> @@ -1769,11 +1772,13 @@ int folio_alloc_swap(struct folio *folio)
>>>       }
>>>
>>>       /* Need to call this even if allocation failed, for MEMCG_SWAP_FAIL. */
>>> -     if (unlikely(mem_cgroup_try_charge_swap(folio)))
>>> +     if (unlikely(mem_cgroup_try_charge_swap(folio))) {
>>>               swap_cache_del_folio(folio);
>>> +             return -ENOMEM;
>>
>> Here we wouldn't have the information whether we could charge after a split.
>>
>> So that would require a rework to signal this more cleanly to the caller.
> 
> Yep. The tricky part is that mem_cgroup_try_charge_swap() cannot
> return how much swap quota is available in the memcg. Do you prefer to
> add an output argument to mem_cgroup_try_charge_swap() to expose
> that
That would probably be cleanest, if that is easily possible. We would want to
get memcg maintainer feedback on that.

@memcg folks: we'd like to know whether splitting a large folio would make
mem_cgroup_try_charge_swap() succeed on a split (smaller) part, to distinguish
"there is no way we can swap out anything, don't split" vs. "we could swap out,
split".

What's the best way to obtain that information?

-- 
Cheers,

David


^ permalink raw reply

* Re: [PATCH v10 0/4] mm/page_owner: add per-fd filter infrastructure for print_mode and NUMA filtering
From: Ye Liu @ 2026-06-25  7:48 UTC (permalink / raw)
  To: Andrew Morton, Zhen Ni
  Cc: vbabka, surenb, mhocko, jackmanb, hannes, ziy, linux-mm,
	linux-kernel
In-Reply-To: <20260624201526.d96ff165f09f361ad128cea0@linux-foundation.org>



在 2026/6/25 11:15, Andrew Morton 写道:
> On Thu, 18 Jun 2026 11:57:46 +0800 Zhen Ni <zhen.ni@easystack.cn> wrote:
> 
>> This patch series introduces per-file-descriptor filtering capabilities to the
>> page_owner feature.
> 
> Thanks.  Ye Liu has been working on page_owner recently
> (https://lore.kernel.org/20260623065234.31866-2-ye.liu@linux.dev
> https://lore.kernel.org/20260625014708.87386-1-ye.liu@linux.dev), so
> perhaps he can help review these changes?
> 

Hi Andrew, Zhen
 
Thanks for the heads-up. I'll take a look at Zhen Ni's series and see if
I can provide some review comments. I'm also aware of the Sashiko report
– I'll check those points and discuss with Zhen if needed.
 
I'll try to get back with feedback within a few days.

By the way, at first glance this series seems similar in functionality to
commit 156c0c5d1463 ("mm/page_owner: introduce struct stack_print_ctx").

> Sashiko found a few things to ask about:
> 	https://sashiko.dev/#/patchset/20260618035750.3724613-1-zhen.ni@easystack.cn
> 

-- 
Thanks,
Ye Liu



^ permalink raw reply

* Re: [PATCH v2 0/3] mm: __access_remote_vm with per-VMA lock
From: Lorenzo Stoakes @ 2026-06-25  7:47 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: Rik van Riel, linux-kernel, x86, linux-mm, Thomas Gleixner,
	Ingo Molnar, Dmitry Ilvokhin, Borislav Petkov, Dave Hansen,
	Andrew Morton, Liam R. Howlett, Vlastimil Babka,
	Suren Baghdasaryan, kernel-team
In-Reply-To: <b0d9ce70-ebb7-4b79-9a35-257b69daff7a@kernel.org>

(Note that it's the merge window, really ideally you should be saving sending
non-RFC series until after the merge window closes, none of these series will be
taken yet and you'll have to resend anyway)

On Thu, Jun 25, 2026 at 08:32:43AM +0200, David Hildenbrand (Arm) wrote:
> On 6/25/26 03:50, Rik van Riel wrote:
> > Sometimes processes can get stuck with the mmap_lock held for
> > a long time. This slows down, and can even prevent system monitoring
> > tools from assessing and logging the situation, because they themselves
> > end up getting stuck on the mmap_lock.
> >
> > However, with the introduction of per-VMA locks, we can improve the
> > reliability of system monitoring, and generally speed up __access_remote_vm
> > under mmap_loc contention, by adding a fast path that does not require
> > the process-wide mmap_lock.
> >
> > This fast path is only compiled in and used when it is safe to do so,
> > meaning a kernel with per-VMA locks, RCU pgae table freeing, the VMA
> > is not hugetlbfs, iomap, pfnmap, etc...
> >
> > v2:
> >  - simplify the code, which should be ok because these copies are < PAGE_SIZE
> >  - clean up the code
> >  - fix locking wrt tlb_remove_table_sync_one()
> >  - hopefully address all the other comments

This is really not sufficient :) you should break out the changes you
make and who suggested them.

If people give their time to review, you should take the time to make it clear
you've dealt with it.

>
> You mean, ignoring my comments about not reiplementing GUP entirely?
>
> NAK

Yeah agreed.

As replied elsewhere, I don't think we need to do that anyway?

But you _really_ need to respond to review inline so we can have those
discussions and ensure that you're moving forward in the correct way.

Just quietly resending with a catch-all comment as per above puts all the load
on us to go and check that you've done what we've asked.

On my side, my review time is very limited now, so triage dictates general
dismissals when the series looks wrong, engaging in discussion will help us all
move forward efficiently.

>
> --
> Cheers,
>
> David

Thanks, Lorenzo


^ permalink raw reply

* Re: [RFC PATCH] mm: bypass swap readahead for zswap
From: Alexandre Ghiti @ 2026-06-25  7:42 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: akpm, hannes, nphamcs, chengming.zhou, david, ljs, liam, vbabka,
	rppt, surenb, mhocko, kasong, chrisl, baohua, usama.arif,
	linux-mm, linux-kernel
In-Reply-To: <CAO9r8zNni28A+9BiDgDTBRuvGTKnqH91kpbpjjeftfnKSeSOJg@mail.gmail.com>

Hi Yosry,

On 6/24/26 20:01, Yosry Ahmed wrote:
> On Wed, Jun 24, 2026 at 12:57 AM Alexandre Ghiti <alex@ghiti.fr> wrote:
>> Commit 0bcac06f27d7 ("mm, swap: skip swapcache for swapin of synchronous
>> device") made SWP_SYNCHRONOUS_IO devices (e.g. zram) skip swap readahead.
>>
>> zswap is the same kind of in-memory, synchronous backend as zram, not a
>> swap device flagged SWP_SYNCHRONOUS_IO so it still goes through
>> swapin_readahead().
>>
>> Here are the results from bypassing readahead for zswap too: it was
>> measured with a kernel build (make -j16) in a memcg, zswap=zstd, shrinker
>> off, on Sapphire Rapids and 3 iterations.
>>
>>    768M memcg (sustained swap thrash):
>>      metric                 mm-new    + bypass    delta
>>      build time (s)          405.0       341.7    -15.6%
>>      zswap-in (GB)            79.5        53.0     -33%
>>      zswap-out (GB)          144.8       115.6     -20%
>>      swap readahead (pages)  6.79M       0.45M     -93%
>>      swap_ra hit (%)          72.1        89.9     +18pp
>>
>>    1G memcg (light pressure, build not memory-bound):
>>      metric                 mm-new    + bypass    delta
>>      build time (s)          177.7       176.0    ~same (no regression)
>>      zswap-in (GB)            10.2         7.5     -26%
>>      zswap-out (GB)           27.7        25.1      -9%
>>      swap readahead (pages)  1.07M       0.08M     -93%
>>      swap_ra hit (%)          68.6        87.2     +19pp
>>
>> The gain is from no longer prefetching pages that are pointless for an
>> in-memory backend: readahead inflates anon residency and thrashes the
>> page cache (file pages get evicted and re-read), lengthens each fault by
>> synchronously (de)compressing a cluster of neighbours, and adds
>> compression traffic when those extra pages are reclaimed.
>>
>> Bypassing swap readahead for zswap therefore makes sense.
>>
>> Signed-off-by: Alexandre Ghiti <alex@ghiti.fr>
>> ---
>>
>> - This bypass originally comes from Usama's series that implements
>>    large folio zswapin: while working on improving this series, I noticed
>>    the gains I got only came from the bypass of readahead.
>>
>>   include/linux/zswap.h |  6 ++++++
>>   mm/memory.c           |  5 +++--
>>   mm/zswap.c            | 11 +++++++++++
>>   3 files changed, 20 insertions(+), 2 deletions(-)
>>
>> diff --git a/include/linux/zswap.h b/include/linux/zswap.h
>> index 30c193a1207e..b6f0e6198b6f 100644
>> --- a/include/linux/zswap.h
>> +++ b/include/linux/zswap.h
>> @@ -35,6 +35,7 @@ void zswap_lruvec_state_init(struct lruvec *lruvec);
>>   void zswap_folio_swapin(struct folio *folio);
>>   bool zswap_is_enabled(void);
>>   bool zswap_never_enabled(void);
>> +bool zswap_present_test(swp_entry_t swp);
>>   #else
>>
>>   struct zswap_lruvec_state {};
>> @@ -69,6 +70,11 @@ static inline bool zswap_never_enabled(void)
>>          return true;
>>   }
>>
>> +static inline bool zswap_present_test(swp_entry_t swp)
>> +{
>> +       return false;
>> +}
>> +
>>   #endif
>>
>>   #endif /* _LINUX_ZSWAP_H */
>> diff --git a/mm/memory.c b/mm/memory.c
>> index ff338c2abe92..5aa1ea9eb48a 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -4827,8 +4827,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>          if (folio)
>>                  swap_update_readahead(folio, vma, vmf->address);
>>          if (!folio) {
>> -               /* Swapin bypasses readahead for SWP_SYNCHRONOUS_IO devices */
>> -               if (data_race(si->flags & SWP_SYNCHRONOUS_IO))
>> +               /* Swapin bypasses readahead for SWP_SYNCHRONOUS_IO devices and zswap */
>> +               if (data_race(si->flags & SWP_SYNCHRONOUS_IO) ||
>> +                   zswap_present_test(entry))
> This assumes that if the swap entry is in zswap, then the remaining
> entries (covered by the readahead window) will also be in zswap,
> right? While not very likely, it's possible that the remaining entries
> not in zswap but on disk, right?


Yes, I assumed locality here.

Indeed, it would be interesting to keep the readahead but only actually 
readahead swap disk entries. I don't know how this will affect the 
readahead window (it is computed from the number of PG_readahead hits 
iirc) but I can give it a try.


>
>>                          folio = swapin_sync(entry, GFP_HIGHUSER_MOVABLE,
>>                                              thp_swapin_suitable_orders(vmf) | BIT(0),
>>                                              vmf, NULL, 0);
>> diff --git a/mm/zswap.c b/mm/zswap.c
>> index 761cd699e0a3..5b85b4d17647 100644
>> --- a/mm/zswap.c
>> +++ b/mm/zswap.c
>> @@ -234,6 +234,17 @@ static inline struct xarray *swap_zswap_tree(swp_entry_t swp)
>>                  >> ZSWAP_ADDRESS_SPACE_SHIFT];
>>   }
>>
>> +/**
>> + * zswap_present_test - check if a swap entry is currently backed by zswap
>> + * @swp: the swap entry to test
>> + *
>> + * Return: true if @swp has a zswap entry, false otherwise.
>> + */
>> +bool zswap_present_test(swp_entry_t swp)
> zswap_is_present()?


Agree, the naming is not perfect, I'll change that to either your 
proposal or zswap_is_entry_present() (or something else), but I'll 
definitely change that.

Thanks,

Alex


>
>> +{
>> +       return xa_load(swap_zswap_tree(swp), swp_offset(swp));
>> +}
>> +
>>   #define zswap_pool_debug(msg, p)                       \
>>          pr_debug("%s pool %s\n", msg, (p)->tfm_name)
>>
>> --
>> 2.54.0
>>
>>


^ permalink raw reply

* Re: [PATCH v5 0/9] dax/kmem: atomic whole-device hotplug via sysfs
From: David Hildenbrand (Arm) @ 2026-06-25  7:41 UTC (permalink / raw)
  To: Gregory Price, linux-mm, nvdimm
  Cc: linux-kernel, linux-cxl, driver-core, linux-kselftest,
	kernel-team, osalvador, gregkh, rafael, dakr, djbw,
	vishal.l.verma, dave.jiang, akpm, ljs, liam, vbabka, rppt, surenb,
	mhocko, shuah, alison.schofield, Smita.KoralahalliChannabasappa,
	ira.weiny, apopple
In-Reply-To: <ajwpCOSGapenRPsu@gourry-fedora-PF4VCD3F>

On 6/24/26 20:59, Gregory Price wrote:
> On Wed, Jun 24, 2026 at 10:57:35AM -0400, Gregory Price wrote:
>> ... snip ...
> 
> Disregard, there are a few unaddressed Sashiko comments, I'm just going
> to respin this.  Will wait until after the merge window closes for v6.
> 
> The rough shape of things should still hold w/ prior feedback.

Added some comments :)

-- 
Cheers,

David


^ permalink raw reply

* Re: [PATCH v8 38/46] KVM: selftests: Add helpers to pin pages with CONFIG_GUP_TEST
From: Fuad Tabba @ 2026-06-25  7:40 UTC (permalink / raw)
  To: ackerleytng
  Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
	jmattson, jthoughton, michael.roth, oupton, pankaj.gupta, qperret,
	rick.p.edgecombe, rientjes, shivankg, steven.price, willy, wyihan,
	yan.y.zhao, forkloop, pratyush, suzuki.poulose, aneesh.kumar,
	liam, Paolo Bonzini, Sean Christopherson, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, x86, H. Peter Anvin,
	Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
	Jonathan Corbet, Shuah Khan, Shuah Khan, Vishal Annapurve,
	Andrew Morton, Chris Li, Kairui Song, Kemeng Shi, Nhat Pham,
	Barry Song, Axel Rasmussen, Yuanchu Xie, Wei Xu, Youngjun Park,
	Qi Zheng, Shakeel Butt, Kiryl Shutsemau, Baoquan He,
	Jason Gunthorpe, Vlastimil Babka, kvm, linux-kernel,
	linux-trace-kernel, linux-doc, linux-kselftest, linux-mm,
	linux-coco
In-Reply-To: <20260618-gmem-inplace-conversion-v8-38-9d2959357853@google.com>

On Fri, 19 Jun 2026 at 01:32, Ackerley Tng via B4 Relay
<devnull+ackerleytng.google.com@kernel.org> wrote:
>
> From: Ackerley Tng <ackerleytng@google.com>
>
> Add helper functions to allow KVM selftests to pin memory using
> CONFIG_GUP_TEST. This is useful for testing scenarios where some page has
> an increased refcount. such as in guest_memfd in-place conversion tests.
>
> The helpers open /sys/kernel/debug/gup_test and invoke the
> PIN_LONGTERM_TEST_START and PIN_LONGTERM_TEST_STOP ioctls. Since this
> functionality depends on the kernel being built with CONFIG_GUP_TEST,
> provide stub implementations that trigger a test failure if the
> configuration is missing.
>
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>

nit below, otherwise:

Reviewed-by: Fuad Tabba <tabba@google.com>

Cheers,
/fuad

> ---
>  tools/testing/selftests/kvm/include/kvm_util.h |  3 +++
>  tools/testing/selftests/kvm/lib/kvm_util.c     | 23 +++++++++++++++++++++++
>  2 files changed, 26 insertions(+)
>
> diff --git a/tools/testing/selftests/kvm/include/kvm_util.h b/tools/testing/selftests/kvm/include/kvm_util.h
> index 323d06b5699ec..79ab64ac8b869 100644
> --- a/tools/testing/selftests/kvm/include/kvm_util.h
> +++ b/tools/testing/selftests/kvm/include/kvm_util.h
> @@ -1195,6 +1195,9 @@ static inline int pin_self_to_any_cpu(void)
>         return pin_task_to_any_cpu(pthread_self());
>  }
>
> +void pin_pages(void *vaddr, uint64_t size);
> +void unpin_pages(void);
> +
>  void kvm_print_vcpu_pinning_help(void);
>  void kvm_parse_vcpu_pinning(const char *pcpus_string, u32 vcpu_to_pcpu[],
>                             int nr_vcpus);
> diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c b/tools/testing/selftests/kvm/lib/kvm_util.c
> index b73817f7bc803..524ef97d634bf 100644
> --- a/tools/testing/selftests/kvm/lib/kvm_util.c
> +++ b/tools/testing/selftests/kvm/lib/kvm_util.c
> @@ -18,6 +18,8 @@
>  #include <unistd.h>
>  #include <linux/kernel.h>
>
> +#include "../../../../mm/gup_test.h"
> +
>  #define KVM_UTIL_MIN_PFN       2
>
>  u32 guest_random_seed;
> @@ -639,6 +641,27 @@ int __pin_task_to_cpu(pthread_t task, int cpu)
>         return pthread_setaffinity_np(task, sizeof(cpuset), &cpuset);
>  }
>
> +static int gup_test_fd = -1;
> +
> +void pin_pages(void *vaddr, uint64_t size)
> +{
> +       const struct pin_longterm_test args = {
> +               .addr = (uint64_t)vaddr,
> +               .size = size,
> +               .flags = PIN_LONGTERM_TEST_FLAG_USE_WRITE,
> +       };
> +
> +       gup_test_fd = __open_path_or_exit("/sys/kernel/debug/gup_test", O_RDWR,
> +                                         "Is CONFIG_GUP_TEST enabled?");

nit: should you close this/reset it to -1 after the tests?

> +
> +       TEST_ASSERT_EQ(ioctl(gup_test_fd, PIN_LONGTERM_TEST_START, &args), 0);
> +}
> +
> +void unpin_pages(void)
> +{
> +       TEST_ASSERT_EQ(ioctl(gup_test_fd, PIN_LONGTERM_TEST_STOP), 0);
> +}
> +
>  static u32 parse_pcpu(const char *cpu_str, const cpu_set_t *allowed_mask)
>  {
>         u32 pcpu = atoi_non_negative("CPU number", cpu_str);
>
> --
> 2.55.0.rc0.738.g0c8ab3ebcc-goog
>
>


^ permalink raw reply

* Re: [PATCH v5 8/9] dax/kmem: add sysfs interface for atomic whole-device hotplug
From: David Hildenbrand (Arm) @ 2026-06-25  7:40 UTC (permalink / raw)
  To: Gregory Price, linux-mm, nvdimm
  Cc: linux-kernel, linux-cxl, driver-core, linux-kselftest,
	kernel-team, osalvador, gregkh, rafael, dakr, djbw,
	vishal.l.verma, dave.jiang, akpm, ljs, liam, vbabka, rppt, surenb,
	mhocko, shuah, alison.schofield, Smita.KoralahalliChannabasappa,
	ira.weiny, apopple, Hannes Reinecke
In-Reply-To: <20260624145744.3532049-9-gourry@gourry.net>

On 6/24/26 16:57, Gregory Price wrote:
> There is no atomic mechanism to offline and remove an entire
> multi-block DAX kmem device.  This is presently done in two steps:
>     1. offline all
>     2. remove all).
> 
> This creates a race condition where another entity operates directly
> on the memory blocks and can cause hot-unplug to fail / unbind to
> deadlock.
> 
> Add a new 'state' sysfs attribute that enables an atomic whole-device
> hotplug operation across its entire memory region.
> 
> daxX.Y/state mirrors the per-block memoryX/state ABI:
>   - [offline, online, online_kernel, online_movable]
>   - "unplugged" - is added specifically for dax0.0/state
> 
> The valid writable states include:
>   - "unplugged":      memory blocks are not present
>   - "online":         memory is online, zone chosen by the kernel
>   - "online_kernel":  memory is online in ZONE_NORMAL
>   - "online_movable": memory is online in ZONE_MOVABLE
> 
> Valid transitions:
>   - unplugged                -> online[_kernel|_movable]
>   - online[_kernel|_movable] -> unplugged
>   - offline                  -> unplugged
> 
> A device can only be onlined from "unplugged", so it must be returned
> there before being onlined into a different state.
> 
> For backwards compatibility the memory blocks are always created at
> probe - existing tools expect them to be present after kmem binds.
> 
> "offline" is therefore a reportable state but is not writable: it only
> arises from the legacy auto_online_blocks=offline policy.  Onlining
> such a device through this attribute requires unplugging it first in
> an effort to get drivers creating DAX devices to set a default.
> 
> Unplug is atomic across the whole device: dax_kmem_do_hotremove()
> collects every added range and offlines/removes them in one operation.
> Either the operation succeeds or is entirely rolled back.
> 
> Unbind Note:
>   We used to call remove_memory() during unbind, which would fire a
>   BUG() if any of the memory blocks were online at that time.  We lift
>   this into a WARN in the cleanup routine and don't attempt hotremove
>   if ->state is not DAX_KMEM_UNPLUGGED or MMOP_OFFLINE.
> 
>   An offline dax device memory is removed on unbind as before.
> 
>   If online at unbind, the resources are leaked (as before), but now
>   we prevent deadlock if a memory region is impossible to hotremove.
> 
> Suggested-by: Hannes Reinecke <hare@suse.de>
> Suggested-by: David Hildenbrand <david@kernel.org>
> Signed-off-by: Gregory Price <gourry@gourry.net>
> ---
>  Documentation/ABI/testing/sysfs-bus-dax |  26 +++
>  drivers/base/memory.c                   |   9 +

Can we have this ...

>  drivers/dax/kmem.c                      | 224 ++++++++++++++++++++----
>  include/linux/memory_hotplug.h          |   1 +
> 

... and this as a separate patch, please?

Nothing else jumped at me.

-- 
Cheers,

David


^ permalink raw reply

* [PATCH 2/2] mm/mm_init: drop overlap_memmap_init()
From: Mike Rapoport @ 2026-06-25  7:39 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, David Hildenbrand, Mike Rapoport, Taku Izumi,
	Wei Yang, Yuan Liu, linux-kernel
In-Reply-To: <20260625073941.145014-1-rppt@kernel.org>

From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>

When ZONE_NORMAL and ZONE_MOVABLE could overlap because kernelcore=mirror
didn't reduce the span of ZONE_NORMAL, initialization of the memory map had
to skip overlapping pages during initialization of ZONE_MOVABLE to avoid
double initialization of the same struct pages.

Since kernelcore=mirror works now the same way as other variants of
kernelcore=/movablecore=, and adjusts the span of ZONE_NORMAL, there can't
be an overlap between ZONE_NORMAL and ZONE_MOVABLE.

Remove overlap_memmap_init().

Co-developed-by: Wei Yang <richard.weiyang@gmail.com>
Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
---
 mm/mm_init.c | 24 ------------------------
 1 file changed, 24 deletions(-)

diff --git a/mm/mm_init.c b/mm/mm_init.c
index dce9dc9f2302..6f0a71ccca30 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -783,28 +783,6 @@ void __meminit init_deferred_page(unsigned long pfn, int nid)
 	__init_deferred_page(pfn, nid);
 }
 
-/* If zone is ZONE_MOVABLE but memory is mirrored, it is an overlapped init */
-static bool __meminit
-overlap_memmap_init(unsigned long zone, unsigned long *pfn)
-{
-	static struct memblock_region *r __meminitdata;
-
-	if (mirrored_kernelcore && zone == ZONE_MOVABLE) {
-		if (!r || *pfn >= memblock_region_memory_end_pfn(r)) {
-			for_each_mem_region(r) {
-				if (*pfn < memblock_region_memory_end_pfn(r))
-					break;
-			}
-		}
-		if (*pfn >= memblock_region_memory_base_pfn(r) &&
-		    memblock_is_mirror(r)) {
-			*pfn = memblock_region_memory_end_pfn(r);
-			return true;
-		}
-	}
-	return false;
-}
-
 /*
  * Only struct pages that correspond to ranges defined by memblock.memory
  * are zeroed and initialized by going through __init_single_page() during
@@ -891,8 +869,6 @@ void __meminit memmap_init_range(unsigned long size, int nid, unsigned long zone
 		 * function.  They do not exist on hotplugged memory.
 		 */
 		if (context == MEMINIT_EARLY) {
-			if (overlap_memmap_init(zone, &pfn))
-				continue;
 			if (defer_init(nid, pfn, zone_end_pfn)) {
 				deferred_struct_pages = true;
 				break;
-- 
2.53.0



^ permalink raw reply related

* [PATCH 1/2] mm/mm_init: don't overlap NORMAL and MOVABLE zones with kernelcore=mirror
From: Mike Rapoport @ 2026-06-25  7:39 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, David Hildenbrand, Mike Rapoport, Taku Izumi,
	Wei Yang, Yuan Liu, linux-kernel
In-Reply-To: <20260625073941.145014-1-rppt@kernel.org>

From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>

When kernelcore or movablecore kernel parameters define size of the
NORMAL and MOVABLE zones as percents of the total memory or by absolute
value, ZONE_NORMAL is clamped at the beginning of ZONE_MOVABLE.

However, when kernelcore=mirror the ZONE_NORMAL span is not changed but
rather pages from ZONE_MOVABLE counted as absent in ZONE_NORMAL.

Make the behaviour of kernelcore= parameter uniform and treat mirror
just as another way to size the zones.

Co-developed-by: Wei Yang <richard.weiyang@gmail.com>
Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
---
 mm/mm_init.c | 36 +++---------------------------------
 1 file changed, 3 insertions(+), 33 deletions(-)

diff --git a/mm/mm_init.c b/mm/mm_init.c
index f9f8e1af921c..dce9dc9f2302 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -1159,9 +1159,8 @@ static void __init adjust_zone_range_for_zone_movable(int nid,
 				arch_zone_highest_possible_pfn[movable_zone]);
 
 		/* Adjust for ZONE_MOVABLE starting within this range */
-		} else if (!mirrored_kernelcore &&
-			*zone_start_pfn < zone_movable_pfn[nid] &&
-			*zone_end_pfn > zone_movable_pfn[nid]) {
+		} else if (*zone_start_pfn < zone_movable_pfn[nid] &&
+			   *zone_end_pfn > zone_movable_pfn[nid]) {
 			*zone_end_pfn = zone_movable_pfn[nid];
 
 		/* Check if this whole range is within ZONE_MOVABLE */
@@ -1209,40 +1208,11 @@ static unsigned long __init zone_absent_pages_in_node(int nid,
 					unsigned long zone_start_pfn,
 					unsigned long zone_end_pfn)
 {
-	unsigned long nr_absent;
-
 	/* zone is empty, we don't have any absent pages */
 	if (zone_start_pfn == zone_end_pfn)
 		return 0;
 
-	nr_absent = __absent_pages_in_range(nid, zone_start_pfn, zone_end_pfn);
-
-	/*
-	 * ZONE_MOVABLE handling.
-	 * Treat pages to be ZONE_MOVABLE in ZONE_NORMAL as absent pages
-	 * and vice versa.
-	 */
-	if (mirrored_kernelcore && zone_movable_pfn[nid]) {
-		unsigned long start_pfn, end_pfn;
-		struct memblock_region *r;
-
-		for_each_mem_region(r) {
-			start_pfn = clamp(memblock_region_memory_base_pfn(r),
-					  zone_start_pfn, zone_end_pfn);
-			end_pfn = clamp(memblock_region_memory_end_pfn(r),
-					zone_start_pfn, zone_end_pfn);
-
-			if (zone_type == ZONE_MOVABLE &&
-			    memblock_is_mirror(r))
-				nr_absent += end_pfn - start_pfn;
-
-			if (zone_type == ZONE_NORMAL &&
-			    !memblock_is_mirror(r))
-				nr_absent += end_pfn - start_pfn;
-		}
-	}
-
-	return nr_absent;
+	return __absent_pages_in_range(nid, zone_start_pfn, zone_end_pfn);
 }
 
 /*
-- 
2.53.0



^ permalink raw reply related

* [PATCH 0/2] mm/mm_init: don't overlap zones with kernelcore=mirror
From: Mike Rapoport @ 2026-06-25  7:39 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, David Hildenbrand, Mike Rapoport, Taku Izumi,
	Wei Yang, Yuan Liu, linux-kernel

From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>

Hi,

These patches make the behaviour of kernelcore= parameter uniform and
treat mirror just as another way to size the zones and cleanup a weird
part of the memory map initialization.
   
Mike Rapoport (Microsoft) (2):
  mm/mm_init: don't overlap NORMAL and MOVABLE zones with kernelcore=mirror
  mm/mm_init: drop overlap_memmap_init()

 mm/mm_init.c | 60 +++-------------------------------------------------
 1 file changed, 3 insertions(+), 57 deletions(-)


base-commit: 4549871118cf616eecdd2d939f78e3b9e1dddc48
-- 
2.53.0



^ permalink raw reply

* Re: [PATCH 3/3] mm: read remote memory without the mmap lock where possible
From: Lorenzo Stoakes @ 2026-06-25  7:39 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, x86, linux-mm, Thomas Gleixner, Ingo Molnar,
	Dmitry Ilvokhin, Borislav Petkov, Dave Hansen, Andrew Morton,
	David Hildenbrand, Liam R. Howlett, Vlastimil Babka,
	Suren Baghdasaryan, kernel-team
In-Reply-To: <20260625015053.2445008-4-riel@surriel.com>

On Wed, Jun 24, 2026 at 09:50:53PM -0400, Rik van Riel wrote:
> __access_remote_vm() takes mmap_read_lock() for the entire transfer and
> uses get_user_pages_remote(), which faults pages in.  For the common case
> of reading memory that is already resident -- /proc/PID/cmdline,
> /proc/PID/environ, ptrace PEEK of resident pages -- the mmap lock is
> unnecessary and is badly contended on large machines.
>
> Add an opportunistic, read-only fast path.  It takes the per-VMA lock with
> lock_vma_under_rcu() and, only when the whole request lies within that one
> VMA, copies the resident pages out using folio_walk_start(FW_VMA_LOCKED)
> to grab a short-lived page reference from a page table walk run with
> interrupts disabled.  Interrupts are disabled only across the walk (until
> the folio is pinned): page table freeing -- a concurrent munmap() or THP
> collapse of an adjacent region -- serializes against lockless walkers via
> tlb_remove_table_sync_one(), which IPIs and waits for every CPU to enable
> interrupts, the same contract gup_fast relies on.  The copy then runs with
> interrupts on, holding only the folio reference.

As I said in reply to 2/3 I don't think you need to do this.

Let's have a discussion _in the review_ about it :)

>
> A request that spans more than one VMA is left entirely to the mmap_lock
> path: relocking per VMA could observe a structurally inconsistent address
> space (a neighbouring VMA unmapped and a different one mapped in its place
> between locks), whereas the mmap_lock path sees a stable VMA tree for the
> whole transfer.
>
> The per-VMA permission check mirrors the read side of check_vma_flags(),
> including the FOLL_ANON restriction that /proc/PID/{cmdline,environ} rely
> on (CVE-2018-1120).  Anything not positively allowed -- a not-present
> page, a hugetlb or VM_IO/VM_PFNMAP or secretmem mapping, or a race with a
> VMA writer -- falls back to the mmap_lock path for the remainder, which
> re-validates everything.  Pages read on the fast path are marked accessed,
> matching the FOLL_TOUCH behaviour of the get_user_pages_remote() slow
> path.
>
> untagged_addr_remote() asserts the mmap lock, so add an unlocked variant
> for the fast path; the untag mask is a stable per-mm value.
>
> Only reads are handled here; writes keep using the slow path.
>
> Assisted-by: Claude:claude-opus-4-8
> Signed-off-by: Rik van Riel <riel@surriel.com>
> ---
>  arch/x86/include/asm/uaccess_64.h |  14 ++-
>  include/linux/uaccess.h           |  11 ++
>  mm/memory.c                       | 195 +++++++++++++++++++++++++++++-
>  3 files changed, 217 insertions(+), 3 deletions(-)
>
> diff --git a/arch/x86/include/asm/uaccess_64.h b/arch/x86/include/asm/uaccess_64.h
> index 4a52497ba6a1..933b0b8b4d60 100644
> --- a/arch/x86/include/asm/uaccess_64.h
> +++ b/arch/x86/include/asm/uaccess_64.h
> @@ -39,11 +39,23 @@ static inline unsigned long __untagged_addr(unsigned long addr)
>  	(__force __typeof__(addr))__untagged_addr(__addr);		\
>  })
>
> +/* Strip the tag bits from a remote mm's address; usable without the mmap lock. */
> +static inline unsigned long __untagged_addr_remote_unlocked(struct mm_struct *mm,
> +							    unsigned long addr)
> +{
> +	return addr & READ_ONCE(mm->context.untag_mask);
> +}
> +
> +#define untagged_addr_remote_unlocked(mm, addr)	({			\
> +	unsigned long __addr = (__force unsigned long)(addr);		\
> +	(__force __typeof__(addr))__untagged_addr_remote_unlocked(mm, __addr); \
> +})
> +
>  static inline unsigned long __untagged_addr_remote(struct mm_struct *mm,
>  						   unsigned long addr)
>  {
>  	mmap_assert_locked(mm);
> -	return addr & READ_ONCE((mm)->context.untag_mask);
> +	return __untagged_addr_remote_unlocked(mm, addr);
>  }
>
>  #define untagged_addr_remote(mm, addr)	({				\
> diff --git a/include/linux/uaccess.h b/include/linux/uaccess.h
> index 8a264662b242..c8c83372c9d8 100644
> --- a/include/linux/uaccess.h
> +++ b/include/linux/uaccess.h
> @@ -34,6 +34,17 @@
>  })
>  #endif
>
> +/*
> + * Like untagged_addr_remote(), but for callers that stabilize @mm by other
> + * means (e.g. a per-VMA lock) and must not assert the mmap lock.
> + */
> +#ifndef untagged_addr_remote_unlocked
> +#define untagged_addr_remote_unlocked(mm, addr)	({	\
> +	(void)(mm);					\
> +	untagged_addr(addr);				\
> +})
> +#endif
> +
>  #ifdef masked_user_access_begin
>   #define can_do_masked_user_access() 1
>  # ifndef masked_user_write_access_begin
> diff --git a/mm/memory.c b/mm/memory.c
> index 86a973119bd4..d2b2f0014a0c 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -42,6 +42,8 @@
>  #include <linux/kernel_stat.h>
>  #include <linux/mm.h>
>  #include <linux/mm_inline.h>
> +#include <linux/secretmem.h>
> +#include <linux/pagewalk.h>
>  #include <linux/sched/mm.h>
>  #include <linux/sched/numa_balancing.h>
>  #include <linux/sched/task.h>
> @@ -7062,6 +7064,180 @@ int generic_access_phys(struct vm_area_struct *vma, unsigned long addr,
>  EXPORT_SYMBOL_GPL(generic_access_phys);
>  #endif
>
> +/*
> + * The fast path uses folio_walk_start(FW_VMA_LOCKED), which needs the per-VMA
> + * lock and RCU-freed page tables to walk page tables without the mmap lock.
> + */
> +#if defined(CONFIG_PER_VMA_LOCK) && defined(CONFIG_MMU_GATHER_RCU_TABLE_FREE)
> +/*
> + * Read-side VMA checks for the lockless fast path, mirroring the read side of
> + * check_vma_flags(): reject what FW_VMA_LOCKED cannot handle (hugetlb), what
> + * needs the ->access() handler (VM_IO/VM_PFNMAP), or what has no struct page to
> + * copy (secretmem); enforce the FOLL_ANON restriction that
> + * /proc/PID/{cmdline,environ} rely on (CVE-2018-1120); and require read access
> + * (honoring FOLL_FORCE).  Anything not positively allowed falls back to the slow
> + * path, which re-validates everything.
> + */

No wall of text please :)

Same comments for the rest, please use paragraphs, try to be succinct, etc.

> +static bool vma_permits_fast_access(struct vm_area_struct *vma,
> +				    unsigned int gup_flags)
> +{
> +	if (vma->vm_flags & (VM_IO | VM_PFNMAP))
> +		return false;
> +	if (is_vm_hugetlb_page(vma) || vma_is_secretmem(vma))
> +		return false;
> +	if ((gup_flags & FOLL_ANON) && !vma_is_anonymous(vma))
> +		return false;
> +	if (!(vma->vm_flags & VM_READ) &&
> +	    (!(gup_flags & FOLL_FORCE) || !(vma->vm_flags & VM_MAYREAD)))
> +		return false;
> +	return true;
> +}
> +
> +/* Size of the single mapping entry folio_walk_start() landed on. */
> +static unsigned long fw_entry_size(enum folio_walk_level level)
> +{
> +	switch (level) {
> +	case FW_LEVEL_PUD:
> +		return PUD_SIZE;
> +	case FW_LEVEL_PMD:
> +		return PMD_SIZE;
> +	default:
> +		return PAGE_SIZE;
> +	}
> +}
> +
> +/*
> + * Copy @len bytes of the pinned @folio out to @buf, starting at byte offset
> + * @folio_off within the folio (the position of @addr).  Maps and copies one
> + * page at a time -- kmap_local_folio() for HIGHMEM, copy_from_user_page() for
> + * the per-page flush on aliasing caches -- without re-walking page tables.
> + * Each page borrows the caller's single folio reference, so the mapping is
> + * dropped with kunmap_local() rather than folio_release_kmap().
> + */
> +static void copy_folio_pages(struct vm_area_struct *vma, struct folio *folio,
> +			     unsigned long folio_off, unsigned long addr,
> +			     void *buf, unsigned long len)
> +{
> +	unsigned long done = 0;
> +
> +	while (done < len) {
> +		unsigned long pos = folio_off + done;
> +		unsigned long page_idx = pos >> PAGE_SHIFT;
> +		unsigned int page_off = pos & ~PAGE_MASK;
> +		unsigned int chunk = min_t(unsigned long, len - done,
> +					   PAGE_SIZE - page_off);
> +		void *kaddr = kmap_local_folio(folio, page_idx << PAGE_SHIFT);
> +
> +		copy_from_user_page(vma, folio_page(folio, page_idx),
> +				    addr + done, buf + done, kaddr + page_off,
> +				    chunk);
> +		kunmap_local(kaddr);
> +		done += chunk;
> +	}
> +}
> +
> +/*
> + * Opportunistic lockless fast path for __access_remote_vm() reads.
> + *
> + * Memory already resident in @mm can be read without taking the frequently
> + * contended mmap_lock: a per-VMA lock stabilizes the VMA, and folio_walk_start()
> + * with FW_VMA_LOCKED grabs a short-lived reference to a present page from a page
> + * table walk run with interrupts disabled, which serializes against concurrent
> + * page table freeing the same way gup_fast does (relying on
> + * MMU_GATHER_RCU_TABLE_FREE).
> + *
> + * Only a request that lies entirely within a single VMA is handled here,
> + * which should not be an issue in practice since every caller has a
> + * buffer of PAGE_SIZE or smaller. Loop iteration inside this function
> + * should be rare, too.
> + *
> + * Returns the number of bytes transferred via the fast path.
> + */
> +static int access_remote_vm_fast(struct mm_struct *mm, unsigned long addr,
> +				 void *buf, int len, unsigned int gup_flags)
> +{
> +	void *old_buf = buf;
> +	struct vm_area_struct *vma;
> +
> +	addr = untagged_addr_remote_unlocked(mm, addr);
> +
> +	vma = lock_vma_under_rcu(mm, addr);
> +	if (!vma)
> +		return 0;
> +
> +	/* Only handle a request contained entirely within this one VMA. */
> +	if (len > vma->vm_end - addr)
> +		goto out_unlock;
> +
> +	if (!vma_permits_fast_access(vma, gup_flags))
> +		goto out_unlock;
> +
> +	while (len) {
> +		struct folio_walk fw;
> +		struct folio *folio;
> +		struct page *page;
> +		unsigned long entry_size, folio_off, span, irq_flags;
> +
> +		/*
> +		 * The lockless page table walk must run with interrupts
> +		 * disabled: page table freeing (munmap or THP collapse, which
> +		 * IPI via tlb_remove_table_sync_one() and wait) then cannot free
> +		 * a table mid-walk -- the same contract gup_fast relies on.  IRQs
> +		 * are restored once the folio is pinned; the copy below holds only
> +		 * the folio reference.
> +		 */
> +		local_irq_save(irq_flags);
> +		folio = folio_walk_start(&fw, vma, addr, FW_VMA_LOCKED);
> +		if (!folio) {
> +			local_irq_restore(irq_flags);
> +			goto out_unlock;	/* not present: let the slow path fault it in */
> +		}
> +		page = fw.page;
> +		if (!page) {
> +			/* No struct page to copy (e.g. a special PTE). */
> +			folio_walk_end(&fw, vma);
> +			local_irq_restore(irq_flags);
> +			goto out_unlock;
> +		}
> +		entry_size = fw_entry_size(fw.level);
> +		folio_get(folio);
> +		folio_walk_end(&fw, vma);
> +		local_irq_restore(irq_flags);
> +
> +		/*
> +		 * folio_walk_start() validated one present mapping entry
> +		 * (PAGE/PMD/PUD_SIZE).  Copy to the end of that entry, bounded by
> +		 * the folio and the remaining length (already within the VMA), so
> +		 * a huge mapping is handled in a single walk.
> +		 */
> +		folio_off = (folio_page_idx(folio, page) << PAGE_SHIFT) +
> +			    offset_in_page(addr);
> +		span = min3((unsigned long)len,
> +			    entry_size - (addr & (entry_size - 1)),
> +			    (folio_nr_pages(folio) << PAGE_SHIFT) - folio_off);
> +
> +		copy_folio_pages(vma, folio, folio_off, addr, buf, span);
> +
> +		/* Match the FOLL_TOUCH behaviour of the slow (GUP) path. */
> +		folio_mark_accessed(folio);
> +		folio_put(folio);
> +		len -= span;
> +		buf += span;
> +		addr += span;
> +	}
> +
> +out_unlock:
> +	vma_end_read(vma);
> +	return buf - old_buf;
> +}
> +#else
> +static int access_remote_vm_fast(struct mm_struct *mm, unsigned long addr,
> +				 void *buf, int len, unsigned int gup_flags)
> +{
> +	return 0;
> +}
> +#endif /* CONFIG_PER_VMA_LOCK && CONFIG_MMU_GATHER_RCU_TABLE_FREE */
> +
>  /*
>   * Access another process' address space as given in mm.
>   */
> @@ -7071,15 +7247,30 @@ static int __access_remote_vm(struct mm_struct *mm, unsigned long addr,
>  	void *old_buf = buf;
>  	int write = gup_flags & FOLL_WRITE;
>
> +	/*
> +	 * Try the lockless fast path for reads first; it transfers what it can
> +	 * from resident memory without taking mmap_lock, and leaves the
> +	 * remainder (if any) to the slow path below.
> +	 */
> +	if (!write) {
> +		int done = access_remote_vm_fast(mm, addr, buf, len, gup_flags);
> +
> +		addr += done;
> +		buf += done;
> +		len -= done;
> +		if (!len)
> +			return buf - old_buf;
> +	}
> +
>  	if (mmap_read_lock_killable(mm))
> -		return 0;
> +		return buf - old_buf;
>
>  	/* Untag the address before looking up the VMA */
>  	addr = untagged_addr_remote(mm, addr);
>
>  	/* Avoid triggering the temporary warning in __get_user_pages */
>  	if (!vma_lookup(mm, addr) && !expand_stack(mm, addr))
> -		return 0;
> +		return buf - old_buf;
>
>  	/* ignore errors, just check how much was successfully transferred */
>  	while (len) {
> --
> 2.53.0-Meta
>

Thanks, Lorenzo


^ permalink raw reply

* Re: [PATCH v2] mm: mglru: fix stale batch updates after memcg reparenting
From: Qi Zheng @ 2026-06-25  7:37 UTC (permalink / raw)
  To: Harry Yoo, akpm, david, kasong, shakeel.butt, baohua,
	axelrasmussen, yuanchu, weixugc, hannes, muchun.song, peiyang_he,
	mhocko, roman.gushchin, ljs
  Cc: linux-mm, linux-kernel, Qi Zheng, stable
In-Reply-To: <1d78e1c1-0cdb-435e-b278-670bce9148b3@kernel.org>



On 6/25/26 2:32 PM, Harry Yoo wrote:
> 
> 
> On 6/25/26 3:11 PM, Qi Zheng wrote:
>> On 6/25/26 12:16 PM, Harry Yoo wrote:
>>>
>> [...]
>>
>>>
>>>> So lock_batch_lruvec() can be implemented like this:
>>>>
>>>> #ifdef CONFIG_MEMCG
>>>> static struct lruvec *lock_batch_lruvec(struct lruvec *lruvec)
>>>> {
>>>>       struct pglist_data *pgdat = lruvec_pgdat(lruvec);
>>>>       struct mem_cgroup *memcg = lruvec_memcg(lruvec);
>>>>
>>>>       rcu_read_lock();
>>>>
>>>>       /*
>>>>        * The memcg can be NULL when the memory controller is disabled.
>>>>        * Otherwise, the caller keeps the memcg owning @lruvec alive.
>>>>        */
>>>>       if (!memcg || !css_is_dying(&memcg->css))
>>>>           goto lock;
>>>>
>>>>       do {
>>>>           memcg = parent_mem_cgroup(memcg);
>>>>       } while (memcg && css_is_dying(&memcg->css));
>>>>       lruvec = mem_cgroup_lruvec(memcg, pgdat);
>>>>
>>>> lock:
>>>>       spin_lock_irq(&lruvec->lru_lock);
>>>>
>>>>       return lruvec;
>>>> }
>>>> #else
>>>> static struct lruvec *lock_batch_lruvec(struct lruvec *lruvec)
>>>> {
>>>>       lruvec_lock_irq(lruvec);
>>>>
>>>>       return lruvec;
>>>> }
>>>> #endif
>>>>
>>>> Does this make sense?
>>>
>>> Yes, looks good to me!
>>
>> OK, this sync method makes more sense as it doesn't require adding a
>> new lrugen->reparente. I'll go with this method and update v3.
> 
> Thanks!
> 
> Just one thing to clarify...
> 
> So, when we check something that's updated _before_ grace period
> (CSS_DYING), RCU is sufficient.
> 
> But in folio_lruvec_lock*(), that is not the case because reparenting
> is performed in the RCU work, under the lruvec lock. So the check needs
> to be done under RCU and the lruvec lock.
> 
> This is quite subtle :D

Indeed.

And in theory, the l->nr_items check in lock_list_lru_of_memcg() could
also be replaced by the CSS_DYING check.

> 
>> Hi Barry and Baolin, what do you think? Since the sync method has been
>> changed, I will temporarily drop your previous Reviewed-by tags in v3. ;)
> 
> And hopefully Peiyang would kindly double check v3 still not reproduced
> on the machine :)

Yeah!

> 



^ permalink raw reply

* Re: [PATCH 2/3] mm/pagewalk: let folio_walk_start() run under the per-VMA lock
From: Lorenzo Stoakes @ 2026-06-25  7:34 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, x86, linux-mm, Thomas Gleixner, Ingo Molnar,
	Dmitry Ilvokhin, Borislav Petkov, Dave Hansen, Andrew Morton,
	David Hildenbrand, Liam R. Howlett, Vlastimil Babka,
	Suren Baghdasaryan, kernel-team
In-Reply-To: <20260625015053.2445008-3-riel@surriel.com>

Rik, it really would have helped if you'd replied to review :)

On Wed, Jun 24, 2026 at 09:50:52PM -0400, Rik van Riel wrote:
> folio_walk_start() asserts the mmap lock is held.  For callers that only
> need to read a single, already-present page, the mmap lock is a heavy and
> often badly contended hammer.  Such a caller can instead hold the per-VMA
> lock, which keeps the VMA itself stable.

<newline>

> The per-VMA lock does not, however, keep the page tables walked below that
> VMA from being freed.  A concurrent munmap() or THP collapse of an
> adjacent region in the same mm can free a shared upper-level table, and

Yeah I need to update the documentation on this at
https://docs.kernel.org/mm/process_addrs.html it's more subtle than written
there.

Firstly you're wrong about munmap() - it acquires the VMA lock of the VMAs freed
in the range and will only remove an upper level table if the entire range is
spanned.

And that's the only way higher level tables can be removed.

PTE page tables can be removed via MADV_DONTNEED, but that a. acquires the VMA
lock and b. frees the PTE page table under RCU.

A THP collapse can happen concurrently, but PTEs are freed under RCU so you
don't need to do this GUP fast imitating stuff.

> THP collapse (collapse_huge_page() -> retract_page_tables()) frees page
> tables of VMAs whose lock it does not hold.  Page table freeing

retract_page_tables() -> pte_free_defer() -> RCU
try_collapse_pte_mapped_thp() -> pte_free_defer() -> RCU

> synchronizes against lockless walkers the way gup_fast relies on:
> tlb_remove_table_sync_one() sends an IPI and waits for every CPU to enable
> interrupts, so a walker that keeps interrupts disabled across the walk
> cannot be observing a table that is about to be freed.  rcu_read_lock() is
> not sufficient -- it does not block that IPI -- so the caller must keep

Yes it is?

I mean unless I'm missing something here.

> interrupts disabled, not merely hold an RCU read-side critical section.
>
> Add an FW_VMA_LOCKED flag.  When passed, folio_walk_start() asserts the
> per-VMA lock and that interrupts are disabled, instead of asserting the
> mmap lock; it requires CONFIG_MMU_GATHER_RCU_TABLE_FREE and refuses
> hugetlb VMAs (PMD sharing maps page tables this VMA's lock does not
> cover).  The caller must keep interrupts disabled until folio_walk_end().
>
> No existing caller passes FW_VMA_LOCKED, so behaviour is unchanged.
>
> Assisted-by: Claude:claude-opus-4-8
> Signed-off-by: Rik van Riel <riel@surriel.com>
> ---
>  include/linux/pagewalk.h |  7 +++++++
>  mm/pagewalk.c            | 29 +++++++++++++++++++++++++++--
>  2 files changed, 34 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/pagewalk.h b/include/linux/pagewalk.h
> index b41d7265c01b..d0387470d732 100644
> --- a/include/linux/pagewalk.h
> +++ b/include/linux/pagewalk.h
> @@ -150,6 +150,13 @@ typedef int __bitwise folio_walk_flags_t;
>
>  /* Walk shared zeropages (small + huge) as well. */
>  #define FW_ZEROPAGE			((__force folio_walk_flags_t)BIT(0))
> +/*
> + * The caller holds the per-VMA lock instead of the mmap lock, with interrupts
> + * disabled across the walk (until folio_walk_end()) to serialize against page
> + * table freeing, the same way gup_fast does. Only valid with RCU-freed page
> + * tables (CONFIG_MMU_GATHER_RCU_TABLE_FREE) and not for hugetlb.
> + */
> +#define FW_VMA_LOCKED			((__force folio_walk_flags_t)BIT(1))
>
>  enum folio_walk_level {
>  	FW_LEVEL_PTE,
> diff --git a/mm/pagewalk.c b/mm/pagewalk.c
> index 3ae2586ff45b..ab1e81983cb8 100644
> --- a/mm/pagewalk.c
> +++ b/mm/pagewalk.c
> @@ -890,7 +890,10 @@ int walk_page_mapping(struct address_space *mapping, pgoff_t first_index,
>   * huge_ptep_set_*, ...). Note that the page table entry stored in @fw might
>   * not correspond to the first physical entry of a logical hugetlb entry.
>   *
> - * The mmap lock must be held in read mode.
> + * The mmap lock must be held in read mode. Alternatively, if @FW_VMA_LOCKED is
> + * passed, the VMA's per-VMA lock must be held and interrupts must be disabled
> + * across the walk and until folio_walk_end() (only supported with RCU-freed page
> + * tables, i.e. CONFIG_MMU_GATHER_RCU_TABLE_FREE, and not for hugetlb).
>   *
>   * Return: folio pointer on success, otherwise NULL.
>   */
> @@ -908,7 +911,29 @@ struct folio *folio_walk_start(struct folio_walk *fw,
>  	pgd_t *pgdp;
>  	p4d_t *p4dp;
>
> -	mmap_assert_locked(vma->vm_mm);
> +	if (flags & FW_VMA_LOCKED) {
> +		/*
> +		 * Lockless walk under the per-VMA lock instead of the mmap
> +		 * lock. The VMA lock keeps the VMA stable, but the page tables
> +		 * walked below it can still be freed concurrently: a munmap() or
> +		 * THP collapse of an adjacent region in the same mm can free a
> +		 * shared upper-level table, and collapse_huge_page() ->
> +		 * retract_page_tables() frees page tables of VMAs whose lock it
> +		 * does not hold. Page table freeing serializes against lockless
> +		 * walkers via tlb_remove_table_sync_one(), which IPIs and waits
> +		 * for every CPU to enable interrupts; an RCU read-side critical
> +		 * section does not block that IPI, so the caller must keep
> +		 * interrupts disabled across the whole walk, like gup_fast.
> +		 * Hugetlb (PMD sharing) maps page tables not covered by this
> +		 * VMA's lock and is not supported.
> +		 */

This is an unreadable wall of text, if it's AI generated please edit before
sending.

> +		VM_WARN_ON_ONCE(!IS_ENABLED(CONFIG_MMU_GATHER_RCU_TABLE_FREE));
> +		VM_WARN_ON_ONCE(is_vm_hugetlb_page(vma));
> +		lockdep_assert_irqs_disabled();
> +		vma_assert_locked(vma);
> +	} else {
> +		mmap_assert_locked(vma->vm_mm);
> +	}
>  	vma_pgtable_walk_begin(vma);
>
>  	if (WARN_ON_ONCE(addr < vma->vm_start || addr >= vma->vm_end))
> --
> 2.53.0-Meta
>

Thanks, Lorenzo


^ permalink raw reply

* Re: [PATCH] arch_numa: remove redundant nodemask clears in numa_init()
From: Sang-Heon Jeon @ 2026-06-25  7:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: rppt, Greg Kroah-Hartman, Rafael J. Wysocki, Danilo Krummrich,
	linux-mm, driver-core, linux-kernel
In-Reply-To: <20260624204030.3c8baa67713b6ca1d537baba@linux-foundation.org>

On Thu, Jun 25, 2026 at 12:40 PM Andrew Morton
<akpm@linux-foundation.org> wrote:
>
> On Thu, 18 Jun 2026 01:39:19 +0900 Sang-Heon Jeon <ekffu200098@gmail.com> wrote:
>
> > numa_init() clears numa_nodes_parsed, node_possible_map and
> > node_online_map, then calls numa_memblks_init(), which clears the same
> > nodemasks. Nothing uses them in between.
> >
> > These clears have been redundant since commit 767507654c22 ("arch_numa:
> > switch over to numa_memblks") made numa_init() use numa_memblks_init().
> >
> > No functional change.
> >
> > ...
> >
> > --- a/drivers/base/arch_numa.c
> > +++ b/drivers/base/arch_numa.c
> > @@ -231,10 +231,6 @@ static int __init numa_init(int (*init_func)(void))
> >  {
> >       int ret;
> >
> > -     nodes_clear(numa_nodes_parsed);
> > -     nodes_clear(node_possible_map);
> > -     nodes_clear(node_online_map);
> > -
> >       ret = numa_memblks_init(init_func, /* memblock_force_top_down */ false);
> >       if (ret < 0)
> >               goto out_free_distance;
>
> hm, OK, thanks.
>
>
> A couple of driveby questions:
>
> Are the other nodes_clear() calls are needed - aren't these things
> zeroed when the kernel is loaded?
>

You're talking about nodes_clear() in numa_memblks_init(), right? If
so, I think they're needed.
Because numa_memblks_init() can run more than once during boot, and a
previous call can leave the maps dirty.

More detailed example, If we first try to set up NUMA from DT,
numa_memblks_init(of_numa_init) is called.
of_numa_init() parses the CPU and memory nodes in order, so if parsing
the CPU nodes succeed but the memory nodes fail, the CPU-node bits are
still set in numa_nodes_parsed.
We then fall back to calling numa_memblks_init(dummy_numa_init). Since
dummy_numa_init() only sets node 0, those stale bits could survive
unexpectedly without nodes_clear().

> Also,
>
> #define node_possible_map       node_states[N_POSSIBLE]
>
> ...
>
> nodemask_t node_states[NR_NODE_STATES] __read_mostly = {
>         [N_POSSIBLE] = NODE_MASK_ALL,
>
> why do we carefully initialize node_possible_map at compile-time then
> zero it within __init code?
>

I also can't find why we initialize node_possible_map carefully, at
least for NUMA.
I'm not sure it's safe to remove initialization with UMA. I'll take a
closer look when I have time, and send a patch to remove them if it's
safe to remove.

Best Regards,
Sang-Heon Jeon

^ permalink raw reply

* Re: [PATCH 1/6] mm/page_owner: extract skip_buddy_pages() helper to unify buddy page skipping
From: Ye Liu @ 2026-06-25  7:31 UTC (permalink / raw)
  To: Vlastimil Babka (SUSE), Andrew Morton
  Cc: Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, linux-mm, linux-kernel
In-Reply-To: <b2e7646d-346b-4eb8-aecc-d70fad3a28f7@kernel.org>

在 2026/6/25 15:13, Vlastimil Babka (SUSE) 写道:
> On 6/25/26 02:20, Andrew Morton wrote:
>> On Tue, 23 Jun 2026 14:52:26 +0800 Ye Liu <ye.liu@linux.dev> wrote:
>>
>>> Three places in page_owner.c duplicate the same pattern: check if a
>>> page is PageBuddy, read its order via buddy_order_unsafe(), advance
>>> the pfn past the buddy block if the order is valid, and continue.
>>>
>>> Consolidate them into a single inline helper skip_buddy_pages().
>>> The function returns true (skip) for any buddy page and advances
>>> @pfn past the block when the order is valid; returns false if the
>>> page is not a buddy page and should be processed normally.
>>>
>>> The old init_pages_in_zone() variant used "order > 0" as an extra
>>> guard before advancing pfn, but the continue was unconditional and
>>> (1UL << 0) - 1 == 0, so the behaviour is identical.  The comment
>>> about zone->lock is preserved in the helper's kernel-doc.
>>
>> All looks nice, thanks.
> 
> I got a bunch of "added to mm-hotfixes-unstable branch" mails, but this
> seems like cleanups and nothing urgent? Was that intended?
> 
>> A [0/N] cover letter is nice to have.
> 
> Seems like it exists, but wasn't delivered. Lore shows its message id, but
> as missing.

Apologies, I accidentally sent the cover letter only to my own address.
I'll ensure it goes to the mailing list in future submissions.         
Thanks for pointing it out.                                            

Subject: [PATCH 0/6] mm/page_owner: misc cleanups                      

Hi,                                                                    

This series collects a few cleanups for mm/page_owner.c that have been 
accumulated while reading through the file.  There is no functional    
change -- the goal is to make the code easier to read and maintain.    

Patch 1 consolidates three identical PageBuddy skip blocks into a      
single skip_buddy_pages() helper, eliminating the duplication and      
keeping the lockless-read comment in one place.                        

Patch 2 replaces the -1 magic number used for "never migrated" with    
a local MIGRATE_REASON_NONE define, making the intent explicit at      
every use site.                                                        

Patch 3 hoists the CONFIG_MEMCG guard out of print_page_owner_memcg()'s
body so that the real implementation and the empty stub are two clearly
separate definitions, the common kernel idiom.                         

Patch 4 adds a missing \n to the count_threshold debugfs attribute     
format string so that cat(1) output is properly terminated.            

Patch 5 moves free_ts_nsec from the allocation summary line to the     
free section in __dump_page_owner(), grouping it with free_pid and     
free_tgid where it logically belongs.  This also makes the dump        
output consistent with print_page_owner().                             

Patch 6 drops the redundant page_owner_ prefix from file-scoped static 
symbols (stack_fops, threshold_fops, etc.).  Since they cannot collide 
across translation units, the prefix carries no information.           

The series is based on v6.17-rc1 and has been compile-tested with and  
without CONFIG_MEMCG.                                                  

Ye Liu (6):                                                            
  mm/page_owner: extract skip_buddy_pages() helper to unify buddy page 
    skipping                                                           
  mm/page_owner: use MIGRATE_REASON_NONE instead of -1 for             
    last_migrate_reason                                                
  mm/page_owner: hoist CONFIG_MEMCG to function level for              
    print_page_owner_memcg()                                           
  mm/page_owner: add missing newline to count_threshold format string  
  mm/page_owner: move free_ts_nsec output to free section in           
    __dump_page_owner()                                                
  mm/page_owner: drop redundant page_owner prefix from static symbols  

 mm/page_owner.c | 121 +++++++++++++++++++++++++++---------------------
 1 file changed, 67 insertions(+), 54 deletions(-)                     

> 
>> AI review identified a few possible pre-existing issues, if you're
>> interested:
>> 	https://sashiko.dev/#/patchset/20260623065234.31866-2-ye.liu@linux.dev
>>
> 

-- 
Thanks,
Ye Liu

^ permalink raw reply

* Re: [PATCH v4] coredump: Add /proc/<pid>/coredump_pre_exit for pre-exit before dumping
From: Christian Brauner @ 2026-06-25  7:28 UTC (permalink / raw)
  To: David Hildenbrand, Mike Rapoport, Lorenzo Stoakes, brauner,
	mjguzik, pfalcato, ebiederm, viro, jack, jlayton, chuck.lever,
	alex.aring, arnd, keescook, mcgrof, j.granados, allen.lkml
  Cc: linux-fsdevel, linux-kernel, linux-arch, Xin Zhao, linux-mm
In-Reply-To: <20260624145552.70143-1-jackzxcui1989@163.com>

> A coredump typically takes some time to complete. If we happen to hold a
> write lock with flock just before triggering the coredump, that write lock
> will not be released during the entire coredump process. As a result,
> other processes attempting to acquire the same write lock may experience
> significant delays. Another typical scenario is that shared memory, such
> as dma-buf, remains occupied and is not released for a long time due to
> core dumps.
> 
> To address this, add /proc/<pid>/coredump_pre_exit node so that people can
> specify which resources they want to release before dumping core. This
> patch implements the early release of two types of resources: flock files
> and file-backed shared memory. Default settings are NOT pre-exit anything.
> 
> A temporary bit, O_TMPCLOS, is added to mark vma->vm_file->f_flags during
> the execution of the newly introduced exit_mmap_mapped_shared() function.
> In this way, the subsequent exit_files_pre_exit() function does not need
> to find the corresponding vma through the file to check for the VM_SHARED
> attribute, thereby reducing the traversal cost.
> 
> Signed-off-by: Xin Zhao <jackzxcui1989@163.com>
>
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index f575d450861e..bc6d3859f874 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -1024,6 +1024,11 @@ Kernel parameters
>  			/proc/<pid>/coredump_filter.
>  			See also Documentation/filesystems/proc.rst.
>  
> +	coredump_pre_exit=
> +			[KNL] Change the default value for
> +			/proc/<pid>/coredump_pre_exit.
> +			See also Documentation/filesystems/proc.rst.

Nah, we're not doing a separate file for this. That makes no sense
whatsoever. I've already explained this in the first mail. There are
effectively three modes:

(1) dump to a file
(2) spawn super-privileged usermode helper process connect coredumping
    process and said helper via pipe
(3) coredumping process connects to AF_UNIX socket

Parameterize (1) and (2) via a command line arguments. I strongly
suspect you're using some AI tooling so it should be able to figure out
how this was done in the past.

(3) can be extended by just introducing a new flag value for struct
    coredump_req. That is also illustrated by previous work.

We're not spreading procfs files. It's terrible api design especially
for security sensitive changes.

> +static void coredump_pre_exit(void)
> +{
> +	struct task_struct *tsk = current;
> +	unsigned long flags = __mm_flags_get_dumpable(tsk->mm);
> +
> +	if (!likely(flags & MMF_DUMP_PRE_EXIT_MASK))
> +		return;
> +
> +	/*
> +	 * Set O_TMPCLOS of file f_flags if file needs to be closed.
> +	 */
> +	if (test_bit(MMF_DUMP_PRE_EXIT_FILE_BACKED_SHARED, &flags) &&
> +	    !test_bit(MMF_DUMP_MAPPED_SHARED, &flags))
> +		exit_mmap_mapped_shared(tsk->mm);
> +
> +	/*
> +	 * Check O_TMPCLOS of file f_flags to close file and clear it.
> +	 */
> +	exit_files_pre_exit(tsk, mm_flags_test(MMF_DUMP_PRE_EXIT_FLOCK, tsk->mm));
> +}
> +
>  static int coredump_wait(int exit_code, struct core_state *core_state)
>  {
>  	struct task_struct *tsk = current;
> @@ -1100,6 +1121,8 @@ static void do_coredump(struct core_name *cn, struct coredump_params *cprm,
>  		return;
>  	}
>  
> +	coredump_pre_exit();
> +
>  	switch (cn->core_type) {
>  	case COREDUMP_FILE:
>  		if (!coredump_file(cn, cprm, binfmt))
> diff --git a/fs/file.c b/fs/file.c
> index 2c81c0b162d0..a58ffffcc31d 100644
> --- a/fs/file.c
> +++ b/fs/file.c
> @@ -23,6 +23,7 @@
>  #include <linux/file_ref.h>
>  #include <net/sock.h>
>  #include <linux/init_task.h>
> +#include <linux/filelock.h>
>  
>  #include "internal.h"
>  
> @@ -527,6 +528,51 @@ void exit_files(struct task_struct *tsk)
>  	}
>  }
>  
> +void exit_files_pre_exit(struct task_struct *tsk, bool checkflock)
> +{
> +	struct files_struct *files = tsk->files;
> +	struct fdtable *fdt;
> +	struct file *file;
> +	unsigned int i, j = 0;
> +
> +	if (!files)
> +		return;
> +
> +	fdt = rcu_dereference_raw(files->fdt);
> +	for (;;) {
> +		unsigned long set;
> +
> +		i = j * BITS_PER_LONG;
> +		if (i >= fdt->max_fds)
> +			break;
> +		set = fdt->open_fds[j++];
> +		while (set) {
> +			if (!(set & 1))
> +				goto next_fd;
> +			file = fdt->fd[i];
> +			if (!file)
> +				goto next_fd;
> +			if (file->f_flags & O_TMPCLOS) {
> +				file->f_flags &= ~O_TMPCLOS;
> +				goto close_fd;
> +			}
> +			if (!checkflock)
> +				goto next_fd;
> +			if (!vfs_inode_has_locks(file_inode(file)))
> +				goto next_fd;
> +
> +close_fd:
> +			fdt->fd[i] = NULL;
> +			filp_close(file, files);
> +			cond_resched();
> +
> +next_fd:
> +			i++;
> +			set >>= 1;
> +		}
> +	}
> +}
> +
>  struct files_struct init_files = {
>  	.count		= ATOMIC_INIT(1),
>  	.fdt		= &init_files.fdtab,
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index d9acfa89c894..99b5f219f7fa 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -3026,6 +3026,83 @@ static const struct file_operations proc_coredump_filter_operations = {
>  	.write		= proc_coredump_filter_write,
>  	.llseek		= generic_file_llseek,
>  };
> +
> +static ssize_t proc_coredump_pre_exit_read(struct file *file, char __user *buf,
> +					   size_t count, loff_t *ppos)
> +{
> +	struct task_struct *task = get_proc_task(file_inode(file));
> +	struct mm_struct *mm;
> +	char buffer[PROC_NUMBUF];
> +	size_t len;
> +	int ret;
> +
> +	if (!task)
> +		return -ESRCH;
> +
> +	ret = 0;
> +	mm = get_task_mm(task);
> +	if (mm) {
> +		unsigned long flags = __mm_flags_get_dumpable(mm);
> +
> +		len = snprintf(buffer, sizeof(buffer), "%08lx\n",
> +			       ((flags & MMF_DUMP_PRE_EXIT_MASK) >>
> +				MMF_DUMP_PRE_EXIT_SHIFT));
> +		mmput(mm);
> +		ret = simple_read_from_buffer(buf, count, ppos, buffer, len);
> +	}
> +
> +	put_task_struct(task);
> +
> +	return ret;
> +}
> +
> +static ssize_t proc_coredump_pre_exit_write(struct file *file,
> +					    const char __user *buf,
> +					    size_t count,
> +					    loff_t *ppos)
> +{
> +	struct task_struct *task;
> +	struct mm_struct *mm;
> +	unsigned int val;
> +	int ret;
> +	int i;
> +	unsigned long mask;
> +
> +	ret = kstrtouint_from_user(buf, count, 0, &val);
> +	if (ret < 0)
> +		return ret;
> +
> +	ret = -ESRCH;
> +	task = get_proc_task(file_inode(file));
> +	if (!task)
> +		goto out_no_task;
> +
> +	mm = get_task_mm(task);
> +	if (!mm)
> +		goto out_no_mm;
> +	ret = 0;
> +
> +	for (i = 0, mask = 1; i < MMF_DUMP_PRE_EXIT_BITS; i++, mask <<= 1) {
> +		if (val & mask)
> +			mm_flags_set(i + MMF_DUMP_PRE_EXIT_SHIFT, mm);
> +		else
> +			mm_flags_clear(i + MMF_DUMP_PRE_EXIT_SHIFT, mm);
> +	}
> +
> +	mmput(mm);
> + out_no_mm:
> +	put_task_struct(task);
> + out_no_task:
> +	if (ret < 0)
> +		return ret;
> +	return count;
> +}
> +
> +static const struct file_operations proc_coredump_pre_exit_operations = {
> +	.read		= proc_coredump_pre_exit_read,
> +	.write		= proc_coredump_pre_exit_write,
> +	.llseek		= generic_file_llseek,
> +};
>  #endif
>  
>  #ifdef CONFIG_TASK_IO_ACCOUNTING
> @@ -3391,6 +3468,7 @@ static const struct pid_entry tgid_base_stuff[] = {
>  #endif
>  #ifdef CONFIG_ELF_CORE
>  	REG("coredump_filter", S_IRUGO|S_IWUSR, proc_coredump_filter_operations),
> +	REG("coredump_pre_exit", S_IRUGO|S_IWUSR, proc_coredump_pre_exit_operations),
>  #endif
>  #ifdef CONFIG_TASK_IO_ACCOUNTING
>  	ONE("io",	S_IRUSR, proc_tgid_io_accounting),
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index af23453e9dbd..dfd4717c7e3e 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -4066,6 +4066,7 @@ void anon_vma_interval_tree_verify(struct anon_vma_chain *node);
>  extern int __vm_enough_memory(const struct mm_struct *mm, long pages, int cap_sys_admin);
>  extern int insert_vm_struct(struct mm_struct *, struct vm_area_struct *);
>  extern void exit_mmap(struct mm_struct *);
> +extern void exit_mmap_mapped_shared(struct mm_struct *mm);
>  bool mmap_read_lock_maybe_expand(struct mm_struct *mm, struct vm_area_struct *vma,
>  				 unsigned long addr, bool write);
>  
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index c7db35be6a30..0555aaf50001 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -1963,6 +1963,15 @@ enum {
>  	(BIT(MMF_DUMP_ANON_PRIVATE) | BIT(MMF_DUMP_ANON_SHARED) | \
>  	 BIT(MMF_DUMP_HUGETLB_PRIVATE) | MMF_DUMP_MASK_DEFAULT_ELF)
>  
> +/* coredump pre-exit bits */
> +#define MMF_DUMP_PRE_EXIT_FLOCK	11
> +#define MMF_DUMP_PRE_EXIT_FILE_BACKED_SHARED 12
> +
> +#define MMF_DUMP_PRE_EXIT_SHIFT	(MMF_DUMPABLE_BITS + MMF_DUMP_FILTER_BITS)
> +#define MMF_DUMP_PRE_EXIT_BITS	2
> +#define MMF_DUMP_PRE_EXIT_MASK	\
> +	(((1 << MMF_DUMP_PRE_EXIT_BITS) - 1) << MMF_DUMP_PRE_EXIT_SHIFT)
> +
>  #ifdef CONFIG_CORE_DUMP_DEFAULT_ELF_HEADERS
>  # define MMF_DUMP_MASK_DEFAULT_ELF	BIT(MMF_DUMP_ELF_HEADERS)
>  #else
> diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
> index 41ed884cffc9..b4becbf6c0eb 100644
> --- a/include/linux/sched/task.h
> +++ b/include/linux/sched/task.h
> @@ -93,6 +93,7 @@ static inline void exit_thread(struct task_struct *tsk)
>  extern __noreturn void do_group_exit(int);
>  
>  extern void exit_files(struct task_struct *);
> +extern void exit_files_pre_exit(struct task_struct *, bool);
>  extern void exit_itimers(struct task_struct *);
>  
>  extern pid_t kernel_clone(struct kernel_clone_args *kargs);
> diff --git a/include/uapi/asm-generic/fcntl.h b/include/uapi/asm-generic/fcntl.h
> index 613475285643..360604d653b4 100644
> --- a/include/uapi/asm-generic/fcntl.h
> +++ b/include/uapi/asm-generic/fcntl.h
> @@ -95,6 +95,10 @@
>  #define O_NDELAY	O_NONBLOCK
>  #endif
>  
> +#ifndef O_TMPCLOS
> +#define O_TMPCLOS	0x80000000	/* tag need close, temporarily used */
> +#endif

Sorry, not going to happen. This doesn't not justify the addition of a
new uapi value at all.

I'm also including various Sashkio comments:

sashiko.dev <sashiko@sashiko.dev>:

[Severity: Medium]
Is it safe to expose an internal, temporary kernel flag in a UAPI header?
Userspace applications could intentionally or accidentally pass O_TMPCLOS to
open(), which might permanently pollute the userspace ABI and trigger
unexpected behavior during a coredump.

via: https://sashiko.dev/#/message/20260624145552.70143-1-jackzxcui1989@163.com

> +
>  #define F_DUPFD		0	/* dup */
>  #define F_GETFD		1	/* get close_on_exec */
>  #define F_SETFD		2	/* set/clear close_on_exec */
> diff --git a/kernel/fork.c b/kernel/fork.c
> index a679b2448234..84f1ee7f32cf 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1030,6 +1030,18 @@ static int __init coredump_filter_setup(char *s)
>  
>  __setup("coredump_filter=", coredump_filter_setup);
>  
> +static unsigned long default_dump_pre_exit;
> +
> +static int __init coredump_pre_exit_setup(char *s)
> +{
> +	default_dump_pre_exit =
> +		(simple_strtoul(s, NULL, 0) << MMF_DUMP_PRE_EXIT_SHIFT) &
> +		MMF_DUMP_PRE_EXIT_MASK;
> +	return 1;
> +}
> +
> +__setup("coredump_pre_exit=", coredump_pre_exit_setup);

This makes no sense. I think you really need to sit down and think about
a design for this that doesn't introduce state machinery for boot, mm,
and the VFS in one shot to solve a fringe problem...





sashiko.dev <sashiko@sashiko.dev>:

[Severity: High]
Does modifying the VMA maple tree via do_munmap() during the for_each_vma()
iteration invalidate the outer iterator? The loop traverses the maple tree
using the iterator vmi. However, do_munmap() creates its own internal
VMA_ITERATOR and removes the VMA from the tree. Because the outer vmi
iterator is not updated to reflect these structural changes, its cached
state becomes stale, which can lead to a use-after-free when vma_next()
is subsequently called.

via: https://sashiko.dev/#/message/20260624145552.70143-1-jackzxcui1989@163.com


sashiko.dev <sashiko@sashiko.dev>:

[Severity: High]
Is it safe to iterate the file descriptor table without holding
rcu_read_lock()? Because coredump_pre_exit() is called before zap_threads()
kills other threads, concurrent threads can still trigger expand_files(),
which replaces the fdt and frees the old one after an RCU grace period.

via: https://sashiko.dev/#/message/20260624145552.70143-1-jackzxcui1989@163.com


sashiko.dev <sashiko@sashiko.dev>:

[Severity: Medium]
Similar to the issue in exit_mmap_mapped_shared(), this non-atomic update
of file->f_flags risks losing concurrent fcntl() updates since it doesn't
hold file->f_lock.

Also, if a file has duplicated file descriptors (e.g., via dup()), will
clearing O_TMPCLOS here prematurely skip the closure of the remaining
descriptors? When encountering the duplicated descriptor later, the flag
will already be cleared, leaving the shared file actively referenced.

via: https://sashiko.dev/#/message/20260624145552.70143-1-jackzxcui1989@163.com

-- 
Christian Brauner <brauner@kernel.org>


^ permalink raw reply

* Re: [RFC PATCH] mm: bypass swap readahead for zswap
From: Alexandre Ghiti @ 2026-06-25  7:26 UTC (permalink / raw)
  To: David Hildenbrand (Arm), akpm, hannes, yosry, nphamcs
  Cc: chengming.zhou, ljs, liam, vbabka, rppt, surenb, mhocko, kasong,
	chrisl, baohua, usama.arif, linux-mm, linux-kernel
In-Reply-To: <3523f142-778c-4efd-9675-3c68b33e7e3d@kernel.org>

Hi David,

On 6/24/26 16:58, David Hildenbrand (Arm) wrote:
> On 6/24/26 09:55, Alexandre Ghiti wrote:
>> Commit 0bcac06f27d7 ("mm, swap: skip swapcache for swapin of synchronous
>> device") made SWP_SYNCHRONOUS_IO devices (e.g. zram) skip swap readahead.
>>
>> zswap is the same kind of in-memory, synchronous backend as zram, not a
>> swap device flagged SWP_SYNCHRONOUS_IO so it still goes through
>> swapin_readahead().
>>
>> Here are the results from bypassing readahead for zswap too: it was
>> measured with a kernel build (make -j16) in a memcg, zswap=zstd, shrinker
>> off, on Sapphire Rapids and 3 iterations.
>>
>>    768M memcg (sustained swap thrash):
>>      metric                 mm-new    + bypass    delta
>>      build time (s)          405.0       341.7    -15.6%
>>      zswap-in (GB)            79.5        53.0     -33%
>>      zswap-out (GB)          144.8       115.6     -20%
>>      swap readahead (pages)  6.79M       0.45M     -93%
>>      swap_ra hit (%)          72.1        89.9     +18pp
>>
>>    1G memcg (light pressure, build not memory-bound):
>>      metric                 mm-new    + bypass    delta
>>      build time (s)          177.7       176.0    ~same (no regression)
>>      zswap-in (GB)            10.2         7.5     -26%
>>      zswap-out (GB)           27.7        25.1      -9%
>>      swap readahead (pages)  1.07M       0.08M     -93%
>>      swap_ra hit (%)          68.6        87.2     +19pp
>>
>> The gain is from no longer prefetching pages that are pointless for an
>> in-memory backend: readahead inflates anon residency and thrashes the
>> page cache (file pages get evicted and re-read), lengthens each fault by
>> synchronously (de)compressing a cluster of neighbours, and adds
>> compression traffic when those extra pages are reclaimed.
>>
>> Bypassing swap readahead for zswap therefore makes sense.
>>
>> Signed-off-by: Alexandre Ghiti <alex@ghiti.fr>
>> ---
> [...]
>
>>   #endif /* _LINUX_ZSWAP_H */
>> diff --git a/mm/memory.c b/mm/memory.c
>> index ff338c2abe92..5aa1ea9eb48a 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -4827,8 +4827,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>   	if (folio)
>>   		swap_update_readahead(folio, vma, vmf->address);
>>   	if (!folio) {
>> -		/* Swapin bypasses readahead for SWP_SYNCHRONOUS_IO devices */
>> -		if (data_race(si->flags & SWP_SYNCHRONOUS_IO))
>> +		/* Swapin bypasses readahead for SWP_SYNCHRONOUS_IO devices and zswap */
>> +		if (data_race(si->flags & SWP_SYNCHRONOUS_IO) ||
>> +		    zswap_present_test(entry))
> This should really be abstracted into a reasonably-named helper that can live in
> swap code.
>

Makes sense, I'll come up with something.

Thanks,

Alex



^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox