Linux-mm Archive on lore.kernel.org
 help / color / mirror / Atom feed
* Re: [RFC PATCH 2/4] dma/pool: Add an API to check if DMA allocation is from pool
From: Samiullah Khawaja @ 2026-06-24 19:10 UTC (permalink / raw)
  To: Pranjal Shrivastava
  Cc: Marek Szyprowski, Will Deacon, Jason Gunthorpe, Pasha Tatashin,
	Mike Rapoport, Pratyush Yadav, Alexander Graf, Robin Murphy,
	Kevin Tian, iommu, kexec, linux-mm, linux-kernel, David Matlack,
	Andrew Morton, Vipin Sharma
In-Reply-To: <aicFes0uxUFQAduC@google.com>

On Mon, Jun 08, 2026 at 06:10:02PM +0000, Pranjal Shrivastava wrote:
>On Tue, May 05, 2026 at 12:27:35AM +0000, Samiullah Khawaja wrote:
>> DMA allocations can be done through DMA pools, add an API that can be
>> used to check if an allocation is done from a pool. This will be used in
>> the later commit during preservation of DMA allocation.
>>
>> Signed-off-by: Samiullah Khawaja <skhawaja@google.com>
>> ---
>>  include/linux/dma-map-ops.h |  1 +
>>  kernel/dma/pool.c           | 13 +++++++++++++
>>  2 files changed, 14 insertions(+)
>>
>> diff --git a/include/linux/dma-map-ops.h b/include/linux/dma-map-ops.h
>> index 6a1832a73cad..6a0bc4ea2467 100644
>> --- a/include/linux/dma-map-ops.h
>> +++ b/include/linux/dma-map-ops.h
>> @@ -216,6 +216,7 @@ struct page *dma_alloc_from_pool(struct device *dev, size_t size,
>>  		bool (*phys_addr_ok)(struct device *, phys_addr_t, size_t));
>>  bool dma_free_from_pool(struct device *dev, void *start, size_t size);
>>
>> +bool dma_is_from_pool(struct device *dev, void *start, size_t size);
>>  int dma_direct_set_offset(struct device *dev, phys_addr_t cpu_start,
>>  		dma_addr_t dma_start, u64 size);
>>
>> diff --git a/kernel/dma/pool.c b/kernel/dma/pool.c
>> index 2b2fbb709242..32ce4d6d7683 100644
>> --- a/kernel/dma/pool.c
>> +++ b/kernel/dma/pool.c
>> @@ -307,3 +307,16 @@ bool dma_free_from_pool(struct device *dev, void *start, size_t size)
>>
>>  	return false;
>>  }
>> +
>> +bool dma_is_from_pool(struct device *dev, void *start, size_t size)
>
>Do we need struct device here? It seems unused?
>
>Nit: we only ever pass 0 gfp_flags to dma_guess_pool, should we instead
>name it: dma_is_from_atomic_pool() to be more accurate?

dma_guess_pool() goes through all of them when passing 0, look at the
prev logic inside it and how it is used by the dma_free_from_pool(). But
you are right they are all atomic, but semantically we want to check
whether it is part of any pool so lets keep it as is. WDYT?
>
>> +{
>> +	struct gen_pool *pool = NULL;
>> +
>> +	while ((pool = dma_guess_pool(pool, 0))) {
>> +		if (!gen_pool_has_addr(pool, (unsigned long)start, size))
>> +			continue;
>> +		return true;
>> +	}
>
>Nit: The loop looks slightly ugly, can we have:
>
>struct gen_pool *pool;
>
>for (pool = dma_guess_pool(NULL, 0); pool != NULL; pool = dma_guess_pool(pool, 0)) {
>	if (gen_pool_has_addr(pool, (unsigned long)start, size))
>		return true;
>}

It is a basically a copy dma_free_from_pool() without the free part, so
keeping the style consistent.
>
>> +
>> +	return false;
>> +}
>> --
>> 2.54.0.545.g6539524ca2-goog
>>
>
>Thanks,
>Praan

Thanks,
Sami


^ permalink raw reply

* Re: [PATCH v8 28/46] KVM: selftests: Add support for mmap() on guest_memfd in core library
From: Fuad Tabba @ 2026-06-24 19:07 UTC (permalink / raw)
  To: ackerleytng
  Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
	jmattson, jthoughton, michael.roth, oupton, pankaj.gupta, qperret,
	rick.p.edgecombe, rientjes, shivankg, steven.price, willy, wyihan,
	yan.y.zhao, forkloop, pratyush, suzuki.poulose, aneesh.kumar,
	liam, Paolo Bonzini, Sean Christopherson, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, x86, H. Peter Anvin,
	Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
	Jonathan Corbet, Shuah Khan, Shuah Khan, Vishal Annapurve,
	Andrew Morton, Chris Li, Kairui Song, Kemeng Shi, Nhat Pham,
	Barry Song, Axel Rasmussen, Yuanchu Xie, Wei Xu, Youngjun Park,
	Qi Zheng, Shakeel Butt, Kiryl Shutsemau, Baoquan He,
	Jason Gunthorpe, Vlastimil Babka, kvm, linux-kernel,
	linux-trace-kernel, linux-doc, linux-kselftest, linux-mm,
	linux-coco
In-Reply-To: <20260618-gmem-inplace-conversion-v8-28-9d2959357853@google.com>

On Fri, 19 Jun 2026 at 01:32, Ackerley Tng via B4 Relay
<devnull+ackerleytng.google.com@kernel.org> wrote:
>
> From: Sean Christopherson <seanjc@google.com>
>
> Accept gmem_flags in vm_mem_add() to be able to create a guest_memfd within
> vm_mem_add().
>
> When vm_mem_add() is used to set up a guest_memfd for a memslot, set up the
> provided (or created) gmem_fd as the fd for the user memory region. This
> makes it available to be mmap()-ed from just like fds from other memory
> sources. mmap() from guest_memfd using the provided gmem_flags and
> gmem_offset.
>
> Add a kvm_slot_to_fd() helper to provide convenient access to the file
> descriptor of a memslot.
>
> Update existing callers of vm_mem_add() to pass 0 for gmem_flags to
> preserve existing behavior.
>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> [For guest_memfds, mmap() using gmem_offset instead of 0 all the time.]
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>

Reviewed-by: Fuad Tabba <tabba@google.com>

Cheers,
/fuad

> ---
>  tools/testing/selftests/kvm/include/kvm_util.h     |  7 +++++-
>  tools/testing/selftests/kvm/lib/kvm_util.c         | 27 ++++++++++++----------
>  .../kvm/x86/private_mem_conversions_test.c         |  2 +-
>  3 files changed, 22 insertions(+), 14 deletions(-)
>
> diff --git a/tools/testing/selftests/kvm/include/kvm_util.h b/tools/testing/selftests/kvm/include/kvm_util.h
> index d4c104cb0418f..0cacf3698b259 100644
> --- a/tools/testing/selftests/kvm/include/kvm_util.h
> +++ b/tools/testing/selftests/kvm/include/kvm_util.h
> @@ -700,7 +700,7 @@ void vm_userspace_mem_region_add(struct kvm_vm *vm,
>                                  gpa_t gpa, u32 slot, u64 npages, u32 flags);
>  void vm_mem_add(struct kvm_vm *vm, enum vm_mem_backing_src_type src_type,
>                 gpa_t gpa, u32 slot, u64 npages, u32 flags,
> -               int gmem_fd, u64 gmem_offset);
> +               int gmem_fd, u64 gmem_offset, u64 gmem_flags);
>
>  #ifndef vm_arch_has_protected_memory
>  static inline bool vm_arch_has_protected_memory(struct kvm_vm *vm)
> @@ -732,6 +732,11 @@ void *addr_gva2hva(struct kvm_vm *vm, gva_t gva);
>  gpa_t addr_hva2gpa(struct kvm_vm *vm, void *hva);
>  void *addr_gpa2alias(struct kvm_vm *vm, gpa_t gpa);
>
> +static inline int kvm_slot_to_fd(struct kvm_vm *vm, u32 slot)
> +{
> +       return memslot2region(vm, slot)->fd;
> +}
> +
>  #ifndef vcpu_arch_put_guest
>  #define vcpu_arch_put_guest(mem, val) do { (mem) = (val); } while (0)
>  #endif
> diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c b/tools/testing/selftests/kvm/lib/kvm_util.c
> index 9b482778f7379..d5bbc80b2bf1c 100644
> --- a/tools/testing/selftests/kvm/lib/kvm_util.c
> +++ b/tools/testing/selftests/kvm/lib/kvm_util.c
> @@ -978,12 +978,13 @@ void vm_set_user_memory_region2(struct kvm_vm *vm, u32 slot, u32 flags,
>  /* FIXME: This thing needs to be ripped apart and rewritten. */
>  void vm_mem_add(struct kvm_vm *vm, enum vm_mem_backing_src_type src_type,
>                 gpa_t gpa, u32 slot, u64 npages, u32 flags,
> -               int gmem_fd, u64 gmem_offset)
> +               int gmem_fd, u64 gmem_offset, u64 gmem_flags)
>  {
>         int ret;
>         struct userspace_mem_region *region;
>         size_t backing_src_pagesz = get_backing_src_pagesz(src_type);
>         size_t mem_size = npages * vm->page_size;
> +       off_t mmap_offset = 0;
>         size_t alignment = 1;
>
>         TEST_REQUIRE_SET_USER_MEMORY_REGION2();
> @@ -1055,8 +1056,6 @@ void vm_mem_add(struct kvm_vm *vm, enum vm_mem_backing_src_type src_type,
>
>         if (flags & KVM_MEM_GUEST_MEMFD) {
>                 if (gmem_fd < 0) {
> -                       u32 gmem_flags = 0;
> -
>                         TEST_ASSERT(!gmem_offset,
>                                     "Offset must be zero when creating new guest_memfd");
>                         gmem_fd = vm_create_guest_memfd(vm, mem_size, gmem_flags);
> @@ -1077,13 +1076,17 @@ void vm_mem_add(struct kvm_vm *vm, enum vm_mem_backing_src_type src_type,
>         }
>
>         region->fd = -1;
> -       if (backing_src_is_shared(src_type))
> +       if (flags & KVM_MEM_GUEST_MEMFD && gmem_flags & GUEST_MEMFD_FLAG_MMAP) {
> +               region->fd = kvm_dup(gmem_fd);
> +               mmap_offset = gmem_offset;
> +       } else if (backing_src_is_shared(src_type)) {
>                 region->fd = kvm_memfd_alloc(region->mmap_size,
>                                              src_type == VM_MEM_SRC_SHARED_HUGETLB);
> +       }
>
> -       region->mmap_start = kvm_mmap(region->mmap_size, PROT_READ | PROT_WRITE,
> -                                     vm_mem_backing_src_alias(src_type)->flag,
> -                                     region->fd);
> +       region->mmap_start = __kvm_mmap(region->mmap_size, PROT_READ | PROT_WRITE,
> +                                       vm_mem_backing_src_alias(src_type)->flag,
> +                                       region->fd, mmap_offset);
>
>         TEST_ASSERT(!is_backing_src_hugetlb(src_type) ||
>                     region->mmap_start == align_ptr_up(region->mmap_start, backing_src_pagesz),
> @@ -1129,10 +1132,10 @@ void vm_mem_add(struct kvm_vm *vm, enum vm_mem_backing_src_type src_type,
>
>         /* If shared memory, create an alias. */
>         if (region->fd >= 0) {
> -               region->mmap_alias = kvm_mmap(region->mmap_size,
> -                                             PROT_READ | PROT_WRITE,
> -                                             vm_mem_backing_src_alias(src_type)->flag,
> -                                             region->fd);
> +               region->mmap_alias = __kvm_mmap(region->mmap_size,
> +                                               PROT_READ | PROT_WRITE,
> +                                               vm_mem_backing_src_alias(src_type)->flag,
> +                                               region->fd, mmap_offset);
>
>                 /* Align host alias address */
>                 region->host_alias = align_ptr_up(region->mmap_alias, alignment);
> @@ -1143,7 +1146,7 @@ void vm_userspace_mem_region_add(struct kvm_vm *vm,
>                                  enum vm_mem_backing_src_type src_type,
>                                  gpa_t gpa, u32 slot, u64 npages, u32 flags)
>  {
> -       vm_mem_add(vm, src_type, gpa, slot, npages, flags, -1, 0);
> +       vm_mem_add(vm, src_type, gpa, slot, npages, flags, -1, 0, 0);
>  }
>
>  /*
> diff --git a/tools/testing/selftests/kvm/x86/private_mem_conversions_test.c b/tools/testing/selftests/kvm/x86/private_mem_conversions_test.c
> index 1d2f5d4fd45d7..861baff201e78 100644
> --- a/tools/testing/selftests/kvm/x86/private_mem_conversions_test.c
> +++ b/tools/testing/selftests/kvm/x86/private_mem_conversions_test.c
> @@ -399,7 +399,7 @@ static void test_mem_conversions(enum vm_mem_backing_src_type src_type, u32 nr_v
>         for (i = 0; i < nr_memslots; i++)
>                 vm_mem_add(vm, src_type, BASE_DATA_GPA + slot_size * i,
>                            BASE_DATA_SLOT + i, slot_size / vm->page_size,
> -                          KVM_MEM_GUEST_MEMFD, memfd, slot_size * i);
> +                          KVM_MEM_GUEST_MEMFD, memfd, slot_size * i, 0);
>
>         for (i = 0; i < nr_vcpus; i++) {
>                 gpa_t gpa =  BASE_DATA_GPA + i * per_cpu_size;
>
> --
> 2.55.0.rc0.738.g0c8ab3ebcc-goog
>
>


^ permalink raw reply

* Re: [PATCH 2/3] ovl: support cachestat() syscall on overlayfs files
From: Nhat Pham @ 2026-06-24 19:06 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Pavel Tikhomirov, Miklos Szeredi, Alexander Viro,
	Christian Brauner, Jan Kara, Matthew Wilcox (Oracle),
	Andrew Morton, Johannes Weiner, Shuah Khan, linux-unionfs,
	linux-kernel, linux-fsdevel, linux-mm, linux-kselftest
In-Reply-To: <CAOQ4uxhPROaeEyYceKNsL5QVwVPkTjGtbRr1P_TYXTA0fHvRoA@mail.gmail.com>

On Wed, Jun 24, 2026 at 7:16 AM Amir Goldstein <amir73il@gmail.com> wrote:
>
> On Wed, Jun 24, 2026 at 1:45 PM Pavel Tikhomirov
> <ptikhomirov@virtuozzo.com> wrote:
> >
> >
> >
> > On 6/23/26 19:12, Nhat Pham wrote:
> > > On Tue, Jun 23, 2026 at 4:15 AM Pavel Tikhomirov
> > > <ptikhomirov@virtuozzo.com> wrote:
> > >>
> > >> Overlayfs forwards data I/O to the real (upper/lower) file, so the page
> > >> cache lives in the real inode's mapping and cachestat() on an overlay
> > >> fd returned all zeroes.
> > >>
> > >> Implement the ->cachestat() file operation by forwarding to the real
> > >> file via vfs_cachestat(), the same way ovl_fadvise() forwards
> > >> for fadvise.
> > >>
> > >> Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
> > >> ---
> > >>  fs/overlayfs/file.c | 18 ++++++++++++++++++
> > >>  1 file changed, 18 insertions(+)
> > >>
> > >> diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c
> > >> index 27cc07738f33b..a7e252a91ea43 100644
> > >> --- a/fs/overlayfs/file.c
> > >> +++ b/fs/overlayfs/file.c
> > >> @@ -518,6 +518,21 @@ static int ovl_fadvise(struct file *file, loff_t offset, loff_t len, int advice)
> > >>                 return vfs_fadvise(realfile, offset, len, advice);
> > >>  }
> > >>
> > >> +#ifdef CONFIG_CACHESTAT_SYSCALL
> > >> +static int ovl_cachestat(struct file *file, struct cachestat_range *csr,
> > >> +                        struct cachestat *cs)
> > >> +{
> > >> +       struct file *realfile;
> > >> +
> > >> +       realfile = ovl_real_file(file);
> > >> +       if (IS_ERR(realfile))
> > >> +               return PTR_ERR(realfile);
> > >
> > > We're propagating the error of ovl_real_file() all the way to
> > > userspace right? I think we need to handle this.
> > >
> > > For example, we might get -EIO here, which is unexpected and
> > > undocumented from cachestat's POV.
> > >
> > > Maybe handle it and just return -EBADF or sth like that (with some
> > > updated documentations, etc.)
> > >
> > > The rest LGTM, but I'll let overlayfs maintainers check the
> > > overlayfs-specific bits :)
> >
> > Yeh, we probably can use EBADF here instead of propagating:
> >
> > Man cachestat(2) says:
> >
> >   EBADF  Invalid file descriptor.
> >
> > not really a bad fd here, but probably close enough not to rewrite man.
>
> Please don't do that.
>
> Re-read what you just wrote - it is ridiculous
> Because of being lazy to update man page,
> we are going to send a confusing error to user which tells them
> that their fd is wrong, which it is not.

I don't think we're being lazy here. It's technically more work to
handle errors and updating documentations :)

I'm more concerned with undocumented/unexpected behavior (error type
in this case). -EIO was an example that I saw in ovl_real_file()
itself, but I'm not familiar enough with overlayfs to know if that's
the extent of it.

But I'm OK with just updating the documentation with a simple note
that other error maybe propagated from the underlying fs, if no one
else thinks it's a problem :)

>
> >
> > I'm a bit hesitant though, since in other overlayfs operations we already
> > propagate, maybe that was by design?
> >
>
> Exactly, plenty of overlayfs operations return EIO for unexpected
> conditions, often accompanied with some assertion as is the case
> with ovl_real_file().
>
> Even though many man pages don't document an explicit EIO error
> code, it is obvious to any experienced sys admin that if EIO is observed
> they should look at the kernel logs, because an underlying subsystem
> may have reported critical errors.
>
> But in general, man pages follow development, not the other way around.

Fair point.

>
> Thanks,
> Amir.


^ permalink raw reply

* Re: [RFC PATCH 3/4] dma-direct: Add API to preserve/restore allocations
From: Samiullah Khawaja @ 2026-06-24 19:00 UTC (permalink / raw)
  To: Pranjal Shrivastava
  Cc: Marek Szyprowski, Will Deacon, Jason Gunthorpe, Pasha Tatashin,
	Mike Rapoport, Pratyush Yadav, Alexander Graf, Robin Murphy,
	Kevin Tian, iommu, kexec, linux-mm, linux-kernel, David Matlack,
	Andrew Morton, Vipin Sharma
In-Reply-To: <aiceKYmc323G7tBs@google.com>

On Mon, Jun 08, 2026 at 07:55:21PM +0000, Pranjal Shrivastava wrote:
>On Tue, May 05, 2026 at 12:27:36AM +0000, Samiullah Khawaja wrote:
>> Add an API to preserve/restore the DMA direct allocation for liveupdate.
>> The underlying memory is preserved/restored using KHO. During restore
>> the memory is setup based on the device configuration, gfp flags and
>> allocation attributes. Once restored, the driver can use the usual
>> dma_free* API to deallocate the restored DMA allocation.
>>
>> This API will be used to add support in dma_alloc* APIs to
>> preseve/restore the DMA allocations.
>>
>> Signed-off-by: Samiullah Khawaja <skhawaja@google.com>
>> ---
>>  include/linux/dma-direct.h |  29 +++++++
>>  kernel/dma/Kconfig         |   3 +
>>  kernel/dma/direct.c        | 163 +++++++++++++++++++++++++++++++++++++
>>  3 files changed, 195 insertions(+)
>>
>
>[...]
>
>> diff --git a/kernel/dma/Kconfig b/kernel/dma/Kconfig
>> index bfef21b4a9ae..d92852942c6c 100644
>> --- a/kernel/dma/Kconfig
>> +++ b/kernel/dma/Kconfig
>> @@ -265,3 +265,6 @@ config DMA_MAP_BENCHMARK
>>  	  performance of dma_(un)map_page.
>>
>>  	  See tools/testing/selftests/dma/dma_map_benchmark.c
>> +
>> +config DMA_LIVEUPDATE
>> +	bool "Enable preservation of DMA direct allocations"
>
>Nit: depends on LIVEUPDATE?

Agreed I will add this.
>
>> diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
>> index ec887f443741..c2b98f91900a 100644
>> --- a/kernel/dma/direct.c
>> +++ b/kernel/dma/direct.c
>> @@ -6,6 +6,8 @@
>>   */
>>  #include <linux/memblock.h> /* for max_pfn */
>>  #include <linux/export.h>
>> +#include <linux/kexec_handover.h>
>> +#include <linux/kho/abi/dma_alloc.h>
>>  #include <linux/mm.h>
>>  #include <linux/dma-map-ops.h>
>>  #include <linux/scatterlist.h>
>> @@ -307,6 +309,167 @@ void *dma_direct_alloc(struct device *dev, size_t size,
>>  	return NULL;
>>  }
>>
>> +#ifdef CONFIG_DMA_LIVEUPDATE
>> +int dma_direct_preserve_allocation(struct device *dev, void *cpu_addr,
>> +				   size_t size, dma_addr_t dma_handle,
>> +				   unsigned long attrs, u64 *state)
>> +{
>> +	struct dma_alloc_ser *ser;
>> +	int ret;
>> +
>> +	if (!kho_is_enabled())
>> +		return -EOPNOTSUPP;
>> +
>> +	if (IS_ENABLED(CONFIG_DMA_CMA))
>> +		return -EOPNOTSUPP;
>> +
>> +	if ((attrs & DMA_ATTR_NO_KERNEL_MAPPING) &&
>> +	    !force_dma_unencrypted(dev) && !is_swiotlb_for_alloc(dev))
>> +		return -EOPNOTSUPP;
>> +
>> +	if (IS_ENABLED(CONFIG_ARCH_HAS_DMA_ALLOC) &&
>> +	    !dev_is_dma_coherent(dev) &&
>> +	    !is_swiotlb_for_alloc(dev))
>> +		return -EOPNOTSUPP;
>> +
>> +	if (IS_ENABLED(CONFIG_DMA_GLOBAL_POOL) &&
>> +	    !dev_is_dma_coherent(dev))
>> +		return -EOPNOTSUPP;
>> +
>> +	if (IS_ENABLED(CONFIG_DMA_COHERENT_POOL) &&
>> +	    dma_is_from_pool(dev, cpu_addr, PAGE_ALIGN(size)))
>> +		return -EOPNOTSUPP;
>> +
>> +	ser = kho_alloc_preserve(sizeof(*ser));
>> +	if (IS_ERR(ser))
>> +		return PTR_ERR(ser);
>> +
>> +	ser->page_phys = dma_to_phys(dev, dma_handle);
>> +	ser->force_decrypted = force_dma_unencrypted(dev);
>> +	ser->size = size;
>> +
>> +	ret = kho_preserve_pages(phys_to_page(ser->page_phys),
>> +				 size >> PAGE_SHIFT);
>
>Should this be `PAGE_ALIGN(size) >> PAGE_SHIFT` OR
>`DIV_ROUND_UP(size, PAGE_SIZE)`?
>
>Otherwise, if size is small, say, size == 64-bytes, we preserve 0 pages?
>
>Also, IIRC, even with PAGE_ALIGN, preserving just the requested pgcount
>is not enough because buddy allocator allocates in order-N.
>
>For e.g. if a driver requests 20KB (5 pages), the buddy allocator
>fulfills it with an order-3 block (8 pages).
>
>Now, if we only tell KHO to preserve 5 pages, the remaining 3 pages are
>free in the new kernel. When the driver eventually tears down and calls
>dma_free_coherent(), dma_direct_free() will call
>__free_pages(page, get_order(size)), which will attempt to free all 8
>pages, causing a double-free panic on the 3 unpreserved pages?
>
>Should we be preserving exactly 1 << get_order(size) pages as per buddy?
>Same applies to unpreserve, and restore.

Agreed. I will update this and also make sure it is covered in the kunit
tests.
>
>> +	if (ret) {
>> +		kho_unpreserve_free(ser);
>> +		return ret;
>> +	}
>> +
>> +	*state = virt_to_phys(ser);
>> +	return 0;
>> +}
>> +
>> +void dma_direct_unpreserve_allocation(struct device *dev, u64 state)
>> +{
>> +	struct dma_alloc_ser *ser;
>> +
>> +	if (!kho_is_enabled())
>> +		return;
>> +
>> +	ser = phys_to_virt(state);
>> +	kho_unpreserve_pages(phys_to_page(ser->page_phys),
>> +			     ser->size >> PAGE_SHIFT);
>> +	kho_unpreserve_free(ser);
>> +}
>> +
>> +void *dma_direct_restore_allocation(struct device *dev, size_t size,
>> +				    dma_addr_t *dma_handle, gfp_t gfp,
>> +				    unsigned long attrs, u64 state)
>
>Are we relying on the caller to pass same attrs? So, a buffer with
>non-coherent attrs can be mapped with coherent attrs in the new kernel.
>Could this cause side-effects? Should we check for such driver bugs with
>a WARN here while comparing older attrs with the newer ones too?
>
>Coherency breaking due to subtle driver bugs is very painful to debug :/

Hmm... this is interesting. The dma_alloc API relies on the caller to
have consistent attrs accross allocation and free. But when updating
kernel where driver could have been updated, we have to be careful.

Agreed.. I will handle this properly by making sure that the new attr is
compatible with the preserved attr.
>
>> +{
>> +	bool remap = false, set_uncached = false;
>> +	struct dma_alloc_ser *ser = NULL;
>> +	struct page *page;
>> +	void *cpu_addr;
>> +
>> +	if (!kho_is_enabled())
>> +		return NULL;
>> +
>> +	ser = phys_to_virt(state);
>> +	page = phys_to_page(ser->page_phys);
>
>[...]
>
>> +
>> +	/*
>> +	 * Remapping will be blocking so return error. The preserved memory
>> +	 * might be already decrypted in the previous kernel, but the decryption
>> +	 * call is not guaranteed to be non-blocking so return error always if
>> +	 * decryption is required.
>> +	 */
>> +	if ((remap || force_dma_unencrypted(dev)) &&
>> +	    dma_direct_use_pool(dev, gfp))
>> +		return NULL;
>> +
>> +	/*
>> +	 * Encryption scheme changed between two kernels and this might cause
>> +	 * issues if device/driver is not handling it properly.
>> +	 */
>> +	WARN_ON_ONCE(ser->force_decrypted != force_dma_unencrypted(dev));
>> +
>> +	/*
>> +	 * arch_dma_prep_coherent() should make sure that any cache lines from
>> +	 * the previous kernel, if the device was coherent previously or cached
>> +	 * mapping in this kernel during init are not problamatic for
>> +	 * non-coherent allocations.
>> +	 */
>> +	if (remap) {
>> +		pgprot_t prot = dma_pgprot(dev, PAGE_KERNEL, attrs);
>> +
>> +		if (force_dma_unencrypted(dev))
>> +			prot = pgprot_decrypted(prot);
>> +
>> +		arch_dma_prep_coherent(page, size);
>> +
>> +		cpu_addr = dma_common_contiguous_remap(page, size, prot,
>> +						       __builtin_return_address(0));
>> +		if (!cpu_addr)
>> +			return NULL;
>
>Should we be kho_restore_free-ing on all these error paths?
>We only seem to be kho_restore_free-ing on the success path.
>Same for kho_restore_pages.. if we return an error here, we don't
>restore the preserved pages? Are we leaking those too?

This is purposefully leaking the memory here. This is because during
liveupdate, this device could be using this memory and freeing it means
that this might cause a memory corruption that would pretty difficult to
debug.
>
>> +	} else {
>> +		cpu_addr = page_address(page);
>> +		if (dma_set_decrypted(dev, cpu_addr, size))
>> +			return NULL;
>> +	}
>> +
>> +	if (set_uncached) {
>> +		arch_dma_prep_coherent(page, size);
>> +		cpu_addr = arch_dma_set_uncached(cpu_addr, size);
>> +		if (IS_ERR(cpu_addr))
>> +			return NULL;
>> +	}
>> +
>> +	*dma_handle = phys_to_dma_direct(dev, ser->page_phys);
>> +
>> +	/*
>> +	 * Cannot free the restored pages on error here as these might be in use
>> +	 * by a device with direct allocation in the previous kernel.
>> +	 */


Check this comment that explains the logic behind not freeing. I think I
will move it up.
>> +	WARN_ON(!kho_restore_pages(ser->page_phys,
>> +				   ser->size >> PAGE_SHIFT));
>> +	kho_restore_free(ser);
>> +	return cpu_addr;
>> +}
>> +#endif
>> +
>>  void dma_direct_free(struct device *dev, size_t size,
>>  		void *cpu_addr, dma_addr_t dma_addr, unsigned long attrs)
>>  {
>
>Thanks,
>Praan

Thanks,
Sami


^ permalink raw reply

* Re: [PATCH v5 0/9] dax/kmem: atomic whole-device hotplug via sysfs
From: Gregory Price @ 2026-06-24 18:59 UTC (permalink / raw)
  To: linux-mm, nvdimm
  Cc: linux-kernel, linux-cxl, driver-core, linux-kselftest,
	kernel-team, david, osalvador, gregkh, rafael, dakr, djbw,
	vishal.l.verma, dave.jiang, akpm, ljs, liam, vbabka, rppt, surenb,
	mhocko, shuah, alison.schofield, Smita.KoralahalliChannabasappa,
	ira.weiny, apopple
In-Reply-To: <20260624145744.3532049-1-gourry@gourry.net>

On Wed, Jun 24, 2026 at 10:57:35AM -0400, Gregory Price wrote:
>... snip ...

Disregard, there are a few unaddressed Sashiko comments, I'm just going
to respin this.  Will wait until after the merge window closes for v6.

The rough shape of things should still hold w/ prior feedback.

~Gregory


^ permalink raw reply

* Re: [PATCH v8 24/46] KVM: guest_memfd: Make in-place conversion the default
From: Fuad Tabba @ 2026-06-24 18:57 UTC (permalink / raw)
  To: ackerleytng
  Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
	jmattson, jthoughton, michael.roth, oupton, pankaj.gupta, qperret,
	rick.p.edgecombe, rientjes, shivankg, steven.price, willy, wyihan,
	yan.y.zhao, forkloop, pratyush, suzuki.poulose, aneesh.kumar,
	liam, Paolo Bonzini, Sean Christopherson, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, x86, H. Peter Anvin,
	Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
	Jonathan Corbet, Shuah Khan, Shuah Khan, Vishal Annapurve,
	Andrew Morton, Chris Li, Kairui Song, Kemeng Shi, Nhat Pham,
	Barry Song, Axel Rasmussen, Yuanchu Xie, Wei Xu, Youngjun Park,
	Qi Zheng, Shakeel Butt, Kiryl Shutsemau, Baoquan He,
	Jason Gunthorpe, Vlastimil Babka, kvm, linux-kernel,
	linux-trace-kernel, linux-doc, linux-kselftest, linux-mm,
	linux-coco
In-Reply-To: <20260618-gmem-inplace-conversion-v8-24-9d2959357853@google.com>

On Fri, 19 Jun 2026 at 01:31, Ackerley Tng via B4 Relay
<devnull+ackerleytng.google.com@kernel.org> wrote:
>
> From: Ackerley Tng <ackerleytng@google.com>
>
> Make in-place conversion the default if the arch has private mem.
>
> The default can be overridden at compile type by enabling

compile _time_

> CONFIG_KVM_VM_MEMORY_ATTRIBUTES, or at KVM load time through a module
> parameter.
>
> In-place conversion also implies tracking a guest's private/shared state in
> guest_memfd. To avoid inconsistencies in the way memory attributes are
> tracked between the per-VM or by guest_memfd, make the module_param
> read-only (0444).
>
> Document that using per-VM attributes for tracking private/shared state of
> guest memory is deprecated in favor of tracking in guest_memfd.
>
> Warn if the admin sets gmem_in_place_conversion as false when
> CONFIG_KVM_VM_MEMORY_ATTRIBUTES is not enabled. Add warning in the code
> path where guest memory is populated for a CoCo VM, since that's the
> earliest point in a CoCo VM's lifecycle where memory attributes are
> queried. Unlike other query sites, this site is exclusively used by CoCo
> VMs.
>
> Signed-off-by: Sean Christopherson <seanjc@google.com>

> ---
>  arch/x86/kvm/Kconfig   | 7 ++++++-
>  virt/kvm/guest_memfd.c | 5 +++++
>  virt/kvm/kvm_main.c    | 3 ++-
>  3 files changed, 13 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
> index c28393dc664eb..a3c189d765150 100644
> --- a/arch/x86/kvm/Kconfig
> +++ b/arch/x86/kvm/Kconfig
> @@ -85,7 +85,12 @@ config KVM_VM_MEMORY_ATTRIBUTES
>         bool "Enable per-VM PRIVATE vs. SHARED attributes (for CoCo VMs)"
>         help
>           Enable support for tracking PRIVATE vs. SHARED memory using per-VM
> -         memory attributes.
> +         memory attributes.  Using per-VM attributes are deprecated in favor

nit:
are->is

Reviewed-by: Fuad Tabba <tabba@google.com>

Cheers,
/fuad





> +         of tracking PRIVATE state in guest_memfd.  Select this if you need
> +         to run CoCo VMs using a VMM that doesn't support guest_memfd memory
> +         attributes.
> +
> +         If unsure, say N.
>
>  config KVM_SW_PROTECTED_VM
>         bool "Enable support for KVM software-protected VMs"
> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> index 86c9f5b0863cb..5cb73543c03c8 100644
> --- a/virt/kvm/guest_memfd.c
> +++ b/virt/kvm/guest_memfd.c
> @@ -1193,10 +1193,15 @@ static bool kvm_gmem_range_is_private(struct file *file, pgoff_t index,
>  {
>         struct maple_tree *mt = &GMEM_I(file_inode(file))->attributes;
>
> +#ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
>         if (!gmem_in_place_conversion)
>                 return kvm_range_has_vm_memory_attributes(kvm, gfn, gfn + nr_pages,
>                                                           KVM_MEMORY_ATTRIBUTE_PRIVATE,
>                                                           KVM_MEMORY_ATTRIBUTE_PRIVATE);
> +#else
> +       if (WARN_ON_ONCE(!gmem_in_place_conversion))
> +               return false;
> +#endif
>
>         return kvm_gmem_range_has_attributes(mt, index, nr_pages,
>                                              KVM_MEMORY_ATTRIBUTE_PRIVATE);
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index dd1d18a1d2f68..46e92b5dc3804 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -102,7 +102,8 @@ static bool __ro_after_init allow_unsafe_mappings;
>  module_param(allow_unsafe_mappings, bool, 0444);
>
>  #ifdef kvm_arch_has_private_mem
> -bool __ro_after_init gmem_in_place_conversion = false;
> +bool __ro_after_init gmem_in_place_conversion = !IS_ENABLED(CONFIG_KVM_VM_MEMORY_ATTRIBUTES);
> +module_param(gmem_in_place_conversion, bool, 0444);
>  EXPORT_SYMBOL_FOR_KVM_INTERNAL(gmem_in_place_conversion);
>  #endif
>
>
> --
> 2.55.0.rc0.738.g0c8ab3ebcc-goog
>
>


^ permalink raw reply

* Re: [PATCH] mm/memcontrol: remove unused for_each_mem_cgroup macro and cleanup
From: Shakeel Butt @ 2026-06-24 18:57 UTC (permalink / raw)
  To: Joshua Hahn
  Cc: linux-mm, Johannes Weiner, Michal Hocko, Roman Gushchin,
	Muchun Song, Andrew Morton, cgroups, linux-kernel, kernel-team
In-Reply-To: <20260624183700.1152742-1-joshua.hahnjy@gmail.com>

On Wed, Jun 24, 2026 at 11:36:59AM -0700, Joshua Hahn wrote:
> Commit 7e1c0d6f58207 ("memcg: switch lruvec stats to rstat") removed the
> last caller of for_each_mem_cgroup back in 2021, and there have not been
> any new callers since. Remove the macro.
> 
> A comment in mem_cgroup_css_online has also been out of date since 2021,
> when 2bfd36374edd9 ("mm: vmscan: consolidate shrinker_maps handling
> code") open-coded the for_each_mem_cgroup iterator. Update the comment.
> 
> Finally, 99430ab8b804c ("mm: introduce BPF kfuncs to access memcg
> statistics and events") added a second declaration for memcg_events to
> include/linux/memcontrol.h, duplicating the one in mm/memcontrol-v1.h.
> Let's clean that up too.
> 
> No functional changes intended.
> 
> Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com>

Thanks for the cleanup.

Acked-by: Shakeel Butt <shakeel.butt@linux.dev>


^ permalink raw reply

* [PATCH] mm/memcontrol: remove unused for_each_mem_cgroup macro and cleanup
From: Joshua Hahn @ 2026-06-24 18:36 UTC (permalink / raw)
  To: linux-mm
  Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Muchun Song, Andrew Morton, cgroups, linux-kernel, kernel-team

Commit 7e1c0d6f58207 ("memcg: switch lruvec stats to rstat") removed the
last caller of for_each_mem_cgroup back in 2021, and there have not been
any new callers since. Remove the macro.

A comment in mem_cgroup_css_online has also been out of date since 2021,
when 2bfd36374edd9 ("mm: vmscan: consolidate shrinker_maps handling
code") open-coded the for_each_mem_cgroup iterator. Update the comment.

Finally, 99430ab8b804c ("mm: introduce BPF kfuncs to access memcg
statistics and events") added a second declaration for memcg_events to
include/linux/memcontrol.h, duplicating the one in mm/memcontrol-v1.h.
Let's clean that up too.

No functional changes intended.

Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com>
---
This is intended for the next release cycle. Thank you!

 mm/memcontrol-v1.h | 6 ------
 mm/memcontrol.c    | 2 +-
 2 files changed, 1 insertion(+), 7 deletions(-)

diff --git a/mm/memcontrol-v1.h b/mm/memcontrol-v1.h
index f92f81108d5ed..d3ed5b93290fb 100644
--- a/mm/memcontrol-v1.h
+++ b/mm/memcontrol-v1.h
@@ -17,14 +17,8 @@
 	     iter != NULL;				\
 	     iter = mem_cgroup_iter(root, iter, NULL))
 
-#define for_each_mem_cgroup(iter)			\
-	for (iter = mem_cgroup_iter(NULL, NULL, NULL);	\
-	     iter != NULL;				\
-	     iter = mem_cgroup_iter(NULL, iter, NULL))
-
 void drain_all_stock(struct mem_cgroup *root_memcg);
 
-unsigned long memcg_events(struct mem_cgroup *memcg, int event);
 int memory_stat_show(struct seq_file *m, void *v);
 
 struct mem_cgroup *mem_cgroup_private_id_get_online(struct mem_cgroup *memcg,
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 56cd4af082326..e171fe36b0711 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4216,7 +4216,7 @@ static int mem_cgroup_css_online(struct cgroup_subsys_state *css)
 	/*
 	 * A memcg must be visible for expand_shrinker_info()
 	 * by the time the maps are allocated. So, we allocate maps
-	 * here, when for_each_mem_cgroup() can't skip it.
+	 * here, when mem_cgroup_iter() can't skip it.
 	 */
 	if (alloc_shrinker_info(memcg))
 		goto offline_kmem;
-- 
2.53.0-Meta



^ permalink raw reply related

* Re: [PATCH v4 4/5] mm/memcontrol: convert memcg to use page_counter_stock
From: Joshua Hahn @ 2026-06-24 18:24 UTC (permalink / raw)
  To: Usama Arif
  Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Muchun Song, Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, cgroups, linux-mm, linux-kernel, kernel-team
In-Reply-To: <120367a5-0a3c-40ba-a821-f46f8494ef85@linux.dev>

On Wed, 24 Jun 2026 17:43:56 +0100 Usama Arif <usama.arif@linux.dev> wrote:

> 
> 
> On 24/06/2026 16:23, Joshua Hahn wrote:
> > On Wed, 24 Jun 2026 07:43:47 -0700 Usama Arif <usama.arif@linux.dev> wrote:
> > 
> >> On Tue, 23 Jun 2026 11:01:22 -0700 Joshua Hahn <joshua.hahnjy@gmail.com> wrote:
> > 
> > Hello Usama!!
> > 
> > Thank you for reviewing the patch : -)
> > 
> > [...snip...]
> > 
> >>> @@ -2595,7 +2596,6 @@ void __mem_cgroup_handle_over_high(gfp_t gfp_mask)
> >>>  static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
> >>>  			    unsigned int nr_pages)
> >>>  {
> >>> -	unsigned int batch = max(MEMCG_CHARGE_BATCH, nr_pages);
> >>>  	int nr_retries = MAX_RECLAIM_RETRIES;
> >>>  	struct mem_cgroup *mem_over_limit;
> >>>  	struct page_counter *counter;
> >>> @@ -2606,36 +2606,30 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
> >>>  	bool raised_max_event = false;
> >>>  	unsigned long pflags;
> >>>  	bool allow_spinning = gfpflags_allow_spinning(gfp_mask);
> >>> +	unsigned long nr_charged = 0;
> >>>  
> >>>  retry:
> >>> -	if (consume_stock(memcg, nr_pages))
> >>> -		return 0;
> >>> -
> >>> -	if (!allow_spinning)
> >>> -		/* Avoid the refill and flush of the older stock */
> >>> -		batch = nr_pages;
> >>> -
> >>>  	reclaim_options = MEMCG_RECLAIM_MAY_SWAP;
> >>>  	if (do_memsw_account() &&
> >>> -	    !page_counter_try_charge(&memcg->memsw, batch, &counter)) {
> >>> +	    !page_counter_try_charge_stock(&memcg->memsw, nr_pages,
> >>> +					   &counter, NULL)) {
> >>>  		mem_over_limit = mem_cgroup_from_counter(counter, memsw);
> >>>  		reclaim_options &= ~MEMCG_RECLAIM_MAY_SWAP;
> >>>  		goto reclaim;
> >>>  	}
> >>>  
> >>> -	if (page_counter_try_charge(&memcg->memory, batch, &counter))
> >>> -		goto done_restock;
> >>> +	if (page_counter_try_charge_stock(&memcg->memory, nr_pages,
> >>> +					  &counter, &nr_charged)) {
> >>> +		if (!nr_charged)
> >>> +			return 0;
> >>> +		goto handle_high;
> >>> +	}
> >>>  
> >>>  	if (do_memsw_account())
> >>> -		page_counter_uncharge(&memcg->memsw, batch);
> >>> +		page_counter_uncharge(&memcg->memsw, nr_pages);
> >>
> >> This needs a transactional rollback. page_counter_try_charge_stock() can
> >> succeed by consuming memsw stock and charging 0 new pages, but the
> >> memory-failure path unconditionally uncharges nr_pages from memsw.
> >> That turns a failed allocation into a real memsw usage decrement.
> > 
> > Hmmmmmmmmmm....... I'm not sure.
> > 
> > At this point in the code, we are either (1) using cgroup v1 with memsw
> > and charged successfully, or (2) not using cgroup v1 with memsw. So I'm
> > not sure if this really is unconditional, we're just distinguishing
> > between cases (1) and (2) by checking if we're using cgroupv1.
> > 
> > Or is your concern with taking a charge via stock, but uncharging with
> > a hierarchical page_counter walk?
> 
> This was my concern. But I re-read the page_counter stock invariant,
> and the stock-hit case is not an undercount? Consuming stock transfers
> already-charged credit to the pending allocation; if the later memory charge
> fails, page_counter_uncharge() discards that consumed credit from the
> hierarchy. That should keeps usage equal to real charges plus remaining stock?

Yes, stock-hit case just does some math without doing any actual
charging. It's stuff that was pre-charged before, so we're not doing
any undercounting or leaking any charges.

What do you mean by "consumed credit"? From what I can see
page_counter_uncharge --> page_counter_cancel subtracts from
counter->usage, which should be the real charge + hierarchy walk.

Am I missing something :p please feel free to let me know!
Joshua


^ permalink raw reply

* Re: [RFC PATCH] mm: bypass swap readahead for zswap
From: Yosry Ahmed @ 2026-06-24 18:01 UTC (permalink / raw)
  To: Alexandre Ghiti
  Cc: akpm, hannes, nphamcs, chengming.zhou, david, ljs, liam, vbabka,
	rppt, surenb, mhocko, kasong, chrisl, baohua, usama.arif,
	linux-mm, linux-kernel
In-Reply-To: <20260624075700.751467-1-alex@ghiti.fr>

On Wed, Jun 24, 2026 at 12:57 AM Alexandre Ghiti <alex@ghiti.fr> wrote:
>
> Commit 0bcac06f27d7 ("mm, swap: skip swapcache for swapin of synchronous
> device") made SWP_SYNCHRONOUS_IO devices (e.g. zram) skip swap readahead.
>
> zswap is the same kind of in-memory, synchronous backend as zram, not a
> swap device flagged SWP_SYNCHRONOUS_IO so it still goes through
> swapin_readahead().
>
> Here are the results from bypassing readahead for zswap too: it was
> measured with a kernel build (make -j16) in a memcg, zswap=zstd, shrinker
> off, on Sapphire Rapids and 3 iterations.
>
>   768M memcg (sustained swap thrash):
>     metric                 mm-new    + bypass    delta
>     build time (s)          405.0       341.7    -15.6%
>     zswap-in (GB)            79.5        53.0     -33%
>     zswap-out (GB)          144.8       115.6     -20%
>     swap readahead (pages)  6.79M       0.45M     -93%
>     swap_ra hit (%)          72.1        89.9     +18pp
>
>   1G memcg (light pressure, build not memory-bound):
>     metric                 mm-new    + bypass    delta
>     build time (s)          177.7       176.0    ~same (no regression)
>     zswap-in (GB)            10.2         7.5     -26%
>     zswap-out (GB)           27.7        25.1      -9%
>     swap readahead (pages)  1.07M       0.08M     -93%
>     swap_ra hit (%)          68.6        87.2     +19pp
>
> The gain is from no longer prefetching pages that are pointless for an
> in-memory backend: readahead inflates anon residency and thrashes the
> page cache (file pages get evicted and re-read), lengthens each fault by
> synchronously (de)compressing a cluster of neighbours, and adds
> compression traffic when those extra pages are reclaimed.
>
> Bypassing swap readahead for zswap therefore makes sense.
>
> Signed-off-by: Alexandre Ghiti <alex@ghiti.fr>
> ---
>
> - This bypass originally comes from Usama's series that implements
>   large folio zswapin: while working on improving this series, I noticed
>   the gains I got only came from the bypass of readahead.
>
>  include/linux/zswap.h |  6 ++++++
>  mm/memory.c           |  5 +++--
>  mm/zswap.c            | 11 +++++++++++
>  3 files changed, 20 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/zswap.h b/include/linux/zswap.h
> index 30c193a1207e..b6f0e6198b6f 100644
> --- a/include/linux/zswap.h
> +++ b/include/linux/zswap.h
> @@ -35,6 +35,7 @@ void zswap_lruvec_state_init(struct lruvec *lruvec);
>  void zswap_folio_swapin(struct folio *folio);
>  bool zswap_is_enabled(void);
>  bool zswap_never_enabled(void);
> +bool zswap_present_test(swp_entry_t swp);
>  #else
>
>  struct zswap_lruvec_state {};
> @@ -69,6 +70,11 @@ static inline bool zswap_never_enabled(void)
>         return true;
>  }
>
> +static inline bool zswap_present_test(swp_entry_t swp)
> +{
> +       return false;
> +}
> +
>  #endif
>
>  #endif /* _LINUX_ZSWAP_H */
> diff --git a/mm/memory.c b/mm/memory.c
> index ff338c2abe92..5aa1ea9eb48a 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4827,8 +4827,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>         if (folio)
>                 swap_update_readahead(folio, vma, vmf->address);
>         if (!folio) {
> -               /* Swapin bypasses readahead for SWP_SYNCHRONOUS_IO devices */
> -               if (data_race(si->flags & SWP_SYNCHRONOUS_IO))
> +               /* Swapin bypasses readahead for SWP_SYNCHRONOUS_IO devices and zswap */
> +               if (data_race(si->flags & SWP_SYNCHRONOUS_IO) ||
> +                   zswap_present_test(entry))

This assumes that if the swap entry is in zswap, then the remaining
entries (covered by the readahead window) will also be in zswap,
right? While not very likely, it's possible that the remaining entries
not in zswap but on disk, right?

>                         folio = swapin_sync(entry, GFP_HIGHUSER_MOVABLE,
>                                             thp_swapin_suitable_orders(vmf) | BIT(0),
>                                             vmf, NULL, 0);
> diff --git a/mm/zswap.c b/mm/zswap.c
> index 761cd699e0a3..5b85b4d17647 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -234,6 +234,17 @@ static inline struct xarray *swap_zswap_tree(swp_entry_t swp)
>                 >> ZSWAP_ADDRESS_SPACE_SHIFT];
>  }
>
> +/**
> + * zswap_present_test - check if a swap entry is currently backed by zswap
> + * @swp: the swap entry to test
> + *
> + * Return: true if @swp has a zswap entry, false otherwise.
> + */
> +bool zswap_present_test(swp_entry_t swp)

zswap_is_present()?

> +{
> +       return xa_load(swap_zswap_tree(swp), swp_offset(swp));
> +}
> +
>  #define zswap_pool_debug(msg, p)                       \
>         pr_debug("%s pool %s\n", msg, (p)->tfm_name)
>
> --
> 2.54.0
>
>


^ permalink raw reply

* Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2
From: Andrei Vagin @ 2026-06-24 17:52 UTC (permalink / raw)
  To: Askar Safin
  Cc: akpm, alexander, axboe, bernd, brauner, criu, david, dhowells,
	fuse-devel, hch, jack, joannelkoong, linux-api, linux-fsdevel,
	linux-kernel, linux-mm, miklos, netdev, patches, pfalcato,
	rostedt, torvalds, val, viro, willy
In-Reply-To: <20260624071226.2272209-1-safinaskar@gmail.com>

On Wed, Jun 24, 2026 at 12:12 AM Askar Safin <safinaskar@gmail.com> wrote:
>
> Andrei Vagin <avagin@gmail.com>:
> > The CRIU fifo test fails with this change. The problem is that vmsplice
> > with SPLICE_F_NONBLOCK to a fifo file descriptor fails with -EOPNOTSUPP.
> >
> > It seems we need a fix like this one:
> >
> > diff --git a/fs/pipe.c b/fs/pipe.c
> > index 429b0714ec57..6fc49e933727 100644
> > --- a/fs/pipe.c
> > +++ b/fs/pipe.c
> > @@ -1253,6 +1253,7 @@ static int fifo_open(struct inode *inode, struct
> > file *filp)
> >
> >         /* We can only do regular read/write on fifos */
> >         stream_open(inode, filp);
> > +       filp->f_mode |= FMODE_NOWAIT;
> >
> >         switch (filp->f_mode & (FMODE_READ | FMODE_WRITE)) {
> >         case FMODE_READ:
>
> Does CRIU actually rely on ability to do SPLICE_F_NONBLOCK vmsplice into
> named fifos? Or this is merely a test?

Yes, it does.

>
> If this is just a test, I think we need not to preserve this behavior.
>
> I did debian code search with regex "vmsplice.*SPLICE_F_NONBLOCK" and I
> found very few packages. And it seems all them use pipes, not named fifos.

In short, this isn't how such cases are handled in the kernel. The fix is
simple and should be applied to avoid breaking random software.

>
> (On speed: I still think that my vmsplice patches are good thing,
> despite performance regressions in CRIU.)

I already explained that this isn't just a perfomance degradation, it
actually breaks the pre-dump mechanism in CRIU. vmsplice is invoked from
our parasite code within the context of a user process, where execution
speed is critical. A heavy performance penalty completely invalidates
the pre-dump logic, making the feature useless.

Under normal circumstances, patches that cause this kind of breakage
would never be merged. However, since there are exceptions to every
rule, we should let the maintainers decide how to proceed here. In CRIU,
we have a backup plan to utilize process_vm_readv to dump process
memory. We already support this mode, but it isn't the default due to
performance concerns. If these patches are merged, it will be the
only option left for CRIU to implement pre-dumping.

However, we need to look at this case in a broader context. This is yet
another example where the change introduces a workflow breakage, meaning
there might be other workloads out there that could be broken by this
change.

At a minimum, we may need to consider a deprecation plan where vmsplice
with SPLICE_F_GIFT triggers a warning for a few releases before these
changes are applied. Alternatively, we could introduce the proposed
behavior alongside a sysctl to fall back to the old behavior and explicitly
state that this fallback path will be completely deprecated in a future kernel
version.

Thanks,
Andrei


^ permalink raw reply

* Re: [PATCH v8 15/46] KVM: guest_memfd: Call arch invalidate hooks on conversion
From: Ackerley Tng @ 2026-06-24 17:46 UTC (permalink / raw)
  To: Sean Christopherson, Fuad Tabba
  Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
	jmattson, jthoughton, michael.roth, oupton, pankaj.gupta, qperret,
	rick.p.edgecombe, rientjes, shivankg, steven.price, willy, wyihan,
	yan.y.zhao, forkloop, pratyush, suzuki.poulose, aneesh.kumar,
	liam, Paolo Bonzini, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
	Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
	Kairui Song, Kemeng Shi, Nhat Pham, Barry Song, Axel Rasmussen,
	Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt,
	Kiryl Shutsemau, Baoquan He, Jason Gunthorpe, Vlastimil Babka,
	kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	linux-mm, linux-coco
In-Reply-To: <ajneQVLriUshjFIO@google.com>

Sean Christopherson <seanjc@google.com> writes:

> On Fri, Jun 19, 2026, Fuad Tabba wrote:
>> On Fri, 19 Jun 2026 at 01:31, Ackerley Tng via B4 Relay
>> <devnull+ackerleytng.google.com@kernel.org> wrote:
>> >
>> > From: Ackerley Tng <ackerleytng@google.com>
>> >
>> > When memory in guest_memfd is converted from private to shared, the
>> > platform-specific state associated with the guest-private pages must be
>> > invalidated or cleaned up.
>> >
>> > Iterate over the folios in the affected range and call the
>> > kvm_arch_gmem_invalidate() hook for each PFN range. This allows
>> > architectures to perform necessary teardown, such as updating hardware
>> > metadata or encryption states, before the pages are transitioned to the
>> > shared state.
>> >
>> > Invoke this helper after indicating to KVM's mmu code that an invalidation
>> > is in progress to stop in-flight page faults from succeeding.
>> >
>> > Reviewed-by: Fuad Tabba <tabba@google.com>
>> > Signed-off-by: Ackerley Tng <ackerleytng@google.com>
>>
>> Coming back to this after working through the arm64/pKVM side. My
>> Reviewed-by here is from the previous round and the patch hasn't
>> changed, but I missed an implication for arm64.
>>
>> kvm_arch_gmem_invalidate() is now called from two paths with the same
>> (start, end) signature: folio teardown (kvm_gmem_free_folio) and
>> private->shared conversion (here). For SNP/TDX that's fine, conversion is
>> destructive anyway. For pKVM the two need opposite content semantics:
>> conversion must preserve the page in place (same physical page, the point
>> of in-place conversion without encryption), while teardown must scrub it
>> before returning it to the host.
>>
>> The hook gets only a pfn range with no indication of which caller it's
>> serving, so arm64 can't give the two paths the behaviour they need. It
>> would help to signal intent on the conversion path: a reason/flag, a
>> separate hook, or not routing non-destructive conversion through the
>> teardown hook.
>>
>> arm64 isn't here yet, so this isn't urgent, but the hook is gaining a
>> second caller now, and it's cheaper to leave room for the distinction
>> than to change a generic contract other arches depend on later.
>
> Crud.  It may not be urgent for arm64, but it's urgent for other reasons that
> I "can't" describe in detail at the moment, and even if that weren't the case, I
> think we should clean things up now.  More below.
>
>> >  virt/kvm/guest_memfd.c | 41 +++++++++++++++++++++++++++++++++++++++++
>> >  1 file changed, 41 insertions(+)
>> >
>> > diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
>> > index 433f79047b9d1..3c94442bc8131 100644
>> > --- a/virt/kvm/guest_memfd.c
>> > +++ b/virt/kvm/guest_memfd.c
>> > @@ -607,6 +607,42 @@ static bool kvm_gmem_is_safe_for_conversion(struct inode *inode, pgoff_t start,
>> >         return safe;
>> >  }
>> >
>> > +#ifdef CONFIG_HAVE_KVM_ARCH_GMEM_INVALIDATE
>> > +static void kvm_gmem_invalidate(struct inode *inode, pgoff_t start, pgoff_t end)
>
> Not your fault, but kvm_arch_gmem_invalidate() is badly misnamed.  It's not
> "invalidating" anything, it's much more of a "free" callback, as SNP uses it to
> put physical pages back into a shared state when a maybe-private folio is freed.
>
> As Fuad points out, (ab)using that hook for the private=>shared conversion case
> "works", but not broadly.  And it makes the bad name worse, because it's called
> from code that _is_ doing true invalidations.  For pKVM, it may not even need to
> do anything invalidation-like.
>

Thanks, I also didn't like the naming of kvm_gmem_invalidate(),
especially when conversions also calls
kvm_gmem_invalidate_{start,end}() and those do different things.

> To avoid a conflict with patches that are going to have priority over this series,
> to set the stage for arm64 support, and to avoid avoid bleeding vendor details
> into guest_memfd, as if they are core guest_memfd behavior (only SNP needs the
> "invalidation" on this specific transition), I think we should add an arch hook
> to do conversions straightaway.
>
> Unless there's a clever option I'm missing, it'll mean adding yet another
> HAVE_KVM_ARCH_GMEM_XXX flag?  Hmm, especially because IIUC, arm64/pKVM doesn't
> need a callback for this case, only the free_folio case.
>
>> > +{
>> > +       struct folio_batch fbatch;
>> > +       pgoff_t next = start;
>> > +       int i;
>> > +
>> > +       folio_batch_init(&fbatch);
>> > +       while (filemap_get_folios(inode->i_mapping, &next, end - 1, &fbatch)) {
>> > +               for (i = 0; i < folio_batch_count(&fbatch); ++i) {
>> > +                       struct folio *folio = fbatch.folios[i];
>> > +                       pgoff_t start_index, end_index;
>> > +                       kvm_pfn_t start_pfn, end_pfn;
>> > +
>> > +                       start_index = max(start, folio->index);
>> > +                       end_index = min(end, folio_next_index(folio));
>> > +                       /*
>> > +                        * end_index is either in folio or points to
>> > +                        * the first page of the next folio. Hence,
>> > +                        * all pages in range [start_index, end_index)
>> > +                        * are contiguous.
>> > +                        */
>> > +                       start_pfn = folio_file_pfn(folio, start_index);
>> > +                       end_pfn = start_pfn + end_index - start_index;
>> > +
>> > +                       kvm_arch_gmem_invalidate(start_pfn, end_pfn);
>> > +               }
>> > +
>> > +               folio_batch_release(&fbatch);
>> > +               cond_resched();
>> > +       }
>> > +}
>> > +#else
>> > +static void kvm_gmem_invalidate(struct inode *inode, pgoff_t start, pgoff_t end) {}
>> > +#endif
>> > +
>> >  static int __kvm_gmem_set_attributes(struct inode *inode, pgoff_t start,
>> >                                      size_t nr_pages, uint64_t attrs,
>> >                                      pgoff_t *err_index)
>> > @@ -647,7 +683,12 @@ static int __kvm_gmem_set_attributes(struct inode *inode, pgoff_t start,
>> >          */
>> >
>> >         kvm_gmem_invalidate_start(inode, start, end);
>> > +
>> > +       if (!to_private)
>> > +               kvm_gmem_invalidate(inode, start, end);
>
> E.g. instead make this something like this?
>
> 	kvm_gmem_set_pfn_attributes(...)
>
> Hrm, though that wastes folio lookups in the to_private case.  So maybe just this,
> assuming pKVM doesn't need to take additional action on conversions?
>
> 	if (!to_private)
> 		kvm_gmem_make_shared(...)
>
> Actually, if we do that, then we don't need a separate arch hook, just a separate
> config.  It'll still bleed SNP details into guest_memfd, but it'll at least be
> done in a way that's more explicitly arch specific (and it's no different than
> what we already do for PREPARE...).
>

pKVM needs some arch guest_memfd lifecycle functions that

+ for conversion, doesn't do anything,
+ for teardown, resets page state (IIUC it'll be reset to
  PKVM_PAGE_OWNED (by the host))

So I think we need different functions for those two stages in the
lifecycle of a page with guest_memfd? What if we have

CONFIG_HAVE_KVM_ARCH_GMEM_SET_PFN_ATTRIBUTES, which gates

+ kvm_gmem_should_set_pfn_attributes(attributes) and
  .gmem_should_set_pfn_attributes
+ kvm_gmem_set_pfn_attributes(start_pfn, end_pfn, attributes) and
  .gmem_set_pfn_attributes

CONFIG_HAVE_KVM_ARCH_GMEM_TEARDOWN, which gates

+ kvm_gmem_teardown() and .gmem_teardown

SNP:

+ .gmem_should_set_pfn_attributes = sev_gmem_should_set_pfn_attributes,
  and sev_gmem_should_set_pfn_attributes returns !is_private
+ Rename .gmem_invalidate and sev_gmem_invalidate to *set_pfn_attributes
+ .gmem_teardown = sev_gmem_set_pfn_attributes

TDX:

+ Disable CONFIG_HAVE_KVM_ARCH_GMEM_SET_PFN_ATTRIBUTES
+ Disable CONFIG_HAVE_KVM_ARCH_GMEM_TEARDOWN

pKVM:

+ Disable CONFIG_HAVE_KVM_ARCH_GMEM_SET_PFN_ATTRIBUTES
+ .gmem_teardown = pkvm_gmem_set_pfn_attributes

Suzuki, does this work for ARM CCA?

This way,

+ The if (is_private) check doesn't leak SNP details into guest_memfd
+ .gmem_make_shared doesn't stick out without a .gmem_make_private
+ .gmem_set_pfn_attributes, .gmem_prepare and .gmem_teardown are aligned
  conceptually as lifecycle hooks

+ I think the private/shared check for prepare can also be folded into
  preparation.
    + Preparation perhaps doesn't need a should_prepare equivalent since
      there's no iteration and getting the gfn is just doing some math?
    + In another patch series?

> E.g. this?  There will still be a looming rename conflict, but that's easy enough
> to handle.
>
> diff --git virt/kvm/guest_memfd.c virt/kvm/guest_memfd.c
> index 9ce5be7843f2..8aead0abd788 100644
> --- virt/kvm/guest_memfd.c
> +++ virt/kvm/guest_memfd.c
> @@ -648,8 +648,8 @@ static bool kvm_gmem_is_safe_for_conversion(struct inode *inode, pgoff_t start,
>         return safe;
>  }
>
> -#ifdef CONFIG_HAVE_KVM_ARCH_GMEM_INVALIDATE
> -static void kvm_gmem_invalidate(struct inode *inode, pgoff_t start, pgoff_t end)
> +#ifdef CONFIG_KVM_ARCH_GMEM_FREE_ON_SHARED_CONVERSION
> +static void kvm_gmem_make_shared(struct inode *inode, pgoff_t start, pgoff_t end)
>  {
>         struct folio_batch fbatch;
>         pgoff_t next = start;
> @@ -681,7 +681,7 @@ static void kvm_gmem_invalidate(struct inode *inode, pgoff_t start, pgoff_t end)
>         }
>  }
>  #else
> -static void kvm_gmem_invalidate(struct inode *inode, pgoff_t start, pgoff_t end) {}
> +static void kvm_gmem_make_shared(struct inode *inode, pgoff_t start, pgoff_t end) { }
>  #endif
>
>  static int __kvm_gmem_set_attributes(struct inode *inode, pgoff_t start,
> @@ -729,7 +729,7 @@ static int __kvm_gmem_set_attributes(struct inode *inode, pgoff_t start,
>         kvm_gmem_invalidate_start(inode, start, end);
>
>         if (!to_private)
> -               kvm_gmem_invalidate(inode, start, end);
> +               kvm_gmem_make_shared(inode, start, end);
>
>         mas_store_prealloc(&mas, xa_mk_value(attrs));


^ permalink raw reply

* Re: [RFC PATCH] mm: bypass swap readahead for zswap
From: Nhat Pham @ 2026-06-24 17:43 UTC (permalink / raw)
  To: Kairui Song
  Cc: Alexandre Ghiti, akpm, hannes, yosry, chengming.zhou, david, ljs,
	liam, vbabka, rppt, surenb, mhocko, chrisl, baohua, usama.arif,
	linux-mm, linux-kernel
In-Reply-To: <CAMgjq7CfMp_7bhDWirUmxM0pFzk6d9in9h6wuHsMoeUu-+TC_Q@mail.gmail.com>

On Wed, Jun 24, 2026 at 3:31 AM Kairui Song <ryncsn@gmail.com> wrote:
>
>
> Better check zswap_never_enabled first to avoid a xa_load if not needed.

+1.

Maybe also xa_empty() when we're at it? :)


^ permalink raw reply

* [REGRESSION] mm/mprotect: shared-dirty base-page toggle slower since v6.17
From: Chengfeng Lin @ 2026-06-24 17:28 UTC (permalink / raw)
  To: Pedro Falcato, Andrew Morton, linux-mm
  Cc: Liam R. Howlett, Lorenzo Stoakes, Vlastimil Babka, Jann Horn,
	linux-kernel, regressions

Hi,

I have a refreshed bare-metal result for the shared-dirty mprotect()
slowdown I reported earlier from QEMU/lab testing.

The reproducer is intentionally narrow:

  - MAP_SHARED | MAP_ANONYMOUS mapping
  - 64 MiB range, write-prefaulted before timing
  - state check: 4 KiB base pages, no THP backing
  - repeated full-range mprotect(PROT_READ)
  - restore with mprotect(PROT_READ | PROT_WRITE)
  - write-touch after each protect/restore cycle

So this is not a generic mprotect() regression claim.  The scope is the
shared-dirty base-page PTE permission-change path.

The bare-metal machine is an Intel Core i7-14700 system.  The workload is
single-threaded and pinned to one logical CPU with `taskset -c 2`.  The primary
metric is `iteration_ns_per_page`, lower is better.  It is the wall-clock time
for one full protect/restore/write-touch iteration, divided by the number of
4 KiB pages in the range.  Each benchmark step used 9 external rounds, 1000
iterations, and 10 warmup iterations.

First, the v6.12 -> v6.19 result still reproduces on bare metal:

  kernel                         iteration_ns_per_page
  v6.12.77                       26
  v6.19.9                        37

I then narrowed the release window with 3 interleaved boot/run steps per
kernel:

  kernel                         values          mean
  v6.16                          25 25 25        25.000
  v6.17                          37 37 37        37.000
  v6.18                          38 38 38        38.000
  v6.18.19                       38 38 38        38.000
  v6.19.9                        37 36 37        36.667

I also checked later context with the same standalone command:

  kernel                         values          mean
  v7.0.9                         36 36 36        36.000
  v6.19.9 + Pedro v3 patch-only  39 39 39        39.000
  v7.1.0-rc3 mm-unstable/Pedro   39 39 39        39.000

I do not treat the mm-unstable result as a clean release-kernel comparison.
It is only a follow-up check, and in this workload it did not improve the
standalone result.

All of these runs reported `expected_match_ratio=100` and
`unexpected_results=0`.  The state check in the standalone output stays in the
same shape: 4 KiB pages, no THP.

This puts the slowdown in the v6.16 -> v6.17 release window.

As an attribution check, I also built a v6.17 probe kernel that only changes
the present-PTE path in `mm/mprotect.c::change_pte_range()` for this workload
back to a single-PTE start/commit/flush shape.  That is not an upstream patch
and not a clean release-kernel comparison; it is only a hot-path probe.

The result was:

  kernel                         values          mean
  v6.16                          25 25 25        25.000
  v6.17                          37 37 37        37.000
  v6.17 single-PTE probe         25 25 25        25.000

So the targeted probe brings v6.17 back to the v6.16 range for this workload.
That points at the v6.17 PTE-batching shape in `change_pte_range()` as the
main cost for this shared-dirty 4 KiB base-page case.

I do not want to overstate the attribution.  I tried reversing the official
`cac1db8c3aad ("mm: optimize mprotect() by PTE batching")` patch onto my
linux-6.17 tree, but it did not apply cleanly.  That means this is not an
exact revert result.  I can only say that the slowdown appears in the
v6.16 -> v6.17 window, and that this focused probe brings the v6.17 result
back to the v6.16 range.

Evidence bundle:

  https://github.com/lcf0399/linux-mm-regression-evidence/tree/acd7fef0e0276ac361971b0960e6611811edf5b3/mprotect-shared-dirty-toggle

Standalone reproducer:

  https://github.com/lcf0399/linux-mm-regression-evidence/tree/acd7fef0e0276ac361971b0960e6611811edf5b3/mprotect-shared-dirty-toggle/reproducer

For each installed kernel, the standalone reproducer was run as:

  taskset -c 2 env MAPPING_MB=64 ITERATIONS=1000 WARMUP=10 \
    EXTERNAL_ROUNDS=9 ./run_mprotect_shared_dirty_reproducer.sh

For the release-window check, a small systemd/GRUB queue booted each target
kernel before running the same command.

Bare-metal summaries and raw run logs:

  https://github.com/lcf0399/linux-mm-regression-evidence/tree/acd7fef0e0276ac361971b0960e6611811edf5b3/mprotect-shared-dirty-toggle/bare-metal

Release-window narrowing:

  https://github.com/lcf0399/linux-mm-regression-evidence/tree/acd7fef0e0276ac361971b0960e6611811edf5b3/mprotect-shared-dirty-toggle/bare-metal/20260623-narrow-6.16-6.19-3rounds

v6.17 single-PTE probe:

  https://github.com/lcf0399/linux-mm-regression-evidence/tree/acd7fef0e0276ac361971b0960e6611811edf5b3/mprotect-shared-dirty-toggle/bare-metal/20260624-6.17-singlepte-probe

Probe patch used for that attribution run:

  https://github.com/lcf0399/linux-mm-regression-evidence/blob/acd7fef0e0276ac361971b0960e6611811edf5b3/mprotect-shared-dirty-toggle/bare-metal/20260624-6.17-singlepte-probe/0001-mm-mprotect-probe-6.17-single-pte-hotpath.patch

#regzbot introduced: v6.16..v6.17

Does this scope look useful to investigate further?  If yes, I can try a more
exact commit-level check or test a patch you think is the right direction.

Thanks,
Chengfeng


^ permalink raw reply

* Re: [PATCH v8 18/46] KVM: guest_memfd: Handle lru_add fbatch refcounts during conversion safety check
From: Sean Christopherson @ 2026-06-24 17:01 UTC (permalink / raw)
  To: Binbin Wu
  Cc: ackerleytng, aik, andrew.jones, brauner, chao.p.peng, david,
	jmattson, jthoughton, michael.roth, oupton, pankaj.gupta, qperret,
	rick.p.edgecombe, rientjes, shivankg, steven.price, tabba, willy,
	wyihan, yan.y.zhao, forkloop, pratyush, suzuki.poulose,
	aneesh.kumar, liam, Paolo Bonzini, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
	Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
	Kairui Song, Kemeng Shi, Nhat Pham, Barry Song, Axel Rasmussen,
	Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt,
	Kiryl Shutsemau, Baoquan He, Jason Gunthorpe, Vlastimil Babka,
	kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	linux-mm, linux-coco
In-Reply-To: <6fc7f450-6d0a-494d-b295-297e4703148d@linux.intel.com>

On Tue, Jun 23, 2026, Binbin Wu wrote:
> On 6/19/2026 8:31 AM, Ackerley Tng via B4 Relay wrote:
> > @@ -606,12 +608,20 @@ static bool kvm_gmem_is_safe_for_conversion(struct inode *inode, pgoff_t start,
> >  	next = start;
> >  	while (safe && filemap_get_folios(mapping, &next, last, &fbatch)) {
> >  
> > -		for (i = 0; i < folio_batch_count(&fbatch); ++i) {
> > +		for (i = 0; i < folio_batch_count(&fbatch);) {
> >  			struct folio *folio = fbatch.folios[i];
> >  
> > -			if (folio_ref_count(folio) !=
> > -			    folio_nr_pages(folio) + filemap_get_folios_refcount) {
> > -				safe = false;
> > +			safe = (folio_ref_count(folio) ==
> > +				folio_nr_pages(folio) +
> > +				filemap_get_folios_refcount);
> > +
> > +			if (safe) {
> > +				++i;
> > +			} else if (folio_may_be_lru_cached(folio) &&
> > +				   !lru_drained) {
> > +				lru_add_drain_all();
> 
> It seems unprivileged userspace is able to trigger lru_add_drain_all() repeatedly
> by invoking KVM_SET_MEMORY_ATTRIBUTES2 in a loop, which could lead to DoS risk?

FIW, if there's a risk, then AFAICT fadvise() and memfd's F_ADD_SEALS already
have the same risk.


^ permalink raw reply

* Re: [PATCH v4 2/5] mm/zswap: Factor writeback loop out of shrink_worker()
From: Yosry Ahmed @ 2026-06-24 17:00 UTC (permalink / raw)
  To: Hao Jia
  Cc: akpm, tj, hannes, shakeel.butt, mhocko, mkoutny, nphamcs,
	chengming.zhou, muchun.song, roman.gushchin, linux-mm,
	linux-kernel, linux-doc, Hao Jia
In-Reply-To: <0916e673-861f-b472-7417-afbffbcc98ad@gmail.com>

On Wed, Jun 24, 2026 at 4:55 AM Hao Jia <jiahao.kernel@gmail.com> wrote:
>
>
>
> On 2026/6/23 07:36, Yosry Ahmed wrote:
> >> +/*
> >> + * Walk the memcg tree and write back zswap pages until the
> >> + * (lower_pages, upper_pages) window closes, or abort encounter
> >> + * MAX_RECLAIM_RETRIES times of the following conditions:
> >> + * - No writeback-candidate memcgs found in a memcg tree walk.
> >> + * - Shrinking a writeback-candidate memcg failed.
> >> + *
> >> + * For shrink_worker(), it passes lower=thr and upper=zswap_total_pages().
> >> + * The @upper limit is refreshed in each iteration by re-evaluating
> >> + * zswap_total_pages(), and the window closes once the total falls
> >> + * below the threshold.
> >
> > This is the wrong abstraction level, and it's obvious by the fact that
> > the function calls zswap_total_pages() again to recalcualte
> > 'upper_pages'. It gets much worse in the next patch as well.
> >
> > The lower_pages and upper_pages thing is also unnecessarily hard to
> > follow.
> >
> > The core of the reuse here is the retry logic. So maybe keep the memcg
> > iteration in the callers, and define a function that takes in one memcg
> > and reclaims one batch from it? failures and attempts can be passed into
> > the function to maintain the state across scans of different memcgs,
> > like zswap_shrink_walk_arg?
> >
> > WDYT?
>
>
> Perhaps something like this?
>
> struct zswap_shrink_state {
>      int attempts;
>      int failures;
>      bool stop;
> };
>
> static bool zswap_shrink_no_candidate(struct zswap_shrink_state *s)
> {
>      if (!s->attempts && ++s->failures == MAX_RECLAIM_RETRIES)
>          return true;
>
>      s->attempts = 0;
>      return false;
> }
>
> static long zswap_shrink_one(struct mem_cgroup *memcg,
>                   struct zswap_shrink_state *s)
> {
>      long shrunk;
>
>      shrunk = shrink_memcg(memcg, NR_ZSWAP_WB_BATCH);
>      if (shrunk == -ENOENT)
>          return 0;
>
>      s->attempts++;
>      if (shrunk <= 0 && ++s->failures == MAX_RECLAIM_RETRIES)
>          s->stop = true;

Do we need 'stop' or can we just return a value here to indicate that
we should stop (e.g. -EBUSY)?

>
>      return shrunk;
> }
>
> static void shrink_worker(struct work_struct *w)
> {
>      struct zswap_shrink_state s = {};
>      unsigned long thr;
>
>      /* Reclaim down to the accept threshold */
>      thr = zswap_accept_thr_pages();
>
>      while (zswap_total_pages() > thr) {
>          struct mem_cgroup *memcg;
>
>          cond_resched();
>
>          memcg = zswap_iter_global();
>          if (!memcg) {
>              if (zswap_shrink_no_candidate(&s))
>                  break;
>              continue;
>          }
>
>          zswap_shrink_one(memcg, &s);
>          /* Drop the extra reference taken by the iterator. */
>          mem_cgroup_put(memcg);
>          if (s.stop)
>              break;
>      }
> }
>
> We could also fold the logic of zswap_shrink_no_candidate() into
> zswap_shrink_one(), but adding a !memcg check inside zswap_shrink_one()
> feels a bit awkward.
>
> WDYT?

I think splitting the shrink/retry logic over 2 functions makes it
more difficult to follow, so yeah I think fold
zswap_shrink_no_candidate() into zswap_shrink_one(). Then the callers
only need to iterate memcgs (depending on the context) and call
zswap_shrink_one() for each of them.


^ permalink raw reply

* Re: [PATCH v4 1/5] mm/zswap: Extend shrink_memcg() writeback capability
From: Yosry Ahmed @ 2026-06-24 16:57 UTC (permalink / raw)
  To: Hao Jia
  Cc: akpm, tj, hannes, shakeel.butt, mhocko, mkoutny, nphamcs,
	chengming.zhou, muchun.song, roman.gushchin, linux-mm,
	linux-kernel, linux-doc, Hao Jia
In-Reply-To: <057ea303-4c27-1a6e-08de-cce26c699097@gmail.com>

>
> /*
>   * Scan up to @nr_to_scan pages across the per-node zswap LRUs of @memcg
>   * and write back the reclaimable ones.
>   *
>   * Since the second-chance algorithm rotates referenced entries to the
>   * LRU tail, the per-node scan is capped at the current LRU length so
>   * each entry is scanned at most once per call. It is up to the caller
>   * to handle retries, deciding whether to scan the next memcg to complete

Nit: "whether to scan another memcg to complete.."

>   * the full iteration, or to rescan the current memcg to drain its zswap
>   * entries.
>   *
>   * Return: The number of compressed bytes written back (>= 0), or -ENOENT
>   * if @memcg has writeback disabled, is a zombie cgroup, or has empty
>   * zswap LRUs.
>   */
> static long shrink_memcg(struct mem_cgroup *memcg, unsigned long nr_to_scan)
> {
>      struct zswap_shrink_walk_arg walk_arg = {
>          .bytes_written = 0,
>          .encountered_page_in_swapcache = false,
>      };
>      unsigned long nr_remaining = nr_to_scan;
>      int nid;
>
>      if (!mem_cgroup_zswap_writeback_enabled(memcg))
>          return -ENOENT;
>
>      /*
>       * Skip zombies because their LRUs are reparented and we would be
>       * reclaiming from the parent instead of the dead memcg.
>       */
>      if (memcg && !mem_cgroup_online(memcg))
>          return -ENOENT;
>
>      for_each_node_state(nid, N_NORMAL_MEMORY) {
>          unsigned long nr_to_walk;
>
>          /*
>           * Cap the walk at the current LRU length to ensure each entry is
>           * scanned at most once per call. Referenced entries are rotated
>           * to the tail for a second chance, and this bound prevents them
>           * from being revisited within a single call. Retries are left to
>           * the caller, which can choose to rescan the current memcg or
>           * move on to the next one.
>           */

Nit: Make this more concise since it's already explained above.

Otherwise this looks good to me, thank you!

>          nr_to_walk = min(nr_remaining,
>                   list_lru_count_one(&zswap_list_lru, nid, memcg));
>          if (!nr_to_walk)
>              continue;
>
>          nr_remaining -= nr_to_walk;
>          list_lru_walk_one(&zswap_list_lru, nid, memcg, &shrink_memcg_cb,
>                    &walk_arg, &nr_to_walk);
>          /* Return the unused share of the budget to the pool. */
>          nr_remaining += nr_to_walk;
>
>          if (!nr_remaining)
>              break;
>      }
>
>      /* Nothing was scanned: every LRU under @memcg was empty. */
>      if (nr_remaining == nr_to_scan)
>          return -ENOENT;
>
>      return walk_arg.bytes_written;
> }
>
>
> Thanks,
> Hao


^ permalink raw reply

* Re: [PATCH v8 18/46] KVM: guest_memfd: Handle lru_add fbatch refcounts during conversion safety check
From: Sean Christopherson @ 2026-06-24 16:57 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
	jmattson, jthoughton, michael.roth, oupton, pankaj.gupta, qperret,
	rick.p.edgecombe, rientjes, shivankg, steven.price, tabba, willy,
	wyihan, yan.y.zhao, forkloop, pratyush, suzuki.poulose,
	aneesh.kumar, liam, Paolo Bonzini, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
	Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
	Kairui Song, Kemeng Shi, Nhat Pham, Barry Song, Axel Rasmussen,
	Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt,
	Kiryl Shutsemau, Baoquan He, Jason Gunthorpe, Vlastimil Babka,
	kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	linux-mm, linux-coco
In-Reply-To: <20260618-gmem-inplace-conversion-v8-18-9d2959357853@google.com>

On Thu, Jun 18, 2026, Ackerley Tng wrote:
> When checking if a guest_memfd folio is safe for conversion, its refcount
> is examined. A folio may be present in a per-CPU lru_add fbatch, which
> temporarily increases its refcount. 

Under what circumstances does this happen, and what alternatives are there for
userspace to work around the issue?


^ permalink raw reply

* Re: [PATCH v2 13/13] mm: remove __GFP_NO_CODETAG
From: Suren Baghdasaryan @ 2026-06-24 16:47 UTC (permalink / raw)
  To: Hao Ge
  Cc: Brendan Jackman, Vlastimil Babka, Harry Yoo (Oracle),
	Gregory Price, Alexei Starovoitov, Matthew Wilcox, linux-mm,
	linux-kernel, linux-rt-devel, Michal Hocko, Andrew Morton,
	Johannes Weiner, Zi Yan, Muchun Song, Oscar Salvador,
	David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
	Mike Rapoport, Matthew Brost, Joshua Hahn, Rakie Kim,
	Byungchul Park, Alistair Popple, Ying Huang, Hao Li,
	Christoph Lameter, David Rientjes, Roman Gushchin,
	Sebastian Andrzej Siewior, Clark Williams, Steven Rostedt
In-Reply-To: <6e312b15-d2b5-4137-aa3f-720ec214c7ab@linux.dev>

On Tue, Jun 23, 2026 at 12:57 AM Hao Ge <hao.ge@linux.dev> wrote:
>
> Hi Brendan
>
>
> On 2026/6/22 18:01, Brendan Jackman wrote:
> > Now that alloc_pages has an entrypoint that allows passing alloc_flags,
> > we can take advantage of this to start removing GFP flags that are only
> > used for mm-internal stuff.
> >
> > This requires also plumbing the alloc_flags into some more of the
> > allocator code, in particular __alloc_pages[_noprof]() gets an
> > alloc_flags arg to go along with its callees, and we now need to pass
> > those flags deeper into the allocator so they can reach the alloc_tag
> > code.
> >
> > To try and keep the new ALLOC_NO_CODETAG's scope nice and narrow, don't
> > define it in mm/internal.h, instead just define a "reserved bit" and
> > then use that in places that don't care about what it means.

I don't understand why you want to narrow down visibility of one of
the alloc_flag bits. We don't do that for any other flags, and this
seems like an unnecessary complexity.

> >
> > Signed-off-by: Brendan Jackman <jackmanb@google.com>
>
>
> Nit: The title says "remove __GFP_NO_CODETAG" but the flag isn't really
> removed — it's migrated from gfp_t to alloc_flags as
>
> ALLOC_NO_CODETAG. Something like "mm: replace __GFP_NO_CODETAG with an
> alloc_flag" would be more accurate.
>
>
> Additionally, as Lorenzo pointed out in another thread, you will likely
> need to rebase this series later.
>
> I noticed Vlastimil has already landed the slab changes removing
> __GFP_NO_OBJ_EXT into mainline:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=335c347686e76df9d2c7d7f61b5ea627a4c5cb4c
>
> For v3, it might make sense to fold in Vlastimil's patch so the full
> removal of __GFP_NO_OBJ_EXT can be completed end-to-end
>
> https://lore.kernel.org/all/20260609-slab_alloc_flags-v1-15-2bf4a4b9b526@kernel.org/

I think Vlastimil's patch will be merged before this one, so this
patch could remove __GFP_NO_OBJ_EXT complely, saying that its last
user (__GFP_NO_CODETAG) is gone.

>
>
> > ---
> >   mm/alloc_tag.c       | 18 ++++++++++--------
> >   mm/compaction.c      |  4 ++--
> >   mm/internal.h        |  8 ++++++--
> >   mm/page_alloc.c      | 42 ++++++++++++++++++++++++------------------
> >   mm/page_frag_cache.c |  4 ++--
> >   5 files changed, 44 insertions(+), 32 deletions(-)
> >
> > diff --git a/mm/alloc_tag.c b/mm/alloc_tag.c
> > index d9be1cf5187d9..61a6cba32ff35 100644
> > --- a/mm/alloc_tag.c
> > +++ b/mm/alloc_tag.c
> > @@ -15,6 +15,8 @@
> >   #include <linux/vmalloc.h>
> >   #include <linux/kmemleak.h>
> >
> > +#include "internal.h"
> > +
> >   #define ALLOCINFO_FILE_NAME         "allocinfo"
> >   #define MODULE_ALLOC_TAG_VMAP_SIZE  (100000UL * sizeof(struct alloc_tag))
> >   #define SECTION_START(NAME)         (CODETAG_SECTION_START_PREFIX NAME)
> > @@ -785,16 +787,15 @@ struct pfn_pool {
> >                                        sizeof(unsigned long))
> >
> >   /*
> > - * Skip early PFN recording for a page allocation.  Reuses the
> > - * %__GFP_NO_OBJ_EXT bit.  Used by __alloc_tag_add_early_pfn() to avoid
> > - * recursion when allocating pages for the early PFN tracking list
> > - * itself.
> > + * Skip early PFN recording for a page allocation.  Used by
> > + * __alloc_tag_add_early_pfn() to avoid recursion when allocating pages for the
> > + * early PFN tracking list itself.
> >    *
> >    * Codetags of the pages allocated with __GFP_NO_CODETAG should be
> >    * cleared (via clear_page_tag_ref()) before freeing the pages to prevent
> >    * alloc_tag_sub_check() from triggering a warning.
> >    */
> > -#define __GFP_NO_CODETAG             __GFP_NO_OBJ_EXT
> > +#define ALLOC_NO_CODETAG             __ALLOC_ALLOC_TAG
> >
> >   static struct pfn_pool *current_pfn_pool __initdata;
> >
> > @@ -806,7 +807,8 @@ static void __init __alloc_tag_add_early_pfn(unsigned long pfn)
> >       do {
> >               pool = READ_ONCE(current_pfn_pool);
> >               if (!pool || atomic_read(&pool->count) >= PFN_POOL_SIZE) {
> > -                     struct page *new_page = alloc_page(__GFP_HIGH | __GFP_NO_CODETAG);
> > +                     struct page *new_page = __alloc_pages(__GFP_HIGH, 0, numa_mem_id(),
> > +                                                           NULL, ALLOC_NO_CODETAG);
> >                       struct pfn_pool *new;
> >
> >                       if (!new_page) {
> > @@ -837,7 +839,7 @@ typedef void alloc_tag_add_func(unsigned long pfn);
> >   static alloc_tag_add_func __rcu *alloc_tag_add_early_pfn_ptr __refdata =
> >       RCU_INITIALIZER(__alloc_tag_add_early_pfn);
> >
> > -void alloc_tag_add_early_pfn(unsigned long pfn, gfp_t gfp_flags)
> > +void alloc_tag_add_early_pfn(unsigned long pfn, unsigned int alloc_flags)
>
>
> alloc_tag_add_early_pfn is actually declared in include/linux/alloc_tag.h,
>
> so we need to update this header in sync as well.
>
> include/linux/alloc_tag.h:166:void alloc_tag_add_early_pfn(unsigned long
> pfn, gfp_t gfp_flags);
> include/linux/alloc_tag.h:170:static inline void
> alloc_tag_add_early_pfn(unsigned long pfn, gfp_t gfp_flags) {}
>
>
> >   {
> >       alloc_tag_add_func *alloc_tag_add;
> >
> > @@ -845,7 +847,7 @@ void alloc_tag_add_early_pfn(unsigned long pfn, gfp_t gfp_flags)
> >               return;
> >
> >       /* Skip allocations for the tracking list itself to avoid recursion. */
> > -     if (gfp_flags & __GFP_NO_CODETAG)
> > +     if (alloc_flags & ALLOC_NO_CODETAG)
> >               return;
> >
> >       rcu_read_lock();
> > diff --git a/mm/compaction.c b/mm/compaction.c
> > index b776f35ad0200..e90ebd2c54f48 100644
> > --- a/mm/compaction.c
> > +++ b/mm/compaction.c
> > @@ -82,7 +82,7 @@ static inline bool is_via_compact_memory(int order) { return false; }
> >
> >   static struct page *mark_allocated_noprof(struct page *page, unsigned int order, gfp_t gfp_flags)
> >   {
> > -     post_alloc_hook(page, order, __GFP_MOVABLE);
> > +     post_alloc_hook(page, order, __GFP_MOVABLE, ALLOC_DEFAULT);
> >       set_page_refcounted(page);
> >       return page;
> >   }
> > @@ -1850,7 +1850,7 @@ static struct folio *compaction_alloc_noprof(struct folio *src, unsigned long da
> >       }
> >       dst = (struct folio *)freepage;
> >
> > -     post_alloc_hook(&dst->page, order, __GFP_MOVABLE);
> > +     post_alloc_hook(&dst->page, order, __GFP_MOVABLE, ALLOC_DEFAULT);
> >       set_page_refcounted(&dst->page);
> >       if (order)
> >               prep_compound_page(&dst->page, order);
> > diff --git a/mm/internal.h b/mm/internal.h
> > index 0847b55bfc147..a45bedb9ada5f 100644
> > --- a/mm/internal.h
> > +++ b/mm/internal.h
> > @@ -684,6 +684,8 @@ struct alloc_context {
> >        */
> >       enum zone_type highest_zoneidx;
> >       bool spread_dirty_pages;
> > +     /* Only flags that are global to the whole allocation go here. */
> > +     unsigned int alloc_flags;
> >   };
> >
> >   /*
> > @@ -907,7 +909,8 @@ static inline void init_compound_tail(struct page *tail,
> >       prep_compound_tail(tail, head, order);
> >   }
> >
> > -void post_alloc_hook(struct page *page, unsigned int order, gfp_t gfp_flags);
> > +void post_alloc_hook(struct page *page, unsigned int order, gfp_t gfp_flags,
> > +                  unsigned int alloc_flags);
> >   extern bool free_pages_prepare(struct page *page, unsigned int order);
> >
> >   extern int user_min_free_kbytes;
> > @@ -1481,6 +1484,7 @@ unsigned int reclaim_clean_pages_from_list(struct zone *zone,
> >   #define ALLOC_HIGHATOMIC    0x200 /* Allows access to MIGRATE_HIGHATOMIC */
> >   #define ALLOC_NOLOCK                0x400 /* Only use spin_trylock in allocation path */
> >   #define ALLOC_KSWAPD                0x800 /* allow waking of kswapd, __GFP_KSWAPD_RECLAIM set */
> > +#define __ALLOC_ALLOC_TAG      0x1000 /* Reserved bit for use by alloc_tag code */
> >
> >   /* Flags that allow allocations below the min watermark. */
> >   #define ALLOC_RESERVES (ALLOC_NON_BLOCK|ALLOC_MIN_RESERVE|ALLOC_HIGHATOMIC|ALLOC_OOM)
> > @@ -1956,7 +1960,7 @@ bool may_expand_vm(struct mm_struct *mm, const vma_flags_t *vma_flags,
> >                  unsigned long npages);
> >
> >   struct page *__alloc_pages_noprof(gfp_t gfp, unsigned int order, int preferred_nid,
> > -             nodemask_t *nodemask);
> > +             nodemask_t *nodemask, unsigned int alloc_flags);
> >   #define __alloc_pages(...)                  alloc_hooks(__alloc_pages_noprof(__VA_ARGS__))
> >
> >   #endif      /* __MM_INTERNAL_H */
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index d99e4ea8307ea..d50fd9c77a2e8 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -1246,7 +1246,7 @@ void __clear_page_tag_ref(struct page *page)
> >   /* Should be called only if mem_alloc_profiling_enabled() */
> >   static noinline
> >   void __pgalloc_tag_add(struct page *page, struct task_struct *task,
> > -                    unsigned int nr, gfp_t gfp_flags)
> > +                    unsigned int nr, unsigned int alloc_flags)
> >   {
> >       union pgtag_ref_handle handle;
> >       union codetag_ref ref;
> > @@ -1260,17 +1260,17 @@ void __pgalloc_tag_add(struct page *page, struct task_struct *task,
> >                * page_ext is not available yet, record the pfn so we can
> >                * clear the tag ref later when page_ext is initialized.
> >                */
> > -             alloc_tag_add_early_pfn(page_to_pfn(page), gfp_flags);
> > +             alloc_tag_add_early_pfn(page_to_pfn(page), alloc_flags);
> >               if (task->alloc_tag)
> >                       alloc_tag_set_inaccurate(task->alloc_tag);
> >       }
> >   }
> >
> >   static inline void pgalloc_tag_add(struct page *page, struct task_struct *task,
> > -                                unsigned int nr, gfp_t gfp_flags)
> > +                                unsigned int nr, unsigned int alloc_flags)
>
>
> The pgalloc_tag_add() stub in the non-CONFIG_MEM_ALLOC_PROFILING build
> could use the same parameter types for consistency

Umm, correction. It *should* use the same parameter type. It's
unfortunate that the compiler doesn't catch this...

>
>
> Thanks
>
> Best Regards
>
> Hao
>
>
> >   {
> >       if (mem_alloc_profiling_enabled())
> > -             __pgalloc_tag_add(page, task, nr, gfp_flags);
> > +             __pgalloc_tag_add(page, task, nr, alloc_flags);
> >   }
> >
> >   /* Should be called only if mem_alloc_profiling_enabled() */
> > @@ -1807,7 +1807,7 @@ static inline bool should_skip_init(gfp_t flags)
> >   }
> >
> >   inline void post_alloc_hook(struct page *page, unsigned int order,
> > -                             gfp_t gfp_flags)
> > +                             gfp_t gfp_flags, unsigned int alloc_flags)
> >   {
> >       const bool zero_tags = gfp_flags & __GFP_ZEROTAGS;
> >       bool init = !want_init_on_free() && want_init_on_alloc(gfp_flags) &&
> > @@ -1858,13 +1858,13 @@ inline void post_alloc_hook(struct page *page, unsigned int order,
> >
> >       set_page_owner(page, order, gfp_flags);
> >       page_table_check_alloc(page, order);
> > -     pgalloc_tag_add(page, current, 1 << order, gfp_flags);
> > +     pgalloc_tag_add(page, current, 1 << order, alloc_flags);
> >   }
> >
> >   static void prep_new_page(struct page *page, unsigned int order, gfp_t gfp_flags,
> >                                                       unsigned int alloc_flags)
> >   {
> > -     post_alloc_hook(page, order, gfp_flags);
> > +     post_alloc_hook(page, order, gfp_flags, alloc_flags);
> >
> >       if (order && (gfp_flags & __GFP_COMP))
> >               prep_compound_page(page, order);
> > @@ -4773,8 +4773,12 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> >        * The fast path uses conservative alloc_flags to succeed only until
> >        * kswapd needs to be woken up, and to avoid the cost of setting up
> >        * alloc_flags precisely. So we do that now.
> > +      *
> > +      * Can't just or alloc_flags if it contains WMARK bits, but those flags
> > +      * shouldn't be set in ac->alloc_flags.
> >        */
> > -     alloc_flags = slowpath_alloc_flags(gfp_mask, order);
> > +     VM_WARN_ON(ac->alloc_flags & ALLOC_WMARK_MASK);
> > +     alloc_flags = ac->alloc_flags | slowpath_alloc_flags(gfp_mask, order);
> >
> >       /*
> >        * We need to recalculate the starting point for the zonelist iterator
> > @@ -4816,7 +4820,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> >       reserve_flags = __gfp_pfmemalloc_flags(gfp_mask);
> >       if (reserve_flags)
> >               alloc_flags = cma_alloc_flags(gfp_mask, reserve_flags) |
> > -                                       (alloc_flags & ALLOC_KSWAPD);
> > +                             ac->alloc_flags | (alloc_flags & ALLOC_KSWAPD);
> >
> >       /*
> >        * Reset the nodemask and zonelist iterators if memory policies can be
> > @@ -5218,7 +5222,7 @@ unsigned long alloc_pages_bulk_noprof(gfp_t gfp, int preferred_nid,
> >       return nr_populated;
> >
> >   failed:
> > -     page = __alloc_pages_noprof(gfp, 0, preferred_nid, nodemask);
> > +     page = __alloc_pages_noprof(gfp, 0, preferred_nid, nodemask, ALLOC_DEFAULT);
> >       if (page)
> >               page_array[nr_populated++] = page;
> >       goto out;
> > @@ -5326,11 +5330,13 @@ struct page *__alloc_frozen_pages_noprof(gfp_t gfp, unsigned int order,
> >   {
> >       struct page *page;
> >       gfp_t alloc_gfp; /* The gfp_t that was actually used for allocation */
> > -     struct alloc_context ac = { };
> > +     struct alloc_context ac = {
> > +             .alloc_flags = alloc_flags,
> > +     };
> >       unsigned int fastpath_alloc_flags = alloc_flags;
> >
> >       /* Other flags could be supported later if needed. */
> > -     if (WARN_ON(alloc_flags & ~ALLOC_NOLOCK))
> > +     if (WARN_ON(alloc_flags & ~(ALLOC_NOLOCK | __ALLOC_ALLOC_TAG)))
> >               return NULL;
> >
> >       if (!alloc_order_allowed(gfp, order, alloc_flags))
> > @@ -5398,12 +5404,12 @@ struct page *__alloc_frozen_pages_noprof(gfp_t gfp, unsigned int order,
> >   EXPORT_SYMBOL(__alloc_frozen_pages_noprof);
> >
> >   struct page *__alloc_pages_noprof(gfp_t gfp, unsigned int order,
> > -             int preferred_nid, nodemask_t *nodemask)
> > +             int preferred_nid, nodemask_t *nodemask, unsigned int alloc_flags)
> >   {
> >       struct page *page;
> >
> >       page = __alloc_frozen_pages_noprof(gfp, order, preferred_nid, nodemask,
> > -                                        ALLOC_DEFAULT);
> > +                                        alloc_flags);
> >       if (page)
> >               set_page_refcounted(page);
> >       return page;
> > @@ -5418,14 +5424,14 @@ struct page *alloc_pages_node_noprof(int nid, gfp_t gfp_mask, unsigned int order
> >       VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES);
> >       warn_if_node_offline(nid, gfp_mask);
> >
> > -     return __alloc_pages_noprof(gfp_mask, order, nid, NULL);
> > +     return __alloc_pages_noprof(gfp_mask, order, nid, NULL, ALLOC_DEFAULT);
> >   }
> >
> >   struct folio *__folio_alloc_noprof(gfp_t gfp, unsigned int order, int preferred_nid,
> >               nodemask_t *nodemask)
> >   {
> >       struct page *page = __alloc_pages_noprof(gfp | __GFP_COMP, order,
> > -                                     preferred_nid, nodemask);
> > +                                     preferred_nid, nodemask, ALLOC_DEFAULT);
> >       return page_rmappable_folio(page);
> >   }
> >   EXPORT_SYMBOL(__folio_alloc_noprof);
> > @@ -7107,7 +7113,7 @@ static void split_free_frozen_pages(struct list_head *list, gfp_t gfp_mask)
> >               list_for_each_entry_safe(page, next, &list[order], lru) {
> >                       int i;
> >
> > -                     post_alloc_hook(page, order, gfp_mask);
> > +                     post_alloc_hook(page, order, gfp_mask, ALLOC_DEFAULT);
> >                       if (!order)
> >                               continue;
> >
> > @@ -7312,7 +7318,7 @@ int alloc_contig_frozen_range_noprof(unsigned long start, unsigned long end,
> >               struct page *head = pfn_to_page(start);
> >
> >               check_new_pages(head, order);
> > -             prep_new_page(head, order, gfp_mask, 0);
> > +             prep_new_page(head, order, gfp_mask, ALLOC_DEFAULT);
> >       } else {
> >               ret = -EINVAL;
> >               WARN(true, "PFN range: requested [%lu, %lu), allocated [%lu, %lu)\n",
> > diff --git a/mm/page_frag_cache.c b/mm/page_frag_cache.c
> > index d2423f30577e4..d9573170e0719 100644
> > --- a/mm/page_frag_cache.c
> > +++ b/mm/page_frag_cache.c
> > @@ -57,10 +57,10 @@ static struct page *__page_frag_cache_refill(struct page_frag_cache *nc,
> >       gfp_mask = (gfp_mask & ~__GFP_DIRECT_RECLAIM) |  __GFP_COMP |
> >                  __GFP_NOWARN | __GFP_NORETRY | __GFP_NOMEMALLOC;
> >       page = __alloc_pages(gfp_mask, PAGE_FRAG_CACHE_MAX_ORDER,
> > -                          numa_mem_id(), NULL);
> > +                          numa_mem_id(), NULL, ALLOC_DEFAULT);
> >   #endif
> >       if (unlikely(!page)) {
> > -             page = __alloc_pages(gfp, 0, numa_mem_id(), NULL);
> > +             page = __alloc_pages(gfp, 0, numa_mem_id(), NULL, ALLOC_DEFAULT);
> >               order = 0;
> >       }
> >
> >


^ permalink raw reply

* Re: [PATCH v4 4/5] mm/memcontrol: convert memcg to use page_counter_stock
From: Usama Arif @ 2026-06-24 16:43 UTC (permalink / raw)
  To: Joshua Hahn
  Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Muchun Song, Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, cgroups, linux-mm, linux-kernel, kernel-team
In-Reply-To: <20260624152331.2228828-1-joshua.hahnjy@gmail.com>



On 24/06/2026 16:23, Joshua Hahn wrote:
> On Wed, 24 Jun 2026 07:43:47 -0700 Usama Arif <usama.arif@linux.dev> wrote:
> 
>> On Tue, 23 Jun 2026 11:01:22 -0700 Joshua Hahn <joshua.hahnjy@gmail.com> wrote:
> 
> Hello Usama!!
> 
> Thank you for reviewing the patch : -)
> 
> [...snip...]
> 
>>> @@ -2595,7 +2596,6 @@ void __mem_cgroup_handle_over_high(gfp_t gfp_mask)
>>>  static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
>>>  			    unsigned int nr_pages)
>>>  {
>>> -	unsigned int batch = max(MEMCG_CHARGE_BATCH, nr_pages);
>>>  	int nr_retries = MAX_RECLAIM_RETRIES;
>>>  	struct mem_cgroup *mem_over_limit;
>>>  	struct page_counter *counter;
>>> @@ -2606,36 +2606,30 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
>>>  	bool raised_max_event = false;
>>>  	unsigned long pflags;
>>>  	bool allow_spinning = gfpflags_allow_spinning(gfp_mask);
>>> +	unsigned long nr_charged = 0;
>>>  
>>>  retry:
>>> -	if (consume_stock(memcg, nr_pages))
>>> -		return 0;
>>> -
>>> -	if (!allow_spinning)
>>> -		/* Avoid the refill and flush of the older stock */
>>> -		batch = nr_pages;
>>> -
>>>  	reclaim_options = MEMCG_RECLAIM_MAY_SWAP;
>>>  	if (do_memsw_account() &&
>>> -	    !page_counter_try_charge(&memcg->memsw, batch, &counter)) {
>>> +	    !page_counter_try_charge_stock(&memcg->memsw, nr_pages,
>>> +					   &counter, NULL)) {
>>>  		mem_over_limit = mem_cgroup_from_counter(counter, memsw);
>>>  		reclaim_options &= ~MEMCG_RECLAIM_MAY_SWAP;
>>>  		goto reclaim;
>>>  	}
>>>  
>>> -	if (page_counter_try_charge(&memcg->memory, batch, &counter))
>>> -		goto done_restock;
>>> +	if (page_counter_try_charge_stock(&memcg->memory, nr_pages,
>>> +					  &counter, &nr_charged)) {
>>> +		if (!nr_charged)
>>> +			return 0;
>>> +		goto handle_high;
>>> +	}
>>>  
>>>  	if (do_memsw_account())
>>> -		page_counter_uncharge(&memcg->memsw, batch);
>>> +		page_counter_uncharge(&memcg->memsw, nr_pages);
>>
>> This needs a transactional rollback. page_counter_try_charge_stock() can
>> succeed by consuming memsw stock and charging 0 new pages, but the
>> memory-failure path unconditionally uncharges nr_pages from memsw.
>> That turns a failed allocation into a real memsw usage decrement.
> 
> Hmmmmmmmmmm....... I'm not sure.
> 
> At this point in the code, we are either (1) using cgroup v1 with memsw
> and charged successfully, or (2) not using cgroup v1 with memsw. So I'm
> not sure if this really is unconditional, we're just distinguishing
> between cases (1) and (2) by checking if we're using cgroupv1.
> 
> Or is your concern with taking a charge via stock, but uncharging with
> a hierarchical page_counter walk?

This was my concern. But I re-read the page_counter stock invariant,
and the stock-hit case is not an undercount? Consuming stock transfers
already-charged credit to the pending allocation; if the later memory charge
fails, page_counter_uncharge() discards that consumed credit from the
hierarchy. That should keeps usage equal to real charges plus remaining stock?

> If so, I think there's a case to be
> made here with just simply returning the stock. I just wanted to keep
> it consistent with the original memcontrol code, which only used
> stock to fulfill charges, not uncharges, since this could make the
> stock grow without bound.
> 
> What do you think? Thanks again for reviewing Usama, I hope you have a
> great day!!!
> Joshua



^ permalink raw reply

* Re: [PATCH v5 4/9] mm/memory_hotplug: add __add_memory_driver_managed() with online_type arg
From: Gupta, Pankaj @ 2026-06-24 16:41 UTC (permalink / raw)
  To: Gregory Price, linux-mm, nvdimm
  Cc: linux-kernel, linux-cxl, driver-core, linux-kselftest,
	kernel-team, david, osalvador, gregkh, rafael, dakr, djbw,
	vishal.l.verma, dave.jiang, akpm, ljs, liam, vbabka, rppt, surenb,
	mhocko, shuah, alison.schofield, Smita.KoralahalliChannabasappa,
	ira.weiny, apopple
In-Reply-To: <20260624145744.3532049-5-gourry@gourry.net>


> Existing callers of add_memory_driver_managed cannot select the
> preferred online type (ZONE_NORMAL vs ZONE_MOVABLE), requiring it to
> hot-add memory as offline blocks, and then follow up by onlining each
> memory block individually.
>
> Most drivers prefer the system default, but the CXL driver wants to
> plumb a preferred policy through the dax kmem driver.
>
> Refactor APIs to add a new interface which allows the dax kmem module
> to select a preferred policy.
>
> Overriding the configured auto-online policy is only safe for known
> in-tree modules, where we know the override reflects a different,
> user-requested policy.  We do not want arbitrary out-of-tree drivers
> silently overriding the system-wide onlining policy, so restrict the
> new interface to the kmem module using EXPORT_SYMBOL_FOR_MODULES()
> rather than a plain EXPORT_SYMBOL_GPL().  Other in-tree modules (e.g.
> cxl_core) can be added to the allowed list as the need arises.
>
> Refactor add_memory_driver_managed, extract __add_memory_driver_managed
> - Add proper kernel-doc for add_memory_driver_managed while refactoring
> - New helper accepts an explicit online_type.
> - New helper validates online_type is between OFFLINE and ONLINE_MOVABLE
>
> Refactor: add_memory_resource, extract __add_memory_resource
> - new helper accepts an explicit online_type
>
> Original APIs now explicitly pass the system-default to new helpers.
>
> No functional change for existing users.
>
> Acked-by: David Hildenbrand (Arm) <david@kernel.org>
> Signed-off-by: Gregory Price <gourry@gourry.net>
> ---
>   include/linux/memory_hotplug.h |  3 ++
>   mm/memory_hotplug.c            | 61 +++++++++++++++++++++++++++++-----
>   2 files changed, 56 insertions(+), 8 deletions(-)
>
> diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
> index f059025f8f8b..d3edeb80aadb 100644
> --- a/include/linux/memory_hotplug.h
> +++ b/include/linux/memory_hotplug.h
> @@ -294,6 +294,9 @@ extern int __add_memory(int nid, u64 start, u64 size, mhp_t mhp_flags);
>   extern int add_memory(int nid, u64 start, u64 size, mhp_t mhp_flags);
>   extern int add_memory_resource(int nid, struct resource *resource,
>   			       mhp_t mhp_flags);
> +int __add_memory_driver_managed(int nid, u64 start, u64 size,
> +				const char *resource_name, mhp_t mhp_flags,
> +				enum mmop online_type);
>   extern int add_memory_driver_managed(int nid, u64 start, u64 size,
>   				     const char *resource_name,
>   				     mhp_t mhp_flags);
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 494257054095..a66346def504 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1494,10 +1494,10 @@ static int create_altmaps_and_memory_blocks(int nid, struct memory_group *group,
>    *
>    * we are OK calling __meminit stuff here - we have CONFIG_MEMORY_HOTPLUG
>    */
> -int add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
> +static int __add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags,
> +				 enum mmop online_type)
>   {
>   	struct mhp_params params = { .pgprot = pgprot_mhp(PAGE_KERNEL) };
> -	enum mmop online_type = mhp_get_default_online_type();
>   	enum memblock_flags memblock_flags = MEMBLOCK_NONE;
>   	struct memory_group *group = NULL;
>   	u64 start, size;
> @@ -1585,7 +1585,7 @@ int add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
>   		merge_system_ram_resource(res);
>   
>   	/* online pages if requested */
> -	if (mhp_get_default_online_type() != MMOP_OFFLINE)
> +	if (online_type != MMOP_OFFLINE)
>   		walk_memory_blocks(start, size, &online_type,
>   				   online_memory_block);
>   
> @@ -1603,7 +1603,13 @@ int add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
>   	return ret;
>   }
>   
> -/* requires device_hotplug_lock, see add_memory_resource() */
> +int add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
> +{
> +	return __add_memory_resource(nid, res, mhp_flags,
> +				     mhp_get_default_online_type());
> +}
> +
> +/* requires device_hotplug_lock, see __add_memory_resource() */
>   int __add_memory(int nid, u64 start, u64 size, mhp_t mhp_flags)
>   {
>   	struct resource *res;
> @@ -1631,7 +1637,15 @@ int add_memory(int nid, u64 start, u64 size, mhp_t mhp_flags)
>   }
>   EXPORT_SYMBOL_GPL(add_memory);
>   
> -/*
> +/**
> + * __add_memory_driver_managed - add driver-managed memory with explicit online_type
> + * @nid: NUMA node ID where the memory will be added
> + * @start: Start physical address of the memory range
> + * @size: Size of the memory range in bytes
> + * @resource_name: Resource name in format "System RAM ($DRIVER)"
> + * @mhp_flags: Memory hotplug flags
> + * @online_type: Auto-Online behavior (offline, online, kernel, movable)
> + *
>    * Add special, driver-managed memory to the system as system RAM. Such
>    * memory is not exposed via the raw firmware-provided memmap as system
>    * RAM, instead, it is detected and added by a driver - during cold boot,
> @@ -1639,6 +1653,7 @@ EXPORT_SYMBOL_GPL(add_memory);
>    *
>    * Reasons why this memory should not be used for the initial memmap of a
>    * kexec kernel or for placing kexec images:
> + *
>    * - The booting kernel is in charge of determining how this memory will be
>    *   used (e.g., use persistent memory as system RAM)
>    * - Coordination with a hypervisor is required before this memory
> @@ -1651,9 +1666,12 @@ EXPORT_SYMBOL_GPL(add_memory);
>    *
>    * The resource_name (visible via /proc/iomem) has to have the format
>    * "System RAM ($DRIVER)".
> + *
> + * Return: 0 on success, negative error code on failure.
>    */
> -int add_memory_driver_managed(int nid, u64 start, u64 size,
> -			      const char *resource_name, mhp_t mhp_flags)
> +int __add_memory_driver_managed(int nid, u64 start, u64 size,
> +		const char *resource_name, mhp_t mhp_flags,
> +		enum mmop online_type)
>   {
>   	struct resource *res;
>   	int rc;
> @@ -1663,6 +1681,9 @@ int add_memory_driver_managed(int nid, u64 start, u64 size,
>   	    resource_name[strlen(resource_name) - 1] != ')')
>   		return -EINVAL;
>   
> +	if (online_type < MMOP_OFFLINE || online_type > MMOP_ONLINE_MOVABLE)
> +		return -EINVAL;
> +
>   	lock_device_hotplug();
>   
>   	res = register_memory_resource(start, size, resource_name);
> @@ -1671,7 +1692,7 @@ int add_memory_driver_managed(int nid, u64 start, u64 size,
>   		goto out_unlock;
>   	}
>   
> -	rc = add_memory_resource(nid, res, mhp_flags);
> +	rc = __add_memory_resource(nid, res, mhp_flags, online_type);
>   	if (rc < 0)
>   		release_memory_resource(res);
>   
> @@ -1679,6 +1700,30 @@ int add_memory_driver_managed(int nid, u64 start, u64 size,
>   	unlock_device_hotplug();
>   	return rc;
>   }
> +EXPORT_SYMBOL_FOR_MODULES(__add_memory_driver_managed, "kmem");
> +
> +/**
> + * add_memory_driver_managed - add driver-managed memory
> + * @nid: NUMA node ID where the memory will be added
> + * @start: Start physical address of the memory range
> + * @size: Size of the memory range in bytes
> + * @resource_name: Resource name in format "System RAM ($DRIVER)"
> + * @mhp_flags: Memory hotplug flags
> + *
> + * Add driver-managed memory with the system default online type set by
> + * build config or kernel boot parameter.
> + *
> + * See __add_memory_driver_managed for more details.
> + *
> + * Return: 0 on success, negative error code on failure.
> + */
> +int add_memory_driver_managed(int nid, u64 start, u64 size,
> +			      const char *resource_name, mhp_t mhp_flags)
> +{
> +	return __add_memory_driver_managed(nid, start, size, resource_name,
> +			mhp_flags,
> +			mhp_get_default_online_type());
> +}
>   EXPORT_SYMBOL_GPL(add_memory_driver_managed);
>   
>   /*

Reviewed-by: Pankaj Gupta <pankaj.gupta@amd.com>




^ permalink raw reply

* Re: [PATCH RFC 0/4] memcg,slab: kmalloc_nolock() fixes
From: Alexei Starovoitov @ 2026-06-24 16:30 UTC (permalink / raw)
  To: Harry Yoo (Oracle), Johannes Weiner, Michal Hocko, Roman Gushchin,
	Shakeel Butt, Muchun Song, Andrew Morton, Vlastimil Babka, Hao Li,
	Christoph Lameter, David Rientjes, Alexei Starovoitov,
	Pedro Falcato
  Cc: cgroups, linux-mm, linux-kernel, bpf
In-Reply-To: <20260624-kmalloc-nolock-fixes-v1-0-fdf4d17351dd@kernel.org>

On Wed Jun 24, 2026 at 6:11 AM PDT, Harry Yoo (Oracle) wrote:
>
> Bug 1 was reported by lockdep, and bugs 2 [2] and 3 [3] were
> reported by Sashiko.

... and in fixes for sashiko complains sashiko finds more issues.
I don't think it will ever end. I suggest to fix realistic scenarios
instead of one out of billion cases that sashiko think is plausible
but will never be hit in reality. The chance of server crashing
due to cosmic rays are higher than such bugs. Hence do not fix them.

> To BPF folks: do we need to backport kmalloc_nolock() support
> for architectures without __CMPXCHG_DOUBLE to v6.18?

nope.

> There are still few users in v6.18, but I can't tell whether it is
> necessary to backport it to v6.18 (hopefully not as urgent as other
> bugfixes).

imo none of these 'fixes' are necessary. Humans are not hitting them.



^ permalink raw reply

* Re: [PATCH v5 2/9] mm/memory_hotplug: pass online_type to online_memory_block() via arg
From: Gupta, Pankaj @ 2026-06-24 16:28 UTC (permalink / raw)
  To: Gregory Price, linux-mm, nvdimm
  Cc: linux-kernel, linux-cxl, driver-core, linux-kselftest,
	kernel-team, david, osalvador, gregkh, rafael, dakr, djbw,
	vishal.l.verma, dave.jiang, akpm, ljs, liam, vbabka, rppt, surenb,
	mhocko, shuah, alison.schofield, Smita.KoralahalliChannabasappa,
	ira.weiny, apopple
In-Reply-To: <20260624145744.3532049-3-gourry@gourry.net>


> Modify online_memory_block() to accept the online type through its arg
> parameter rather than calling mhp_get_default_online_type() internally.
>
> This prepares for allowing callers to specify explicit online types.
>
> Update the caller in add_memory_resource() to pass the default online
> type via a local variable.
>
> No functional change.
>
> Acked-by: David Hildenbrand (Red Hat) <david@kernel.org>
> Signed-off-by: Gregory Price <gourry@gourry.net>
> ---
>   mm/memory_hotplug.c | 8 ++++++--
>   1 file changed, 6 insertions(+), 2 deletions(-)
>
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 7ac19fab2263..6833208cc17c 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1337,7 +1337,9 @@ static int check_hotplug_memory_range(u64 start, u64 size)
>   
>   static int online_memory_block(struct memory_block *mem, void *arg)
>   {
> -	mem->online_type = mhp_get_default_online_type();
> +	enum mmop *online_type = arg;
> +
> +	mem->online_type = *online_type;
>   	return device_online(&mem->dev);
>   }
>   
> @@ -1494,6 +1496,7 @@ static int create_altmaps_and_memory_blocks(int nid, struct memory_group *group,
>   int add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
>   {
>   	struct mhp_params params = { .pgprot = pgprot_mhp(PAGE_KERNEL) };
> +	enum mmop online_type = mhp_get_default_online_type();
>   	enum memblock_flags memblock_flags = MEMBLOCK_NONE;
>   	struct memory_group *group = NULL;
>   	u64 start, size;
> @@ -1582,7 +1585,8 @@ int add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
>   
>   	/* online pages if requested */
>   	if (mhp_get_default_online_type() != MMOP_OFFLINE)
> -		walk_memory_blocks(start, size, NULL, online_memory_block);
> +		walk_memory_blocks(start, size, &online_type,
> +				   online_memory_block);
>   
>   	return ret;
>   error:
Reviewed-by: Pankaj Gupta <pankaj.gupta@amd.com>



^ permalink raw reply

* Re: [PATCH v2 10/13] mm: Remove __alloc_pages_node()
From: Suren Baghdasaryan @ 2026-06-24 16:24 UTC (permalink / raw)
  To: Brendan Jackman
  Cc: Andrew Morton, Vlastimil Babka, Michal Hocko, Johannes Weiner,
	Zi Yan, Muchun Song, Oscar Salvador, David Hildenbrand,
	Lorenzo Stoakes, Liam R. Howlett, Mike Rapoport, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Ying Huang,
	Alistair Popple, Hao Li, Christoph Lameter, David Rientjes,
	Roman Gushchin, Sebastian Andrzej Siewior, Clark Williams,
	Steven Rostedt, Harry Yoo (Oracle), Gregory Price,
	Alexei Starovoitov, Matthew Wilcox, linux-mm, linux-kernel,
	linux-rt-devel
In-Reply-To: <20260622-alloc-trylock-v2-10-31f31367d420@google.com>

On Mon, Jun 22, 2026 at 3:01 AM Brendan Jackman <jackmanb@google.com> wrote:
>
> There were only a few users, which have been removed. The only advantage
> of this API over alloc_pages_node() is avoiding a single conditional
> branch. The disadvantages are:
>
> 1. More API surface, more sources of confusion, more maintenance.
>
> 2. Worse impact of CPU hotplug bugs: most users of __alloc_pages_node()
>    were using the result of cpu_to_node(); if the CPU gets hotplugged
>    out this will return NUMA_NO_NODE. If one of these paths fails to
>    protect against a concurrent hotplug then page_alloc.c will use
>    NUMA_NO_NODE as an index into NODE_DATA() and cause some horrible
>    memory corruption or other. With alloc_pages_node(), the code might
>    just work fine.
>
> Ulterior motive: this frees up the __* variants of the allocator APIs to
> serve specifically for use as mm-internal API.

Ah, that's what motivated all that churn! :)

>
> Signed-off-by: Brendan Jackman <jackmanb@google.com>

Reviewed-by: Suren Baghdasaryan <surenb@google.com>

> ---
>  include/linux/gfp.h | 20 ++++----------------
>  1 file changed, 4 insertions(+), 16 deletions(-)
>
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index cdf95a9f0b87c..7edcc2e0be9ce 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -278,21 +278,6 @@ static inline void warn_if_node_offline(int this_node, gfp_t gfp_mask)
>         dump_stack();
>  }
>
> -/*
> - * Allocate pages, preferring the node given as nid. The node must be valid and
> - * online. For more general interface, see alloc_pages_node().
> - */
> -static inline struct page *
> -__alloc_pages_node_noprof(int nid, gfp_t gfp_mask, unsigned int order)
> -{
> -       VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES);
> -       warn_if_node_offline(nid, gfp_mask);
> -
> -       return __alloc_pages_noprof(gfp_mask, order, nid, NULL);
> -}
> -
> -#define  __alloc_pages_node(...)               alloc_hooks(__alloc_pages_node_noprof(__VA_ARGS__))
> -
>  static inline
>  struct folio *__folio_alloc_node_noprof(gfp_t gfp, unsigned int order, int nid)
>  {
> @@ -315,7 +300,10 @@ static inline struct page *alloc_pages_node_noprof(int nid, gfp_t gfp_mask,
>         if (nid == NUMA_NO_NODE)
>                 nid = numa_mem_id();
>
> -       return __alloc_pages_node_noprof(nid, gfp_mask, order);
> +       VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES);
> +       warn_if_node_offline(nid, gfp_mask);
> +
> +       return __alloc_pages_noprof(gfp_mask, order, nid, NULL);
>  }
>
>  #define  alloc_pages_node(...)                 alloc_hooks(alloc_pages_node_noprof(__VA_ARGS__))
>
> --
> 2.54.0
>


^ permalink raw reply

* Re: [PATCH v2 03/13] mm/page_alloc: unify __alloc_frozen_pages[_nolock]_noprof()
From: Brendan Jackman @ 2026-06-24 16:24 UTC (permalink / raw)
  To: Suren Baghdasaryan, Brendan Jackman
  Cc: Andrew Morton, Vlastimil Babka, Michal Hocko, Johannes Weiner,
	Zi Yan, Muchun Song, Oscar Salvador, David Hildenbrand,
	Lorenzo Stoakes, Liam R. Howlett, Mike Rapoport, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Ying Huang,
	Alistair Popple, Hao Li, Christoph Lameter, David Rientjes,
	Roman Gushchin, Sebastian Andrzej Siewior, Clark Williams,
	Steven Rostedt, Harry Yoo (Oracle), Gregory Price,
	Alexei Starovoitov, Matthew Wilcox, linux-mm, linux-kernel,
	linux-rt-devel
In-Reply-To: <CAJuCfpE28TqZy2k5-X1ZEdd0HhTuc5i7+0kyxX5nXH4j+5JVfw@mail.gmail.com>

On Wed Jun 24, 2026 at 4:00 PM UTC, Suren Baghdasaryan wrote:
> On Mon, Jun 22, 2026 at 3:02 AM Brendan Jackman <jackmanb@google.com> wrote:
>>
>> Currently the core allocator code is controlled by ALLOC_NOLOCK, but the
>> main entry point function is significantly different from the normal
>> __alloc_frozen_pages_nolock(), this is tiring when reading the code.
>>
>> Plumb the ALLOC_NOLOCK control one layer up in the call stack: create
>> an alloc_flags argument to __alloc_frozen_pages_nolock() (which is only
>> exposed to mm/) and then turn the nolock variant into a thin wrapper
>> that just sets that flag (as well as handling NUMA_NO_NODE, similar to
>> how some of the wrappers in gfp.h do).
>>
>> Rationale that this doesn't change anything:
>>
>> 1. Simple bits: A bunch of the nolock-specific handling is just moved to
>>    the new alloc_order_allowed(), alloc_trylock_allowed() and
>>    gfp_trylock.
>>
>> 2. __alloc_frozen_pages_noprof() has some extra logic that wasn't
>>    previously in the nolock variant:
>>
>>    a. Application of gfp_allowed_mask; this only affects early boot, and
>>       only flags that affect the slowpath get changed here.
>>
>>    b. Application of current_gfp_context() - also only affects the
>>       slowpath
>>
>> 3. The slowpath itself: this is now just explicitly skipped under
>>    !ALLOC_TRYLOCK.
>>
>> Ulterior motive: adding an alloc_flags arg to the allocator's
>> mm-internal entrypoint can later be used to do more allocation
>> customisation without needing to create new GFP flags.
>
> Looks like a nice overall cleanup.
>
>>
>> While adding this flag to a bunch of places, create ALLOC_DEFAULT to
>> avoid a mysterious literal 0 in most places. alloc_frozen_pages_noprof()
>> is defined above the alloc flags so just leave that as a slightly messy
>> exception instead of trying to fully reorder mm/internal.h for that one
>> case.
>
> Moving the whole alloc_frozen_pages() block down seems simple enough
> and would avoid special-casing this.

Yeah... when you put it like that, I don't actually know why I was so
intimidated by the prospect of moving a handful of function
declarations!

Anyway in the v3 I'm creating a new mm/page_alloc.h so this will happen
as a side effect of that.


^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox