Linux-mm Archive on lore.kernel.org
 help / color / mirror / Atom feed
* Re: [PATCH v8 42/46] KVM: selftests: Provide common function to set memory attributes
From: Fuad Tabba @ 2026-06-25  9:09 UTC (permalink / raw)
  To: ackerleytng
  Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
	jmattson, jthoughton, michael.roth, oupton, pankaj.gupta, qperret,
	rick.p.edgecombe, rientjes, shivankg, steven.price, willy, wyihan,
	yan.y.zhao, forkloop, pratyush, suzuki.poulose, aneesh.kumar,
	liam, Paolo Bonzini, Sean Christopherson, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, x86, H. Peter Anvin,
	Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
	Jonathan Corbet, Shuah Khan, Shuah Khan, Vishal Annapurve,
	Andrew Morton, Chris Li, Kairui Song, Kemeng Shi, Nhat Pham,
	Barry Song, Axel Rasmussen, Yuanchu Xie, Wei Xu, Youngjun Park,
	Qi Zheng, Shakeel Butt, Kiryl Shutsemau, Baoquan He,
	Jason Gunthorpe, Vlastimil Babka, kvm, linux-kernel,
	linux-trace-kernel, linux-doc, linux-kselftest, linux-mm,
	linux-coco
In-Reply-To: <20260618-gmem-inplace-conversion-v8-42-9d2959357853@google.com>

On Fri, 19 Jun 2026 at 01:32, Ackerley Tng via B4 Relay
<devnull+ackerleytng.google.com@kernel.org> wrote:
>
> From: Sean Christopherson <seanjc@google.com>
>
> Introduce vm_mem_set_memory_attributes(), which handles setting of memory
> attributes for a range of guest physical addresses, regardless of whether
> the attributes should be set via guest_memfd or via the memory attributes
> at the VM level.
>
> Refactor existing vm_mem_set_{shared,private} functions to use the new
> function. Opportunistically update the size parameter to use size_t instead
> of u64.
>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> Co-developed-by: Ackerley Tng <ackerleytng@google.com>
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>

Reviewed-by: Fuad Tabba <tabba@google.com>

Cheers,
/fuad

> ---
>  tools/testing/selftests/kvm/include/kvm_util.h | 46 +++++++++++++++++++-------
>  1 file changed, 34 insertions(+), 12 deletions(-)
>
> diff --git a/tools/testing/selftests/kvm/include/kvm_util.h b/tools/testing/selftests/kvm/include/kvm_util.h
> index 3a6b1fa7f26ef..db1442da21bb1 100644
> --- a/tools/testing/selftests/kvm/include/kvm_util.h
> +++ b/tools/testing/selftests/kvm/include/kvm_util.h
> @@ -454,18 +454,6 @@ static inline void vm_set_memory_attributes(struct kvm_vm *vm, gpa_t gpa,
>         vm_ioctl(vm, KVM_SET_MEMORY_ATTRIBUTES, &attr);
>  }
>
> -static inline void vm_mem_set_private(struct kvm_vm *vm, gpa_t gpa,
> -                                     u64 size)
> -{
> -       vm_set_memory_attributes(vm, gpa, size, KVM_MEMORY_ATTRIBUTE_PRIVATE);
> -}
> -
> -static inline void vm_mem_set_shared(struct kvm_vm *vm, gpa_t gpa,
> -                                    u64 size)
> -{
> -       vm_set_memory_attributes(vm, gpa, size, 0);
> -}
> -
>  static inline int __gmem_set_memory_attributes(int fd, u64 offset,
>                                                size_t size, u64 attributes,
>                                                u64 *error_offset)
> @@ -532,6 +520,40 @@ static inline void gmem_set_shared(int fd, u64 offset, size_t size)
>         gmem_set_memory_attributes(fd, offset, size, 0);
>  }
>
> +static inline void vm_mem_set_memory_attributes(struct kvm_vm *vm, gpa_t gpa,
> +                                               size_t size, u64 attrs)
> +{
> +       if (kvm_has_gmem_attributes) {
> +               gpa_t end = gpa + size;
> +               off_t fd_offset;
> +               gpa_t addr;
> +               size_t len;
> +               int fd;
> +
> +               for (addr = gpa; addr < end; addr += len) {
> +                       fd = kvm_gpa_to_guest_memfd(vm, addr, &fd_offset, &len);
> +                       len = min(end - addr, len);
> +
> +                       gmem_set_memory_attributes(fd, fd_offset, len, attrs);
> +               }
> +       } else {
> +               vm_set_memory_attributes(vm, gpa, size, attrs);
> +       }
> +}
> +
> +static inline void vm_mem_set_private(struct kvm_vm *vm, gpa_t gpa,
> +                                     size_t size)
> +{
> +       vm_mem_set_memory_attributes(vm, gpa, size,
> +                                    KVM_MEMORY_ATTRIBUTE_PRIVATE);
> +}
> +
> +static inline void vm_mem_set_shared(struct kvm_vm *vm, gpa_t gpa,
> +                                    size_t size)
> +{
> +       vm_mem_set_memory_attributes(vm, gpa, size, 0);
> +}
> +
>  void vm_guest_mem_fallocate(struct kvm_vm *vm, gpa_t gpa, u64 size,
>                             bool punch_hole);
>
>
> --
> 2.55.0.rc0.738.g0c8ab3ebcc-goog
>
>


^ permalink raw reply

* Re: [PATCH v1 1/2] eventfd: luo: luo support for preserving eventfd
From: Pratyush Yadav @ 2026-06-25  9:06 UTC (permalink / raw)
  To: Chenghao Duan
  Cc: viro, brauner, jack, linux-fsdevel, pasha.tatashin, linux-kernel,
	rppt, pratyush, kexec, linux-mm, jianghaoran
In-Reply-To: <20260625054946.73445-2-duanchenghao@kylinos.cn>

On Thu, Jun 25 2026, Chenghao Duan wrote:

> This patch adds support for preserving eventfd file descriptors across
> kexec live updates using the Live Update Orchestrator (LUO) framework.
> Userspace applications using eventfd for event notification can now
> maintain their state across kernel updates.
>
> Preserved State:
> The following properties of the eventfd are preserved across kexec:
> - Counter Value: The current 64-bit counter value, including any pending
>   events that have been signaled but not yet consumed by readers.
> - File Flags: The creation flags (EFD_SEMAPHORE, EFD_CLOEXEC, EFD_NONBLOCK)
>   are preserved.
>
> Non-Preserved State:
> - File Descriptor Number: The eventfd will be assigned a new fd number
>   in the target process after restore.
> - Wait Queue State: Any processes blocked on read() operations will be
>   woken up and need to re-establish their blocking state.
> - All other internal state is reset to default.
>
> Changes:
> - fs/eventfd.c: Add eventfd_luo_get_state() to safely read eventfd state
>   (count and flags), and eventfd_create() helper function.
> - fs/eventfd_luo.c: New file implementing LUO file operations:
>   preserve, freeze, unpreserve, retrieve, and finish callbacks.
> - include/linux/eventfd.h: Export new functions.
> - include/linux/kho/abi/eventfd.h: Define the ABI contract with
>   eventfd_luo_ser structure for serialization.

Why do you need to preserve this? Why don't you create a fresh one after
kexec? You just preserve the counter, which looks pretty much useless.
You can just as well open a new eventfd after kexec and set the counter
value if you care about it.

[...]

-- 
Regards,
Pratyush Yadav


^ permalink raw reply

* Re: [PATCH 1/2] mm/vmpressure: skip tree=true accounting on cgroup v2
From: Michal Koutný @ 2026-06-25  9:04 UTC (permalink / raw)
  To: Usama Arif
  Cc: Andrew Morton, david, linux-mm, hannes, tj, shakeel.butt,
	roman.gushchin, liam, linux-kernel, ljs, mhocko, rppt, surenb,
	vbabka, kernel-team
In-Reply-To: <20260606114158.3126210-2-usama.arif@linux.dev>

[-- Attachment #1: Type: text/plain, Size: 933 bytes --]

Hello Usama.

On Sat, Jun 06, 2026 at 04:41:33AM -0700, Usama Arif <usama.arif@linux.dev> wrote:
> --- a/mm/vmpressure.c
> +++ b/mm/vmpressure.c
> @@ -246,11 +246,13 @@ void vmpressure(gfp_t gfp, int order, struct mem_cgroup *memcg, bool tree,
>  		return;
>  
>  	/*
> -	 * The in-kernel users only care about the reclaim efficiency
> -	 * for this @memcg rather than the whole subtree, and there
> -	 * isn't and won't be any in-kernel user in a legacy cgroup.
> +	 * Only two combinations have a consumer:
> +	 *   cgroup v2 + tree=false -> in-kernel socket pressure
> +	 *   cgroup v1 + tree=true  -> userspace eventfds (memory.pressure_level)
> +	 * Skip the other two: nothing consumes the result.

This is a good finding, I had some troubles convincing myself that the
v2 has really only the memcg->socket_pressure. I think swapping the
order of the patches would make it easier to comprehend.


Michal

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 265 bytes --]

^ permalink raw reply

* Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2
From: Askar Safin @ 2026-06-25  9:03 UTC (permalink / raw)
  To: val
  Cc: akpm, axboe, brauner, david, dhowells, fuse-devel, hch, jack,
	joannelkoong, linux-api, linux-fsdevel, linux-kernel, linux-mm,
	miklos, netdev, patches, pfalcato, rostedt, safinaskar, torvalds,
	viro, willy
In-Reply-To: <83f05c55-efba-4bf5-abfe-d2ab0819e904@packett.cool>

Val Packett <val@packett.cool>:
> speaking of fuse_dev_splice……_write actually, this series has broken 
> xdg-document-portal!
> 
> https://github.com/flatpak/xdg-desktop-portal/issues/2026
> 
> Specifically what happens is that the EINVAL is returned due to oh.len 
> != nbytes:
> 
> fuse_dev_do_write: oh.len 16400 != nbytes 15526
> 
> (where 16400 == 16384 (read len) + 16, 15526 == 15510 (file len) + 16)
> 
> After reverting the series, there is no error because oh.len 
> becomes 15526 too.

Please, test v2 version of my fixes:
https://lore.kernel.org/lkml/20260625083409.3769242-1-safinaskar@gmail.com/ .

This should fix this bug.

-- 
Askar Safin


^ permalink raw reply

* Re: [PATCH 2/2] mm/vmpressure: split v1 userspace eventfd code into vmpressure-v1.c
From: Michal Koutný @ 2026-06-25  9:00 UTC (permalink / raw)
  To: Usama Arif
  Cc: Andrew Morton, david, linux-mm, hannes, tj, shakeel.butt,
	roman.gushchin, liam, linux-kernel, ljs, mhocko, rppt, surenb,
	vbabka, kernel-team
In-Reply-To: <20260606114158.3126210-3-usama.arif@linux.dev>

[-- Attachment #1: Type: text/plain, Size: 1395 bytes --]

On Sat, Jun 06, 2026 at 04:41:34AM -0700, Usama Arif <usama.arif@linux.dev> wrote:
> Clean up mm/vmpressure.c by separating the cgroup v1 userspace eventfd
> interface from the shared and v2 in-kernel code.
> 
> Currently, almost half of mm/vmpressure.c exists to serve tree=true:
> struct vmpressure_event, the events list and its mutex, the work_struct
> and vmpressure_work_fn that drains tree_scanned/tree_reclaimed, the
> parent walk, vmpressure_event(), vmpressure_register_event(),
> vmpressure_unregister_event(), and vmpressure_prio() (which always
> calls vmpressure() with tree=true).
> 
> Move it all into a new mm/vmpressure-v1.c built only when
> CONFIG_MEMCG_V1=y (following the existing memcontrol-v1.o pattern).

Thanks for this dissection.

> @@ -283,14 +152,8 @@ void vmpressure(gfp_t gfp, int order, struct mem_cgroup *memcg, bool tree,
>  		return;
>  
>  	if (tree) {
> -		spin_lock(&vmpr->sr_lock);
> -		scanned = vmpr->tree_scanned += scanned;
> -		vmpr->tree_reclaimed += reclaimed;
> -		spin_unlock(&vmpr->sr_lock);
> -
> -		if (scanned < vmpressure_win)
> -			return;
> -		schedule_work(&vmpr->work);
> +		vmpressure_v1_account_tree(vmpr, scanned, reclaimed);
> +		return;
>  	} else {
>  		enum vmpressure_levels level;
>  

This return; looks weird, I'd either
a) drop it or 
b) keep it + de-indent the rest of the vmpressure().

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 265 bytes --]

^ permalink raw reply

* Re: [PATCH] tools/writeback: parse help before importing drgn
From: Yousef Alhouseen @ 2026-06-25  8:59 UTC (permalink / raw)
  To: SeongJae Park
  Cc: willy, jack, shikemeng, linux-fsdevel, linux-mm, linux-kernel
In-Reply-To: <20260625002752.96325-1-sj@kernel.org>

Hi SJ,

You're right; the normal invocation path is through the drgn launcher,
so the no-drgn case I described is too narrow to justify the patch as
written.

Please drop this patch.

Thanks,
Yousef


On Wed, 24 Jun 2026 17:27:52 -0700, SeongJae Park <sj@kernel.org> wrote:
> On Wed, 24 Jun 2026 14:35:14 +0200 Yousef Alhouseen <alhouseenyousef@gmail.com> wrote:
>
> > wb_monitor.py imports drgn before argparse can handle "-h". That makes
> > help fail on systems where drgn is not installed, even though the script
> > does not need drgn to print usage text.
>
> But... How do you execute the drgn script on systems not having drgn? I tried
> to mimic the situation and reproduce the issue you are saying about, but what I
> get is like below:
>
> $ sudo mv /usr/bin/drgn /usr/bin/drgn.bak
> $ drgn tools/writeback/wb_monitor.py
> -bash: /usr/bin/drgn: No such file or directory
> $ python tools/writeback/wb_monitor.py
> Traceback (most recent call last):
> File "/home/lkhack/linux/tools/writeback/wb_monitor.py", line 44, in <module>
> bdi_list = prog['bdi_list']
> ^^^^
> NameError: name 'prog' is not defined
>
> Thanks,
> SJ
>
> [...]


^ permalink raw reply

* Re: [PATCH v8 41/46] KVM: selftests: Provide function to look up guest_memfd details from gpa
From: Fuad Tabba @ 2026-06-25  8:58 UTC (permalink / raw)
  To: ackerleytng
  Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
	jmattson, jthoughton, michael.roth, oupton, pankaj.gupta, qperret,
	rick.p.edgecombe, rientjes, shivankg, steven.price, willy, wyihan,
	yan.y.zhao, forkloop, pratyush, suzuki.poulose, aneesh.kumar,
	liam, Paolo Bonzini, Sean Christopherson, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, x86, H. Peter Anvin,
	Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
	Jonathan Corbet, Shuah Khan, Shuah Khan, Vishal Annapurve,
	Andrew Morton, Chris Li, Kairui Song, Kemeng Shi, Nhat Pham,
	Barry Song, Axel Rasmussen, Yuanchu Xie, Wei Xu, Youngjun Park,
	Qi Zheng, Shakeel Butt, Kiryl Shutsemau, Baoquan He,
	Jason Gunthorpe, Vlastimil Babka, kvm, linux-kernel,
	linux-trace-kernel, linux-doc, linux-kselftest, linux-mm,
	linux-coco
In-Reply-To: <20260618-gmem-inplace-conversion-v8-41-9d2959357853@google.com>

On Fri, 19 Jun 2026 at 01:32, Ackerley Tng via B4 Relay
<devnull+ackerleytng.google.com@kernel.org> wrote:
>
> From: Ackerley Tng <ackerleytng@google.com>
>
> Introduce a new helper, kvm_gpa_to_guest_memfd(), to find the
> guest_memfd-related details of a memory region that contains a given guest
> physical address (GPA).
>
> The function returns the file descriptor for the memfd, the offset into
> the file that corresponds to the GPA, and the number of bytes remaining
> in the region from that GPA.
>
> kvm_gpa_to_guest_memfd() was factored out from vm_guest_mem_fallocate();
> refactor vm_guest_mem_fallocate() to use the new helper.
>
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
> Co-developed-by: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>

Reviewed-by: Fuad Tabba <tabba@google.com>

Cheers,
/fuad

> ---
>  tools/testing/selftests/kvm/include/kvm_util.h |  3 +++
>  tools/testing/selftests/kvm/lib/kvm_util.c     | 37 ++++++++++++++++----------
>  2 files changed, 26 insertions(+), 14 deletions(-)
>
> diff --git a/tools/testing/selftests/kvm/include/kvm_util.h b/tools/testing/selftests/kvm/include/kvm_util.h
> index 79ab64ac8b869..3a6b1fa7f26ef 100644
> --- a/tools/testing/selftests/kvm/include/kvm_util.h
> +++ b/tools/testing/selftests/kvm/include/kvm_util.h
> @@ -428,6 +428,9 @@ static inline void vm_enable_cap(struct kvm_vm *vm, u32 cap, u64 arg0)
>         vm_ioctl(vm, KVM_ENABLE_CAP, &enable_cap);
>  }
>
> +int kvm_gpa_to_guest_memfd(struct kvm_vm *vm, gpa_t gpa, off_t *fd_offset,
> +                          size_t *nr_bytes);
> +
>  /*
>   * KVM_SET_MEMORY_ATTRIBUTES{,2} overwrites _all_ attributes.  These
>   * flows need significant enhancements to support multiple attributes.
> diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c b/tools/testing/selftests/kvm/lib/kvm_util.c
> index 524ef97d634bf..0b2256ea65ff9 100644
> --- a/tools/testing/selftests/kvm/lib/kvm_util.c
> +++ b/tools/testing/selftests/kvm/lib/kvm_util.c
> @@ -1305,27 +1305,20 @@ void vm_guest_mem_fallocate(struct kvm_vm *vm, u64 base, u64 size,
>                             bool punch_hole)
>  {
>         const int mode = FALLOC_FL_KEEP_SIZE | (punch_hole ? FALLOC_FL_PUNCH_HOLE : 0);
> -       struct userspace_mem_region *region;
>         u64 end = base + size;
> -       gpa_t gpa, len;
>         off_t fd_offset;
> -       int ret;
> +       int fd, ret;
> +       size_t len;
> +       gpa_t gpa;
>
>         for (gpa = base; gpa < end; gpa += len) {
> -               u64 offset;
> -
> -               region = userspace_mem_region_find(vm, gpa, gpa);
> -               TEST_ASSERT(region && region->region.flags & KVM_MEM_GUEST_MEMFD,
> -                           "Private memory region not found for GPA 0x%lx", gpa);
> +               fd = kvm_gpa_to_guest_memfd(vm, gpa, &fd_offset, &len);
> +               len = min(end - gpa, len);
>
> -               offset = gpa - region->region.guest_phys_addr;
> -               fd_offset = region->region.guest_memfd_offset + offset;
> -               len = min_t(u64, end - gpa, region->region.memory_size - offset);
> -
> -               ret = fallocate(region->region.guest_memfd, mode, fd_offset, len);
> +               ret = fallocate(fd, mode, fd_offset, len);
>                 TEST_ASSERT(!ret, "fallocate() failed to %s at %lx (len = %lu), fd = %d, mode = %x, offset = %lx",
>                             punch_hole ? "punch hole" : "allocate", gpa, len,
> -                           region->region.guest_memfd, mode, fd_offset);
> +                           fd, mode, fd_offset);
>         }
>  }
>
> @@ -1662,6 +1655,22 @@ void *addr_gpa2alias(struct kvm_vm *vm, gpa_t gpa)
>         return (void *) ((uintptr_t) region->host_alias + offset);
>  }
>
> +int kvm_gpa_to_guest_memfd(struct kvm_vm *vm, gpa_t gpa, off_t *fd_offset,
> +                          size_t *nr_bytes)
> +{
> +       struct userspace_mem_region *region;
> +       gpa_t gpa_offset;
> +
> +       region = userspace_mem_region_find(vm, gpa, gpa);
> +       TEST_ASSERT(region && region->region.flags & KVM_MEM_GUEST_MEMFD,
> +                   "guest_memfd memory region not found for GPA 0x%lx", gpa);
> +
> +       gpa_offset = gpa - region->region.guest_phys_addr;
> +       *fd_offset = region->region.guest_memfd_offset + gpa_offset;
> +       *nr_bytes = region->region.memory_size - gpa_offset;
> +       return region->region.guest_memfd;
> +}
> +
>  /* Create an interrupt controller chip for the specified VM. */
>  void vm_create_irqchip(struct kvm_vm *vm)
>  {
>
> --
> 2.55.0.rc0.738.g0c8ab3ebcc-goog
>
>


^ permalink raw reply

* Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2
From: Askar Safin @ 2026-06-25  8:53 UTC (permalink / raw)
  To: avagin
  Cc: akpm, alexander, axboe, bernd, brauner, criu, david, dhowells,
	fuse-devel, hch, jack, joannelkoong, linux-api, linux-fsdevel,
	linux-kernel, linux-mm, miklos, netdev, patches, pfalcato,
	rostedt, safinaskar, torvalds, val, viro, willy
In-Reply-To: <CANaxB-xUrLQYGiRJZc4Boi+KX=0TJSWymErNovANVko20fMDVA@mail.gmail.com>

Andrei Vagin <avagin@gmail.com>:
> On Wed, Jun 24, 2026 at 12:12 AM Askar Safin <safinaskar@gmail.com> wrote:
> > Does CRIU actually rely on ability to do SPLICE_F_NONBLOCK vmsplice into
> > named fifos? Or this is merely a test?
> 
> Yes, it does.

I. e. CRIU relies on that named fifo behavior? Okay, I just sent
v2 version of my fixes. The patchset contains fix for named fifos.

Please, test that this fixes that named fifo problem.

> I already explained that this isn't just a perfomance degradation, it
> actually breaks the pre-dump mechanism in CRIU. vmsplice is invoked from
> our parasite code within the context of a user process, where execution
> speed is critical. A heavy performance penalty completely invalidates
> the pre-dump logic, making the feature useless.

This is very unfortunate. But I still want to remove vmsplice.

> At a minimum, we may need to consider a deprecation plan where vmsplice
> with SPLICE_F_GIFT triggers a warning for a few releases before these
> changes are applied. Alternatively, we could introduce the proposed
> behavior alongside a sysctl to fall back to the old behavior and explicitly
> state that this fallback path will be completely deprecated in a future kernel
> version.

My patches change not only SPLICE_F_GIFT behavior, but also vmsplice
behavior in general.

Let other developers decide what to do (i. e. do nothing, remove
vmsplice now or implement some deprecation scheme).

-- 
Askar Safin


^ permalink raw reply

* Re: [PATCH] mm/memory: refactor finish_fault
From: David Hildenbrand (Arm) @ 2026-06-25  8:52 UTC (permalink / raw)
  To: Sarthak Sharma, Andrew Morton
  Cc: Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Dev Jain, linux-mm,
	linux-kernel
In-Reply-To: <20260624102047.144543-1-sarthak.sharma@arm.com>

On 6/24/26 12:20, Sarthak Sharma wrote:
> finish_fault() currently has a goto fallback implementation
> where we try to map a large folio with PTEs. If that cannot be
> installed, we goto fallback and go through the fallback mapping
> path again. This looks weird and is tough to comprehend.
> 
> Remove the goto fallback implementation and try to map the
> whole folio if allowed. If the whole folio cannot be mapped,
> fall back to single page mapping without repeating the whole
> function.
> 
> The cleanup of finish_fault() was suggested by David in [1].
> 
> [1] https://lore.kernel.org/all/3684c55a-6581-4731-b94a-19526f455a1e@kernel.org/
> 
> Suggested-by: David Hildenbrand (Arm) <david@kernel.org>
> Signed-off-by: Sarthak Sharma <sarthak.sharma@arm.com>
> ---
> Tested this patch by running mm selftests on baseline and patched 7.1
> kernels. No regressions were observed.

This goes into the right direction, but I think we can do better.

For example, we know that we always have to fallback to a single PTE with
userfaultfd (incl. not mapping a PMD-sized folio by PMDs).

Let me find some time to play with this myself.

-- 
Cheers,

David


^ permalink raw reply

* Re: [PATCH 0/2] fs: support $ORIGIN in ELF interpreter paths
From: Christian Brauner @ 2026-06-25  8:50 UTC (permalink / raw)
  To: John Ericson
  Cc: Farid Zakaria, Jan Kara, Kees Cook, Christian Brauner, Al Viro,
	shuah, linux-fsdevel, linux-mm, linux-kselftest, LKML
In-Reply-To: <24420045-a6eb-4999-ab19-1e344eaba8a4@app.fastmail.com>

On 2026-06-22 17:08:55-04:00, John Ericson wrote:
> Hi, I am another Nix developer, and have participated in some LKML
> discussions in the (recent and distant) past, and thought I should weigh
> in here too.
> 
> On Mon, Jun 22, 2026, at 1:15 PM, Farid Zakaria wrote:
> 
> > On Mon, Jun 22, 2026 at 3:40 AM Jan Kara <jack@suse.cz> wrote:
> >
> > Having put forward the patch, I'm clearly biased toward thinking this
> > support should exist in the kernel.
> > If I had to think to strengthen my argument would be that the kernel
> > should not be imposing how the interpreter is found on userland.
> > Finding the interpreter relative to the binary would be useful for
> > package deployment scenarios similar to app-bundles beyond systems
> > like Nix -- which is the originating reason why $ORIGIN exists in the
> > dynamic linker.
> 
> Yes, the idea of making "relocatable software" is not a new one, and
> indeed it is why `$ORIGIN` is supported in the RPATH etc. in the first
> place.
> 
> Most of the programming model for writing relocatable software is fixed
> at this point. For example, /proc/self/exe made it much easier to look
> up arbitrary stuff relevant to the current executable. It is just some
> initial entry point stuff (the ELF interpreter, and shebangs) which is a
> glaring exception. Those should support `$ORIGIN` too. There is no good
> technical justification (that I can think of) for some but not all of
> these supporting `$ORIGIN` --- either it makes sense everywhere, or it
> makes sense nowhere.
> 
> (I suspect the only reason it didn't happen was pure inertia/Conway's
> law --- easier for whoever was excited about `$ORIGIN` to change the
> glibc loader than the kernel.)
> 
> > To me, the gap is that prior to systems like Nix, the idea of wanting
> > your dynamic linker to be part of your app bundle was not necessary
> > but Nix models the dependency chain down to the loader. Such
> > functionality would be even more correct for these other bundled
> > solutions as well, making them portable across userspace glibc
> > versions for instance.
> 
> Yes, exactly. Traditionally people thought "eh `/lib/ld-linux.so.*`
> doesn't change too much", and decided relocatable software that
> nonetheless hard-coded that absolute path to an unknown system-provided
> ELF interpreter was good enough. (Or if they weren't good enough, they
> went with static linking, but that imposes other costs.)
> 
> Now there do exist purely-user-space work-arounds, like
> https://github.com/Mic92/wrap-buddy, but they are quite complex, and
> involve various patching trickery that is likely to scare a lot of
> security analysis tools. A kernel-based solution that allows clean
> declarative expression of intent with `$ORIGIN` is much more elegant.
> 
> > > In particular the
> 
> I think it is good to see what Conda does as documented in
> <https://docs.conda.io/projects/conda-build/en/stable/resources/make-relocatable.html>
> and consider why relying on namespaces vs good old-fashioned relocatable
> isn't good enough for them either.
> 
> (I don't doubt that Conda would find this approach more robust than
> their sedding tricks, and prefer to use it where possible.)
> 
> The short answer is while all of us in the build system space love
> sandboxing during the build, we don't want that to lead to *requiring*
> run time sandboxing of the built artifacts. For example, we can
> certainly arrange sandboxing so `/lib/ld-linux.so.*` is the one that
> some executable expects now, but every time that executable is run, it
> *must* be run in a root filesystem where `/lib/ld-linux.so.*` is the
> loader it expects.
> 
> If you have multiple programs that (for whatever reasons) expect
> multiple different loaders, all spawning one another, it would
> potentially incur quite the development cost to ensure that they all do
> the proper unsharing to make everything work.
> 
> Relocatability recognizes that whether or not namespaces exist, in an
> "open world" scenario where we don't know how the software we are
> writing will be combined with other software for deployment downstream
> in different ways, it is easiest to adopt an idiom where different
> things can be placed at different absolute paths, at the user's
> discretion, and so conflicts are always avoidable.
> 
> > > Anyway I'm pretty sure Christian will have more educated answer than me but
> 
> Waiting makes sense, I am curious too what he will have to say.

The arguments I have heard from various people so far are:

(1) Userspace would be able to clone a random chroot to /woot and run a
    binary from it without having to set up a complicated sandbox
    effectively making dynamically linked binaries more like static
    binaries in a sense.

(2) Quote:
    "If you debootstrap/dnf a chroot to some location in your
    home dir and try to run a binary from it, that it tries to load the
    libraries from your /usr is a pretty unintuitive and not at all
    useful behavior."

(3) Quote:
    "[Various remote execution things run in locked down containers that
    disable userns, which makes the sandbox impossible and hence our
    builds wouldn't work there."

I'm discounting "Oh, userspace already allows this so why not the
kernel.". I think that's generally a bad argument. Kernel and userspace
aren't really alike in that regard.

The userspace ORIGIN concept is guarded behind AT_SECURE. The kernel has
to enforce the same rule. That means the loader now depends on the type
of binary. I think this is a rather serious issue.

First, it creates confusion in userspace what loader is used. Second, it
means anything that any build/chroot that uses AT_SECURE binaries now
has to use the sandboxing solution anyway or risk that some binaries use
the system loader and others the chroot loader.

Ignoring AT_SECURE, LSMs likely will need a say in whether that ORIGIN
thing gets honored or not introducing yet another vector where this can
be overriden or ignored.

Also, we change long-standing kernel behavior which will be very
surprising for any userspace that might implicitly rely on the fact that
the system loader is used. So even if we were to do something like this
it would very likely have to be configurable in some way.

This makes this all ripe for malicious loader injection attacks. And we
need to consider this possibility.

So I'm not enthusiastic about this. I want this to be consistent.



^ permalink raw reply

* Re: [PATCH v4] coredump: Add /proc/<pid>/coredump_pre_exit for pre-exit before dumping
From: Xin Zhao @ 2026-06-25  8:50 UTC (permalink / raw)
  To: brauner
  Cc: alex.aring, allen.lkml, arnd, chuck.lever, david, ebiederm,
	j.granados, jack, jackzxcui1989, jlayton, keescook, linux-arch,
	linux-fsdevel, linux-kernel, linux-mm, ljs, mcgrof, mjguzik,
	pfalcato, rppt, viro
In-Reply-To: <20260625-wappnen-drohbrief-wermutstropfen-c53538f01547@brauner>

On Thu, 25 Jun 2026 09:28:08 +0200 Christian Brauner <brauner@kernel.org> wrote:

> > +	coredump_pre_exit=
> > +			[KNL] Change the default value for
> > +			/proc/<pid>/coredump_pre_exit.
> > +			See also Documentation/filesystems/proc.rst.
> 
> Nah, we're not doing a separate file for this. That makes no sense
> whatsoever. I've already explained this in the first mail. There are
> effectively three modes:
> 
> (1) dump to a file
> (2) spawn super-privileged usermode helper process connect coredumping
>     process and said helper via pipe
> (3) coredumping process connects to AF_UNIX socket
> 
> Parameterize (1) and (2) via a command line arguments. I strongly
> suspect you're using some AI tooling so it should be able to figure out
> how this was done in the past.
> 
> (3) can be extended by just introducing a new flag value for struct
>     coredump_req. That is also illustrated by previous work.
> 
> We're not spreading procfs files. It's terrible api design especially
> for security sensitive changes.

The coredump socket approach is easier to implement because it allows for
interaction between the server and client, enabling the customization of
protocols. However, for the coredump file method, I can only think of
defining "r" and "R" through core_pattern to release flock and file-backed
shared data in advance. I'm unsure if this is feasible, as it changes the
original definition of core_pattern.

Regarding the coredump pipe, there is also a lack of a mechanism for the
pipe program to notify the coredump process, so it might still require
adding "r" and "R" at the end of core_pattern to indicate this, allowing
the coredump process to handle the early release on its own. I'm not sure
if my understanding is correct.

Even if the coredump pipe program obtains the file pointer from the process
that generated the coredump, it cannot reduce the reference count of the
file (which I understand is a very bad attempt). Since it cannot decrease
the reference count of the file, the early release must still be performed
by the task that generated the coredump. Given this situation, it seems
that we indeed need to use core_pattern for marking. I've thought for a
long time about more suitable solutions, but I haven't come up with any.


> > +#ifndef O_TMPCLOS
> > +#define O_TMPCLOS	0x80000000	/* tag need close, temporarily used */
> > +#endif
> 
> Sorry, not going to happen. This doesn't not justify the addition of a
> new uapi value at all.

OK, if I use it at last, I will not put it in user header file.

> > +
> > +__setup("coredump_pre_exit=", coredump_pre_exit_setup);
> 
> This makes no sense. I think you really need to sit down and think about
> a design for this that doesn't introduce state machinery for boot, mm,
> and the VFS in one shot to solve a fringe problem...

I'll get rid of the attempt to add a new boot-up argument for this feature.

> [Severity: High]
> Does modifying the VMA maple tree via do_munmap() during the for_each_vma()
> iteration invalidate the outer iterator? The loop traverses the maple tree
> using the iterator vmi. However, do_munmap() creates its own internal
> VMA_ITERATOR and removes the VMA from the tree. Because the outer vmi
> iterator is not updated to reflect these structural changes, its cached
> state becomes stale, which can lead to a use-after-free when vma_next()
> is subsequently called.
> 
> via: https://sashiko.dev/#/message/20260624145552.70143-1-jackzxcui1989@163.com

When executing this traversal logic, we have already acquired a lock, and
the process has been frozen. The traversal logic goes from start to finish.
Are you sure that this approach could still have issues?

> [Severity: High]
> Is it safe to iterate the file descriptor table without holding
> rcu_read_lock()? Because coredump_pre_exit() is called before zap_threads()
> kills other threads, concurrent threads can still trigger expand_files(),
> which replaces the fdt and frees the old one after an RCU grace period.

Since the process has already been frozen, shouldn't we not need to consider
such concurrency issues?

> [Severity: Medium]
> Similar to the issue in exit_mmap_mapped_shared(), this non-atomic update
> of file->f_flags risks losing concurrent fcntl() updates since it doesn't
> hold file->f_lock.
> 
> Also, if a file has duplicated file descriptors (e.g., via dup()), will
> clearing O_TMPCLOS here prematurely skip the closure of the remaining
> descriptors? When encountering the duplicated descriptor later, the flag
> will already be cleared, leaving the shared file actively referenced.

> [Severity: Medium]
> Similar to the issue in exit_mmap_mapped_shared(), this non-atomic update
> of file->f_flags risks losing concurrent fcntl() updates since it doesn't
> hold file->f_lock.
> 
> Also, if a file has duplicated file descriptors (e.g., via dup()), will
> clearing O_TMPCLOS here prematurely skip the closure of the remaining
> descriptors? When encountering the duplicated descriptor later, the flag
> will already be cleared, leaving the shared file actively referenced.

Currently, this flag will only be used by the logic we added, so I believe
there won't be any issues.

Thanks
Xin Zhao



^ permalink raw reply

* Re: [PATCH v8 40/46] KVM: selftests: Reset shared memory after hole-punching
From: Fuad Tabba @ 2026-06-25  8:46 UTC (permalink / raw)
  To: ackerleytng
  Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
	jmattson, jthoughton, michael.roth, oupton, pankaj.gupta, qperret,
	rick.p.edgecombe, rientjes, shivankg, steven.price, willy, wyihan,
	yan.y.zhao, forkloop, pratyush, suzuki.poulose, aneesh.kumar,
	liam, Paolo Bonzini, Sean Christopherson, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, x86, H. Peter Anvin,
	Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
	Jonathan Corbet, Shuah Khan, Shuah Khan, Vishal Annapurve,
	Andrew Morton, Chris Li, Kairui Song, Kemeng Shi, Nhat Pham,
	Barry Song, Axel Rasmussen, Yuanchu Xie, Wei Xu, Youngjun Park,
	Qi Zheng, Shakeel Butt, Kiryl Shutsemau, Baoquan He,
	Jason Gunthorpe, Vlastimil Babka, kvm, linux-kernel,
	linux-trace-kernel, linux-doc, linux-kselftest, linux-mm,
	linux-coco
In-Reply-To: <20260618-gmem-inplace-conversion-v8-40-9d2959357853@google.com>

On Fri, 19 Jun 2026 at 01:32, Ackerley Tng via B4 Relay
<devnull+ackerleytng.google.com@kernel.org> wrote:
>
> From: Ackerley Tng <ackerleytng@google.com>
>
> private_mem_conversions_test used to reset the shared memory that was used
> for the test to an initial pattern at the end of each test iteration. Then,
> it would punch out the pages, which would zero memory.
>
> Without in-place conversion, the resetting would write shared memory, and
> hole-punching will zero private memory, hence resetting the test to the
> state at the beginning of the for loop.
>
> With in-place conversion, resetting writes memory as shared, and
> hole-punching zeroes the same physical memory, hence undoing the reset
> done before the hole punch.
>
> Move the resetting after the hole-punching, and reset the entire
> PER_CPU_DATA_SIZE instead of just the tested range.
>
> With in-place conversion, this zeroes and then resets the same physical
> memory. Without in-place conversion, the private memory is zeroed, and the
> shared memory is reset to init_p.
>
> This is sufficient since at each test stage, the memory is assumed to start
> as shared, and private memory is always assumed to start zeroed. Conversion
> zeroes memory, so the future test stages will work as expected.
>
> Fixes: 43f623f350ce1 ("KVM: selftests: Add x86-only selftest for private memory conversions")
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>

Reviewed-by: Fuad Tabba <tabba@google.com>

Cheers,
/fuad

> ---
>  tools/testing/selftests/kvm/x86/private_mem_conversions_test.c | 9 ++++++---
>  1 file changed, 6 insertions(+), 3 deletions(-)
>
> diff --git a/tools/testing/selftests/kvm/x86/private_mem_conversions_test.c b/tools/testing/selftests/kvm/x86/private_mem_conversions_test.c
> index 861baff201e78..289ad10063fca 100644
> --- a/tools/testing/selftests/kvm/x86/private_mem_conversions_test.c
> +++ b/tools/testing/selftests/kvm/x86/private_mem_conversions_test.c
> @@ -202,15 +202,18 @@ static void guest_test_explicit_conversion(u64 base_gpa, bool do_fallocate)
>                 guest_sync_shared(gpa, size, p3, p4);
>                 memcmp_g(gpa, p4, size);
>
> -               /* Reset the shared memory back to the initial pattern. */
> -               memset((void *)gpa, init_p, size);
> -
>                 /*
>                  * Free (via PUNCH_HOLE) *all* private memory so that the next
>                  * iteration starts from a clean slate, e.g. with respect to
>                  * whether or not there are pages/folios in guest_mem.
>                  */
>                 guest_map_shared(base_gpa, PER_CPU_DATA_SIZE, true);
> +
> +               /*
> +                * Hole-punching above zeroed private memory. Reset shared
> +                * memory in preparation for the next GUEST_STAGE.
> +                */
> +               memset((void *)base_gpa, init_p, PER_CPU_DATA_SIZE);
>         }
>  }
>
>
> --
> 2.55.0.rc0.738.g0c8ab3ebcc-goog
>
>


^ permalink raw reply

* Re: [PATCH 5/6] mm: remove unnecessary empty range check in early_calculate_totalpages()
From: Mike Rapoport @ 2026-06-25  8:46 UTC (permalink / raw)
  To: Sang-Heon Jeon; +Cc: Mike Rapoport, Andrew Morton, linux-mm
In-Reply-To: <20260621145919.1453-6-ekffu200098@gmail.com>

On Sun, 21 Jun 2026 23:59:15 +0900, Sang-Heon Jeon <ekffu200098@gmail.com> wrote:
> early_calculate_totalpages() iterates the memory ranges with
> for_each_mem_pfn_range() and calls node_set_state(nid, N_MEMORY) only when
> end_pfn - start_pfn is non-zero. for_each_mem_pfn_range() never returns an
> empty range, so start_pfn < end_pfn always.
> 
> Therefore the check is unnecessary, so remove it.
> 
> [...]

Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>

-- 
Sincerely yours,
Mike.



^ permalink raw reply

* Re: [PATCH 0/6] treewide: remove unnecessary invalid range checks in memblock iteration loops
From: Mike Rapoport @ 2026-06-25  8:46 UTC (permalink / raw)
  To: Sang-Heon Jeon
  Cc: Albert Ou, Andrew Morton, Andrey Ryabinin, Catalin Marinas,
	Huacai Chen, Mike Rapoport, Muchun Song, Oscar Salvador,
	Palmer Dabbelt, Paul Walmsley, Will Deacon, Alexander Potapenko,
	Alexandre Ghiti, Andrey Konovalov, David Hildenbrand,
	Dmitry Vyukov, kasan-dev, linux-arm-kernel, linux-mm, linux-riscv,
	loongarch, Vincenzo Frascino, WANG Xuerui
In-Reply-To: <20260621145919.1453-1-ekffu200098@gmail.com>

On Sun, 21 Jun 2026 23:59:10 +0900, Sang-Heon Jeon <ekffu200098@gmail.com> wrote:
> treewide: remove unnecessary invalid range checks in memblock iteration loops
> 
> The memblock API guarantees that for_each_mem_range() and
> for_each_mem_pfn_range() never return an invalid range, meaning start is
> always less than end.
> 
> Several memblock callers still have unnecessary invalid range checks in
> their loop bodies, so remove them.
> 
> [...]

Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>

-- 
Sincerely yours,
Mike.



^ permalink raw reply

* Re: [PATCH v2 0/7] vmsplice: fix some problems in my previous vmsplice patchset
From: David Hildenbrand (Arm) @ 2026-06-25  8:46 UTC (permalink / raw)
  To: Askar Safin, linux-fsdevel, Christian Brauner, Alexander Viro,
	Jan Kara
  Cc: linux-kernel, linux-mm, linux-api, netdev, fuse-devel,
	Linus Torvalds, Matthew Wilcox, Jens Axboe, Christoph Hellwig,
	David Howells, Andrew Morton, Pedro Falcato, Miklos Szeredi,
	Andy Lutomirski, Collin Funk, David Laight, Stefan Metzmacher,
	The 8472, Willy Tarreau, Joanne Koong, Val Packett, Andrei Vagin,
	patches
In-Reply-To: <20260625083409.3769242-1-safinaskar@gmail.com>

On 6/25/26 10:34, Askar Safin wrote:
> This patchset is for VFS. Of course, it depends on my previous vmsplice
> patchset ( https://lore.kernel.org/all/20260531010107.1953702-1-safinaskar@gmail.com/ ).
> 
> I fix some problems in my previous patchset.

I think we concluded that we cannot rip out vmsplice that way at this point, and
I suspect that Christian will drop that topic branch from -next after -rc1.

-- 
Cheers,

David


^ permalink raw reply

* Re: [PATCH] mm/rmap: use huge_ptep_get() in try_to_unmap_one()
From: Dev Jain @ 2026-06-25  8:40 UTC (permalink / raw)
  To: David Hildenbrand (Arm), akpm, ljs
  Cc: riel, liam, vbabka, harry, jannh, kas, linux-mm, linux-kernel,
	ryan.roberts, anshuman.khandual, stable
In-Reply-To: <237cdb97-5abc-4c89-a0cf-1a961425f947@kernel.org>



On 25/06/26 1:58 pm, David Hildenbrand (Arm) wrote:
> On 6/25/26 10:03, Dev Jain wrote:
>>
>>
>> On 25/06/26 1:26 pm, David Hildenbrand (Arm) wrote:
>>> On 6/25/26 06:28, Dev Jain wrote:
>>>> try_to_unmap_one() handles hugetlb folios when memory failure needs
>>>> to replace a poisoned hugetlb mapping with a hwpoison entry. In that
>>>> case page_vma_mapped_walk() returns the hugetlb entry in pvmw.pte, but
>>>> the code reads it with ptep_get() before decoding the PFN.
>>>>
>>>> That is wrong on architectures where hugetlb entries are not encoded as
>>>> regular PTEs. On s390, for example, a raw huge RSTE must be converted
>>>> by huge_ptep_get() before helpers such as pte_pfn() can inspect it. A
>>>> raw decode can select the wrong subpage, so try_to_unmap_one() can
>>>> install a hwpoison entry for the wrong PFN.
>>>>
>>>> The userspace-visible result is that a later access to the poisoned
>>>> hugetlb subpage can miss the expected SIGBUS. With DEBUG_VM, the wrong
>>>> subpage can also trip the PageHWPoison check.
>>>>
>>>> Use huge_ptep_get() for hugetlb mappings before decoding the PFN.
>>>>
>>>> Before c7ab0d2fdc84, the bug existed in the form of a plain dereference:
>>>> we would check the head page pfn of the hugetlb with pte_pfn(*pte), and
>>>> bail out on mismatch. This would mean that the hwpoisoned entry will not
>>>> get installed.
>>>>
>>>> I am not sure what is the procedure on such kinds of very old bugs - how
>>>> back should I really go?
>>>>
>>>> Fixes: c7ab0d2fdc84 ("mm: convert try_to_unmap_one() to use page_vma_mapped_walk()")
>>>> Cc: stable@vger.kernel.org
>>>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>>>> ---
>>>> Applies on mm-unstable (d17fe8a046a2).
>>>> There are similar old bugs present, in try_to_migrate_one(), check_pte(),
>>>> remove_migration_pte(), prot_none_hugetlb_entry().
>>>
>>> Yeah, we should handle all these cases properly. Can you send fixes?
>>>
>>> Using ptep_get() on something that's not a PTE entry is shaky on some architectures.
>>
>> I can send the fixes blaming the commit till which backport is relatively simple. The bug will
>> still remain before that, where we don't even do ptep_get(), just a plain dereference, if
>> that is fine. Probably no one is running pre-2017 kernels.
> 
> The issue is that we would have to analyze in which cases exactly it would cause
> problems, like when migrating prot-none hugetlb folios on s390x, where
> pte_present() would not work as expected.
> 
> I don't think any of us has time (or motivation) for that detailed analysis to
> make some odd hugetlb cases happy.
> 
> So I'd say, let's just fix it in a simple way and be done with it. Use
> best-effort Fixes: but rather state in the patch description that this was found
> by code inspection and that the actual effects are unclear (e.g., pte_present()
> misbehaving on s390x), and using huge_ptep_get() is just the right thing to do.

Sure thing, sounds good.




^ permalink raw reply

* [PATCH v2 7/7] pipe: set FMODE_NOWAIT for named FIFOs
From: Askar Safin @ 2026-06-25  8:34 UTC (permalink / raw)
  To: linux-fsdevel, Christian Brauner, Alexander Viro, Jan Kara
  Cc: linux-kernel, linux-mm, linux-api, netdev, fuse-devel,
	Linus Torvalds, Matthew Wilcox, Jens Axboe, Christoph Hellwig,
	David Howells, Andrew Morton, David Hildenbrand, Pedro Falcato,
	Miklos Szeredi, Andy Lutomirski, Collin Funk, David Laight,
	Stefan Metzmacher, The 8472, Willy Tarreau, Joanne Koong,
	Val Packett, Andrei Vagin, patches
In-Reply-To: <20260625083409.3769242-1-safinaskar@gmail.com>

CRIU relies on ability to do vmsplice(SPLICE_F_NONBLOCK) on named FIFOs.

Signed-off-by: Askar Safin <safinaskar@gmail.com>
---
 fs/pipe.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/fs/pipe.c b/fs/pipe.c
index c0ccf21b9..a8e9b4459 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -1156,6 +1156,12 @@ static int fifo_open(struct inode *inode, struct file *filp)
 	/* We can only do regular read/write on fifos */
 	stream_open(inode, filp);
 
+	/*
+	 * CRIU relies on ability to do vmsplice(SPLICE_F_NONBLOCK)
+	 * on named FIFOs.
+	 */
+	filp->f_mode |= FMODE_NOWAIT;
+
 	switch (filp->f_mode & (FMODE_READ | FMODE_WRITE)) {
 	case FMODE_READ:
 	/*
-- 
2.47.3



^ permalink raw reply related

* Re: [RFC v2 PATCH] reserve_mem: add support for static memory
From: Mike Rapoport @ 2026-06-25  8:37 UTC (permalink / raw)
  To: Shyam Saini
  Cc: linux-mm, linux-doc, linux-kernel, akpm, tgopinath, bboscaccy,
	kees, tony.luck, gpiccoli, bp, rdunlap, peterz, feng.tang,
	dapeng1.mi, elver, enelsonmoore, kuba, lirongqing, ebiggers,
	Catalin Marinas, Will Deacon, Ard Biesheuvel, David Hildenbrand,
	linux-arm-kernel
In-Reply-To: <ajyC2eX9MKSU84Z8@linuxonhyperv3.guj3yctzbm1etfxqx2vob5hsef.xx.internal.cloudapp.net>

Hi Shyam,

On Wed, Jun 24, 2026 at 06:22:33PM -0700, Shyam Saini wrote:
> On 21 Jun 2026 13:36, Mike Rapoport wrote:
> > On Thu, Jun 18, 2026 at 11:23:31PM -0700, Shyam Saini wrote:
> > > reserve_mem relies on dynamic memory allocation, this limits the
> > > usecase where memory is required to be preserved across the boots.
> > > Eg: ramoops memory reservation on ACPI platforms
> > >
> > > So add support to pass a pre-determined static address and reserve
> > > memory at a specified location. This enables use case like ramoops
> > > on ACPI platforms to reliably access ramoops region with previous
> > > boot logs.
> > > 
> > > Also skip the parsing of <align> when static address is passed.
> > > 
> > > Example syntax for static address
> > >  reserve_mem=4M@0x1E0000000:oops
> > 
> > reserve_mem is best effort by design because such hacks as well as memmap=
> > cannot guarantee this memory is actually free.
> > 
> > If you want to preserve ramoops reliably, use KHO with reserve_mem.
> > The first kernel will allocate memory, this memory will be preserved by KHO
> > and could be picked up by the second kernel.
> 
> ok, On ARM64 DTS systems, we can reserve ramoops memory in the device tree during
> the warm reboot.

The cc list actually implied x86 ;-)
Added arm64 folks now.

> For an equivalent ARM64 ACPI platform, what is the recommended way to reserve
> and preserve that memory across the boots? 

I don't think it exists, but a command line option (be it memmap= or
reserve_mem=) does not seem the right way to me.

Most of the arguments that were made against adding memmap= to arm64 [1]
apply here.

If kexec is an option, KHO provides a reliable way to preserve memory
across boots.

If kexec is not an option, we should look for a generic way to specify
something like DT's reserved_mem for ACPI/EFI systems.

[1] https://lkml.kernel.org/lkml/20201118063314.22940-1-song.bao.hua@hisilicon.com/T/

> Thanks,
> Shyam

-- 
Sincerely yours,
Mike.


^ permalink raw reply

* [PATCH v2 6/7] vmsplice: return -EINVAL for particular combination of flags
From: Askar Safin @ 2026-06-25  8:34 UTC (permalink / raw)
  To: linux-fsdevel, Christian Brauner, Alexander Viro, Jan Kara
  Cc: linux-kernel, linux-mm, linux-api, netdev, fuse-devel,
	Linus Torvalds, Matthew Wilcox, Jens Axboe, Christoph Hellwig,
	David Howells, Andrew Morton, David Hildenbrand, Pedro Falcato,
	Miklos Szeredi, Andy Lutomirski, Collin Funk, David Laight,
	Stefan Metzmacher, The 8472, Willy Tarreau, Joanne Koong,
	Val Packett, Andrei Vagin, patches
In-Reply-To: <20260625083409.3769242-1-safinaskar@gmail.com>

See comment for details.

Signed-off-by: Askar Safin <safinaskar@gmail.com>
---
 fs/read_write.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/fs/read_write.c b/fs/read_write.c
index dbd0debc2..b1f71b142 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1258,6 +1258,16 @@ SYSCALL_DEFINE4(vmsplice, int, fd, const struct iovec __user *, vec,
 		bool non_block = (flags & SPLICE_F_NONBLOCK) || (fd_file(f)->f_flags & O_NONBLOCK);
 		ssize_t ret;
 
+		/*
+		 * libfuse relies on sharing vmsplice behavior.
+		 * So we detect particular combination of flags to
+		 * pipe2(2) and vmsplice(2) and return -EINVAL.
+		 * This forces libfuse to fail back to non-vmsplice
+		 * code path.
+		 */
+		if ((flags == SPLICE_F_NONBLOCK) && (fd_file(f)->f_flags & O_NONBLOCK))
+			return -EINVAL;
+
 		do {
 			pipe_lock(pipe);
 			ret = pipe_wait_for_space(pipe, non_block);
-- 
2.47.3



^ permalink raw reply related

* [PATCH v2 5/7] vmsplice: make sure we don't wait after writing some data
From: Askar Safin @ 2026-06-25  8:34 UTC (permalink / raw)
  To: linux-fsdevel, Christian Brauner, Alexander Viro, Jan Kara
  Cc: linux-kernel, linux-mm, linux-api, netdev, fuse-devel,
	Linus Torvalds, Matthew Wilcox, Jens Axboe, Christoph Hellwig,
	David Howells, Andrew Morton, David Hildenbrand, Pedro Falcato,
	Miklos Szeredi, Andy Lutomirski, Collin Funk, David Laight,
	Stefan Metzmacher, The 8472, Willy Tarreau, Joanne Koong,
	Val Packett, Andrei Vagin, patches
In-Reply-To: <20260625083409.3769242-1-safinaskar@gmail.com>

Make sure we don't wait for space in pipe after writing some data.
This is needed for compatibility with previous version of vmsplice.
Found by LTP vmsplice01.
See comments in the code and links below for details.

Link: https://lore.kernel.org/all/20260603-raumfahrt-unmerklich-ertrugen-c4ecae70d5f9@brauner/
Link: https://lore.kernel.org/all/CAHk-=wgV-j-G3d+899Zm1pQ=NaJrddPz=GKcL5Yw5DTUM=GaUw@mail.gmail.com/
Signed-off-by: Askar Safin <safinaskar@gmail.com>
---
 fs/read_write.c | 39 +++++++++++++++++++++++++++++++++++++--
 1 file changed, 37 insertions(+), 2 deletions(-)

diff --git a/fs/read_write.c b/fs/read_write.c
index 77487b307..dbd0debc2 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1221,6 +1221,8 @@ SYSCALL_DEFINE6(pwritev2, unsigned long, fd, const struct iovec __user *, vec,
 SYSCALL_DEFINE4(vmsplice, int, fd, const struct iovec __user *, vec,
 		unsigned long, vlen, unsigned int, flags)
 {
+	struct pipe_inode_info *pipe;
+
 	if (unlikely(flags & ~SPLICE_F_ALL))
 		return -EINVAL;
 
@@ -1229,11 +1231,44 @@ SYSCALL_DEFINE4(vmsplice, int, fd, const struct iovec __user *, vec,
 		return -EBADF;
 
 	/* We do vfs_writev/vfs_readv, so it is okay to pass "false" here */
-	if (!get_pipe_info(fd_file(f), /* for_splice = */ false))
+	pipe = get_pipe_info(fd_file(f), /* for_splice = */ false);
+
+	if (!pipe)
 		return -EBADF;
 
 	if (fd_file(f)->f_mode & FMODE_WRITE) {
-		ssize_t ret = vfs_writev(fd_file(f), vec, vlen, NULL, (flags & SPLICE_F_NONBLOCK) ? RWF_NOWAIT : 0);
+		/*
+		 * When writing to the pipe, previous implementation of vmsplice
+		 * first waited for space in the pipe to appear
+		 * (depending on whether SPLICE_F_NONBLOCK was passed),
+		 * then did unconditional non-blocking write to the pipe.
+		 *
+		 * This differs from what pwritev2 does.
+		 *
+		 * For compatibility we do the same thing previous
+		 * implementation did.
+		 *
+		 * We lock the pipe, do pipe_wait_for_space, then unlock
+		 * the pipe, and then do vfs_writev. vfs_writev internally
+		 * locks the pipe again. This may cause TOCTOU: when we
+		 * do vfs_writev, the pipe may become full again. So we
+		 * do a loop.
+		 */
+
+		bool non_block = (flags & SPLICE_F_NONBLOCK) || (fd_file(f)->f_flags & O_NONBLOCK);
+		ssize_t ret;
+
+		do {
+			pipe_lock(pipe);
+			ret = pipe_wait_for_space(pipe, non_block);
+			pipe_unlock(pipe);
+
+			if (ret < 0)
+				break;
+
+			ret = vfs_writev(fd_file(f), vec, vlen, NULL, RWF_NOWAIT);
+		} while (!non_block && ret == -EAGAIN);
+
 		if (ret > 0)
 			add_wchar(current, ret);
 		inc_syscw(current);
-- 
2.47.3



^ permalink raw reply related

* [PATCH v2 4/7] pipe: move wait_for_space to fs/pipe.c and rename it
From: Askar Safin @ 2026-06-25  8:34 UTC (permalink / raw)
  To: linux-fsdevel, Christian Brauner, Alexander Viro, Jan Kara
  Cc: linux-kernel, linux-mm, linux-api, netdev, fuse-devel,
	Linus Torvalds, Matthew Wilcox, Jens Axboe, Christoph Hellwig,
	David Howells, Andrew Morton, David Hildenbrand, Pedro Falcato,
	Miklos Szeredi, Andy Lutomirski, Collin Funk, David Laight,
	Stefan Metzmacher, The 8472, Willy Tarreau, Joanne Koong,
	Val Packett, Andrei Vagin, patches
In-Reply-To: <20260625083409.3769242-1-safinaskar@gmail.com>

This is needed, because I plan to use it in fs/read_write.c.

Signed-off-by: Askar Safin <safinaskar@gmail.com>
---
 fs/pipe.c                 | 17 +++++++++++++++++
 fs/splice.c               | 19 +------------------
 include/linux/pipe_fs_i.h |  2 ++
 3 files changed, 20 insertions(+), 18 deletions(-)

diff --git a/fs/pipe.c b/fs/pipe.c
index 9841648c9..c0ccf21b9 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -1451,6 +1451,23 @@ long pipe_fcntl(struct file *file, unsigned int cmd, unsigned int arg)
 	return ret;
 }
 
+int pipe_wait_for_space(struct pipe_inode_info *pipe, bool non_block)
+{
+	for (;;) {
+		if (unlikely(!pipe->readers)) {
+			send_sig(SIGPIPE, current, 0);
+			return -EPIPE;
+		}
+		if (!pipe_is_full(pipe))
+			return 0;
+		if (non_block)
+			return -EAGAIN;
+		if (signal_pending(current))
+			return -ERESTARTSYS;
+		pipe_wait_writable(pipe);
+	}
+}
+
 static const struct super_operations pipefs_ops = {
 	.destroy_inode = free_inode_nonrcu,
 	.statfs = simple_statfs,
diff --git a/fs/splice.c b/fs/splice.c
index 707db2c2c..d12243d19 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -1239,23 +1239,6 @@ ssize_t splice_file_range(struct file *in, loff_t *ppos, struct file *out,
 }
 EXPORT_SYMBOL(splice_file_range);
 
-static int wait_for_space(struct pipe_inode_info *pipe, bool non_block)
-{
-	for (;;) {
-		if (unlikely(!pipe->readers)) {
-			send_sig(SIGPIPE, current, 0);
-			return -EPIPE;
-		}
-		if (!pipe_is_full(pipe))
-			return 0;
-		if (non_block)
-			return -EAGAIN;
-		if (signal_pending(current))
-			return -ERESTARTSYS;
-		pipe_wait_writable(pipe);
-	}
-}
-
 static int splice_pipe_to_pipe(struct pipe_inode_info *ipipe,
 			       struct pipe_inode_info *opipe,
 			       size_t len, unsigned int flags);
@@ -1268,7 +1251,7 @@ ssize_t splice_file_to_pipe(struct file *in,
 	ssize_t ret;
 
 	pipe_lock(opipe);
-	ret = wait_for_space(opipe, flags & SPLICE_F_NONBLOCK);
+	ret = pipe_wait_for_space(opipe, flags & SPLICE_F_NONBLOCK);
 	if (!ret)
 		ret = do_splice_read(in, offset, opipe, len, flags);
 	pipe_unlock(opipe);
diff --git a/include/linux/pipe_fs_i.h b/include/linux/pipe_fs_i.h
index a1eeed800..be653625d 100644
--- a/include/linux/pipe_fs_i.h
+++ b/include/linux/pipe_fs_i.h
@@ -335,4 +335,6 @@ struct pipe_inode_info *get_pipe_info(struct file *file, bool for_splice);
 int create_pipe_files(struct file **, int);
 unsigned int round_pipe_size(unsigned int size);
 
+int pipe_wait_for_space(struct pipe_inode_info *pipe, bool non_block);
+
 #endif
-- 
2.47.3



^ permalink raw reply related

* [PATCH v2 3/7] splice: turn wait_for_space flags argument into bool
From: Askar Safin @ 2026-06-25  8:34 UTC (permalink / raw)
  To: linux-fsdevel, Christian Brauner, Alexander Viro, Jan Kara
  Cc: linux-kernel, linux-mm, linux-api, netdev, fuse-devel,
	Linus Torvalds, Matthew Wilcox, Jens Axboe, Christoph Hellwig,
	David Howells, Andrew Morton, David Hildenbrand, Pedro Falcato,
	Miklos Szeredi, Andy Lutomirski, Collin Funk, David Laight,
	Stefan Metzmacher, The 8472, Willy Tarreau, Joanne Koong,
	Val Packett, Andrei Vagin, patches
In-Reply-To: <20260625083409.3769242-1-safinaskar@gmail.com>

I want to do this, because I will move this function to fs/pipe.c.

Signed-off-by: Askar Safin <safinaskar@gmail.com>
---
 fs/splice.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/fs/splice.c b/fs/splice.c
index 6ddf7dd72..707db2c2c 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -1239,7 +1239,7 @@ ssize_t splice_file_range(struct file *in, loff_t *ppos, struct file *out,
 }
 EXPORT_SYMBOL(splice_file_range);
 
-static int wait_for_space(struct pipe_inode_info *pipe, unsigned flags)
+static int wait_for_space(struct pipe_inode_info *pipe, bool non_block)
 {
 	for (;;) {
 		if (unlikely(!pipe->readers)) {
@@ -1248,7 +1248,7 @@ static int wait_for_space(struct pipe_inode_info *pipe, unsigned flags)
 		}
 		if (!pipe_is_full(pipe))
 			return 0;
-		if (flags & SPLICE_F_NONBLOCK)
+		if (non_block)
 			return -EAGAIN;
 		if (signal_pending(current))
 			return -ERESTARTSYS;
@@ -1268,7 +1268,7 @@ ssize_t splice_file_to_pipe(struct file *in,
 	ssize_t ret;
 
 	pipe_lock(opipe);
-	ret = wait_for_space(opipe, flags);
+	ret = wait_for_space(opipe, flags & SPLICE_F_NONBLOCK);
 	if (!ret)
 		ret = do_splice_read(in, offset, opipe, len, flags);
 	pipe_unlock(opipe);
-- 
2.47.3



^ permalink raw reply related

* [PATCH v2 2/7] vmsplice: change argument type back to "int"
From: Askar Safin @ 2026-06-25  8:34 UTC (permalink / raw)
  To: linux-fsdevel, Christian Brauner, Alexander Viro, Jan Kara
  Cc: linux-kernel, linux-mm, linux-api, netdev, fuse-devel,
	Linus Torvalds, Matthew Wilcox, Jens Axboe, Christoph Hellwig,
	David Howells, Andrew Morton, David Hildenbrand, Pedro Falcato,
	Miklos Szeredi, Andy Lutomirski, Collin Funk, David Laight,
	Stefan Metzmacher, The 8472, Willy Tarreau, Joanne Koong,
	Val Packett, Andrei Vagin, patches
In-Reply-To: <20260625083409.3769242-1-safinaskar@gmail.com>

My previous vmsplice patchset changed vmsplice argument from
"int" to "unsigned long". This may cause problems, so let's
change it back.

Signed-off-by: Askar Safin <safinaskar@gmail.com>
---
 fs/read_write.c          | 2 +-
 include/linux/syscalls.h | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/read_write.c b/fs/read_write.c
index e224e7cb8..77487b307 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1218,7 +1218,7 @@ SYSCALL_DEFINE6(pwritev2, unsigned long, fd, const struct iovec __user *, vec,
 /*
  * Legacy preadv2/pwritev2 wrapper.
  */
-SYSCALL_DEFINE4(vmsplice, unsigned long, fd, const struct iovec __user *, vec,
+SYSCALL_DEFINE4(vmsplice, int, fd, const struct iovec __user *, vec,
 		unsigned long, vlen, unsigned int, flags)
 {
 	if (unlikely(flags & ~SPLICE_F_ALL))
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index a86a88207..46a3ec954 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -514,7 +514,7 @@ asmlinkage long sys_ppoll_time32(struct pollfd __user *, unsigned int,
 			  struct old_timespec32 __user *, const sigset_t __user *,
 			  size_t);
 asmlinkage long sys_signalfd4(int ufd, sigset_t __user *user_mask, size_t sizemask, int flags);
-asmlinkage long sys_vmsplice(unsigned long fd, const struct iovec __user *vec,
+asmlinkage long sys_vmsplice(int fd, const struct iovec __user *vec,
 			     unsigned long vlen, unsigned int flags);
 asmlinkage long sys_splice(int fd_in, loff_t __user *off_in,
 			   int fd_out, loff_t __user *off_out,
-- 
2.47.3



^ permalink raw reply related

* [PATCH v2 1/7] vmsplice: open-code do_writev and do_readv
From: Askar Safin @ 2026-06-25  8:34 UTC (permalink / raw)
  To: linux-fsdevel, Christian Brauner, Alexander Viro, Jan Kara
  Cc: linux-kernel, linux-mm, linux-api, netdev, fuse-devel,
	Linus Torvalds, Matthew Wilcox, Jens Axboe, Christoph Hellwig,
	David Howells, Andrew Morton, David Hildenbrand, Pedro Falcato,
	Miklos Szeredi, Andy Lutomirski, Collin Funk, David Laight,
	Stefan Metzmacher, The 8472, Willy Tarreau, Joanne Koong,
	Val Packett, Andrei Vagin, patches
In-Reply-To: <20260625083409.3769242-1-safinaskar@gmail.com>

My previous vmsplice patch did the following mistake: I did
"CLASS(fd, f)(fd)", then did some checks on resulting "struct file",
then passed numeric (!) file descriptor to a function.

This is somewhat okay in this particular case, but I still think
this is code smell, so I fix this by open-coding do_writev and do_readv.

Also I insert a comment to warn other developers to keep
do_writev and do_readv in sync with vmsplice(2).

Signed-off-by: Askar Safin <safinaskar@gmail.com>
---
 fs/read_write.c | 22 +++++++++++++++++-----
 1 file changed, 17 insertions(+), 5 deletions(-)

diff --git a/fs/read_write.c b/fs/read_write.c
index 1e5444f4d..e224e7cb8 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1070,6 +1070,7 @@ static ssize_t vfs_writev(struct file *file, const struct iovec __user *vec,
 static ssize_t do_readv(unsigned long fd, const struct iovec __user *vec,
 			unsigned long vlen, rwf_t flags)
 {
+	/* All future changes to this function should be kept in sync with vmsplice(2). */
 	CLASS(fd_pos, f)(fd);
 	ssize_t ret = -EBADF;
 
@@ -1093,6 +1094,7 @@ static ssize_t do_readv(unsigned long fd, const struct iovec __user *vec,
 static ssize_t do_writev(unsigned long fd, const struct iovec __user *vec,
 			 unsigned long vlen, rwf_t flags)
 {
+	/* All future changes to this function should be kept in sync with vmsplice(2). */
 	CLASS(fd_pos, f)(fd);
 	ssize_t ret = -EBADF;
 
@@ -1226,14 +1228,24 @@ SYSCALL_DEFINE4(vmsplice, unsigned long, fd, const struct iovec __user *, vec,
 	if (fd_empty(f))
 		return -EBADF;
 
-	/* We do do_writev/do_readv, so it is okay to pass "false" here */
+	/* We do vfs_writev/vfs_readv, so it is okay to pass "false" here */
 	if (!get_pipe_info(fd_file(f), /* for_splice = */ false))
 		return -EBADF;
 
-	if (fd_file(f)->f_mode & FMODE_WRITE)
-		return do_writev(fd, vec, vlen, (flags & SPLICE_F_NONBLOCK) ? RWF_NOWAIT : 0);
-	else
-		return do_readv(fd, vec, vlen, (flags & SPLICE_F_NONBLOCK) ? RWF_NOWAIT : 0);
+	if (fd_file(f)->f_mode & FMODE_WRITE) {
+		ssize_t ret = vfs_writev(fd_file(f), vec, vlen, NULL, (flags & SPLICE_F_NONBLOCK) ? RWF_NOWAIT : 0);
+		if (ret > 0)
+			add_wchar(current, ret);
+		inc_syscw(current);
+		return ret;
+	} else {
+		ssize_t ret = vfs_readv(fd_file(f), vec, vlen, NULL, (flags & SPLICE_F_NONBLOCK) ? RWF_NOWAIT : 0);
+
+		if (ret > 0)
+			add_rchar(current, ret);
+		inc_syscr(current);
+		return ret;
+	}
 }
 
 /*
-- 
2.47.3



^ permalink raw reply related

* [PATCH v2 0/7] vmsplice: fix some problems in my previous vmsplice patchset
From: Askar Safin @ 2026-06-25  8:34 UTC (permalink / raw)
  To: linux-fsdevel, Christian Brauner, Alexander Viro, Jan Kara
  Cc: linux-kernel, linux-mm, linux-api, netdev, fuse-devel,
	Linus Torvalds, Matthew Wilcox, Jens Axboe, Christoph Hellwig,
	David Howells, Andrew Morton, David Hildenbrand, Pedro Falcato,
	Miklos Szeredi, Andy Lutomirski, Collin Funk, David Laight,
	Stefan Metzmacher, The 8472, Willy Tarreau, Joanne Koong,
	Val Packett, Andrei Vagin, patches

This patchset is for VFS. Of course, it depends on my previous vmsplice
patchset ( https://lore.kernel.org/all/20260531010107.1953702-1-safinaskar@gmail.com/ ).

I fix some problems in my previous patchset.

1. Fix problem with CLASS(fd, f)(fd). See first patch in this patchset
for details. This is probably not so important, but I fix it anyway.

2. Change "unsigned long" back to "int". See second patch for details.
Again, this is probably not important, but I want to fix this anyway.

3. Fix that LTP vmsplice01 bug.

4. libfuse relies on sharing vmsplice behavior. So we detect particular
combination of flags to pipe2(2) and vmsplice(2) and return -EINVAL.
This forces libfuse to fail back to non-vmsplice code path.
I. e. we fix libfuse-related regression [1].
I did debian code search for regex "vmsplice.*SPLICE_F_NONBLOCK" and
I found no other packages with this particular combination of flags
except for fuse itself. (Okay, other packages are fio and stress-ng,
but these are merely testers.) So, I think this is okay to return
EINVAL here, breakage will be minimal.

5. Set FMODE_NOWAIT for named FIFOs. CRIU relies on ability to do
vmsplice(SPLICE_F_NONBLOCK) on named FIFOs. So, I fix this CRIU-related
regression [2]. But there is another CRIU-related regression, which I do not
fix [3]: CRIU behavior in splice mode becomes so slow that splice mode
becomes useless. I personally still believe that removing vmsplice is
right thing to do. Other option is doing nothing. Yet another option
is to implement some deprecation period [3]. Let other developers
decide.

See patches for details.

Please, run that LTP vmsplice01 test again.

Notes:

- I want to repeat: I change behavior around SPLICE_F_NONBLOCK.
Previously, vmsplice ignored whether pipe itself was opened as
non-blocking file. Now it is not ignored. And in my opinion
new behavior is better.
- vmsplice(2) now is in fs/read_write.c . It is very similar to
preadv2 and pwritev2 now, so I think it belongs to fs/read_write.c now.

Please, review this patchset carefully. I'm still new contributor.
In particular, please, review that do-while loop, I'm not sure I did
everything right.

Tested in Qemu.

[1] https://lore.kernel.org/all/CAJnrk1Y9egYizkx1H9K0cqxSYuB+7vLvQbV7Tf4C5eHFqnnC-A@mail.gmail.com/
[2] https://lore.kernel.org/all/CANaxB-zK5q=Xw6UZTmeFtXsDZjUsPkFk=p485m-wtNTBnf4hgg@mail.gmail.com/
[3] https://lore.kernel.org/all/CANaxB-xUrLQYGiRJZc4Boi+KX=0TJSWymErNovANVko20fMDVA@mail.gmail.com/

v1: https://lore.kernel.org/lkml/20260606061031.3744880-1-safinaskar@gmail.com/

Changes since v1: fix fuse-related and CRIU-related regressions (see above).

Askar Safin (7):
  vmsplice: open-code do_writev and do_readv
  vmsplice: change argument type back to "int"
  splice: turn wait_for_space flags argument into bool
  pipe: move wait_for_space to fs/pipe.c and rename it
  vmsplice: make sure we don't wait after writing some data
  vmsplice: return -EINVAL for particular combination of flags
  pipe: set FMODE_NOWAIT for named FIFOs

 fs/pipe.c                 | 23 +++++++++++++
 fs/read_write.c           | 71 +++++++++++++++++++++++++++++++++++----
 fs/splice.c               | 19 +----------
 include/linux/pipe_fs_i.h |  2 ++
 include/linux/syscalls.h  |  2 +-
 5 files changed, 91 insertions(+), 26 deletions(-)


base-commit: 8d86fcfc2857d64af85f5c87c193c25655c970af
-- 
2.47.3



^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox