Linux userland API discussions

Linux userland API discussions
 help / color / mirror / Atom feed

* Re: [PATCH RFC 4/4] arm64/io: Add {__raw_read|__raw_write}128 support
From: Mark Rutland @ 2025-11-12 12:28 UTC (permalink / raw)
  To: Chenghai Huang
  Cc: arnd, catalin.marinas, will, akpm, anshuman.khandual,
	ryan.roberts, andriy.shevchenko, herbert, linux-kernel,
	linux-arch, linux-arm-kernel, linux-crypto, linux-api, fanghao11,
	shenyang39, liulongfang, qianweili
In-Reply-To: <20251112015846.1842207-5-huangchenghai2@huawei.com>

On Wed, Nov 12, 2025 at 09:58:46AM +0800, Chenghai Huang wrote:
> From: Weili Qian <qianweili@huawei.com>
> 
> Starting from ARMv8.4, stp and ldp instructions become atomic.

That's not true for accesses to Device memory types.

Per ARM DDI 0487, L.b, section B2.2.1.1 ("Changes to single-copy atomicity in
Armv8.4"):

  If FEAT_LSE2 is implemented, LDP, LDNP, and STP instructions that load
  or store two 64-bit registers are single-copy atomic when all of the
  following conditions are true:
  • The overall memory access is aligned to 16 bytes.
  • Accesses are to Inner Write-Back, Outer Write-Back Normal cacheable memory.

IIUC when used for Device memory types, those can be split, and a part
of the access could be replayed multiple times (e.g. due to an
intetrupt).

I don't think we can add this generally. It is not atomic, and not
generally safe.

Mark.

> Currently, device drivers depend on 128-bit atomic memory IO access,
> but these are implemented within the drivers. Therefore, this introduces
> generic {__raw_read|__raw_write}128 function for 128-bit memory access.
> 
> Signed-off-by: Weili Qian <qianweili@huawei.com>
> Signed-off-by: Chenghai Huang <huangchenghai2@huawei.com>
> ---
>  arch/arm64/include/asm/io.h | 21 +++++++++++++++++++++
>  1 file changed, 21 insertions(+)
> 
> diff --git a/arch/arm64/include/asm/io.h b/arch/arm64/include/asm/io.h
> index 83e03abbb2ca..80430750a28c 100644
> --- a/arch/arm64/include/asm/io.h
> +++ b/arch/arm64/include/asm/io.h
> @@ -50,6 +50,17 @@ static __always_inline void __raw_writeq(u64 val, volatile void __iomem *addr)
>  	asm volatile("str %x0, %1" : : "rZ" (val), "Qo" (*ptr));
>  }
>  
> +#define __raw_write128 __raw_write128
> +static __always_inline void __raw_write128(u128 val, volatile void __iomem *addr)
> +{
> +	u64 low, high;
> +
> +	low = val;
> +	high = (u64)(val >> 64);
> +
> +	asm volatile ("stp %x0, %x1, [%2]\n" :: "rZ"(low), "rZ"(high), "r"(addr));
> +}
> +
>  #define __raw_readb __raw_readb
>  static __always_inline u8 __raw_readb(const volatile void __iomem *addr)
>  {
> @@ -95,6 +106,16 @@ static __always_inline u64 __raw_readq(const volatile void __iomem *addr)
>  	return val;
>  }
>  
> +#define __raw_read128 __raw_read128
> +static __always_inline u128 __raw_read128(const volatile void __iomem *addr)
> +{
> +	u64 high, low;
> +
> +	asm volatile("ldp %0, %1, [%2]" : "=r" (low), "=r" (high) : "r" (addr));
> +
> +	return (((u128)high << 64) | (u128)low);
> +}
> +
>  /* IO barriers */
>  #define __io_ar(v)							\
>  ({									\
> -- 
> 2.33.0
> 
> 

^ permalink raw reply

* Re: [PATCH] man/man2/clone.2: Document CLONE_NEWPID and CLONE_NEWUSER flag
From: Alejandro Colomar @ 2025-11-12 11:23 UTC (permalink / raw)
  To: hoodit dev; +Cc: Carlos O'Donell, linux-man, linux-api, Andrew Morton
In-Reply-To: <CAFvyz33t9gYOi2HtNFNC_YAPS-_0QHiqJQwatc7YsGppstiZ7A@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 2286 bytes --]

Hi,

On Wed, Oct 29, 2025 at 06:00:50PM +0900, hoodit dev wrote:
> Hi, Alejandro Colomar and Carlos
> 
> Just a friendly ping to check if you had a chance to review this patch.

I don't know enough of clone(2) to review this.  I'll wait for Carlos's
review.


Have a lovely day!
Alex

> 
> Thanks
> 
> 2025년 5월 2일 (금) 오전 6:30, Alejandro Colomar <alx@kernel.org>님이 작성:
> >
> > Hi Carlos,
> >
> > On Mon, Apr 21, 2025 at 04:16:03AM +0900, devhoodit wrote:
> > > CLONE_NEWPID and CLONE_PARENT can be used together, but not CLONE_THREAD.  Similarly, CLONE_NEWUSER and CLONE_PARENT can be used together, but not CLONE_THREAD.
> > > This was discussed here: <https://lore.kernel.org/linux-man/06febfb3-e2e2-4363-bc34-83a07692144f@redhat.com/T/>
> > > Relevant code: <https://github.com/torvalds/linux/blob/219d54332a09e8d8741c1e1982f5eae56099de85/kernel/fork.c#L1815>
> > >
> > > Cc: Carlos O'Donell <carlos@redhat.com>
> > > Cc: Andrew Morton <akpm@linux-foundation.org>
> > > Signed-off-by: devhoodit <devhoodit@gmail.com>
> >
> > Could you please review this patch?
> >
> >
> > Have a lovely night!
> > Alex
> >
> > > ---
> > >  man/man2/clone.2 | 9 +++------
> > >  1 file changed, 3 insertions(+), 6 deletions(-)
> > >
> > > diff --git a/man/man2/clone.2 b/man/man2/clone.2
> > > index 1b74e4c92..b9561125a 100644
> > > --- a/man/man2/clone.2
> > > +++ b/man/man2/clone.2
> > > @@ -776,9 +776,7 @@ .SS The flags mask
> > >  no privileges are needed to create a user namespace.
> > >  .IP
> > >  This flag can't be specified in conjunction with
> > > -.B CLONE_THREAD
> > > -or
> > > -.BR CLONE_PARENT .
> > > +.BR CLONE_THREAD .
> > >  For security reasons,
> > >  .\" commit e66eded8309ebf679d3d3c1f5820d1f2ca332c71
> > >  .\" https://lwn.net/Articles/543273/
> > > @@ -1319,11 +1317,10 @@ .SH ERRORS
> > >  mask.
> > >  .TP
> > >  .B EINVAL
> > > +Both
> > >  .B CLONE_NEWPID
> > > -and one (or both) of
> > > +and
> > >  .B CLONE_THREAD
> > > -or
> > > -.B CLONE_PARENT
> > >  were specified in the
> > >  .I flags
> > >  mask.
> > > --
> > > 2.49.0
> > >
> >
> > --
> > <https://www.alejandro-colomar.es/>
> 

-- 
<https://www.alejandro-colomar.es>
Use port 80 (that is, <...:80/>).

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* Re: RFC: Serial port DTR/RTS - O_NRESETDEV
From: Greg KH @ 2025-11-12 11:22 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Theodore Ts'o, Maarten Brock, linux-serial@vger.kernel.org,
	linux-api@vger.kernel.org, LKML
In-Reply-To: <D4AF3E24-8698-4EEC-9D52-655D69897111@zytor.com>

On Mon, Nov 10, 2025 at 07:57:22PM -0800, H. Peter Anvin wrote:
> Honestly, though, I'm far less interested in what 8250-based hardware does than e.g. USB.

hahahahahahaha {snort}

Hah.  that's a good one.

Oh, you aren't kidding.

Wow, good luck with this.  USB-serial adaptors are all over the place,
some have real uarts in them (and so do buffering in the device, and
line handling in odd ways when powered up), and some are almost just a
straight pipe through to the USB host with control line handling ideas
tacked on to the side as an afterthought, if at all.

There is no standard here, they all work differently, and even work
differently across the same device type with just barely enough hints
for us to determine what is going on.

So don't worry about USB, if you throw that into the mix, all bets are
off and you should NEVER rely on that.

Remeber USB->serial was explicitly rejected by the USB standard group,
only to have it come back in the "side door" through the spec process
when it turned out that Microsoft hated having to write a zillion
different vendor-specific drivers because the vendor provided ones kept
crashing user's machines.  So what we ended up with was "just enough" to
make it through the spec process, and even then line signals are
probably never tested so you can't rely on them.

good luck!

greg "this brought up too many bad memories" k-h

^ permalink raw reply

* Re: [PATCH v5 02/22] liveupdate: luo_core: integrate with KHO
From: Mike Rapoport @ 2025-11-12 10:21 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: pratyush, jasonmiu, graf, dmatlack, rientjes, corbet, rdunlap,
	ilpo.jarvinen, kanie, ojeda, aliceryhl, masahiroy, akpm, tj,
	yoann.congal, mmaurer, roman.gushchin, chenridong, axboe,
	mark.rutland, jannh, vincent.guittot, hannes, dan.j.williams,
	david, joel.granados, rostedt, anna.schumaker, song, zhangguopeng,
	linux, linux-kernel, linux-doc, linux-mm, gregkh, tglx, mingo, bp,
	dave.hansen, x86, hpa, rafael, dakr, bartosz.golaszewski,
	cw00.choi, myungjoo.ham, yesanishhere, Jonathan.Cameron,
	quic_zijuhu, aleksander.lobakin, ira.weiny, andriy.shevchenko,
	leon, lukas, bhelgaas, wagi, djeffery, stuart.w.hayes, ptyadav,
	lennart, brauner, linux-api, linux-fsdevel, saeedm, ajayachandra,
	jgg, parav, leonro, witu, hughd, skhawaja, chrisl
In-Reply-To: <CA+CK2bD3hps+atqUZ2LKyuoOSRRUWpTPE+frd5g13js4EAFK8g@mail.gmail.com>

On Tue, Nov 11, 2025 at 03:42:24PM -0500, Pasha Tatashin wrote:
> On Tue, Nov 11, 2025 at 3:39 PM Pasha Tatashin
> <pasha.tatashin@soleen.com> wrote:
> >
> > > >       kho_memory_init();
> > > >
> > > > +     /* Live Update should follow right after KHO is initialized */
> > > > +     liveupdate_init();
> > > > +
> > >
> > > Why do you think it should be immediately after kho_memory_init()?
> > > Any reason this can't be called from start_kernel() or even later as an
> > > early_initcall() or core_initall()?
> >
> > Unfortunately, no, even here it is too late, and we might need to find
> > a way to move the kho_init/liveupdate_init earlier. We must be able to
> > preserve HugeTLB pages, and those are reserved earlier in boot.
> 
> Just to clarify: liveupdate_init() is needed to start using:
> liveupdate_flb_incoming_* API, and FLB data is needed during HugeTLB
> reservation.

Since flb is "file-lifecycle-bound", it implies *file*. Early memory
reservations in hugetlb are not bound to files, they end up in file objects
way later.

So I think for now we can move liveupdate_init() later in boot and we will
solve the problem of hugetlb reservations when we add support for hugetlb.
 
> Pasha

-- 
Sincerely yours,
Mike.

^ permalink raw reply

* Re: [PATCH v6 00/17] vfs: recall-only directory delegations for knfsd
From: Christian Brauner @ 2025-11-12  9:00 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Christian Brauner, linux-fsdevel, linux-kernel, linux-nfs,
	linux-cifs, samba-technical, netfs, ecryptfs, linux-unionfs,
	linux-xfs, netdev, linux-api, Miklos Szeredi, Alexander Viro,
	Jan Kara, Chuck Lever, Alexander Aring, Trond Myklebust,
	Anna Schumaker, Steve French, Paulo Alcantara, Ronnie Sahlberg,
	Shyam Prasad N, Tom Talpey, Bharath SM, Greg Kroah-Hartman,
	Rafael J. Wysocki, Danilo Krummrich, David Howells, Tyler Hicks,
	NeilBrown, Olga Kornievskaia, Dai Ngo, Amir Goldstein,
	Namjae Jeon, Steve French, Sergey Senozhatsky, Carlos Maiolino,
	Kuniyuki Iwashima, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman
In-Reply-To: <20251111-dir-deleg-ro-v6-0-52f3feebb2f2@kernel.org>

On Tue, 11 Nov 2025 09:12:41 -0500, Jeff Layton wrote:
> Behold, another version of the directory delegation patchset. This
> version contains support for recall-only delegations. Support for
> CB_NOTIFY will be forthcoming (once the client-side patches have caught
> up).
> 
> The main changes here are in response to Jan's comments. I also changed
> struct delegation use to fixed-with integer types.
> 
> [...]

Applied to the vfs-6.19.directory.delegations branch of the vfs/vfs.git tree.
Patches in the vfs-6.19.directory.delegations branch should appear in linux-next soon.

Please report any outstanding bugs that were missed during review in a
new review to the original patch series allowing us to drop it.

It's encouraged to provide Acked-bys and Reviewed-bys even though the
patch has now been applied. If possible patch trailers will be updated.

Note that commit hashes shown below are subject to change due to rebase,
trailer updates or similar. If in doubt, please check the listed branch.

tree:   https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git
branch: vfs-6.19.directory.delegations

[01/17] filelock: make lease_alloc() take a flags argument
        https://git.kernel.org/vfs/vfs/c/6fc5f2b19e75
[02/17] filelock: rework the __break_lease API to use flags
        https://git.kernel.org/vfs/vfs/c/4be9f3cc582a
[03/17] filelock: add struct delegated_inode
        https://git.kernel.org/vfs/vfs/c/6976ed2dd0d5
[04/17] filelock: push the S_ISREG check down to ->setlease handlers
        https://git.kernel.org/vfs/vfs/c/e6d28ebc17eb
[05/17] vfs: add try_break_deleg calls for parents to vfs_{link,rename,unlink}
        https://git.kernel.org/vfs/vfs/c/b46ebf9a768d
[06/17] vfs: allow mkdir to wait for delegation break on parent
        https://git.kernel.org/vfs/vfs/c/e12d203b8c88
[07/17] vfs: allow rmdir to wait for delegation break on parent
        https://git.kernel.org/vfs/vfs/c/4fa76319cd0c
[08/17] vfs: break parent dir delegations in open(..., O_CREAT) codepath
        https://git.kernel.org/vfs/vfs/c/134796f43a5e
[09/17] vfs: clean up argument list for vfs_create()
        https://git.kernel.org/vfs/vfs/c/85bbffcad730
[10/17] vfs: make vfs_create break delegations on parent directory
        https://git.kernel.org/vfs/vfs/c/c826229c6a82
[11/17] vfs: make vfs_mknod break delegations on parent directory
        https://git.kernel.org/vfs/vfs/c/e8960c1b2ee9
[12/17] vfs: make vfs_symlink break delegations on parent dir
        https://git.kernel.org/vfs/vfs/c/92bf53577f01
[13/17] filelock: lift the ban on directory leases in generic_setlease
        https://git.kernel.org/vfs/vfs/c/d0eab9fc1047
[14/17] nfsd: allow filecache to hold S_IFDIR files
        https://git.kernel.org/vfs/vfs/c/544a0ee152f0
[15/17] nfsd: allow DELEGRETURN on directories
        https://git.kernel.org/vfs/vfs/c/80c8afddc8b1
[16/17] nfsd: wire up GET_DIR_DELEGATION handling
        https://git.kernel.org/vfs/vfs/c/8b99f6a8c116
[17/17] vfs: expose delegation support to userland
        https://git.kernel.org/vfs/vfs/c/1602bad16d7d

^ permalink raw reply

* [PATCH RFC 0/4] Introduce 128-bit IO access
From: Chenghai Huang @ 2025-11-12  1:58 UTC (permalink / raw)
  To: arnd, catalin.marinas, will, akpm, anshuman.khandual,
	ryan.roberts, andriy.shevchenko, herbert, linux-kernel,
	linux-arch, linux-arm-kernel, linux-crypto, linux-api
  Cc: fanghao11, shenyang39, liulongfang, qianweili

These patches introduce 128-bit IO access functionality. The reason
is that the current HiSilicon cryptographic devices need to
maintain atomic operations when accessing 128-bit MMIO across
physical and virtual functions.

Currently, 128-bit atomic writes have already been implemented in
the device driver, and the driver also depends on a 128-bit atomic
read access interface. Therefore, we have introduced a generic
128-bit IO access interface to replace the implementation of
128-bit read and write IO interfaces using instructions in the
device driver. When the architecture does not support 128-bit
atomic operations, non-atomic 128-bit read and write interfaces can
be used to make the driver functional.

Weili Qian (4):
  UAPI: Introduce 128-bit types and byteswap operations
  asm-generic/io.h: add io{read,write}128 accessors
  io-128-nonatomic: introduce io{read|write}128_{lo_hi|hi_lo}
  arm64/io: Add {__raw_read|__raw_write}128 support

 arch/arm64/include/asm/io.h                  | 21 +++++++++
 include/asm-generic/io.h                     | 48 ++++++++++++++++++++
 include/linux/io-128-nonatomic-hi-lo.h       | 35 ++++++++++++++
 include/linux/io-128-nonatomic-lo-hi.h       | 34 ++++++++++++++
 include/uapi/linux/byteorder/big_endian.h    |  6 +++
 include/uapi/linux/byteorder/little_endian.h |  6 +++
 include/uapi/linux/swab.h                    | 10 ++++
 include/uapi/linux/types.h                   |  3 ++
 8 files changed, 163 insertions(+)
 create mode 100644 include/linux/io-128-nonatomic-hi-lo.h
 create mode 100644 include/linux/io-128-nonatomic-lo-hi.h

-- 
2.33.0

^ permalink raw reply

* [PATCH RFC 3/4] io-128-nonatomic: introduce io{read|write}128_{lo_hi|hi_lo}
From: Chenghai Huang @ 2025-11-12  1:58 UTC (permalink / raw)
  To: arnd, catalin.marinas, will, akpm, anshuman.khandual,
	ryan.roberts, andriy.shevchenko, herbert, linux-kernel,
	linux-arch, linux-arm-kernel, linux-crypto, linux-api
  Cc: fanghao11, shenyang39, liulongfang, qianweili
In-Reply-To: <20251112015846.1842207-1-huangchenghai2@huawei.com>

From: Weili Qian <qianweili@huawei.com>

In order to provide non-atomic functions for io{read|write}128.
We define a number of variants of these functions in the generic
iomap that will do non-atomic operations.

These functions are only defined if io{read|write}128 are defined.
If they are not, then the wrappers that always use non-atomic operations
from include/linux/io-128-nonatomic*.h will be used.

Signed-off-by: Weili Qian <qianweili@huawei.com>
Signed-off-by: Chenghai Huang <huangchenghai2@huawei.com>
---
 include/linux/io-128-nonatomic-hi-lo.h | 35 ++++++++++++++++++++++++++
 include/linux/io-128-nonatomic-lo-hi.h | 34 +++++++++++++++++++++++++
 2 files changed, 69 insertions(+)
 create mode 100644 include/linux/io-128-nonatomic-hi-lo.h
 create mode 100644 include/linux/io-128-nonatomic-lo-hi.h

diff --git a/include/linux/io-128-nonatomic-hi-lo.h b/include/linux/io-128-nonatomic-hi-lo.h
new file mode 100644
index 000000000000..b5b083a9e81b
--- /dev/null
+++ b/include/linux/io-128-nonatomic-hi-lo.h
@@ -0,0 +1,35 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_IO_128_NONATOMIC_HI_LO_H_
+#define _LINUX_IO_128_NONATOMIC_HI_LO_H_
+
+#include <linux/io.h>
+#include <asm-generic/int-ll64.h>
+
+static inline u128 ioread128_hi_lo(const void __iomem *addr)
+{
+	u32 low, high;
+
+	high = ioread64(addr + sizeof(u64));
+	low = ioread64(addr);
+
+	return low + ((u128)high << 64);
+}
+
+static inline void iowrite128_hi_lo(u128 val, void __iomem *addr)
+{
+	iowrite64(val >> 64, addr + sizeof(u64));
+	iowrite64(val, addr);
+}
+
+#ifndef ioread128
+#define ioread128_is_nonatomic
+#define ioread128 ioread128_hi_lo
+#endif
+
+#ifndef iowrite128
+#define iowrite128_is_nonatomic
+#define iowrite128 iowrite128_hi_lo
+#endif
+
+#endif	/* _LINUX_IO_128_NONATOMIC_HI_LO_H_ */
+
diff --git a/include/linux/io-128-nonatomic-lo-hi.h b/include/linux/io-128-nonatomic-lo-hi.h
new file mode 100644
index 000000000000..0448ee5a13de
--- /dev/null
+++ b/include/linux/io-128-nonatomic-lo-hi.h
@@ -0,0 +1,34 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_IO_128_NONATOMIC_LO_HI_H_
+#define _LINUX_IO_128_NONATOMIC_LO_HI_H_
+
+#include <linux/io.h>
+#include <asm-generic/int-ll64.h>
+
+static inline u128 ioread128_lo_hi(const void __iomem *addr)
+{
+	u64 low, high;
+
+	low = ioread64(addr);
+	high = ioread64(addr + sizeof(u64));
+
+	return low + ((u128)high << 64);
+}
+
+static inline void iowrite128_lo_hi(u128 val, void __iomem *addr)
+{
+	iowrite64(val, addr);
+	iowrite64(val >> 64, addr + sizeof(u64));
+}
+
+#ifndef ioread128
+#define ioread128_is_nonatomic
+#define ioread128 ioread128_lo_hi
+#endif
+
+#ifndef iowrite128
+#define iowrite128_is_nonatomic
+#define iowrite128 iowrite128_lo_hi
+#endif
+
+#endif	/* _LINUX_IO_128_NONATOMIC_LO_HI_H_ */
-- 
2.33.0


^ permalink raw reply related

* [PATCH RFC 4/4] arm64/io: Add {__raw_read|__raw_write}128 support
From: Chenghai Huang @ 2025-11-12  1:58 UTC (permalink / raw)
  To: arnd, catalin.marinas, will, akpm, anshuman.khandual,
	ryan.roberts, andriy.shevchenko, herbert, linux-kernel,
	linux-arch, linux-arm-kernel, linux-crypto, linux-api
  Cc: fanghao11, shenyang39, liulongfang, qianweili
In-Reply-To: <20251112015846.1842207-1-huangchenghai2@huawei.com>

From: Weili Qian <qianweili@huawei.com>

Starting from ARMv8.4, stp and ldp instructions become atomic.
Currently, device drivers depend on 128-bit atomic memory IO access,
but these are implemented within the drivers. Therefore, this introduces
generic {__raw_read|__raw_write}128 function for 128-bit memory access.

Signed-off-by: Weili Qian <qianweili@huawei.com>
Signed-off-by: Chenghai Huang <huangchenghai2@huawei.com>
---
 arch/arm64/include/asm/io.h | 21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)

diff --git a/arch/arm64/include/asm/io.h b/arch/arm64/include/asm/io.h
index 83e03abbb2ca..80430750a28c 100644
--- a/arch/arm64/include/asm/io.h
+++ b/arch/arm64/include/asm/io.h
@@ -50,6 +50,17 @@ static __always_inline void __raw_writeq(u64 val, volatile void __iomem *addr)
 	asm volatile("str %x0, %1" : : "rZ" (val), "Qo" (*ptr));
 }
 
+#define __raw_write128 __raw_write128
+static __always_inline void __raw_write128(u128 val, volatile void __iomem *addr)
+{
+	u64 low, high;
+
+	low = val;
+	high = (u64)(val >> 64);
+
+	asm volatile ("stp %x0, %x1, [%2]\n" :: "rZ"(low), "rZ"(high), "r"(addr));
+}
+
 #define __raw_readb __raw_readb
 static __always_inline u8 __raw_readb(const volatile void __iomem *addr)
 {
@@ -95,6 +106,16 @@ static __always_inline u64 __raw_readq(const volatile void __iomem *addr)
 	return val;
 }
 
+#define __raw_read128 __raw_read128
+static __always_inline u128 __raw_read128(const volatile void __iomem *addr)
+{
+	u64 high, low;
+
+	asm volatile("ldp %0, %1, [%2]" : "=r" (low), "=r" (high) : "r" (addr));
+
+	return (((u128)high << 64) | (u128)low);
+}
+
 /* IO barriers */
 #define __io_ar(v)							\
 ({									\
-- 
2.33.0


^ permalink raw reply related

* [PATCH RFC 1/4] UAPI: Introduce 128-bit types and byteswap operations
From: Chenghai Huang @ 2025-11-12  1:58 UTC (permalink / raw)
  To: arnd, catalin.marinas, will, akpm, anshuman.khandual,
	ryan.roberts, andriy.shevchenko, herbert, linux-kernel,
	linux-arch, linux-arm-kernel, linux-crypto, linux-api
  Cc: fanghao11, shenyang39, liulongfang, qianweili
In-Reply-To: <20251112015846.1842207-1-huangchenghai2@huawei.com>

From: Weili Qian <qianweili@huawei.com>

Architectures like ARM64 support 128-bit integer types and
operations. This patch adds a generic byte order conversion
interface for 128-bit.

Signed-off-by: Weili Qian <qianweili@huawei.com>
Signed-off-by: Chenghai Huang <huangchenghai2@huawei.com>
---
 include/uapi/linux/byteorder/big_endian.h    |  6 ++++++
 include/uapi/linux/byteorder/little_endian.h |  6 ++++++
 include/uapi/linux/swab.h                    | 10 ++++++++++
 include/uapi/linux/types.h                   |  3 +++
 4 files changed, 25 insertions(+)

diff --git a/include/uapi/linux/byteorder/big_endian.h b/include/uapi/linux/byteorder/big_endian.h
index 80aa5c41a763..318d51a18f43 100644
--- a/include/uapi/linux/byteorder/big_endian.h
+++ b/include/uapi/linux/byteorder/big_endian.h
@@ -29,6 +29,12 @@
 #define __constant_be32_to_cpu(x) ((__force __u32)(__be32)(x))
 #define __constant_cpu_to_be16(x) ((__force __be16)(__u16)(x))
 #define __constant_be16_to_cpu(x) ((__force __u16)(__be16)(x))
+
+#ifdef __SIZEOF_INT128__
+#define __cpu_to_le128(x) ((__force __le128)__swab128((x)))
+#define __le128_to_cpu(x) __swab128((__force __u128)(__le128)(x))
+#endif
+
 #define __cpu_to_le64(x) ((__force __le64)__swab64((x)))
 #define __le64_to_cpu(x) __swab64((__force __u64)(__le64)(x))
 #define __cpu_to_le32(x) ((__force __le32)__swab32((x)))
diff --git a/include/uapi/linux/byteorder/little_endian.h b/include/uapi/linux/byteorder/little_endian.h
index cd98982e7523..b2732452b825 100644
--- a/include/uapi/linux/byteorder/little_endian.h
+++ b/include/uapi/linux/byteorder/little_endian.h
@@ -29,6 +29,12 @@
 #define __constant_be32_to_cpu(x) ___constant_swab32((__force __u32)(__be32)(x))
 #define __constant_cpu_to_be16(x) ((__force __be16)___constant_swab16((x)))
 #define __constant_be16_to_cpu(x) ___constant_swab16((__force __u16)(__be16)(x))
+
+#ifdef __SIZEOF_INT128__
+#define __cpu_to_le128(x) ((__force __le128)(__u128)(x))
+#define __le128_to_cpu(x) ((__force __u128)(__le128)(x))
+#endif
+
 #define __cpu_to_le64(x) ((__force __le64)(__u64)(x))
 #define __le64_to_cpu(x) ((__force __u64)(__le64)(x))
 #define __cpu_to_le32(x) ((__force __le32)(__u32)(x))
diff --git a/include/uapi/linux/swab.h b/include/uapi/linux/swab.h
index 01717181339e..7381b9a785ce 100644
--- a/include/uapi/linux/swab.h
+++ b/include/uapi/linux/swab.h
@@ -133,6 +133,16 @@ static inline __attribute_const__ __u32 __fswahb32(__u32 val)
 	__fswab64(x))
 #endif
 
+#ifdef __SIZEOF_INT128__
+static inline __attribute_const__ __u128 __swab128(__u128 val)
+{
+	__u64 h = val >> 64;
+	__u64 l = val;
+
+	return (((__u128)__swab64(l)) << 64) | ((__u128)(__swab64(h)));
+}
+#endif
+
 static __always_inline unsigned long __swab(const unsigned long y)
 {
 #if __BITS_PER_LONG == 64
diff --git a/include/uapi/linux/types.h b/include/uapi/linux/types.h
index 48b933938877..9624ea43cd8a 100644
--- a/include/uapi/linux/types.h
+++ b/include/uapi/linux/types.h
@@ -40,6 +40,9 @@ typedef __u32 __bitwise __be32;
 typedef __u64 __bitwise __le64;
 typedef __u64 __bitwise __be64;
 
+#ifdef __SIZEOF_INT128__
+typedef __u128 __bitwise __le128;
+#endif
 typedef __u16 __bitwise __sum16;
 typedef __u32 __bitwise __wsum;
 
-- 
2.33.0


^ permalink raw reply related

* [PATCH RFC 2/4] asm-generic/io.h: add io{read,write}128 accessors
From: Chenghai Huang @ 2025-11-12  1:58 UTC (permalink / raw)
  To: arnd, catalin.marinas, will, akpm, anshuman.khandual,
	ryan.roberts, andriy.shevchenko, herbert, linux-kernel,
	linux-arch, linux-arm-kernel, linux-crypto, linux-api
  Cc: fanghao11, shenyang39, liulongfang, qianweili
In-Reply-To: <20251112015846.1842207-1-huangchenghai2@huawei.com>

From: Weili Qian <qianweili@huawei.com>

Architectures like ARM64 already support 128-bit memory access. Currently,
device drivers implement atomic read and write operations for 128-bit
memory using assembly. This patch adds generic io{read,write}128 access
functions, which will enable device drivers to consistently use
io{read,write}128 for 128-bit access.

Signed-off-by: Weili Qian <qianweili@huawei.com>
Signed-off-by: Chenghai Huang <huangchenghai2@huawei.com>
---
 include/asm-generic/io.h | 48 ++++++++++++++++++++++++++++++++++++++++
 1 file changed, 48 insertions(+)

diff --git a/include/asm-generic/io.h b/include/asm-generic/io.h
index ca5a1ce6f0f8..c419021318e6 100644
--- a/include/asm-generic/io.h
+++ b/include/asm-generic/io.h
@@ -146,6 +146,16 @@ static inline u64 __raw_readq(const volatile void __iomem *addr)
 #endif
 #endif /* CONFIG_64BIT */
 
+#ifdef CONFIG_ARCH_SUPPORTS_INT128
+#ifndef __raw_read128
+#define __raw_read128 __raw_read128
+static inline u128 __raw_read128(volatile void __iomem *addr)
+{
+	return *(const volatile u128 __force *)addr;
+}
+#endif
+#endif /* CONFIG_ARCH_SUPPORTS_INT128 */
+
 #ifndef __raw_writeb
 #define __raw_writeb __raw_writeb
 static inline void __raw_writeb(u8 value, volatile void __iomem *addr)
@@ -180,6 +190,16 @@ static inline void __raw_writeq(u64 value, volatile void __iomem *addr)
 #endif
 #endif /* CONFIG_64BIT */
 
+#ifdef CONFIG_ARCH_SUPPORTS_INT128
+#ifndef __raw_write128
+#define __raw_write128 __raw_write128
+static inline void __raw_write128(u128 value, volatile void __iomem *addr)
+{
+	*(volatile u128 __force *)addr = value;
+}
+#endif
+#endif /* CONFIG_ARCH_SUPPORTS_INT128 */
+
 /*
  * {read,write}{b,w,l,q}() access little endian memory and return result in
  * native endianness.
@@ -917,6 +937,22 @@ static inline u64 ioread64(const volatile void __iomem *addr)
 #endif
 #endif /* CONFIG_64BIT */
 
+#ifdef CONFIG_ARCH_SUPPORTS_INT128
+#ifndef ioread128
+#define ioread128 ioread128
+static inline u128 ioread128(const volatile void __iomem *addr)
+{
+	u128 val;
+
+	__io_br();
+	val = __le128_to_cpu((__le128 __force)__raw_read128(addr));
+	__io_ar(val);
+
+	return val;
+}
+#endif
+#endif /* CONFIG_ARCH_SUPPORTS_INT128 */
+
 #ifndef iowrite8
 #define iowrite8 iowrite8
 static inline void iowrite8(u8 value, volatile void __iomem *addr)
@@ -951,6 +987,18 @@ static inline void iowrite64(u64 value, volatile void __iomem *addr)
 #endif
 #endif /* CONFIG_64BIT */
 
+#ifdef CONFIG_ARCH_SUPPORTS_INT128
+#ifndef iowrite128
+#define iowrite128 iowrite128
+static inline void iowrite128(u128 value, volatile void __iomem *addr)
+{
+	__io_bw();
+	__raw_write128((u128 __force)__cpu_to_le128(value), addr);
+	__io_aw();
+}
+#endif
+#endif /* CONFIG_ARCH_SUPPORTS_INT128 */
+
 #ifndef ioread16be
 #define ioread16be ioread16be
 static inline u16 ioread16be(const volatile void __iomem *addr)
-- 
2.33.0


^ permalink raw reply related

* Re: RFC: Serial port DTR/RTS - O_NRESETDEV
From: H. Peter Anvin @ 2025-11-11 21:28 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Maarten Brock, linux-serial@vger.kernel.org,
	linux-api@vger.kernel.org, LKML
In-Reply-To: <20251111043803.GK2988753@mit.edu>

On 2025-11-10 20:38, Theodore Ts'o wrote:
> On Mon, Nov 10, 2025 at 07:57:22PM -0800, H. Peter Anvin wrote:
>> I really think you are looking at this from a very odd point of
>> view, and you seem to be very inconsistent. Boot time setup? Isn't
>> that what setserial is for? We have the ability to feed this
>> configuration already, but you need a file descriptor.
> 
> I'm not really fond of adding some new open flag that to me seems
> **very** serial / RS-485 specific, and so I'm trying to find some
> way to avoid it.
> 

I don't think it is.  "Opening this device for configuration."

> I also think that that the GPIO style timing requirements of RTS
> **really** should be done as a line discpline, and not in userspace.
> 

No disagreement there -- and so it is. What I want to do is a way to *attach*
that line discipline without poking with the serial port itself.  That's what
I keep trying to get at.

>> Honestly, though, I'm far less interested in what 8250-based hardware does than e.g. USB.
> 
> I'm quite confident that USB won't have "state" that will be preserved
> across a reboot, because the device won't even get powered up until
> the USB device is attached.  And part of the problem was that the
> requirements weren't particularly clear, and given the insistence that
> the "state" be preserved even across reboot, despite the serial port
> autoconfiguration, I had assumed you were posting uing the COM 1/2/3/4
> ports where autoconfiguration isn't stricty speaking necessary.
> 
> In some ways, USB ports might be easier, since it should be possible
> to specify udev rules which get passed to the driver when the USB
> serial device is inserted, and so *that* can easily be done without
> needing a file descriptor.
> 
> And for this sort of thing, it seems perfectly fair to hard code some
> specific behavior using either a boot command line or a udev rule,
> since you seem to be positing that the serial port will be dedicated
> to some kind of weird-shit RS-485 bus device, where any time RTS/DTR
> gets raised, the bus will malfunction in weird and wondrous ways....

But again, it is very much a configuration property.  You don't know where
your dynamically assigned serial port will end up -- and you *can't*, because
it is a property of the DCE -- what is plugged *into* the device.

Now you have someone writing a terminal program or something like Arduino and
decide to enumerate serial ports (which, as I stated, you can't actually do
right now without opening the devices).  This is why it makes sense for the
open() caller to declare intent; this is similar to how O_NDELAY replaced
callout devices.

It would be lovely if we could do something like
open("/dev/ttyS0/option-string") and so on, but that is well and truly a far
bigger change to the whole driver API.

	-hpa

^ permalink raw reply

* Re: [PATCH v5 05/22] liveupdate: kho: when live update add KHO image during kexec load
From: Pasha Tatashin @ 2025-11-11 20:59 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: pratyush, jasonmiu, graf, dmatlack, rientjes, corbet, rdunlap,
	ilpo.jarvinen, kanie, ojeda, aliceryhl, masahiroy, akpm, tj,
	yoann.congal, mmaurer, roman.gushchin, chenridong, axboe,
	mark.rutland, jannh, vincent.guittot, hannes, dan.j.williams,
	david, joel.granados, rostedt, anna.schumaker, song, zhangguopeng,
	linux, linux-kernel, linux-doc, linux-mm, gregkh, tglx, mingo, bp,
	dave.hansen, x86, hpa, rafael, dakr, bartosz.golaszewski,
	cw00.choi, myungjoo.ham, yesanishhere, Jonathan.Cameron,
	quic_zijuhu, aleksander.lobakin, ira.weiny, andriy.shevchenko,
	leon, lukas, bhelgaas, wagi, djeffery, stuart.w.hayes, ptyadav,
	lennart, brauner, linux-api, linux-fsdevel, saeedm, ajayachandra,
	jgg, parav, leonro, witu, hughd, skhawaja, chrisl
In-Reply-To: <aROaJUjyyZqJ19Wo@kernel.org>

> I believe that when my concerns about "[PATCH v5 02/22] liveupdate:
> luo_core: integrate with KHO" [1] are resolved this patch won't be needed.
>
> [1] https://lore.kernel.org/all/aROZi043lxtegqWE@kernel.org/

Thank you, I replied to your comments in that patch. However, until
KHO becomes statless this change is needed. We *must* have KHO image
as part of kexec load if liveupdate=1.

>
> > Pasha
>
> --
> Sincerely yours,
> Mike.

^ permalink raw reply

* Re: [PATCH v5 02/22] liveupdate: luo_core: integrate with KHO
From: Pasha Tatashin @ 2025-11-11 20:57 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: pratyush, jasonmiu, graf, dmatlack, rientjes, corbet, rdunlap,
	ilpo.jarvinen, kanie, ojeda, aliceryhl, masahiroy, akpm, tj,
	yoann.congal, mmaurer, roman.gushchin, chenridong, axboe,
	mark.rutland, jannh, vincent.guittot, hannes, dan.j.williams,
	david, joel.granados, rostedt, anna.schumaker, song, zhangguopeng,
	linux, linux-kernel, linux-doc, linux-mm, gregkh, tglx, mingo, bp,
	dave.hansen, x86, hpa, rafael, dakr, bartosz.golaszewski,
	cw00.choi, myungjoo.ham, yesanishhere, Jonathan.Cameron,
	quic_zijuhu, aleksander.lobakin, ira.weiny, andriy.shevchenko,
	leon, lukas, bhelgaas, wagi, djeffery, stuart.w.hayes, ptyadav,
	lennart, brauner, linux-api, linux-fsdevel, saeedm, ajayachandra,
	jgg, parav, leonro, witu, hughd, skhawaja, chrisl
In-Reply-To: <aROZi043lxtegqWE@kernel.org>

Hi Mike,

Thank you for review, my comments below:

> > This is why this call is placed first in reboot(), before any
> > irreversible reboot notifiers or shutdown callbacks are performed. If
> > an allocation problem occurs in KHO, the error is simply reported back
> > to userspace, and the live update update is safely aborted.
>
> This is fine. But what I don't like is that we can't use kho without
> liveupdate. We are making debugfs optional, we have a way to call

Yes you can: you can disable liveupdate (i.e. not supply liveupdate=1
via kernel parameter) and use KHO the old way: drive it from the
userspace. However, if liveupdate is enabled, liveupdate becomes the
driver of KHO as unfortunately KHO has these weird states at the
moment.

> kho_finalize() on the reboot path and it does not seem an issue to do it
> even without liveupdate. But then we force kho_finalize() into
> liveupdate_reboot() allowing weird configurations where kho is there but
> it's unusable.

What do you mean KHO is there but unusable, we should not have such a state...

> What I'd like to see is that we can finalize KHO on kexec reboot path even
> when liveupdate is not compiled and until then the patch that makes KHO
> debugfs optional should not go further IMO.
>
> Another thing I didn't check in this series yet is how finalization driven
> from debugfs interacts with liveupdate internal handling?

I think what we can do is the following:
- Remove "Kconfig: make debugfs optional" from this series, and
instead make that change as part of stateless KHO work.
- This will ensure that when liveupdate=0 always KHO finalize is fully
support the old way.
- When liveupdate=1 always disable KHO debugfs "finalize" API, and
allow liveupdate to drive it automatically. It would add another
liveupdate_enable() check to KHO, and is going to be removed as part
of stateless KHO work.

Pasha

^ permalink raw reply

* Re: [PATCH v5 02/22] liveupdate: luo_core: integrate with KHO
From: Pasha Tatashin @ 2025-11-11 20:42 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: pratyush, jasonmiu, graf, dmatlack, rientjes, corbet, rdunlap,
	ilpo.jarvinen, kanie, ojeda, aliceryhl, masahiroy, akpm, tj,
	yoann.congal, mmaurer, roman.gushchin, chenridong, axboe,
	mark.rutland, jannh, vincent.guittot, hannes, dan.j.williams,
	david, joel.granados, rostedt, anna.schumaker, song, zhangguopeng,
	linux, linux-kernel, linux-doc, linux-mm, gregkh, tglx, mingo, bp,
	dave.hansen, x86, hpa, rafael, dakr, bartosz.golaszewski,
	cw00.choi, myungjoo.ham, yesanishhere, Jonathan.Cameron,
	quic_zijuhu, aleksander.lobakin, ira.weiny, andriy.shevchenko,
	leon, lukas, bhelgaas, wagi, djeffery, stuart.w.hayes, ptyadav,
	lennart, brauner, linux-api, linux-fsdevel, saeedm, ajayachandra,
	jgg, parav, leonro, witu, hughd, skhawaja, chrisl
In-Reply-To: <CA+CK2bDnaLJS9GdO_7Anhwah2uQrYYk_RhQMSiRL-YB=8ZZZWQ@mail.gmail.com>

On Tue, Nov 11, 2025 at 3:39 PM Pasha Tatashin
<pasha.tatashin@soleen.com> wrote:
>
> > >       kho_memory_init();
> > >
> > > +     /* Live Update should follow right after KHO is initialized */
> > > +     liveupdate_init();
> > > +
> >
> > Why do you think it should be immediately after kho_memory_init()?
> > Any reason this can't be called from start_kernel() or even later as an
> > early_initcall() or core_initall()?
>
> Unfortunately, no, even here it is too late, and we might need to find
> a way to move the kho_init/liveupdate_init earlier. We must be able to
> preserve HugeTLB pages, and those are reserved earlier in boot.

Just to clarify: liveupdate_init() is needed to start using:
liveupdate_flb_incoming_* API, and FLB data is needed during HugeTLB
reservation.

Pasha

^ permalink raw reply

* Re: [PATCH v5 02/22] liveupdate: luo_core: integrate with KHO
From: Pasha Tatashin @ 2025-11-11 20:39 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: pratyush, jasonmiu, graf, dmatlack, rientjes, corbet, rdunlap,
	ilpo.jarvinen, kanie, ojeda, aliceryhl, masahiroy, akpm, tj,
	yoann.congal, mmaurer, roman.gushchin, chenridong, axboe,
	mark.rutland, jannh, vincent.guittot, hannes, dan.j.williams,
	david, joel.granados, rostedt, anna.schumaker, song, zhangguopeng,
	linux, linux-kernel, linux-doc, linux-mm, gregkh, tglx, mingo, bp,
	dave.hansen, x86, hpa, rafael, dakr, bartosz.golaszewski,
	cw00.choi, myungjoo.ham, yesanishhere, Jonathan.Cameron,
	quic_zijuhu, aleksander.lobakin, ira.weiny, andriy.shevchenko,
	leon, lukas, bhelgaas, wagi, djeffery, stuart.w.hayes, ptyadav,
	lennart, brauner, linux-api, linux-fsdevel, saeedm, ajayachandra,
	jgg, parav, leonro, witu, hughd, skhawaja, chrisl
In-Reply-To: <aRObz4bQzRHH5hJb@kernel.org>

> >       kho_memory_init();
> >
> > +     /* Live Update should follow right after KHO is initialized */
> > +     liveupdate_init();
> > +
>
> Why do you think it should be immediately after kho_memory_init()?
> Any reason this can't be called from start_kernel() or even later as an
> early_initcall() or core_initall()?

Unfortunately, no, even here it is too late, and we might need to find
a way to move the kho_init/liveupdate_init earlier. We must be able to
preserve HugeTLB pages, and those are reserved earlier in boot.

Pasha

^ permalink raw reply

* Re: [PATCH v5 02/22] liveupdate: luo_core: integrate with KHO
From: Mike Rapoport @ 2025-11-11 20:25 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: pratyush, jasonmiu, graf, dmatlack, rientjes, corbet, rdunlap,
	ilpo.jarvinen, kanie, ojeda, aliceryhl, masahiroy, akpm, tj,
	yoann.congal, mmaurer, roman.gushchin, chenridong, axboe,
	mark.rutland, jannh, vincent.guittot, hannes, dan.j.williams,
	david, joel.granados, rostedt, anna.schumaker, song, zhangguopeng,
	linux, linux-kernel, linux-doc, linux-mm, gregkh, tglx, mingo, bp,
	dave.hansen, x86, hpa, rafael, dakr, bartosz.golaszewski,
	cw00.choi, myungjoo.ham, yesanishhere, Jonathan.Cameron,
	quic_zijuhu, aleksander.lobakin, ira.weiny, andriy.shevchenko,
	leon, lukas, bhelgaas, wagi, djeffery, stuart.w.hayes, ptyadav,
	lennart, brauner, linux-api, linux-fsdevel, saeedm, ajayachandra,
	jgg, parav, leonro, witu, hughd, skhawaja, chrisl
In-Reply-To: <20251107210526.257742-3-pasha.tatashin@soleen.com>

On Fri, Nov 07, 2025 at 04:03:00PM -0500, Pasha Tatashin wrote:
> Integrate the LUO with the KHO framework to enable passing LUO state
> across a kexec reboot.
> 
> When LUO is transitioned to a "prepared" state, it tells KHO to
> finalize, so all memory segments that were added to KHO preservation
> list are getting preserved. After "Prepared" state no new segments
> can be preserved. If LUO is canceled, it also tells KHO to cancel the
> serialization, and therefore, later LUO can go back into the prepared
> state.
> 
> This patch introduces the following changes:
> - During the KHO finalization phase allocate FDT blob.
> - Populate this FDT with a LUO compatibility string ("luo-v1").
> 
> LUO now depends on `CONFIG_KEXEC_HANDOVER`. The core state transition
> logic (`luo_do_*_calls`) remains unimplemented in this patch.
> 
> Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>

...

> diff --git a/mm/mm_init.c b/mm/mm_init.c
> index c6812b4dbb2e..20c850a52167 100644
> --- a/mm/mm_init.c
> +++ b/mm/mm_init.c
> @@ -21,6 +21,7 @@
>  #include <linux/buffer_head.h>
>  #include <linux/kmemleak.h>
>  #include <linux/kfence.h>
> +#include <linux/liveupdate.h>
>  #include <linux/page_ext.h>
>  #include <linux/pti.h>
>  #include <linux/pgtable.h>
> @@ -2703,6 +2704,9 @@ void __init mm_core_init(void)
>  	 */
>  	kho_memory_init();
>  
> +	/* Live Update should follow right after KHO is initialized */
> +	liveupdate_init();
> +

Why do you think it should be immediately after kho_memory_init()?
Any reason this can't be called from start_kernel() or even later as an
early_initcall() or core_initall()?

>  	memblock_free_all();
>  	mem_init();
>  	kmem_cache_init();
> -- 
> 2.51.2.1041.gc1ab5b90ca-goog
> 
> 

-- 
Sincerely yours,
Mike.

^ permalink raw reply

* Re: [PATCH v5 05/22] liveupdate: kho: when live update add KHO image during kexec load
From: Mike Rapoport @ 2025-11-11 20:18 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: pratyush, jasonmiu, graf, dmatlack, rientjes, corbet, rdunlap,
	ilpo.jarvinen, kanie, ojeda, aliceryhl, masahiroy, akpm, tj,
	yoann.congal, mmaurer, roman.gushchin, chenridong, axboe,
	mark.rutland, jannh, vincent.guittot, hannes, dan.j.williams,
	david, joel.granados, rostedt, anna.schumaker, song, zhangguopeng,
	linux, linux-kernel, linux-doc, linux-mm, gregkh, tglx, mingo, bp,
	dave.hansen, x86, hpa, rafael, dakr, bartosz.golaszewski,
	cw00.choi, myungjoo.ham, yesanishhere, Jonathan.Cameron,
	quic_zijuhu, aleksander.lobakin, ira.weiny, andriy.shevchenko,
	leon, lukas, bhelgaas, wagi, djeffery, stuart.w.hayes, ptyadav,
	lennart, brauner, linux-api, linux-fsdevel, saeedm, ajayachandra,
	jgg, parav, leonro, witu, hughd, skhawaja, chrisl
In-Reply-To: <CA+CK2bA=cQkibx4dSxJQTVxVxqkAsZPfFoPJip6rx8DqX62aEA@mail.gmail.com>

On Mon, Nov 10, 2025 at 10:31:23AM -0500, Pasha Tatashin wrote:
> On Mon, Nov 10, 2025 at 7:47 AM Mike Rapoport <rppt@kernel.org> wrote:
> >
> > On Fri, Nov 07, 2025 at 04:03:03PM -0500, Pasha Tatashin wrote:
> > > In case KHO is driven from within kernel via live update, finalize will
> > > always happen during reboot, so add the KHO image unconditionally.
> > >
> > > Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
> > > ---
> > >  kernel/liveupdate/kexec_handover.c | 3 ++-
> > >  1 file changed, 2 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/kernel/liveupdate/kexec_handover.c b/kernel/liveupdate/kexec_handover.c
> > > index 9f0913e101be..b54ca665e005 100644
> > > --- a/kernel/liveupdate/kexec_handover.c
> > > +++ b/kernel/liveupdate/kexec_handover.c
> > > @@ -15,6 +15,7 @@
> > >  #include <linux/kexec_handover.h>
> > >  #include <linux/libfdt.h>
> > >  #include <linux/list.h>
> > > +#include <linux/liveupdate.h>
> > >  #include <linux/memblock.h>
> > >  #include <linux/page-isolation.h>
> > >  #include <linux/vmalloc.h>
> > > @@ -1489,7 +1490,7 @@ int kho_fill_kimage(struct kimage *image)
> > >       int err = 0;
> > >       struct kexec_buf scratch;
> > >
> > > -     if (!kho_out.finalized)
> > > +     if (!kho_out.finalized && !liveupdate_enabled())
> > >               return 0;
> >
> > This feels backwards, I don't think KHO should call liveupdate methods.
> 
> It is backward, but it is a requirement until KHO becomes stateless.
> LUO does not have dependencies on userspace state of when kexec is
> loaded. In fact the next kernel must be loaded before the brownout as
> it is an expensive operation. The sequence of events should:
> 
> 1. Load the next kernel in memory
> 2. Preserve resources via LUO
> 3. Do Kexec reboot

I believe that when my concerns about "[PATCH v5 02/22] liveupdate:
luo_core: integrate with KHO" [1] are resolved this patch won't be needed.

[1] https://lore.kernel.org/all/aROZi043lxtegqWE@kernel.org/
 
> Pasha

-- 
Sincerely yours,
Mike.

^ permalink raw reply

* Re: [PATCH v5 02/22] liveupdate: luo_core: integrate with KHO
From: Mike Rapoport @ 2025-11-11 20:16 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: pratyush, jasonmiu, graf, dmatlack, rientjes, corbet, rdunlap,
	ilpo.jarvinen, kanie, ojeda, aliceryhl, masahiroy, akpm, tj,
	yoann.congal, mmaurer, roman.gushchin, chenridong, axboe,
	mark.rutland, jannh, vincent.guittot, hannes, dan.j.williams,
	david, joel.granados, rostedt, anna.schumaker, song, zhangguopeng,
	linux, linux-kernel, linux-doc, linux-mm, gregkh, tglx, mingo, bp,
	dave.hansen, x86, hpa, rafael, dakr, bartosz.golaszewski,
	cw00.choi, myungjoo.ham, yesanishhere, Jonathan.Cameron,
	quic_zijuhu, aleksander.lobakin, ira.weiny, andriy.shevchenko,
	leon, lukas, bhelgaas, wagi, djeffery, stuart.w.hayes, ptyadav,
	lennart, brauner, linux-api, linux-fsdevel, saeedm, ajayachandra,
	jgg, parav, leonro, witu, hughd, skhawaja, chrisl
In-Reply-To: <CA+CK2bCHhbBtSJCx38gxjfR6DM1PjcfsOTD-Pqzqyez1_hXJ7Q@mail.gmail.com>

On Mon, Nov 10, 2025 at 10:43:43AM -0500, Pasha Tatashin wrote:
> >
> > kho_finalize() should be really called from kernel_kexec().
> >
> > We avoided it because of the concern that memory allocations that late in
> > reboot could be an issue. But I looked at hibernate() and it does
> > allocations on reboot->hibernate path, so adding kho_finalize() as the
> > first step of kernel_kexec() seems fine.
> 
> This isn't a regular reboot; it's a live update. The
> liveupdate_reboot() is designed to be reversible and allows us to
> return an error, undoing the freeze() operations via unfreeze() in
> case of failure.
> 
> This is why this call is placed first in reboot(), before any
> irreversible reboot notifiers or shutdown callbacks are performed. If
> an allocation problem occurs in KHO, the error is simply reported back
> to userspace, and the live update update is safely aborted.

This is fine. But what I don't like is that we can't use kho without
liveupdate. We are making debugfs optional, we have a way to call
kho_finalize() on the reboot path and it does not seem an issue to do it
even without liveupdate. But then we force kho_finalize() into
liveupdate_reboot() allowing weird configurations where kho is there but
it's unusable.

What I'd like to see is that we can finalize KHO on kexec reboot path even
when liveupdate is not compiled and until then the patch that makes KHO
debugfs optional should not go further IMO.

Another thing I didn't check in this series yet is how finalization driven
from debugfs interacts with liveupdate internal handling?
 
> > And if we prioritize stateless memory tracking in KHO, it won't be a
> > concern at all.
> 
> We are prioritizing stateless KHO work ;-) +Jason Miu
> Once KHO is stateless, the kho_finalize() is going to be removed.

There's still fdt_finish(), but it can't fail in practice. 

-- 
Sincerely yours,
Mike.

^ permalink raw reply

* Re: [PATCH v6 17/17] vfs: expose delegation support to userland
From: Jan Kara @ 2025-11-11 14:25 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Miklos Szeredi, Alexander Viro, Christian Brauner, Jan Kara,
	Chuck Lever, Alexander Aring, Trond Myklebust, Anna Schumaker,
	Steve French, Paulo Alcantara, Ronnie Sahlberg, Shyam Prasad N,
	Tom Talpey, Bharath SM, Greg Kroah-Hartman, Rafael J. Wysocki,
	Danilo Krummrich, David Howells, Tyler Hicks, NeilBrown,
	Olga Kornievskaia, Dai Ngo, Amir Goldstein, Namjae Jeon,
	Steve French, Sergey Senozhatsky, Carlos Maiolino,
	Kuniyuki Iwashima, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman, linux-fsdevel, linux-kernel, linux-nfs,
	linux-cifs, samba-technical, netfs, ecryptfs, linux-unionfs,
	linux-xfs, netdev, linux-api
In-Reply-To: <20251111-dir-deleg-ro-v6-17-52f3feebb2f2@kernel.org>

On Tue 11-11-25 09:12:58, Jeff Layton wrote:
> Now that support for recallable directory delegations is available,
> expose this functionality to userland with new F_SETDELEG and F_GETDELEG
> commands for fcntl().
> 
> Note that this also allows userland to request a FL_DELEG type lease on
> files too. Userland applications that do will get signalled when there
> are metadata changes in addition to just data changes (which is a
> limitation of FL_LEASE leases).
> 
> These commands accept a new "struct delegation" argument that contains a
> flags field for future expansion.
> 
> Signed-off-by: Jeff Layton <jlayton@kernel.org>

Looks good. Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/fcntl.c                 | 13 +++++++++++++
>  fs/locks.c                 | 45 ++++++++++++++++++++++++++++++++++++++++-----
>  include/linux/filelock.h   | 12 ++++++++++++
>  include/uapi/linux/fcntl.h | 11 +++++++++++
>  4 files changed, 76 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/fcntl.c b/fs/fcntl.c
> index 72f8433d9109889eecef56b32d20a85b4e12ea44..f93dbca0843557d197bd1e023519cfa0f00ad78f 100644
> --- a/fs/fcntl.c
> +++ b/fs/fcntl.c
> @@ -445,6 +445,7 @@ static long do_fcntl(int fd, unsigned int cmd, unsigned long arg,
>  		struct file *filp)
>  {
>  	void __user *argp = (void __user *)arg;
> +	struct delegation deleg;
>  	int argi = (int)arg;
>  	struct flock flock;
>  	long err = -EINVAL;
> @@ -550,6 +551,18 @@ static long do_fcntl(int fd, unsigned int cmd, unsigned long arg,
>  	case F_SET_RW_HINT:
>  		err = fcntl_set_rw_hint(filp, arg);
>  		break;
> +	case F_GETDELEG:
> +		if (copy_from_user(&deleg, argp, sizeof(deleg)))
> +			return -EFAULT;
> +		err = fcntl_getdeleg(filp, &deleg);
> +		if (!err && copy_to_user(argp, &deleg, sizeof(deleg)))
> +			return -EFAULT;
> +		break;
> +	case F_SETDELEG:
> +		if (copy_from_user(&deleg, argp, sizeof(deleg)))
> +			return -EFAULT;
> +		err = fcntl_setdeleg(fd, filp, &deleg);
> +		break;
>  	default:
>  		break;
>  	}
> diff --git a/fs/locks.c b/fs/locks.c
> index dd290a87f58eb5d522f03fa99d612fbad84dacf3..7f4ccc7974bc8d3e82500ee692c6520b53f2280f 100644
> --- a/fs/locks.c
> +++ b/fs/locks.c
> @@ -1703,7 +1703,7 @@ EXPORT_SYMBOL(lease_get_mtime);
>   *	XXX: sfr & willy disagree over whether F_INPROGRESS
>   *	should be returned to userspace.
>   */
> -int fcntl_getlease(struct file *filp)
> +static int __fcntl_getlease(struct file *filp, unsigned int flavor)
>  {
>  	struct file_lease *fl;
>  	struct inode *inode = file_inode(filp);
> @@ -1719,7 +1719,8 @@ int fcntl_getlease(struct file *filp)
>  		list_for_each_entry(fl, &ctx->flc_lease, c.flc_list) {
>  			if (fl->c.flc_file != filp)
>  				continue;
> -			type = target_leasetype(fl);
> +			if (fl->c.flc_flags & flavor)
> +				type = target_leasetype(fl);
>  			break;
>  		}
>  		spin_unlock(&ctx->flc_lock);
> @@ -1730,6 +1731,19 @@ int fcntl_getlease(struct file *filp)
>  	return type;
>  }
>  
> +int fcntl_getlease(struct file *filp)
> +{
> +	return __fcntl_getlease(filp, FL_LEASE);
> +}
> +
> +int fcntl_getdeleg(struct file *filp, struct delegation *deleg)
> +{
> +	if (deleg->d_flags != 0 || deleg->__pad != 0)
> +		return -EINVAL;
> +	deleg->d_type = __fcntl_getlease(filp, FL_DELEG);
> +	return 0;
> +}
> +
>  /**
>   * check_conflicting_open - see if the given file points to an inode that has
>   *			    an existing open that would conflict with the
> @@ -2039,13 +2053,13 @@ vfs_setlease(struct file *filp, int arg, struct file_lease **lease, void **priv)
>  }
>  EXPORT_SYMBOL_GPL(vfs_setlease);
>  
> -static int do_fcntl_add_lease(unsigned int fd, struct file *filp, int arg)
> +static int do_fcntl_add_lease(unsigned int fd, struct file *filp, unsigned int flavor, int arg)
>  {
>  	struct file_lease *fl;
>  	struct fasync_struct *new;
>  	int error;
>  
> -	fl = lease_alloc(filp, FL_LEASE, arg);
> +	fl = lease_alloc(filp, flavor, arg);
>  	if (IS_ERR(fl))
>  		return PTR_ERR(fl);
>  
> @@ -2081,7 +2095,28 @@ int fcntl_setlease(unsigned int fd, struct file *filp, int arg)
>  
>  	if (arg == F_UNLCK)
>  		return vfs_setlease(filp, F_UNLCK, NULL, (void **)&filp);
> -	return do_fcntl_add_lease(fd, filp, arg);
> +	return do_fcntl_add_lease(fd, filp, FL_LEASE, arg);
> +}
> +
> +/**
> + *	fcntl_setdeleg	-	sets a delegation on an open file
> + *	@fd: open file descriptor
> + *	@filp: file pointer
> + *	@deleg: delegation request from userland
> + *
> + *	Call this fcntl to establish a delegation on the file.
> + *	Note that you also need to call %F_SETSIG to
> + *	receive a signal when the lease is broken.
> + */
> +int fcntl_setdeleg(unsigned int fd, struct file *filp, struct delegation *deleg)
> +{
> +	/* For now, no flags are supported */
> +	if (deleg->d_flags != 0 || deleg->__pad != 0)
> +		return -EINVAL;
> +
> +	if (deleg->d_type == F_UNLCK)
> +		return vfs_setlease(filp, F_UNLCK, NULL, (void **)&filp);
> +	return do_fcntl_add_lease(fd, filp, FL_DELEG, deleg->d_type);
>  }
>  
>  /**
> diff --git a/include/linux/filelock.h b/include/linux/filelock.h
> index 208d108df2d73a9df65e5dc9968d074af385f881..54b824c05299261e6bd6acc4175cb277ea35b35d 100644
> --- a/include/linux/filelock.h
> +++ b/include/linux/filelock.h
> @@ -159,6 +159,8 @@ int fcntl_setlk64(unsigned int, struct file *, unsigned int,
>  
>  int fcntl_setlease(unsigned int fd, struct file *filp, int arg);
>  int fcntl_getlease(struct file *filp);
> +int fcntl_setdeleg(unsigned int fd, struct file *filp, struct delegation *deleg);
> +int fcntl_getdeleg(struct file *filp, struct delegation *deleg);
>  
>  static inline bool lock_is_unlock(struct file_lock *fl)
>  {
> @@ -278,6 +280,16 @@ static inline int fcntl_getlease(struct file *filp)
>  	return F_UNLCK;
>  }
>  
> +static inline int fcntl_setdeleg(unsigned int fd, struct file *filp, struct delegation *deleg)
> +{
> +	return -EINVAL;
> +}
> +
> +static inline int fcntl_getdeleg(struct file *filp, struct delegation *deleg)
> +{
> +	return -EINVAL;
> +}
> +
>  static inline bool lock_is_unlock(struct file_lock *fl)
>  {
>  	return false;
> diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h
> index 3741ea1b73d8500061567b6590ccf5fb4c6770f0..008fac15e573084a9b48e4e991528b4363c54047 100644
> --- a/include/uapi/linux/fcntl.h
> +++ b/include/uapi/linux/fcntl.h
> @@ -79,6 +79,17 @@
>   */
>  #define RWF_WRITE_LIFE_NOT_SET	RWH_WRITE_LIFE_NOT_SET
>  
> +/* Set/Get delegations */
> +#define F_GETDELEG		(F_LINUX_SPECIFIC_BASE + 15)
> +#define F_SETDELEG		(F_LINUX_SPECIFIC_BASE + 16)
> +
> +/* Argument structure for F_GETDELEG and F_SETDELEG */
> +struct delegation {
> +	uint32_t	d_flags;	/* Must be 0 */
> +	uint16_t	d_type;		/* F_RDLCK, F_WRLCK, F_UNLCK */
> +	uint16_t	__pad;		/* Must be 0 */
> +};
> +
>  /*
>   * Types of directory notifications that may be requested.
>   */
> 
> -- 
> 2.51.1
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply

* [PATCH v6 17/17] vfs: expose delegation support to userland
From: Jeff Layton @ 2025-11-11 14:12 UTC (permalink / raw)
  To: Miklos Szeredi, Alexander Viro, Christian Brauner, Jan Kara,
	Chuck Lever, Alexander Aring, Trond Myklebust, Anna Schumaker,
	Steve French, Paulo Alcantara, Ronnie Sahlberg, Shyam Prasad N,
	Tom Talpey, Bharath SM, Greg Kroah-Hartman, Rafael J. Wysocki,
	Danilo Krummrich, David Howells, Tyler Hicks, NeilBrown,
	Olga Kornievskaia, Dai Ngo, Amir Goldstein, Namjae Jeon,
	Steve French, Sergey Senozhatsky, Carlos Maiolino,
	Kuniyuki Iwashima, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman
  Cc: linux-fsdevel, linux-kernel, linux-nfs, linux-cifs,
	samba-technical, netfs, ecryptfs, linux-unionfs, linux-xfs,
	netdev, linux-api, Jeff Layton
In-Reply-To: <20251111-dir-deleg-ro-v6-0-52f3feebb2f2@kernel.org>

Now that support for recallable directory delegations is available,
expose this functionality to userland with new F_SETDELEG and F_GETDELEG
commands for fcntl().

Note that this also allows userland to request a FL_DELEG type lease on
files too. Userland applications that do will get signalled when there
are metadata changes in addition to just data changes (which is a
limitation of FL_LEASE leases).

These commands accept a new "struct delegation" argument that contains a
flags field for future expansion.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
 fs/fcntl.c                 | 13 +++++++++++++
 fs/locks.c                 | 45 ++++++++++++++++++++++++++++++++++++++++-----
 include/linux/filelock.h   | 12 ++++++++++++
 include/uapi/linux/fcntl.h | 11 +++++++++++
 4 files changed, 76 insertions(+), 5 deletions(-)

diff --git a/fs/fcntl.c b/fs/fcntl.c
index 72f8433d9109889eecef56b32d20a85b4e12ea44..f93dbca0843557d197bd1e023519cfa0f00ad78f 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -445,6 +445,7 @@ static long do_fcntl(int fd, unsigned int cmd, unsigned long arg,
 		struct file *filp)
 {
 	void __user *argp = (void __user *)arg;
+	struct delegation deleg;
 	int argi = (int)arg;
 	struct flock flock;
 	long err = -EINVAL;
@@ -550,6 +551,18 @@ static long do_fcntl(int fd, unsigned int cmd, unsigned long arg,
 	case F_SET_RW_HINT:
 		err = fcntl_set_rw_hint(filp, arg);
 		break;
+	case F_GETDELEG:
+		if (copy_from_user(&deleg, argp, sizeof(deleg)))
+			return -EFAULT;
+		err = fcntl_getdeleg(filp, &deleg);
+		if (!err && copy_to_user(argp, &deleg, sizeof(deleg)))
+			return -EFAULT;
+		break;
+	case F_SETDELEG:
+		if (copy_from_user(&deleg, argp, sizeof(deleg)))
+			return -EFAULT;
+		err = fcntl_setdeleg(fd, filp, &deleg);
+		break;
 	default:
 		break;
 	}
diff --git a/fs/locks.c b/fs/locks.c
index dd290a87f58eb5d522f03fa99d612fbad84dacf3..7f4ccc7974bc8d3e82500ee692c6520b53f2280f 100644
--- a/fs/locks.c
+++ b/fs/locks.c
@@ -1703,7 +1703,7 @@ EXPORT_SYMBOL(lease_get_mtime);
  *	XXX: sfr & willy disagree over whether F_INPROGRESS
  *	should be returned to userspace.
  */
-int fcntl_getlease(struct file *filp)
+static int __fcntl_getlease(struct file *filp, unsigned int flavor)
 {
 	struct file_lease *fl;
 	struct inode *inode = file_inode(filp);
@@ -1719,7 +1719,8 @@ int fcntl_getlease(struct file *filp)
 		list_for_each_entry(fl, &ctx->flc_lease, c.flc_list) {
 			if (fl->c.flc_file != filp)
 				continue;
-			type = target_leasetype(fl);
+			if (fl->c.flc_flags & flavor)
+				type = target_leasetype(fl);
 			break;
 		}
 		spin_unlock(&ctx->flc_lock);
@@ -1730,6 +1731,19 @@ int fcntl_getlease(struct file *filp)
 	return type;
 }
 
+int fcntl_getlease(struct file *filp)
+{
+	return __fcntl_getlease(filp, FL_LEASE);
+}
+
+int fcntl_getdeleg(struct file *filp, struct delegation *deleg)
+{
+	if (deleg->d_flags != 0 || deleg->__pad != 0)
+		return -EINVAL;
+	deleg->d_type = __fcntl_getlease(filp, FL_DELEG);
+	return 0;
+}
+
 /**
  * check_conflicting_open - see if the given file points to an inode that has
  *			    an existing open that would conflict with the
@@ -2039,13 +2053,13 @@ vfs_setlease(struct file *filp, int arg, struct file_lease **lease, void **priv)
 }
 EXPORT_SYMBOL_GPL(vfs_setlease);
 
-static int do_fcntl_add_lease(unsigned int fd, struct file *filp, int arg)
+static int do_fcntl_add_lease(unsigned int fd, struct file *filp, unsigned int flavor, int arg)
 {
 	struct file_lease *fl;
 	struct fasync_struct *new;
 	int error;
 
-	fl = lease_alloc(filp, FL_LEASE, arg);
+	fl = lease_alloc(filp, flavor, arg);
 	if (IS_ERR(fl))
 		return PTR_ERR(fl);
 
@@ -2081,7 +2095,28 @@ int fcntl_setlease(unsigned int fd, struct file *filp, int arg)
 
 	if (arg == F_UNLCK)
 		return vfs_setlease(filp, F_UNLCK, NULL, (void **)&filp);
-	return do_fcntl_add_lease(fd, filp, arg);
+	return do_fcntl_add_lease(fd, filp, FL_LEASE, arg);
+}
+
+/**
+ *	fcntl_setdeleg	-	sets a delegation on an open file
+ *	@fd: open file descriptor
+ *	@filp: file pointer
+ *	@deleg: delegation request from userland
+ *
+ *	Call this fcntl to establish a delegation on the file.
+ *	Note that you also need to call %F_SETSIG to
+ *	receive a signal when the lease is broken.
+ */
+int fcntl_setdeleg(unsigned int fd, struct file *filp, struct delegation *deleg)
+{
+	/* For now, no flags are supported */
+	if (deleg->d_flags != 0 || deleg->__pad != 0)
+		return -EINVAL;
+
+	if (deleg->d_type == F_UNLCK)
+		return vfs_setlease(filp, F_UNLCK, NULL, (void **)&filp);
+	return do_fcntl_add_lease(fd, filp, FL_DELEG, deleg->d_type);
 }
 
 /**
diff --git a/include/linux/filelock.h b/include/linux/filelock.h
index 208d108df2d73a9df65e5dc9968d074af385f881..54b824c05299261e6bd6acc4175cb277ea35b35d 100644
--- a/include/linux/filelock.h
+++ b/include/linux/filelock.h
@@ -159,6 +159,8 @@ int fcntl_setlk64(unsigned int, struct file *, unsigned int,
 
 int fcntl_setlease(unsigned int fd, struct file *filp, int arg);
 int fcntl_getlease(struct file *filp);
+int fcntl_setdeleg(unsigned int fd, struct file *filp, struct delegation *deleg);
+int fcntl_getdeleg(struct file *filp, struct delegation *deleg);
 
 static inline bool lock_is_unlock(struct file_lock *fl)
 {
@@ -278,6 +280,16 @@ static inline int fcntl_getlease(struct file *filp)
 	return F_UNLCK;
 }
 
+static inline int fcntl_setdeleg(unsigned int fd, struct file *filp, struct delegation *deleg)
+{
+	return -EINVAL;
+}
+
+static inline int fcntl_getdeleg(struct file *filp, struct delegation *deleg)
+{
+	return -EINVAL;
+}
+
 static inline bool lock_is_unlock(struct file_lock *fl)
 {
 	return false;
diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h
index 3741ea1b73d8500061567b6590ccf5fb4c6770f0..008fac15e573084a9b48e4e991528b4363c54047 100644
--- a/include/uapi/linux/fcntl.h
+++ b/include/uapi/linux/fcntl.h
@@ -79,6 +79,17 @@
  */
 #define RWF_WRITE_LIFE_NOT_SET	RWH_WRITE_LIFE_NOT_SET
 
+/* Set/Get delegations */
+#define F_GETDELEG		(F_LINUX_SPECIFIC_BASE + 15)
+#define F_SETDELEG		(F_LINUX_SPECIFIC_BASE + 16)
+
+/* Argument structure for F_GETDELEG and F_SETDELEG */
+struct delegation {
+	uint32_t	d_flags;	/* Must be 0 */
+	uint16_t	d_type;		/* F_RDLCK, F_WRLCK, F_UNLCK */
+	uint16_t	__pad;		/* Must be 0 */
+};
+
 /*
  * Types of directory notifications that may be requested.
  */

-- 
2.51.1


^ permalink raw reply related

* [PATCH v6 16/17] nfsd: wire up GET_DIR_DELEGATION handling
From: Jeff Layton @ 2025-11-11 14:12 UTC (permalink / raw)
  To: Miklos Szeredi, Alexander Viro, Christian Brauner, Jan Kara,
	Chuck Lever, Alexander Aring, Trond Myklebust, Anna Schumaker,
	Steve French, Paulo Alcantara, Ronnie Sahlberg, Shyam Prasad N,
	Tom Talpey, Bharath SM, Greg Kroah-Hartman, Rafael J. Wysocki,
	Danilo Krummrich, David Howells, Tyler Hicks, NeilBrown,
	Olga Kornievskaia, Dai Ngo, Amir Goldstein, Namjae Jeon,
	Steve French, Sergey Senozhatsky, Carlos Maiolino,
	Kuniyuki Iwashima, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman
  Cc: linux-fsdevel, linux-kernel, linux-nfs, linux-cifs,
	samba-technical, netfs, ecryptfs, linux-unionfs, linux-xfs,
	netdev, linux-api, Jeff Layton
In-Reply-To: <20251111-dir-deleg-ro-v6-0-52f3feebb2f2@kernel.org>

Add a new routine for acquiring a read delegation on a directory. These
are recallable-only delegations with no support for CB_NOTIFY. That will
be added in a later phase.

Since the same CB_RECALL/DELEGRETURN infrastructure is used for regular
and directory delegations, a normal nfs4_delegation is used to represent
a directory delegation.

Reviewed-by: NeilBrown <neil@brown.name>
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
 fs/nfsd/nfs4proc.c  |  22 +++++++++++-
 fs/nfsd/nfs4state.c | 100 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/nfsd/state.h     |   5 +++
 3 files changed, 126 insertions(+), 1 deletion(-)

diff --git a/fs/nfsd/nfs4proc.c b/fs/nfsd/nfs4proc.c
index e466cf52d7d7e1a78c3a469613a85ab3546d6d17..517968dddf4a33651313658d300f9f929f83c5af 100644
--- a/fs/nfsd/nfs4proc.c
+++ b/fs/nfsd/nfs4proc.c
@@ -2341,6 +2341,13 @@ nfsd4_get_dir_delegation(struct svc_rqst *rqstp,
 			 union nfsd4_op_u *u)
 {
 	struct nfsd4_get_dir_delegation *gdd = &u->get_dir_delegation;
+	struct nfs4_delegation *dd;
+	struct nfsd_file *nf;
+	__be32 status;
+
+	status = nfsd_file_acquire_dir(rqstp, &cstate->current_fh, &nf);
+	if (status != nfs_ok)
+		return status;
 
 	/*
 	 * RFC 8881, section 18.39.3 says:
@@ -2354,7 +2361,20 @@ nfsd4_get_dir_delegation(struct svc_rqst *rqstp,
 	 * return NFS4_OK with a non-fatal status of GDD4_UNAVAIL in this
 	 * situation.
 	 */
-	gdd->gddrnf_status = GDD4_UNAVAIL;
+	dd = nfsd_get_dir_deleg(cstate, gdd, nf);
+	nfsd_file_put(nf);
+	if (IS_ERR(dd)) {
+		int err = PTR_ERR(dd);
+
+		if (err != -EAGAIN)
+			return nfserrno(err);
+		gdd->gddrnf_status = GDD4_UNAVAIL;
+		return nfs_ok;
+	}
+
+	gdd->gddrnf_status = GDD4_OK;
+	memcpy(&gdd->gddr_stateid, &dd->dl_stid.sc_stateid, sizeof(gdd->gddr_stateid));
+	nfs4_put_stid(&dd->dl_stid);
 	return nfs_ok;
 }
 
diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
index da66798023aba4c36c38208cec7333db237e46e0..8f8c9385101e15b64883eabec71775f26b14f890 100644
--- a/fs/nfsd/nfs4state.c
+++ b/fs/nfsd/nfs4state.c
@@ -9347,3 +9347,103 @@ nfsd4_deleg_getattr_conflict(struct svc_rqst *rqstp, struct dentry *dentry,
 	nfs4_put_stid(&dp->dl_stid);
 	return status;
 }
+
+/**
+ * nfsd_get_dir_deleg - attempt to get a directory delegation
+ * @cstate: compound state
+ * @gdd: GET_DIR_DELEGATION arg/resp structure
+ * @nf: nfsd_file opened on the directory
+ *
+ * Given a GET_DIR_DELEGATION request @gdd, attempt to acquire a delegation
+ * on the directory to which @nf refers. Note that this does not set up any
+ * sort of async notifications for the delegation.
+ */
+struct nfs4_delegation *
+nfsd_get_dir_deleg(struct nfsd4_compound_state *cstate,
+		   struct nfsd4_get_dir_delegation *gdd,
+		   struct nfsd_file *nf)
+{
+	struct nfs4_client *clp = cstate->clp;
+	struct nfs4_delegation *dp;
+	struct file_lease *fl;
+	struct nfs4_file *fp, *rfp;
+	int status = 0;
+
+	fp = nfsd4_alloc_file();
+	if (!fp)
+		return ERR_PTR(-ENOMEM);
+
+	nfsd4_file_init(&cstate->current_fh, fp);
+
+	rfp = nfsd4_file_hash_insert(fp, &cstate->current_fh);
+	if (unlikely(!rfp)) {
+		put_nfs4_file(fp);
+		return ERR_PTR(-ENOMEM);
+	}
+
+	if (rfp != fp) {
+		put_nfs4_file(fp);
+		fp = rfp;
+	}
+
+	/* if this client already has one, return that it's unavailable */
+	spin_lock(&state_lock);
+	spin_lock(&fp->fi_lock);
+	/* existing delegation? */
+	if (nfs4_delegation_exists(clp, fp)) {
+		status = -EAGAIN;
+	} else if (!fp->fi_deleg_file) {
+		fp->fi_deleg_file = nfsd_file_get(nf);
+		fp->fi_delegees = 1;
+	} else {
+		++fp->fi_delegees;
+	}
+	spin_unlock(&fp->fi_lock);
+	spin_unlock(&state_lock);
+
+	if (status) {
+		put_nfs4_file(fp);
+		return ERR_PTR(status);
+	}
+
+	/* Try to set up the lease */
+	status = -ENOMEM;
+	dp = alloc_init_deleg(clp, fp, NULL, NFS4_OPEN_DELEGATE_READ);
+	if (!dp)
+		goto out_delegees;
+
+	fl = nfs4_alloc_init_lease(dp);
+	if (!fl)
+		goto out_put_stid;
+
+	status = kernel_setlease(nf->nf_file,
+				 fl->c.flc_type, &fl, NULL);
+	if (fl)
+		locks_free_lease(fl);
+	if (status)
+		goto out_put_stid;
+
+	/*
+	 * Now, try to hash it. This can fail if we race another nfsd task
+	 * trying to set a delegation on the same file. If that happens,
+	 * then just say UNAVAIL.
+	 */
+	spin_lock(&state_lock);
+	spin_lock(&clp->cl_lock);
+	spin_lock(&fp->fi_lock);
+	status = hash_delegation_locked(dp, fp);
+	spin_unlock(&fp->fi_lock);
+	spin_unlock(&clp->cl_lock);
+	spin_unlock(&state_lock);
+
+	if (!status)
+		return dp;
+
+	/* Something failed. Drop the lease and clean up the stid */
+	kernel_setlease(fp->fi_deleg_file->nf_file, F_UNLCK, NULL, (void **)&dp);
+out_put_stid:
+	nfs4_put_stid(&dp->dl_stid);
+out_delegees:
+	put_deleg_file(fp);
+	return ERR_PTR(status);
+}
diff --git a/fs/nfsd/state.h b/fs/nfsd/state.h
index 1e736f4024263ffa9c93bcc9ec48f44566a8cc77..b052c1effdc5356487c610db9728df8ecfe851d4 100644
--- a/fs/nfsd/state.h
+++ b/fs/nfsd/state.h
@@ -867,4 +867,9 @@ static inline bool try_to_expire_client(struct nfs4_client *clp)
 
 extern __be32 nfsd4_deleg_getattr_conflict(struct svc_rqst *rqstp,
 		struct dentry *dentry, struct nfs4_delegation **pdp);
+
+struct nfsd4_get_dir_delegation;
+struct nfs4_delegation *nfsd_get_dir_deleg(struct nfsd4_compound_state *cstate,
+						struct nfsd4_get_dir_delegation *gdd,
+						struct nfsd_file *nf);
 #endif   /* NFSD4_STATE_H */

-- 
2.51.1


^ permalink raw reply related

* [PATCH v6 15/17] nfsd: allow DELEGRETURN on directories
From: Jeff Layton @ 2025-11-11 14:12 UTC (permalink / raw)
  To: Miklos Szeredi, Alexander Viro, Christian Brauner, Jan Kara,
	Chuck Lever, Alexander Aring, Trond Myklebust, Anna Schumaker,
	Steve French, Paulo Alcantara, Ronnie Sahlberg, Shyam Prasad N,
	Tom Talpey, Bharath SM, Greg Kroah-Hartman, Rafael J. Wysocki,
	Danilo Krummrich, David Howells, Tyler Hicks, NeilBrown,
	Olga Kornievskaia, Dai Ngo, Amir Goldstein, Namjae Jeon,
	Steve French, Sergey Senozhatsky, Carlos Maiolino,
	Kuniyuki Iwashima, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman
  Cc: linux-fsdevel, linux-kernel, linux-nfs, linux-cifs,
	samba-technical, netfs, ecryptfs, linux-unionfs, linux-xfs,
	netdev, linux-api, Jeff Layton
In-Reply-To: <20251111-dir-deleg-ro-v6-0-52f3feebb2f2@kernel.org>

As Trond pointed out: "...provided that the presented stateid is
actually valid, it is also sufficient to uniquely identify the file to
which it is associated (see RFC8881 Section 8.2.4), so the filehandle
should be considered mostly irrelevant for operations like DELEGRETURN."

Don't ask fh_verify to filter on file type.

Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: NeilBrown <neil@brown.name>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
 fs/nfsd/nfs4state.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
index 81fa7cc6c77b3cdc5ff22bc60ab0654f95dc258d..da66798023aba4c36c38208cec7333db237e46e0 100644
--- a/fs/nfsd/nfs4state.c
+++ b/fs/nfsd/nfs4state.c
@@ -7828,7 +7828,8 @@ nfsd4_delegreturn(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
 	__be32 status;
 	struct nfsd_net *nn = net_generic(SVC_NET(rqstp), nfsd_net_id);
 
-	if ((status = fh_verify(rqstp, &cstate->current_fh, S_IFREG, 0)))
+	status = fh_verify(rqstp, &cstate->current_fh, 0, 0);
+	if (status)
 		return status;
 
 	status = nfsd4_lookup_stateid(cstate, stateid, SC_TYPE_DELEG, SC_STATUS_REVOKED, &s, nn);

-- 
2.51.1


^ permalink raw reply related

* [PATCH v6 14/17] nfsd: allow filecache to hold S_IFDIR files
From: Jeff Layton @ 2025-11-11 14:12 UTC (permalink / raw)
  To: Miklos Szeredi, Alexander Viro, Christian Brauner, Jan Kara,
	Chuck Lever, Alexander Aring, Trond Myklebust, Anna Schumaker,
	Steve French, Paulo Alcantara, Ronnie Sahlberg, Shyam Prasad N,
	Tom Talpey, Bharath SM, Greg Kroah-Hartman, Rafael J. Wysocki,
	Danilo Krummrich, David Howells, Tyler Hicks, NeilBrown,
	Olga Kornievskaia, Dai Ngo, Amir Goldstein, Namjae Jeon,
	Steve French, Sergey Senozhatsky, Carlos Maiolino,
	Kuniyuki Iwashima, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman
  Cc: linux-fsdevel, linux-kernel, linux-nfs, linux-cifs,
	samba-technical, netfs, ecryptfs, linux-unionfs, linux-xfs,
	netdev, linux-api, Jeff Layton
In-Reply-To: <20251111-dir-deleg-ro-v6-0-52f3feebb2f2@kernel.org>

The filecache infrastructure will only handle S_IFREG files at the
moment. Directory delegations will require adding support for opening
S_IFDIR inodes.

Plumb a "type" argument into nfsd_file_do_acquire() and have all of the
existing callers set it to S_IFREG. Add a new nfsd_file_acquire_dir()
wrapper that nfsd can call to request a nfsd_file that holds a directory
open.

For now, there is no need for a fsnotify_mark for directories, as
CB_NOTIFY is not yet supported. Change nfsd_file_do_acquire() to avoid
allocating one for non-S_IFREG inodes.

Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: NeilBrown <neil@brown.name>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
 fs/nfsd/filecache.c | 57 ++++++++++++++++++++++++++++++++++++++++-------------
 fs/nfsd/filecache.h |  2 ++
 fs/nfsd/vfs.c       |  5 +++--
 fs/nfsd/vfs.h       |  2 +-
 4 files changed, 49 insertions(+), 17 deletions(-)

diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
index a238b6725008a5c2988bd3da874d1f34ee778437..93798575b8075c63f95cd415b6d24df706ada0f6 100644
--- a/fs/nfsd/filecache.c
+++ b/fs/nfsd/filecache.c
@@ -1086,7 +1086,7 @@ nfsd_file_do_acquire(struct svc_rqst *rqstp, struct net *net,
 		     struct auth_domain *client,
 		     struct svc_fh *fhp,
 		     unsigned int may_flags, struct file *file,
-		     struct nfsd_file **pnf, bool want_gc)
+		     umode_t type, bool want_gc, struct nfsd_file **pnf)
 {
 	unsigned char need = may_flags & NFSD_FILE_MAY_MASK;
 	struct nfsd_file *new, *nf;
@@ -1097,13 +1097,13 @@ nfsd_file_do_acquire(struct svc_rqst *rqstp, struct net *net,
 	int ret;
 
 retry:
-	if (rqstp) {
-		status = fh_verify(rqstp, fhp, S_IFREG,
+	if (rqstp)
+		status = fh_verify(rqstp, fhp, type,
 				   may_flags|NFSD_MAY_OWNER_OVERRIDE);
-	} else {
-		status = fh_verify_local(net, cred, client, fhp, S_IFREG,
+	else
+		status = fh_verify_local(net, cred, client, fhp, type,
 					 may_flags|NFSD_MAY_OWNER_OVERRIDE);
-	}
+
 	if (status != nfs_ok)
 		return status;
 	inode = d_inode(fhp->fh_dentry);
@@ -1176,15 +1176,18 @@ nfsd_file_do_acquire(struct svc_rqst *rqstp, struct net *net,
 
 open_file:
 	trace_nfsd_file_alloc(nf);
-	nf->nf_mark = nfsd_file_mark_find_or_create(inode);
-	if (nf->nf_mark) {
+
+	if (type == S_IFREG)
+		nf->nf_mark = nfsd_file_mark_find_or_create(inode);
+
+	if (type != S_IFREG || nf->nf_mark) {
 		if (file) {
 			get_file(file);
 			nf->nf_file = file;
 			status = nfs_ok;
 			trace_nfsd_file_opened(nf, status);
 		} else {
-			ret = nfsd_open_verified(fhp, may_flags, &nf->nf_file);
+			ret = nfsd_open_verified(fhp, type, may_flags, &nf->nf_file);
 			if (ret == -EOPENSTALE && stale_retry) {
 				stale_retry = false;
 				nfsd_file_unhash(nf);
@@ -1246,7 +1249,7 @@ nfsd_file_acquire_gc(struct svc_rqst *rqstp, struct svc_fh *fhp,
 		     unsigned int may_flags, struct nfsd_file **pnf)
 {
 	return nfsd_file_do_acquire(rqstp, SVC_NET(rqstp), NULL, NULL,
-				    fhp, may_flags, NULL, pnf, true);
+				    fhp, may_flags, NULL, S_IFREG, true, pnf);
 }
 
 /**
@@ -1271,7 +1274,7 @@ nfsd_file_acquire(struct svc_rqst *rqstp, struct svc_fh *fhp,
 		  unsigned int may_flags, struct nfsd_file **pnf)
 {
 	return nfsd_file_do_acquire(rqstp, SVC_NET(rqstp), NULL, NULL,
-				    fhp, may_flags, NULL, pnf, false);
+				    fhp, may_flags, NULL, S_IFREG, false, pnf);
 }
 
 /**
@@ -1314,8 +1317,8 @@ nfsd_file_acquire_local(struct net *net, struct svc_cred *cred,
 	const struct cred *save_cred = get_current_cred();
 	__be32 beres;
 
-	beres = nfsd_file_do_acquire(NULL, net, cred, client,
-				     fhp, may_flags, NULL, pnf, false);
+	beres = nfsd_file_do_acquire(NULL, net, cred, client, fhp, may_flags,
+				     NULL, S_IFREG, false, pnf);
 	put_cred(revert_creds(save_cred));
 	return beres;
 }
@@ -1344,7 +1347,33 @@ nfsd_file_acquire_opened(struct svc_rqst *rqstp, struct svc_fh *fhp,
 			 struct nfsd_file **pnf)
 {
 	return nfsd_file_do_acquire(rqstp, SVC_NET(rqstp), NULL, NULL,
-				    fhp, may_flags, file, pnf, false);
+				    fhp, may_flags, file, S_IFREG, false, pnf);
+}
+
+/**
+ * nfsd_file_acquire_dir - Get a struct nfsd_file with an open directory
+ * @rqstp: the RPC transaction being executed
+ * @fhp: the NFS filehandle of the file to be opened
+ * @pnf: OUT: new or found "struct nfsd_file" object
+ *
+ * The nfsd_file_object returned by this API is reference-counted
+ * but not garbage-collected. The object is unhashed after the
+ * final nfsd_file_put(). This opens directories only, and only
+ * in O_RDONLY mode.
+ *
+ * Return values:
+ *   %nfs_ok - @pnf points to an nfsd_file with its reference
+ *   count boosted.
+ *
+ * On error, an nfsstat value in network byte order is returned.
+ */
+__be32
+nfsd_file_acquire_dir(struct svc_rqst *rqstp, struct svc_fh *fhp,
+		      struct nfsd_file **pnf)
+{
+	return nfsd_file_do_acquire(rqstp, SVC_NET(rqstp), NULL, NULL, fhp,
+				    NFSD_MAY_READ|NFSD_MAY_64BIT_COOKIE,
+				    NULL, S_IFDIR, false, pnf);
 }
 
 /*
diff --git a/fs/nfsd/filecache.h b/fs/nfsd/filecache.h
index e3d6ca2b60308e5e91ba4bb32d935f54527d8bda..b383dbc5b9218d21a29b852572f80fab08de9fa9 100644
--- a/fs/nfsd/filecache.h
+++ b/fs/nfsd/filecache.h
@@ -82,5 +82,7 @@ __be32 nfsd_file_acquire_opened(struct svc_rqst *rqstp, struct svc_fh *fhp,
 __be32 nfsd_file_acquire_local(struct net *net, struct svc_cred *cred,
 			       struct auth_domain *client, struct svc_fh *fhp,
 			       unsigned int may_flags, struct nfsd_file **pnf);
+__be32 nfsd_file_acquire_dir(struct svc_rqst *rqstp, struct svc_fh *fhp,
+		  struct nfsd_file **pnf);
 int nfsd_file_cache_stats_show(struct seq_file *m, void *v);
 #endif /* _FS_NFSD_FILECACHE_H */
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 28710da4cce7cc7fc1e14d29420239dc357316f6..8c3ffacc533e9de0d506fb2ec222387446ba8e9f 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -959,15 +959,16 @@ nfsd_open(struct svc_rqst *rqstp, struct svc_fh *fhp, umode_t type,
 /**
  * nfsd_open_verified - Open a regular file for the filecache
  * @fhp: NFS filehandle of the file to open
+ * @type: S_IFMT inode type allowed (0 means any type is allowed)
  * @may_flags: internal permission flags
  * @filp: OUT: open "struct file *"
  *
  * Returns zero on success, or a negative errno value.
  */
 int
-nfsd_open_verified(struct svc_fh *fhp, int may_flags, struct file **filp)
+nfsd_open_verified(struct svc_fh *fhp, umode_t type, int may_flags, struct file **filp)
 {
-	return __nfsd_open(fhp, S_IFREG, may_flags, filp);
+	return __nfsd_open(fhp, type, may_flags, filp);
 }
 
 /*
diff --git a/fs/nfsd/vfs.h b/fs/nfsd/vfs.h
index 0c0292611c6de3daf6f3ed51e2c61c0ad2751de4..09de48c50cbef8e7c4828b38dcb663b529514a30 100644
--- a/fs/nfsd/vfs.h
+++ b/fs/nfsd/vfs.h
@@ -114,7 +114,7 @@ __be32		nfsd_setxattr(struct svc_rqst *rqstp, struct svc_fh *fhp,
 int 		nfsd_open_break_lease(struct inode *, int);
 __be32		nfsd_open(struct svc_rqst *, struct svc_fh *, umode_t,
 				int, struct file **);
-int		nfsd_open_verified(struct svc_fh *fhp, int may_flags,
+int		nfsd_open_verified(struct svc_fh *fhp, umode_t type, int may_flags,
 				struct file **filp);
 __be32		nfsd_splice_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
 				struct file *file, loff_t offset,

-- 
2.51.1


^ permalink raw reply related

* [PATCH v6 13/17] filelock: lift the ban on directory leases in generic_setlease
From: Jeff Layton @ 2025-11-11 14:12 UTC (permalink / raw)
  To: Miklos Szeredi, Alexander Viro, Christian Brauner, Jan Kara,
	Chuck Lever, Alexander Aring, Trond Myklebust, Anna Schumaker,
	Steve French, Paulo Alcantara, Ronnie Sahlberg, Shyam Prasad N,
	Tom Talpey, Bharath SM, Greg Kroah-Hartman, Rafael J. Wysocki,
	Danilo Krummrich, David Howells, Tyler Hicks, NeilBrown,
	Olga Kornievskaia, Dai Ngo, Amir Goldstein, Namjae Jeon,
	Steve French, Sergey Senozhatsky, Carlos Maiolino,
	Kuniyuki Iwashima, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman
  Cc: linux-fsdevel, linux-kernel, linux-nfs, linux-cifs,
	samba-technical, netfs, ecryptfs, linux-unionfs, linux-xfs,
	netdev, linux-api, Jeff Layton
In-Reply-To: <20251111-dir-deleg-ro-v6-0-52f3feebb2f2@kernel.org>

With the addition of the try_break_lease calls in directory changing
operations, allow generic_setlease to hand them out. Write leases on
directories are never allowed however, so continue to reject them.

For now, there is no API for requesting delegations from userland, so
ensure that userland is prevented from acquiring a lease on a directory.

Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: NeilBrown <neil@brown.name>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
 fs/locks.c | 12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/fs/locks.c b/fs/locks.c
index f5b210a2dc34c70ac36e972436c62482bbe32ca6..dd290a87f58eb5d522f03fa99d612fbad84dacf3 100644
--- a/fs/locks.c
+++ b/fs/locks.c
@@ -1935,14 +1935,19 @@ static int generic_delete_lease(struct file *filp, void *owner)
 int generic_setlease(struct file *filp, int arg, struct file_lease **flp,
 			void **priv)
 {
-	if (!S_ISREG(file_inode(filp)->i_mode))
+	struct inode *inode = file_inode(filp);
+
+	if (!S_ISREG(inode->i_mode) && !S_ISDIR(inode->i_mode))
 		return -EINVAL;
 
 	switch (arg) {
 	case F_UNLCK:
 		return generic_delete_lease(filp, *priv);
-	case F_RDLCK:
 	case F_WRLCK:
+		if (S_ISDIR(inode->i_mode))
+			return -EINVAL;
+		fallthrough;
+	case F_RDLCK:
 		if (!(*flp)->fl_lmops->lm_break) {
 			WARN_ON_ONCE(1);
 			return -ENOLCK;
@@ -2071,6 +2076,9 @@ static int do_fcntl_add_lease(unsigned int fd, struct file *filp, int arg)
  */
 int fcntl_setlease(unsigned int fd, struct file *filp, int arg)
 {
+	if (S_ISDIR(file_inode(filp)->i_mode))
+		return -EINVAL;
+
 	if (arg == F_UNLCK)
 		return vfs_setlease(filp, F_UNLCK, NULL, (void **)&filp);
 	return do_fcntl_add_lease(fd, filp, arg);

-- 
2.51.1


^ permalink raw reply related

* [PATCH v6 12/17] vfs: make vfs_symlink break delegations on parent dir
From: Jeff Layton @ 2025-11-11 14:12 UTC (permalink / raw)
  To: Miklos Szeredi, Alexander Viro, Christian Brauner, Jan Kara,
	Chuck Lever, Alexander Aring, Trond Myklebust, Anna Schumaker,
	Steve French, Paulo Alcantara, Ronnie Sahlberg, Shyam Prasad N,
	Tom Talpey, Bharath SM, Greg Kroah-Hartman, Rafael J. Wysocki,
	Danilo Krummrich, David Howells, Tyler Hicks, NeilBrown,
	Olga Kornievskaia, Dai Ngo, Amir Goldstein, Namjae Jeon,
	Steve French, Sergey Senozhatsky, Carlos Maiolino,
	Kuniyuki Iwashima, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman
  Cc: linux-fsdevel, linux-kernel, linux-nfs, linux-cifs,
	samba-technical, netfs, ecryptfs, linux-unionfs, linux-xfs,
	netdev, linux-api, Jeff Layton
In-Reply-To: <20251111-dir-deleg-ro-v6-0-52f3feebb2f2@kernel.org>

In order to add directory delegation support, we must break delegations
on the parent on any change to the directory.

Add a delegated_inode parameter to vfs_symlink() and have it break the
delegation. do_symlinkat() can then wait on the delegation break before
proceeding.

Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: NeilBrown <neil@brown.name>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
 fs/ecryptfs/inode.c      |  2 +-
 fs/init.c                |  2 +-
 fs/namei.c               | 16 ++++++++++++++--
 fs/nfsd/vfs.c            |  2 +-
 fs/overlayfs/overlayfs.h |  2 +-
 include/linux/fs.h       |  2 +-
 6 files changed, 19 insertions(+), 7 deletions(-)

diff --git a/fs/ecryptfs/inode.c b/fs/ecryptfs/inode.c
index 83f06452476dec4025cc8f50e082ae6ccbbc7914..ba15e7359dfa6e150b577205991010873a633511 100644
--- a/fs/ecryptfs/inode.c
+++ b/fs/ecryptfs/inode.c
@@ -479,7 +479,7 @@ static int ecryptfs_symlink(struct mnt_idmap *idmap,
 	if (rc)
 		goto out_lock;
 	rc = vfs_symlink(&nop_mnt_idmap, lower_dir, lower_dentry,
-			 encoded_symname);
+			 encoded_symname, NULL);
 	kfree(encoded_symname);
 	if (rc || d_really_is_negative(lower_dentry))
 		goto out_lock;
diff --git a/fs/init.c b/fs/init.c
index 4f02260dd65b0dfcbfbf5812d2ec6a33444a3b1f..e0f5429c0a49d046bd3f231a260954ed0f90ef44 100644
--- a/fs/init.c
+++ b/fs/init.c
@@ -209,7 +209,7 @@ int __init init_symlink(const char *oldname, const char *newname)
 	error = security_path_symlink(&path, dentry, oldname);
 	if (!error)
 		error = vfs_symlink(mnt_idmap(path.mnt), path.dentry->d_inode,
-				    dentry, oldname);
+				    dentry, oldname, NULL);
 	end_creating_path(&path, dentry);
 	return error;
 }
diff --git a/fs/namei.c b/fs/namei.c
index e9616134390fb7f0d09a759be69bf677f8800bc5..d5ab28947b2b6c6e19c7bb4a9140ccec407dc07c 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -4845,6 +4845,7 @@ SYSCALL_DEFINE1(unlink, const char __user *, pathname)
  * @dir:	inode of the parent directory
  * @dentry:	dentry of the child symlink file
  * @oldname:	name of the file to link to
+ * @delegated_inode: returns victim inode, if the inode is delegated.
  *
  * Create a symlink.
  *
@@ -4855,7 +4856,8 @@ SYSCALL_DEFINE1(unlink, const char __user *, pathname)
  * raw inode simply pass @nop_mnt_idmap.
  */
 int vfs_symlink(struct mnt_idmap *idmap, struct inode *dir,
-		struct dentry *dentry, const char *oldname)
+		struct dentry *dentry, const char *oldname,
+		struct delegated_inode *delegated_inode)
 {
 	int error;
 
@@ -4870,6 +4872,10 @@ int vfs_symlink(struct mnt_idmap *idmap, struct inode *dir,
 	if (error)
 		return error;
 
+	error = try_break_deleg(dir, delegated_inode);
+	if (error)
+		return error;
+
 	error = dir->i_op->symlink(idmap, dir, dentry, oldname);
 	if (!error)
 		fsnotify_create(dir, dentry);
@@ -4883,6 +4889,7 @@ int do_symlinkat(struct filename *from, int newdfd, struct filename *to)
 	struct dentry *dentry;
 	struct path path;
 	unsigned int lookup_flags = 0;
+	struct delegated_inode delegated_inode = { };
 
 	if (IS_ERR(from)) {
 		error = PTR_ERR(from);
@@ -4897,8 +4904,13 @@ int do_symlinkat(struct filename *from, int newdfd, struct filename *to)
 	error = security_path_symlink(&path, dentry, from->name);
 	if (!error)
 		error = vfs_symlink(mnt_idmap(path.mnt), path.dentry->d_inode,
-				    dentry, from->name);
+				    dentry, from->name, &delegated_inode);
 	end_creating_path(&path, dentry);
+	if (is_delegated(&delegated_inode)) {
+		error = break_deleg_wait(&delegated_inode);
+		if (!error)
+			goto retry;
+	}
 	if (retry_estale(error, lookup_flags)) {
 		lookup_flags |= LOOKUP_REVAL;
 		goto retry;
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 6684935007b17ed40723ff0a744045fd187ace5e..28710da4cce7cc7fc1e14d29420239dc357316f6 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -1742,7 +1742,7 @@ nfsd_symlink(struct svc_rqst *rqstp, struct svc_fh *fhp,
 	err = fh_fill_pre_attrs(fhp);
 	if (err != nfs_ok)
 		goto out_unlock;
-	host_err = vfs_symlink(&nop_mnt_idmap, d_inode(dentry), dnew, path);
+	host_err = vfs_symlink(&nop_mnt_idmap, d_inode(dentry), dnew, path, NULL);
 	err = nfserrno(host_err);
 	cerr = fh_compose(resfhp, fhp->fh_export, dnew, fhp);
 	if (!err)
diff --git a/fs/overlayfs/overlayfs.h b/fs/overlayfs/overlayfs.h
index afd95721f76ea761cfe7133c3902028550f3e359..5065961bd370628c9afe914ea570d9998a096660 100644
--- a/fs/overlayfs/overlayfs.h
+++ b/fs/overlayfs/overlayfs.h
@@ -267,7 +267,7 @@ static inline int ovl_do_symlink(struct ovl_fs *ofs,
 				 struct inode *dir, struct dentry *dentry,
 				 const char *oldname)
 {
-	int err = vfs_symlink(ovl_upper_mnt_idmap(ofs), dir, dentry, oldname);
+	int err = vfs_symlink(ovl_upper_mnt_idmap(ofs), dir, dentry, oldname, NULL);
 
 	pr_debug("symlink(\"%s\", %pd2) = %i\n", oldname, dentry, err);
 	return err;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 1a5d86cfafaa97fc89d15cd1a156968b8c3dd377..64323e618724bc20dc101db13035b042f5f88e4d 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2118,7 +2118,7 @@ struct dentry *vfs_mkdir(struct mnt_idmap *, struct inode *,
 int vfs_mknod(struct mnt_idmap *, struct inode *, struct dentry *,
 	      umode_t, dev_t, struct delegated_inode *);
 int vfs_symlink(struct mnt_idmap *, struct inode *,
-		struct dentry *, const char *);
+		struct dentry *, const char *, struct delegated_inode *);
 int vfs_link(struct dentry *, struct mnt_idmap *, struct inode *,
 	     struct dentry *, struct delegated_inode *);
 int vfs_rmdir(struct mnt_idmap *, struct inode *, struct dentry *,

-- 
2.51.1


^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox