Linux userland API discussions
 help / color / mirror / Atom feed
* Re: [PATCH v4 03/30] kho: drop notifiers
From: Pratyush Yadav @ 2025-10-06 17:01 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: pratyush, jasonmiu, graf, changyuanl, rppt, dmatlack, rientjes,
	corbet, rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl,
	masahiroy, akpm, tj, yoann.congal, mmaurer, roman.gushchin,
	chenridong, axboe, mark.rutland, jannh, vincent.guittot, hannes,
	dan.j.williams, david, joel.granados, rostedt, anna.schumaker,
	song, zhangguopeng, linux, linux-kernel, linux-doc, linux-mm,
	gregkh, tglx, mingo, bp, dave.hansen, x86, hpa, rafael, dakr,
	bartosz.golaszewski, cw00.choi, myungjoo.ham, yesanishhere,
	Jonathan.Cameron, quic_zijuhu, aleksander.lobakin, ira.weiny,
	andriy.shevchenko, leon, lukas, bhelgaas, wagi, djeffery,
	stuart.w.hayes, lennart, brauner, linux-api, linux-fsdevel,
	saeedm, ajayachandra, jgg, parav, leonro, witu, hughd, skhawaja,
	chrisl, steven.sistare
In-Reply-To: <20250929010321.3462457-4-pasha.tatashin@soleen.com>

On Mon, Sep 29 2025, Pasha Tatashin wrote:

> From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
>
> The KHO framework uses a notifier chain as the mechanism for clients to
> participate in the finalization process. While this works for a single,
> central state machine, it is too restrictive for kernel-internal
> components like pstore/reserve_mem or IMA. These components need a
> simpler, direct way to register their state for preservation (e.g.,
> during their initcall) without being part of a complex,
> shutdown-time notifier sequence. The notifier model forces all
> participants into a single finalization flow and makes direct
> preservation from an arbitrary context difficult.
> This patch refactors the client participation model by removing the
> notifier chain and introducing a direct API for managing FDT subtrees.
>
> The core kho_finalize() and kho_abort() state machine remains, but
> clients now register their data with KHO beforehand.
>
> Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
[...]
> diff --git a/mm/memblock.c b/mm/memblock.c
> index e23e16618e9b..c4b2d4e4c715 100644
> --- a/mm/memblock.c
> +++ b/mm/memblock.c
> @@ -2444,53 +2444,18 @@ int reserve_mem_release_by_name(const char *name)
>  #define MEMBLOCK_KHO_FDT "memblock"
>  #define MEMBLOCK_KHO_NODE_COMPATIBLE "memblock-v1"
>  #define RESERVE_MEM_KHO_NODE_COMPATIBLE "reserve-mem-v1"
> -static struct page *kho_fdt;
> -
> -static int reserve_mem_kho_finalize(struct kho_serialization *ser)
> -{
> -	int err = 0, i;
> -
> -	for (i = 0; i < reserved_mem_count; i++) {
> -		struct reserve_mem_table *map = &reserved_mem_table[i];
> -		struct page *page = phys_to_page(map->start);
> -		unsigned int nr_pages = map->size >> PAGE_SHIFT;
> -
> -		err |= kho_preserve_pages(page, nr_pages);
> -	}
> -
> -	err |= kho_preserve_folio(page_folio(kho_fdt));
> -	err |= kho_add_subtree(ser, MEMBLOCK_KHO_FDT, page_to_virt(kho_fdt));
> -
> -	return notifier_from_errno(err);
> -}
> -
> -static int reserve_mem_kho_notifier(struct notifier_block *self,
> -				    unsigned long cmd, void *v)
> -{
> -	switch (cmd) {
> -	case KEXEC_KHO_FINALIZE:
> -		return reserve_mem_kho_finalize((struct kho_serialization *)v);
> -	case KEXEC_KHO_ABORT:
> -		return NOTIFY_DONE;
> -	default:
> -		return NOTIFY_BAD;
> -	}
> -}
> -
> -static struct notifier_block reserve_mem_kho_nb = {
> -	.notifier_call = reserve_mem_kho_notifier,
> -};
>  
>  static int __init prepare_kho_fdt(void)
>  {
>  	int err = 0, i;
> +	struct page *fdt_page;
>  	void *fdt;
>  
> -	kho_fdt = alloc_page(GFP_KERNEL);
> -	if (!kho_fdt)
> +	fdt_page = alloc_page(GFP_KERNEL);
> +	if (!fdt_page)
>  		return -ENOMEM;
>  
> -	fdt = page_to_virt(kho_fdt);
> +	fdt = page_to_virt(fdt_page);
>  
>  	err |= fdt_create(fdt, PAGE_SIZE);
>  	err |= fdt_finish_reservemap(fdt);
> @@ -2499,7 +2464,10 @@ static int __init prepare_kho_fdt(void)
>  	err |= fdt_property_string(fdt, "compatible", MEMBLOCK_KHO_NODE_COMPATIBLE);
>  	for (i = 0; i < reserved_mem_count; i++) {
>  		struct reserve_mem_table *map = &reserved_mem_table[i];
> +		struct page *page = phys_to_page(map->start);
> +		unsigned int nr_pages = map->size >> PAGE_SHIFT;
>  
> +		err |= kho_preserve_pages(page, nr_pages);
>  		err |= fdt_begin_node(fdt, map->name);
>  		err |= fdt_property_string(fdt, "compatible", RESERVE_MEM_KHO_NODE_COMPATIBLE);
>  		err |= fdt_property(fdt, "start", &map->start, sizeof(map->start));
> @@ -2507,13 +2475,14 @@ static int __init prepare_kho_fdt(void)
>  		err |= fdt_end_node(fdt);
>  	}
>  	err |= fdt_end_node(fdt);
> -
>  	err |= fdt_finish(fdt);
>  
> +	err |= kho_preserve_folio(page_folio(fdt_page));
> +	err |= kho_add_subtree(MEMBLOCK_KHO_FDT, fdt);
> +
>  	if (err) {
>  		pr_err("failed to prepare memblock FDT for KHO: %d\n", err);
> -		put_page(kho_fdt);
> -		kho_fdt = NULL;
> +		put_page(fdt_page);

This adds subtree to KHO even if the FDT might be invalid. And then
leaves a dangling reference in KHO to the FDT in case of an error. I
think you should either do this check after
kho_preserve_folio(page_folio(fdt_page)) and do a clean error check for
kho_add_subtree(), or call kho_remove_subtree() in the error block.

I prefer the former since if kho_add_subtree() is the one that fails,
there is little sense in removing a subtree that was never added.

>  	}
>  
>  	return err;
> @@ -2529,13 +2498,6 @@ static int __init reserve_mem_init(void)
>  	err = prepare_kho_fdt();
>  	if (err)
>  		return err;
> -
> -	err = register_kho_notifier(&reserve_mem_kho_nb);
> -	if (err) {
> -		put_page(kho_fdt);
> -		kho_fdt = NULL;
> -	}
> -
>  	return err;
>  }
>  late_initcall(reserve_mem_init);

-- 
Regards,
Pratyush Yadav

^ permalink raw reply

* Re: [PATCH v4 03/30] kho: drop notifiers
From: Pasha Tatashin @ 2025-10-06 17:21 UTC (permalink / raw)
  To: Pratyush Yadav
  Cc: jasonmiu, graf, changyuanl, rppt, dmatlack, rientjes, corbet,
	rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl, masahiroy, akpm,
	tj, yoann.congal, mmaurer, roman.gushchin, chenridong, axboe,
	mark.rutland, jannh, vincent.guittot, hannes, dan.j.williams,
	david, joel.granados, rostedt, anna.schumaker, song, zhangguopeng,
	linux, linux-kernel, linux-doc, linux-mm, gregkh, tglx, mingo, bp,
	dave.hansen, x86, hpa, rafael, dakr, bartosz.golaszewski,
	cw00.choi, myungjoo.ham, yesanishhere, Jonathan.Cameron,
	quic_zijuhu, aleksander.lobakin, ira.weiny, andriy.shevchenko,
	leon, lukas, bhelgaas, wagi, djeffery, stuart.w.hayes, lennart,
	brauner, linux-api, linux-fsdevel, saeedm, ajayachandra, jgg,
	parav, leonro, witu, hughd, skhawaja, chrisl, steven.sistare
In-Reply-To: <mafs0tt0cnevi.fsf@kernel.org>

On Mon, Oct 6, 2025 at 1:01 PM Pratyush Yadav <pratyush@kernel.org> wrote:
>
> On Mon, Sep 29 2025, Pasha Tatashin wrote:
>
> > From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
> >
> > The KHO framework uses a notifier chain as the mechanism for clients to
> > participate in the finalization process. While this works for a single,
> > central state machine, it is too restrictive for kernel-internal
> > components like pstore/reserve_mem or IMA. These components need a
> > simpler, direct way to register their state for preservation (e.g.,
> > during their initcall) without being part of a complex,
> > shutdown-time notifier sequence. The notifier model forces all
> > participants into a single finalization flow and makes direct
> > preservation from an arbitrary context difficult.
> > This patch refactors the client participation model by removing the
> > notifier chain and introducing a direct API for managing FDT subtrees.
> >
> > The core kho_finalize() and kho_abort() state machine remains, but
> > clients now register their data with KHO beforehand.
> >
> > Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> > Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
> [...]
> > diff --git a/mm/memblock.c b/mm/memblock.c
> > index e23e16618e9b..c4b2d4e4c715 100644
> > --- a/mm/memblock.c
> > +++ b/mm/memblock.c
> > @@ -2444,53 +2444,18 @@ int reserve_mem_release_by_name(const char *name)
> >  #define MEMBLOCK_KHO_FDT "memblock"
> >  #define MEMBLOCK_KHO_NODE_COMPATIBLE "memblock-v1"
> >  #define RESERVE_MEM_KHO_NODE_COMPATIBLE "reserve-mem-v1"
> > -static struct page *kho_fdt;
> > -
> > -static int reserve_mem_kho_finalize(struct kho_serialization *ser)
> > -{
> > -     int err = 0, i;
> > -
> > -     for (i = 0; i < reserved_mem_count; i++) {
> > -             struct reserve_mem_table *map = &reserved_mem_table[i];
> > -             struct page *page = phys_to_page(map->start);
> > -             unsigned int nr_pages = map->size >> PAGE_SHIFT;
> > -
> > -             err |= kho_preserve_pages(page, nr_pages);
> > -     }
> > -
> > -     err |= kho_preserve_folio(page_folio(kho_fdt));
> > -     err |= kho_add_subtree(ser, MEMBLOCK_KHO_FDT, page_to_virt(kho_fdt));
> > -
> > -     return notifier_from_errno(err);
> > -}
> > -
> > -static int reserve_mem_kho_notifier(struct notifier_block *self,
> > -                                 unsigned long cmd, void *v)
> > -{
> > -     switch (cmd) {
> > -     case KEXEC_KHO_FINALIZE:
> > -             return reserve_mem_kho_finalize((struct kho_serialization *)v);
> > -     case KEXEC_KHO_ABORT:
> > -             return NOTIFY_DONE;
> > -     default:
> > -             return NOTIFY_BAD;
> > -     }
> > -}
> > -
> > -static struct notifier_block reserve_mem_kho_nb = {
> > -     .notifier_call = reserve_mem_kho_notifier,
> > -};
> >
> >  static int __init prepare_kho_fdt(void)
> >  {
> >       int err = 0, i;
> > +     struct page *fdt_page;
> >       void *fdt;
> >
> > -     kho_fdt = alloc_page(GFP_KERNEL);
> > -     if (!kho_fdt)
> > +     fdt_page = alloc_page(GFP_KERNEL);
> > +     if (!fdt_page)
> >               return -ENOMEM;
> >
> > -     fdt = page_to_virt(kho_fdt);
> > +     fdt = page_to_virt(fdt_page);
> >
> >       err |= fdt_create(fdt, PAGE_SIZE);
> >       err |= fdt_finish_reservemap(fdt);
> > @@ -2499,7 +2464,10 @@ static int __init prepare_kho_fdt(void)
> >       err |= fdt_property_string(fdt, "compatible", MEMBLOCK_KHO_NODE_COMPATIBLE);
> >       for (i = 0; i < reserved_mem_count; i++) {
> >               struct reserve_mem_table *map = &reserved_mem_table[i];
> > +             struct page *page = phys_to_page(map->start);
> > +             unsigned int nr_pages = map->size >> PAGE_SHIFT;
> >
> > +             err |= kho_preserve_pages(page, nr_pages);
> >               err |= fdt_begin_node(fdt, map->name);
> >               err |= fdt_property_string(fdt, "compatible", RESERVE_MEM_KHO_NODE_COMPATIBLE);
> >               err |= fdt_property(fdt, "start", &map->start, sizeof(map->start));
> > @@ -2507,13 +2475,14 @@ static int __init prepare_kho_fdt(void)
> >               err |= fdt_end_node(fdt);
> >       }
> >       err |= fdt_end_node(fdt);
> > -
> >       err |= fdt_finish(fdt);
> >
> > +     err |= kho_preserve_folio(page_folio(fdt_page));
> > +     err |= kho_add_subtree(MEMBLOCK_KHO_FDT, fdt);
> > +
> >       if (err) {
> >               pr_err("failed to prepare memblock FDT for KHO: %d\n", err);
> > -             put_page(kho_fdt);
> > -             kho_fdt = NULL;
> > +             put_page(fdt_page);
>
> This adds subtree to KHO even if the FDT might be invalid. And then
> leaves a dangling reference in KHO to the FDT in case of an error. I
> think you should either do this check after
> kho_preserve_folio(page_folio(fdt_page)) and do a clean error check for
> kho_add_subtree(), or call kho_remove_subtree() in the error block.

I agree, I do not like these err |= stuff, we should be checking
errors cleanly, and do proper clean-ups.

> I prefer the former since if kho_add_subtree() is the one that fails,
> there is little sense in removing a subtree that was never added.
>
> >       }
> >
> >       return err;
> > @@ -2529,13 +2498,6 @@ static int __init reserve_mem_init(void)
> >       err = prepare_kho_fdt();
> >       if (err)
> >               return err;
> > -
> > -     err = register_kho_notifier(&reserve_mem_kho_nb);
> > -     if (err) {
> > -             put_page(kho_fdt);
> > -             kho_fdt = NULL;
> > -     }
> > -
> >       return err;
> >  }
> >  late_initcall(reserve_mem_init);
>
> --
> Regards,
> Pratyush Yadav

^ permalink raw reply

* Re: [PATCH v4 02/30] kho: make debugfs interface optional
From: Pasha Tatashin @ 2025-10-06 17:23 UTC (permalink / raw)
  To: Pratyush Yadav
  Cc: jasonmiu, graf, changyuanl, rppt, dmatlack, rientjes, corbet,
	rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl, masahiroy, akpm,
	tj, yoann.congal, mmaurer, roman.gushchin, chenridong, axboe,
	mark.rutland, jannh, vincent.guittot, hannes, dan.j.williams,
	david, joel.granados, rostedt, anna.schumaker, song, zhangguopeng,
	linux, linux-kernel, linux-doc, linux-mm, gregkh, tglx, mingo, bp,
	dave.hansen, x86, hpa, rafael, dakr, bartosz.golaszewski,
	cw00.choi, myungjoo.ham, yesanishhere, Jonathan.Cameron,
	quic_zijuhu, aleksander.lobakin, ira.weiny, andriy.shevchenko,
	leon, lukas, bhelgaas, wagi, djeffery, stuart.w.hayes, lennart,
	brauner, linux-api, linux-fsdevel, saeedm, ajayachandra, jgg,
	parav, leonro, witu, hughd, skhawaja, chrisl, steven.sistare
In-Reply-To: <mafs0y0ponf6b.fsf@kernel.org>

On Mon, Oct 6, 2025 at 12:55 PM Pratyush Yadav <pratyush@kernel.org> wrote:
>
> Hi Pasha,
>
> On Mon, Sep 29 2025, Pasha Tatashin wrote:
>
> > Currently, KHO is controlled via debugfs interface, but once LUO is
> > introduced, it can control KHO, and the debug interface becomes
> > optional.
> >
> > Add a separate config CONFIG_KEXEC_HANDOVER_DEBUG that enables
> > the debugfs interface, and allows to inspect the tree.
> >
> > Move all debugfs related code to a new file to keep the .c files
> > clear of ifdefs.
> >
> > Co-developed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> > Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> > Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
> [...]
> > @@ -662,36 +660,24 @@ static void __init kho_reserve_scratch(void)
> >       kho_enable = false;
> >  }
> >
> > -struct fdt_debugfs {
> > -     struct list_head list;
> > -     struct debugfs_blob_wrapper wrapper;
> > -     struct dentry *file;
> > +struct kho_out {
> > +     struct blocking_notifier_head chain_head;
> > +     struct mutex lock; /* protects KHO FDT finalization */
> > +     struct kho_serialization ser;
> > +     bool finalized;
> > +     struct kho_debugfs dbg;
> >  };
> >
> > -static int kho_debugfs_fdt_add(struct list_head *list, struct dentry *dir,
> > -                            const char *name, const void *fdt)
> > -{
> > -     struct fdt_debugfs *f;
> > -     struct dentry *file;
> > -
> > -     f = kmalloc(sizeof(*f), GFP_KERNEL);
> > -     if (!f)
> > -             return -ENOMEM;
> > -
> > -     f->wrapper.data = (void *)fdt;
> > -     f->wrapper.size = fdt_totalsize(fdt);
> > -
> > -     file = debugfs_create_blob(name, 0400, dir, &f->wrapper);
> > -     if (IS_ERR(file)) {
> > -             kfree(f);
> > -             return PTR_ERR(file);
> > -     }
> > -
> > -     f->file = file;
> > -     list_add(&f->list, list);
> > -
> > -     return 0;
> > -}
> > +static struct kho_out kho_out = {
> > +     .chain_head = BLOCKING_NOTIFIER_INIT(kho_out.chain_head),
> > +     .lock = __MUTEX_INITIALIZER(kho_out.lock),
> > +     .ser = {
> > +             .track = {
> > +                     .orders = XARRAY_INIT(kho_out.ser.track.orders, 0),
> > +             },
> > +     },
> > +     .finalized = false,
> > +};
>
> There is already one definition for struct kho_out and a static struct
> kho_out early in the file. This is a second declaration and definition.
> And I was super confused when I saw patch 3 since it seemed to be making
> unrelated changes to this struct (and removing an instance of this,
> which should be done in this patch instead). In fact, this patch doesn't
> even build due to this problem. I think some patch massaging is needed
> to fix this all up.

Let me fix it. I Plan to send a separate series only with KHO changes
from LUO, so we can expedite its landing.

Pasha

>
> >
> >  /**
> >   * kho_add_subtree - record the physical address of a sub FDT in KHO root tree.
> [...]
>
> --
> Regards,
> Pratyush Yadav

^ permalink raw reply

* Re: [PATCH v4 02/30] kho: make debugfs interface optional
From: Pasha Tatashin @ 2025-10-06 18:02 UTC (permalink / raw)
  To: Pratyush Yadav
  Cc: jasonmiu, graf, changyuanl, rppt, dmatlack, rientjes, corbet,
	rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl, masahiroy, akpm,
	tj, yoann.congal, mmaurer, roman.gushchin, chenridong, axboe,
	mark.rutland, jannh, vincent.guittot, hannes, dan.j.williams,
	david, joel.granados, rostedt, anna.schumaker, song, zhangguopeng,
	linux, linux-kernel, linux-doc, linux-mm, gregkh, tglx, mingo, bp,
	dave.hansen, x86, hpa, rafael, dakr, bartosz.golaszewski,
	cw00.choi, myungjoo.ham, yesanishhere, Jonathan.Cameron,
	quic_zijuhu, aleksander.lobakin, ira.weiny, andriy.shevchenko,
	leon, lukas, bhelgaas, wagi, djeffery, stuart.w.hayes, lennart,
	brauner, linux-api, linux-fsdevel, saeedm, ajayachandra, jgg,
	parav, leonro, witu, hughd, skhawaja, chrisl, steven.sistare
In-Reply-To: <mafs07bx8ouva.fsf@kernel.org>

On Mon, Oct 6, 2025 at 12:31 PM Pratyush Yadav <pratyush@kernel.org> wrote:
>
> On Mon, Sep 29 2025, Pasha Tatashin wrote:
>
> > Currently, KHO is controlled via debugfs interface, but once LUO is
> > introduced, it can control KHO, and the debug interface becomes
> > optional.
> >
> > Add a separate config CONFIG_KEXEC_HANDOVER_DEBUG that enables
> > the debugfs interface, and allows to inspect the tree.
> >
> > Move all debugfs related code to a new file to keep the .c files
> > clear of ifdefs.
> >
> > Co-developed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> > Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> > Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
> > ---
> >  MAINTAINERS                      |   3 +-
> >  kernel/Kconfig.kexec             |  10 ++
> >  kernel/Makefile                  |   1 +
> >  kernel/kexec_handover.c          | 255 +++++--------------------------
> >  kernel/kexec_handover_debug.c    | 218 ++++++++++++++++++++++++++
> >  kernel/kexec_handover_internal.h |  44 ++++++
> >  6 files changed, 311 insertions(+), 220 deletions(-)
> >  create mode 100644 kernel/kexec_handover_debug.c
> >  create mode 100644 kernel/kexec_handover_internal.h
> >
> [...]
> > --- a/kernel/Kconfig.kexec
> > +++ b/kernel/Kconfig.kexec
> > @@ -109,6 +109,16 @@ config KEXEC_HANDOVER
> >         to keep data or state alive across the kexec. For this to work,
> >         both source and target kernels need to have this option enabled.
> >
> > +config KEXEC_HANDOVER_DEBUG
>
> Nit: can we call it KEXEC_HANDOVER_DEBUGFS instead? I think we would
> like to add a KEXEC_HANDOVER_DEBUG at some point to control debug
> asserts for KHO, and the naming would get confusing. And renaming config
> symbols is kind of a pain.

Done.

>
> > +     bool "kexec handover debug interface"
> > +     depends on KEXEC_HANDOVER
> > +     depends on DEBUG_FS
> > +     help
> > +       Allow to control kexec handover device tree via debugfs
> > +       interface, i.e. finalize the state or aborting the finalization.
> > +       Also, enables inspecting the KHO fdt trees with the debugfs binary
> > +       blobs.
> > +
> [...]
>
> --
> Regards,
> Pratyush Yadav

^ permalink raw reply

* Re: [PATCH v6 4/6] fs: make vfs_fileattr_[get|set] return -EOPNOSUPP
From: Andrey Albershteyn @ 2025-10-06 18:52 UTC (permalink / raw)
  To: Jan Kara
  Cc: Jiri Slaby, Amir Goldstein, Arnd Bergmann, Casey Schaufler,
	Christian Brauner, Pali Rohár, Paul Moore, linux-api,
	linux-fsdevel, linux-kernel, linux-xfs, selinux,
	Andrey Albershteyn
In-Reply-To: <jp3vopwtpik7bj77aejuknaziecuml6x2l2dr3oe2xoats6tls@yskzvehakmkv>

On 2025-10-06 17:39:46, Jan Kara wrote:
> On Mon 06-10-25 13:09:05, Jiri Slaby wrote:
> > On 30. 06. 25, 18:20, Andrey Albershteyn wrote:
> > > Future patches will add new syscalls which use these functions. As
> > > this interface won't be used for ioctls only, the EOPNOSUPP is more
> > > appropriate return code.
> > > 
> > > This patch converts return code from ENOIOCTLCMD to EOPNOSUPP for
> > > vfs_fileattr_get and vfs_fileattr_set. To save old behavior translate
> > > EOPNOSUPP back for current users - overlayfs, encryptfs and fs/ioctl.c.
> > > 
> > > Signed-off-by: Andrey Albershteyn <aalbersh@kernel.org>
> > ...
> > > @@ -292,6 +294,8 @@ int ioctl_setflags(struct file *file, unsigned int __user *argp)
> > >   			fileattr_fill_flags(&fa, flags);
> > >   			err = vfs_fileattr_set(idmap, dentry, &fa);
> > >   			mnt_drop_write_file(file);
> > > +			if (err == -EOPNOTSUPP)
> > > +				err = -ENOIOCTLCMD;
> > 
> > This breaks borg code (unit tests already) as it expects EOPNOTSUPP, not
> > ENOIOCTLCMD/ENOTTY:
> > https://github.com/borgbackup/borg/blob/1c6ef7a200c7f72f8d1204d727fea32168616ceb/src/borg/platform/linux.pyx#L147
> > 
> > I.e. setflags now returns ENOIOCTLCMD/ENOTTY for cases where 6.16 used to
> > return EOPNOTSUPP.
> > 
> > This minimal testcase program doing ioctl(fd2, FS_IOC_SETFLAGS,
> > &FS_NODUMP_FL):
> > https://github.com/jirislaby/collected_sources/tree/master/ioctl_setflags
> > 
> > dumps in 6.16:
> > sf: ioctl: Operation not supported
> > 
> > with the above patch:
> > sf: ioctl: Inappropriate ioctl for device
> > 
> > Is this expected?
> 
> No, that's a bug and a clear userspace regression so we need to fix it. I
> think we need to revert this commit and instead convert ENOIOCTLCMD from
> vfs_fileattr_get/set() to EOPNOTSUPP in appropriate places. Andrey?

I will prepare a patch soon

> 
> 								Honza
> -- 
> Jan Kara <jack@suse.com>
> SUSE Labs, CR
> 

-- 
- Andrey


^ permalink raw reply

* Re: [PATCH] fs: Propagate FMODE_NOCMTIME flag to user-facing O_NOCMTIME
From: Christoph Hellwig @ 2025-10-07  5:08 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Christoph Hellwig, Pavel Emelyanov, linux-fsdevel,
	Raphael S . Carvalho, linux-api, linux-xfs
In-Reply-To: <CALCETrVsD6Z42gO7S-oAbweN5OwV1OLqxztBkB58goSzccSZKw@mail.gmail.com>

On Sat, Oct 04, 2025 at 09:08:05AM -0700, Andy Lutomirski wrote:
> > Well, we'll need to look into that, including maybe non-blockin
> > timestamp updates.
> >
> 
> It's been 12 years (!), but maybe it's time to reconsider this:
> 
> https://lore.kernel.org/all/cover.1377193658.git.luto@amacapital.net/

I don't see how that is relevant here.  Also writes through shared
mmaps are problematic for so many reasons that I'm not sure we want
to encourage people to use that more.


^ permalink raw reply

* Re: [PATCH] fs: Propagate FMODE_NOCMTIME flag to user-facing O_NOCMTIME
From: Christoph Hellwig @ 2025-10-07  5:10 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Christoph Hellwig, Pavel Emelyanov, linux-fsdevel,
	Raphael S . Carvalho, linux-api, linux-xfs
In-Reply-To: <aOLr8M6s1W2qC5-Q@dread.disaster.area>

On Mon, Oct 06, 2025 at 09:06:40AM +1100, Dave Chinner wrote:
> If you don't care about accurate c/mtime, then mount the filesystem
> with '-o lazytime' to degrade c/mtime updates to "eventual
> consistency" behaviour for IO operations.

Exactly.

> Lazytime updates can generally be done in a non-blocking manner
> right now (someone raised that in the context of io-uring on #xfs
> about a month ago), but the NOWAIT behaviour for timestamp updates
> is done at a higher level in the VFS and does not take into account
> filesystem specific non-blocking lazytime updates at all.  If we
> push the NOWAIT checking behaviour down to the filesystem, we can do
> this.

We might not even have to push it out, but just make the VFS/rw helper
check aware of lazytime.  Either way currently even a lazytime
timestampt update will cause a write to block, which renders the
nowait writes pretty useless on anything but block devices.


^ permalink raw reply

* Re: [PATCH 1/7] tools/nolibc: remove __nolibc_enosys() fallback from time64-related functions
From: Arnd Bergmann @ 2025-10-07 10:03 UTC (permalink / raw)
  To: Thomas Weißschuh
  Cc: Willy Tarreau, Catalin Marinas, Will Deacon, shuah, Mark Brown,
	linux-kernel, linux-arm-kernel, linux-kselftest, Linux-Arch,
	linux-api
In-Reply-To: <a5b8344e-8214-4946-8344-f34e969d30b2@t-8ch.de>

On Mon, Oct 6, 2025, at 22:14, Thomas Weißschuh wrote:
> On 2025-10-01 09:43:37+0200, Arnd Bergmann wrote:
>> On Thu, Aug 21, 2025, at 17:40, Thomas Weißschuh wrote:
>>  - the old types are often too short, both for the y2038
>>    overflow and for the file system types.
>
> So far this was not something we actively tried to support,
> especially with the restriction mentioned below.
>
>> I suspect the problem is that the kernel's uapi/linux/time.h
>> still defines the old types as the default, and nolibc
>> historically just picks it up from there.
>
> So far we have tried to keep nolibc compatible with the kernel UAPI when
> included in any order. This forced us to use 'struct timespec' from
> uapi/linux/time.h. With the upcoming implementation of signals in nolibc
> this guideline is relaxed a bit, so we should be able to use our own
> always-64-bit 'struct timespec'.

You can probably either "#define timespec __kernel_timespec" or
"#define __kernel_timespec timespec" before including
linux/time_types.h.

Note that there is no time64 variant of "struct timeval", so
any syscall that needs this has to be implemented in userspace
as a wrapper around the timespec based one, e.g. gettimeofday()
needs to call clock_gettime() on all 32-bit systems.

>> We should also consider drop the
>> legacy type definitions from uapi/linux/time.h and
>> require each libc to define their own.
>
> Can we even just drop them? Or should they also get some backwards
> compat guards?

This is the big question, and we kind of left this one open
to be decided later when we finished the actual binary interface.

I think simply dropping the old definition is one of several
options we have, because that does not change the ABI in an
incompatible way and just requires the few user space sources
that use this to either require old kernel headers or make
simple source-level changes that they should have done for
portability anyway.

I see multiple decisions we have to make in that option space
once we decide to do anything:

- do we change the headers for both 32-bit and 64-bit userspace
  for consistency, or only for 32-bit userspace to limit
  the impact to those users that care about 32-bit?

- do we remove only the type definitions (timespec, timeval,
  itimerspec, itimerval, timex) or also the syscall macros
  for the time32 syscalls using them?

- What method (if any) would be used to choose between the
  time32 definitions, the time64 definitions or none of them
  when including the kernel headers?

      Arnd

^ permalink raw reply

* Re: [PATCH v6 4/6] fs: make vfs_fileattr_[get|set] return -EOPNOSUPP
From: Christian Brauner @ 2025-10-07 11:00 UTC (permalink / raw)
  To: Andrey Albershteyn
  Cc: Jan Kara, Jiri Slaby, Amir Goldstein, Arnd Bergmann,
	Casey Schaufler, Pali Rohár, Paul Moore, linux-api,
	linux-fsdevel, linux-kernel, linux-xfs, selinux,
	Andrey Albershteyn
In-Reply-To: <eyl6bzyi33tn6uys2ba5xjluvw7yjempqnla3jaih76mtgxgxq@i6xe2nquwqaf>

On Mon, Oct 06, 2025 at 08:52:32PM +0200, Andrey Albershteyn wrote:
> On 2025-10-06 17:39:46, Jan Kara wrote:
> > On Mon 06-10-25 13:09:05, Jiri Slaby wrote:
> > > On 30. 06. 25, 18:20, Andrey Albershteyn wrote:
> > > > Future patches will add new syscalls which use these functions. As
> > > > this interface won't be used for ioctls only, the EOPNOSUPP is more
> > > > appropriate return code.
> > > > 
> > > > This patch converts return code from ENOIOCTLCMD to EOPNOSUPP for
> > > > vfs_fileattr_get and vfs_fileattr_set. To save old behavior translate
> > > > EOPNOSUPP back for current users - overlayfs, encryptfs and fs/ioctl.c.
> > > > 
> > > > Signed-off-by: Andrey Albershteyn <aalbersh@kernel.org>
> > > ...
> > > > @@ -292,6 +294,8 @@ int ioctl_setflags(struct file *file, unsigned int __user *argp)
> > > >   			fileattr_fill_flags(&fa, flags);
> > > >   			err = vfs_fileattr_set(idmap, dentry, &fa);
> > > >   			mnt_drop_write_file(file);
> > > > +			if (err == -EOPNOTSUPP)
> > > > +				err = -ENOIOCTLCMD;
> > > 
> > > This breaks borg code (unit tests already) as it expects EOPNOTSUPP, not
> > > ENOIOCTLCMD/ENOTTY:
> > > https://github.com/borgbackup/borg/blob/1c6ef7a200c7f72f8d1204d727fea32168616ceb/src/borg/platform/linux.pyx#L147
> > > 
> > > I.e. setflags now returns ENOIOCTLCMD/ENOTTY for cases where 6.16 used to
> > > return EOPNOTSUPP.
> > > 
> > > This minimal testcase program doing ioctl(fd2, FS_IOC_SETFLAGS,
> > > &FS_NODUMP_FL):
> > > https://github.com/jirislaby/collected_sources/tree/master/ioctl_setflags
> > > 
> > > dumps in 6.16:
> > > sf: ioctl: Operation not supported
> > > 
> > > with the above patch:
> > > sf: ioctl: Inappropriate ioctl for device
> > > 
> > > Is this expected?

Nope, unintentional regression as Arnd noted.

> > 
> > No, that's a bug and a clear userspace regression so we need to fix it. I
> > think we need to revert this commit and instead convert ENOIOCTLCMD from
> > vfs_fileattr_get/set() to EOPNOTSUPP in appropriate places. Andrey?
> 
> I will prepare a patch soon

Thanks!

^ permalink raw reply

* Re: [PATCH v4 03/30] kho: drop notifiers
From: Pratyush Yadav @ 2025-10-07 12:09 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: Pratyush Yadav, jasonmiu, graf, changyuanl, rppt, dmatlack,
	rientjes, corbet, rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl,
	masahiroy, akpm, tj, yoann.congal, mmaurer, roman.gushchin,
	chenridong, axboe, mark.rutland, jannh, vincent.guittot, hannes,
	dan.j.williams, david, joel.granados, rostedt, anna.schumaker,
	song, zhangguopeng, linux, linux-kernel, linux-doc, linux-mm,
	gregkh, tglx, mingo, bp, dave.hansen, x86, hpa, rafael, dakr,
	bartosz.golaszewski, cw00.choi, myungjoo.ham, yesanishhere,
	Jonathan.Cameron, quic_zijuhu, aleksander.lobakin, ira.weiny,
	andriy.shevchenko, leon, lukas, bhelgaas, wagi, djeffery,
	stuart.w.hayes, lennart, brauner, linux-api, linux-fsdevel,
	saeedm, ajayachandra, jgg, parav, leonro, witu, hughd, skhawaja,
	chrisl, steven.sistare
In-Reply-To: <CA+CK2bA2qfLF1Mbyvnat+L9+5KAw6LnhYETXVoYcMGJxwTGahg@mail.gmail.com>

On Mon, Oct 06 2025, Pasha Tatashin wrote:

> On Mon, Oct 6, 2025 at 1:01 PM Pratyush Yadav <pratyush@kernel.org> wrote:
>>
>> On Mon, Sep 29 2025, Pasha Tatashin wrote:
>>
>> > From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
>> >
>> > The KHO framework uses a notifier chain as the mechanism for clients to
>> > participate in the finalization process. While this works for a single,
>> > central state machine, it is too restrictive for kernel-internal
>> > components like pstore/reserve_mem or IMA. These components need a
>> > simpler, direct way to register their state for preservation (e.g.,
>> > during their initcall) without being part of a complex,
>> > shutdown-time notifier sequence. The notifier model forces all
>> > participants into a single finalization flow and makes direct
>> > preservation from an arbitrary context difficult.
>> > This patch refactors the client participation model by removing the
>> > notifier chain and introducing a direct API for managing FDT subtrees.
>> >
>> > The core kho_finalize() and kho_abort() state machine remains, but
>> > clients now register their data with KHO beforehand.
>> >
>> > Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
>> > Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
>> [...]
>> > diff --git a/mm/memblock.c b/mm/memblock.c
>> > index e23e16618e9b..c4b2d4e4c715 100644
>> > --- a/mm/memblock.c
>> > +++ b/mm/memblock.c
>> > @@ -2444,53 +2444,18 @@ int reserve_mem_release_by_name(const char *name)
>> >  #define MEMBLOCK_KHO_FDT "memblock"
>> >  #define MEMBLOCK_KHO_NODE_COMPATIBLE "memblock-v1"
>> >  #define RESERVE_MEM_KHO_NODE_COMPATIBLE "reserve-mem-v1"
>> > -static struct page *kho_fdt;
>> > -
>> > -static int reserve_mem_kho_finalize(struct kho_serialization *ser)
>> > -{
>> > -     int err = 0, i;
>> > -
>> > -     for (i = 0; i < reserved_mem_count; i++) {
>> > -             struct reserve_mem_table *map = &reserved_mem_table[i];
>> > -             struct page *page = phys_to_page(map->start);
>> > -             unsigned int nr_pages = map->size >> PAGE_SHIFT;
>> > -
>> > -             err |= kho_preserve_pages(page, nr_pages);
>> > -     }
>> > -
>> > -     err |= kho_preserve_folio(page_folio(kho_fdt));
>> > -     err |= kho_add_subtree(ser, MEMBLOCK_KHO_FDT, page_to_virt(kho_fdt));
>> > -
>> > -     return notifier_from_errno(err);
>> > -}
>> > -
>> > -static int reserve_mem_kho_notifier(struct notifier_block *self,
>> > -                                 unsigned long cmd, void *v)
>> > -{
>> > -     switch (cmd) {
>> > -     case KEXEC_KHO_FINALIZE:
>> > -             return reserve_mem_kho_finalize((struct kho_serialization *)v);
>> > -     case KEXEC_KHO_ABORT:
>> > -             return NOTIFY_DONE;
>> > -     default:
>> > -             return NOTIFY_BAD;
>> > -     }
>> > -}
>> > -
>> > -static struct notifier_block reserve_mem_kho_nb = {
>> > -     .notifier_call = reserve_mem_kho_notifier,
>> > -};
>> >
>> >  static int __init prepare_kho_fdt(void)
>> >  {
>> >       int err = 0, i;
>> > +     struct page *fdt_page;
>> >       void *fdt;
>> >
>> > -     kho_fdt = alloc_page(GFP_KERNEL);
>> > -     if (!kho_fdt)
>> > +     fdt_page = alloc_page(GFP_KERNEL);
>> > +     if (!fdt_page)
>> >               return -ENOMEM;
>> >
>> > -     fdt = page_to_virt(kho_fdt);
>> > +     fdt = page_to_virt(fdt_page);
>> >
>> >       err |= fdt_create(fdt, PAGE_SIZE);
>> >       err |= fdt_finish_reservemap(fdt);
>> > @@ -2499,7 +2464,10 @@ static int __init prepare_kho_fdt(void)
>> >       err |= fdt_property_string(fdt, "compatible", MEMBLOCK_KHO_NODE_COMPATIBLE);
>> >       for (i = 0; i < reserved_mem_count; i++) {
>> >               struct reserve_mem_table *map = &reserved_mem_table[i];
>> > +             struct page *page = phys_to_page(map->start);
>> > +             unsigned int nr_pages = map->size >> PAGE_SHIFT;
>> >
>> > +             err |= kho_preserve_pages(page, nr_pages);
>> >               err |= fdt_begin_node(fdt, map->name);
>> >               err |= fdt_property_string(fdt, "compatible", RESERVE_MEM_KHO_NODE_COMPATIBLE);
>> >               err |= fdt_property(fdt, "start", &map->start, sizeof(map->start));
>> > @@ -2507,13 +2475,14 @@ static int __init prepare_kho_fdt(void)
>> >               err |= fdt_end_node(fdt);
>> >       }
>> >       err |= fdt_end_node(fdt);
>> > -
>> >       err |= fdt_finish(fdt);
>> >
>> > +     err |= kho_preserve_folio(page_folio(fdt_page));
>> > +     err |= kho_add_subtree(MEMBLOCK_KHO_FDT, fdt);
>> > +
>> >       if (err) {
>> >               pr_err("failed to prepare memblock FDT for KHO: %d\n", err);
>> > -             put_page(kho_fdt);
>> > -             kho_fdt = NULL;
>> > +             put_page(fdt_page);
>>
>> This adds subtree to KHO even if the FDT might be invalid. And then
>> leaves a dangling reference in KHO to the FDT in case of an error. I
>> think you should either do this check after
>> kho_preserve_folio(page_folio(fdt_page)) and do a clean error check for
>> kho_add_subtree(), or call kho_remove_subtree() in the error block.
>
> I agree, I do not like these err |= stuff, we should be checking
> errors cleanly, and do proper clean-ups.

Yeah, this is mainly a byproduct of using FDTs. Getting and setting
simple properties also needs error checking and that can get tedious
real quick. Which is why this pattern has shown up I suppose.

>
>> I prefer the former since if kho_add_subtree() is the one that fails,
>> there is little sense in removing a subtree that was never added.
>>
>> >       }
>> >
>> >       return err;
>> > @@ -2529,13 +2498,6 @@ static int __init reserve_mem_init(void)
>> >       err = prepare_kho_fdt();
>> >       if (err)
>> >               return err;
>> > -
>> > -     err = register_kho_notifier(&reserve_mem_kho_nb);
>> > -     if (err) {
>> > -             put_page(kho_fdt);
>> > -             kho_fdt = NULL;
>> > -     }
>> > -
>> >       return err;
>> >  }
>> >  late_initcall(reserve_mem_init);
>>
>> --
>> Regards,
>> Pratyush Yadav

-- 
Regards,
Pratyush Yadav

^ permalink raw reply

* Re: [PATCH v4 03/30] kho: drop notifiers
From: Pasha Tatashin @ 2025-10-07 13:16 UTC (permalink / raw)
  To: Pratyush Yadav
  Cc: jasonmiu, graf, changyuanl, rppt, dmatlack, rientjes, corbet,
	rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl, masahiroy, akpm,
	tj, yoann.congal, mmaurer, roman.gushchin, chenridong, axboe,
	mark.rutland, jannh, vincent.guittot, hannes, dan.j.williams,
	david, joel.granados, rostedt, anna.schumaker, song, zhangguopeng,
	linux, linux-kernel, linux-doc, linux-mm, gregkh, tglx, mingo, bp,
	dave.hansen, x86, hpa, rafael, dakr, bartosz.golaszewski,
	cw00.choi, myungjoo.ham, yesanishhere, Jonathan.Cameron,
	quic_zijuhu, aleksander.lobakin, ira.weiny, andriy.shevchenko,
	leon, lukas, bhelgaas, wagi, djeffery, stuart.w.hayes, lennart,
	brauner, linux-api, linux-fsdevel, saeedm, ajayachandra, jgg,
	parav, leonro, witu, hughd, skhawaja, chrisl, steven.sistare
In-Reply-To: <mafs0playoqui.fsf@kernel.org>

On Tue, Oct 7, 2025 at 8:10 AM Pratyush Yadav <pratyush@kernel.org> wrote:
>
> On Mon, Oct 06 2025, Pasha Tatashin wrote:
>
> > On Mon, Oct 6, 2025 at 1:01 PM Pratyush Yadav <pratyush@kernel.org> wrote:
> >>
> >> On Mon, Sep 29 2025, Pasha Tatashin wrote:
> >>
> >> > From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
> >> >
> >> > The KHO framework uses a notifier chain as the mechanism for clients to
> >> > participate in the finalization process. While this works for a single,
> >> > central state machine, it is too restrictive for kernel-internal
> >> > components like pstore/reserve_mem or IMA. These components need a
> >> > simpler, direct way to register their state for preservation (e.g.,
> >> > during their initcall) without being part of a complex,
> >> > shutdown-time notifier sequence. The notifier model forces all
> >> > participants into a single finalization flow and makes direct
> >> > preservation from an arbitrary context difficult.
> >> > This patch refactors the client participation model by removing the
> >> > notifier chain and introducing a direct API for managing FDT subtrees.
> >> >
> >> > The core kho_finalize() and kho_abort() state machine remains, but
> >> > clients now register their data with KHO beforehand.
> >> >
> >> > Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> >> > Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
> >> [...]
> >> > diff --git a/mm/memblock.c b/mm/memblock.c
> >> > index e23e16618e9b..c4b2d4e4c715 100644
> >> > --- a/mm/memblock.c
> >> > +++ b/mm/memblock.c
> >> > @@ -2444,53 +2444,18 @@ int reserve_mem_release_by_name(const char *name)
> >> >  #define MEMBLOCK_KHO_FDT "memblock"
> >> >  #define MEMBLOCK_KHO_NODE_COMPATIBLE "memblock-v1"
> >> >  #define RESERVE_MEM_KHO_NODE_COMPATIBLE "reserve-mem-v1"
> >> > -static struct page *kho_fdt;
> >> > -
> >> > -static int reserve_mem_kho_finalize(struct kho_serialization *ser)
> >> > -{
> >> > -     int err = 0, i;
> >> > -
> >> > -     for (i = 0; i < reserved_mem_count; i++) {
> >> > -             struct reserve_mem_table *map = &reserved_mem_table[i];
> >> > -             struct page *page = phys_to_page(map->start);
> >> > -             unsigned int nr_pages = map->size >> PAGE_SHIFT;
> >> > -
> >> > -             err |= kho_preserve_pages(page, nr_pages);
> >> > -     }
> >> > -
> >> > -     err |= kho_preserve_folio(page_folio(kho_fdt));
> >> > -     err |= kho_add_subtree(ser, MEMBLOCK_KHO_FDT, page_to_virt(kho_fdt));
> >> > -
> >> > -     return notifier_from_errno(err);
> >> > -}
> >> > -
> >> > -static int reserve_mem_kho_notifier(struct notifier_block *self,
> >> > -                                 unsigned long cmd, void *v)
> >> > -{
> >> > -     switch (cmd) {
> >> > -     case KEXEC_KHO_FINALIZE:
> >> > -             return reserve_mem_kho_finalize((struct kho_serialization *)v);
> >> > -     case KEXEC_KHO_ABORT:
> >> > -             return NOTIFY_DONE;
> >> > -     default:
> >> > -             return NOTIFY_BAD;
> >> > -     }
> >> > -}
> >> > -
> >> > -static struct notifier_block reserve_mem_kho_nb = {
> >> > -     .notifier_call = reserve_mem_kho_notifier,
> >> > -};
> >> >
> >> >  static int __init prepare_kho_fdt(void)
> >> >  {
> >> >       int err = 0, i;
> >> > +     struct page *fdt_page;
> >> >       void *fdt;
> >> >
> >> > -     kho_fdt = alloc_page(GFP_KERNEL);
> >> > -     if (!kho_fdt)
> >> > +     fdt_page = alloc_page(GFP_KERNEL);
> >> > +     if (!fdt_page)
> >> >               return -ENOMEM;
> >> >
> >> > -     fdt = page_to_virt(kho_fdt);
> >> > +     fdt = page_to_virt(fdt_page);
> >> >
> >> >       err |= fdt_create(fdt, PAGE_SIZE);
> >> >       err |= fdt_finish_reservemap(fdt);
> >> > @@ -2499,7 +2464,10 @@ static int __init prepare_kho_fdt(void)
> >> >       err |= fdt_property_string(fdt, "compatible", MEMBLOCK_KHO_NODE_COMPATIBLE);
> >> >       for (i = 0; i < reserved_mem_count; i++) {
> >> >               struct reserve_mem_table *map = &reserved_mem_table[i];
> >> > +             struct page *page = phys_to_page(map->start);
> >> > +             unsigned int nr_pages = map->size >> PAGE_SHIFT;
> >> >
> >> > +             err |= kho_preserve_pages(page, nr_pages);
> >> >               err |= fdt_begin_node(fdt, map->name);
> >> >               err |= fdt_property_string(fdt, "compatible", RESERVE_MEM_KHO_NODE_COMPATIBLE);
> >> >               err |= fdt_property(fdt, "start", &map->start, sizeof(map->start));
> >> > @@ -2507,13 +2475,14 @@ static int __init prepare_kho_fdt(void)
> >> >               err |= fdt_end_node(fdt);
> >> >       }
> >> >       err |= fdt_end_node(fdt);
> >> > -
> >> >       err |= fdt_finish(fdt);
> >> >
> >> > +     err |= kho_preserve_folio(page_folio(fdt_page));
> >> > +     err |= kho_add_subtree(MEMBLOCK_KHO_FDT, fdt);
> >> > +
> >> >       if (err) {
> >> >               pr_err("failed to prepare memblock FDT for KHO: %d\n", err);
> >> > -             put_page(kho_fdt);
> >> > -             kho_fdt = NULL;
> >> > +             put_page(fdt_page);
> >>
> >> This adds subtree to KHO even if the FDT might be invalid. And then
> >> leaves a dangling reference in KHO to the FDT in case of an error. I
> >> think you should either do this check after
> >> kho_preserve_folio(page_folio(fdt_page)) and do a clean error check for
> >> kho_add_subtree(), or call kho_remove_subtree() in the error block.
> >
> > I agree, I do not like these err |= stuff, we should be checking
> > errors cleanly, and do proper clean-ups.
>
> Yeah, this is mainly a byproduct of using FDTs. Getting and setting
> simple properties also needs error checking and that can get tedious
> real quick. Which is why this pattern has shown up I suppose.

Exactly. This is also why it's important to replace FDT with something
more sensible for general-purpose live update purposes.

By the way, I forgot to address this comment in the v5 of the KHO
series I sent out yesterday. Could you please take another look? If
everything else is good, I will refresh that series so we can ask
Andrew to take in the KHO patches. That would simplify the LUO series.

Pasha

>
> >
> >> I prefer the former since if kho_add_subtree() is the one that fails,
> >> there is little sense in removing a subtree that was never added.
> >>
> >> >       }
> >> >
> >> >       return err;
> >> > @@ -2529,13 +2498,6 @@ static int __init reserve_mem_init(void)
> >> >       err = prepare_kho_fdt();
> >> >       if (err)
> >> >               return err;
> >> > -
> >> > -     err = register_kho_notifier(&reserve_mem_kho_nb);
> >> > -     if (err) {
> >> > -             put_page(kho_fdt);
> >> > -             kho_fdt = NULL;
> >> > -     }
> >> > -
> >> >       return err;
> >> >  }
> >> >  late_initcall(reserve_mem_init);
> >>
> >> --
> >> Regards,
> >> Pratyush Yadav
>
> --
> Regards,
> Pratyush Yadav

^ permalink raw reply

* Re: [PATCH v4 03/30] kho: drop notifiers
From: Pratyush Yadav @ 2025-10-07 13:30 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: Pratyush Yadav, jasonmiu, graf, changyuanl, rppt, dmatlack,
	rientjes, corbet, rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl,
	masahiroy, akpm, tj, yoann.congal, mmaurer, roman.gushchin,
	chenridong, axboe, mark.rutland, jannh, vincent.guittot, hannes,
	dan.j.williams, david, joel.granados, rostedt, anna.schumaker,
	song, zhangguopeng, linux, linux-kernel, linux-doc, linux-mm,
	gregkh, tglx, mingo, bp, dave.hansen, x86, hpa, rafael, dakr,
	bartosz.golaszewski, cw00.choi, myungjoo.ham, yesanishhere,
	Jonathan.Cameron, quic_zijuhu, aleksander.lobakin, ira.weiny,
	andriy.shevchenko, leon, lukas, bhelgaas, wagi, djeffery,
	stuart.w.hayes, lennart, brauner, linux-api, linux-fsdevel,
	saeedm, ajayachandra, jgg, parav, leonro, witu, hughd, skhawaja,
	chrisl, steven.sistare
In-Reply-To: <CA+CK2bCFsPZQQQ0JFErnYt=dbzBx=ZJdV+eNXYWyNUE+xk7=yA@mail.gmail.com>

On Tue, Oct 07 2025, Pasha Tatashin wrote:

> On Tue, Oct 7, 2025 at 8:10 AM Pratyush Yadav <pratyush@kernel.org> wrote:
>>
>> On Mon, Oct 06 2025, Pasha Tatashin wrote:
>>
>> > On Mon, Oct 6, 2025 at 1:01 PM Pratyush Yadav <pratyush@kernel.org> wrote:
>> >>
>> >> On Mon, Sep 29 2025, Pasha Tatashin wrote:
>> >>
>> >> > From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
>> >> >
>> >> > The KHO framework uses a notifier chain as the mechanism for clients to
>> >> > participate in the finalization process. While this works for a single,
>> >> > central state machine, it is too restrictive for kernel-internal
>> >> > components like pstore/reserve_mem or IMA. These components need a
>> >> > simpler, direct way to register their state for preservation (e.g.,
>> >> > during their initcall) without being part of a complex,
>> >> > shutdown-time notifier sequence. The notifier model forces all
>> >> > participants into a single finalization flow and makes direct
>> >> > preservation from an arbitrary context difficult.
>> >> > This patch refactors the client participation model by removing the
>> >> > notifier chain and introducing a direct API for managing FDT subtrees.
>> >> >
>> >> > The core kho_finalize() and kho_abort() state machine remains, but
>> >> > clients now register their data with KHO beforehand.
>> >> >
>> >> > Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
>> >> > Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
>> >> [...]
>> >> > diff --git a/mm/memblock.c b/mm/memblock.c
>> >> > index e23e16618e9b..c4b2d4e4c715 100644
>> >> > --- a/mm/memblock.c
>> >> > +++ b/mm/memblock.c
>> >> > @@ -2444,53 +2444,18 @@ int reserve_mem_release_by_name(const char *name)
>> >> >  #define MEMBLOCK_KHO_FDT "memblock"
>> >> >  #define MEMBLOCK_KHO_NODE_COMPATIBLE "memblock-v1"
>> >> >  #define RESERVE_MEM_KHO_NODE_COMPATIBLE "reserve-mem-v1"
>> >> > -static struct page *kho_fdt;
>> >> > -
>> >> > -static int reserve_mem_kho_finalize(struct kho_serialization *ser)
>> >> > -{
>> >> > -     int err = 0, i;
>> >> > -
>> >> > -     for (i = 0; i < reserved_mem_count; i++) {
>> >> > -             struct reserve_mem_table *map = &reserved_mem_table[i];
>> >> > -             struct page *page = phys_to_page(map->start);
>> >> > -             unsigned int nr_pages = map->size >> PAGE_SHIFT;
>> >> > -
>> >> > -             err |= kho_preserve_pages(page, nr_pages);
>> >> > -     }
>> >> > -
>> >> > -     err |= kho_preserve_folio(page_folio(kho_fdt));
>> >> > -     err |= kho_add_subtree(ser, MEMBLOCK_KHO_FDT, page_to_virt(kho_fdt));
>> >> > -
>> >> > -     return notifier_from_errno(err);
>> >> > -}
>> >> > -
>> >> > -static int reserve_mem_kho_notifier(struct notifier_block *self,
>> >> > -                                 unsigned long cmd, void *v)
>> >> > -{
>> >> > -     switch (cmd) {
>> >> > -     case KEXEC_KHO_FINALIZE:
>> >> > -             return reserve_mem_kho_finalize((struct kho_serialization *)v);
>> >> > -     case KEXEC_KHO_ABORT:
>> >> > -             return NOTIFY_DONE;
>> >> > -     default:
>> >> > -             return NOTIFY_BAD;
>> >> > -     }
>> >> > -}
>> >> > -
>> >> > -static struct notifier_block reserve_mem_kho_nb = {
>> >> > -     .notifier_call = reserve_mem_kho_notifier,
>> >> > -};
>> >> >
>> >> >  static int __init prepare_kho_fdt(void)
>> >> >  {
>> >> >       int err = 0, i;
>> >> > +     struct page *fdt_page;
>> >> >       void *fdt;
>> >> >
>> >> > -     kho_fdt = alloc_page(GFP_KERNEL);
>> >> > -     if (!kho_fdt)
>> >> > +     fdt_page = alloc_page(GFP_KERNEL);
>> >> > +     if (!fdt_page)
>> >> >               return -ENOMEM;
>> >> >
>> >> > -     fdt = page_to_virt(kho_fdt);
>> >> > +     fdt = page_to_virt(fdt_page);
>> >> >
>> >> >       err |= fdt_create(fdt, PAGE_SIZE);
>> >> >       err |= fdt_finish_reservemap(fdt);
>> >> > @@ -2499,7 +2464,10 @@ static int __init prepare_kho_fdt(void)
>> >> >       err |= fdt_property_string(fdt, "compatible", MEMBLOCK_KHO_NODE_COMPATIBLE);
>> >> >       for (i = 0; i < reserved_mem_count; i++) {
>> >> >               struct reserve_mem_table *map = &reserved_mem_table[i];
>> >> > +             struct page *page = phys_to_page(map->start);
>> >> > +             unsigned int nr_pages = map->size >> PAGE_SHIFT;
>> >> >
>> >> > +             err |= kho_preserve_pages(page, nr_pages);
>> >> >               err |= fdt_begin_node(fdt, map->name);
>> >> >               err |= fdt_property_string(fdt, "compatible", RESERVE_MEM_KHO_NODE_COMPATIBLE);
>> >> >               err |= fdt_property(fdt, "start", &map->start, sizeof(map->start));
>> >> > @@ -2507,13 +2475,14 @@ static int __init prepare_kho_fdt(void)
>> >> >               err |= fdt_end_node(fdt);
>> >> >       }
>> >> >       err |= fdt_end_node(fdt);
>> >> > -
>> >> >       err |= fdt_finish(fdt);
>> >> >
>> >> > +     err |= kho_preserve_folio(page_folio(fdt_page));
>> >> > +     err |= kho_add_subtree(MEMBLOCK_KHO_FDT, fdt);
>> >> > +
>> >> >       if (err) {
>> >> >               pr_err("failed to prepare memblock FDT for KHO: %d\n", err);
>> >> > -             put_page(kho_fdt);
>> >> > -             kho_fdt = NULL;
>> >> > +             put_page(fdt_page);
>> >>
>> >> This adds subtree to KHO even if the FDT might be invalid. And then
>> >> leaves a dangling reference in KHO to the FDT in case of an error. I
>> >> think you should either do this check after
>> >> kho_preserve_folio(page_folio(fdt_page)) and do a clean error check for
>> >> kho_add_subtree(), or call kho_remove_subtree() in the error block.
>> >
>> > I agree, I do not like these err |= stuff, we should be checking
>> > errors cleanly, and do proper clean-ups.
>>
>> Yeah, this is mainly a byproduct of using FDTs. Getting and setting
>> simple properties also needs error checking and that can get tedious
>> real quick. Which is why this pattern has shown up I suppose.
>
> Exactly. This is also why it's important to replace FDT with something
> more sensible for general-purpose live update purposes.
>
> By the way, I forgot to address this comment in the v5 of the KHO
> series I sent out yesterday. Could you please take another look? If
> everything else is good, I will refresh that series so we can ask
> Andrew to take in the KHO patches. That would simplify the LUO series.

Good idea. Will take a look.

[...]

-- 
Regards,
Pratyush Yadav

^ permalink raw reply

* Re: [PATCH v4 00/30] Live Update Orchestrator
From: Pasha Tatashin @ 2025-10-07 17:10 UTC (permalink / raw)
  To: pratyush, jasonmiu, graf, changyuanl, pasha.tatashin, rppt,
	dmatlack, rientjes, corbet, rdunlap, ilpo.jarvinen, kanie, ojeda,
	aliceryhl, masahiroy, akpm, tj, yoann.congal, mmaurer,
	roman.gushchin, chenridong, axboe, mark.rutland, jannh,
	vincent.guittot, hannes, dan.j.williams, david, joel.granados,
	rostedt, anna.schumaker, song, zhangguopeng, linux, linux-kernel,
	linux-doc, linux-mm, gregkh, tglx, mingo, bp, dave.hansen, x86,
	hpa, rafael, dakr, bartosz.golaszewski, cw00.choi, myungjoo.ham,
	yesanishhere, Jonathan.Cameron, quic_zijuhu, aleksander.lobakin,
	ira.weiny, andriy.shevchenko, leon, lukas, bhelgaas, wagi,
	djeffery, stuart.w.hayes, ptyadav, lennart, brauner, linux-api,
	linux-fsdevel, saeedm, ajayachandra, jgg, parav, leonro, witu,
	hughd, skhawaja, chrisl, steven.sistare
In-Reply-To: <20250929010321.3462457-1-pasha.tatashin@soleen.com>

On Sun, Sep 28, 2025 at 9:03 PM Pasha Tatashin
<pasha.tatashin@soleen.com> wrote:
>
> This series introduces the Live Update Orchestrator (LUO), a kernel
> subsystem designed to facilitate live kernel updates. LUO enables
> kexec-based reboots with minimal downtime, a critical capability for
> cloud environments where hypervisors must be updated without disrupting
> running virtual machines. By preserving the state of selected resources,
> such as file descriptors and memory, LUO allows workloads to resume
> seamlessly in the new kernel.
>
> The git branch for this series can be found at:
> https://github.com/googleprodkernel/linux-liveupdate/tree/luo/v4
>
> The patch series applies against linux-next tag: next-20250926
>
> While this series is showed cased using memfd preservation. There are
> works to preserve devices:
> 1. IOMMU: https://lore.kernel.org/all/20250928190624.3735830-16-skhawaja@google.com
> 2. PCI: https://lore.kernel.org/all/20250916-luo-pci-v2-0-c494053c3c08@kernel.org
>
> =======================================================================
> Changelog since v3:
> (https://lore.kernel.org/all/20250807014442.3829950-1-pasha.tatashin@soleen.com):
>
> - The main architectural change in this version is introduction of
>   "sessions" to manage the lifecycle of preserved file descriptors.
>   In v3, session management was left to a single userspace agent. This
>   approach has been revised to improve robustness. Now, each session is
>   represented by a file descriptor (/dev/liveupdate). The lifecycle of
>   all preserved resources within a session is tied to this FD, ensuring
>   automatic cleanup by the kernel if the controlling userspace agent
>   crashes or exits unexpectedly.
>
> - The first three KHO fixes from the previous series have been merged
>   into Linus' tree.
>
> - Various bug fixes and refactorings, including correcting memory
>   unpreservation logic during a kho_abort() sequence.
>
> - Addressing all comments from reviewers.
>
> - Removing sysfs interface (/sys/kernel/liveupdate/state), the state
>   can now be queried  only via ioctl() API.
>
> =======================================================================

Hi all,

Following up on yesterday's Hypervisor Live Update meeting, we
discussed the requirements for the LUO to track dependencies,
particularly for IOMMU preservation and other stateful file
descriptors. This email summarizes the main design decisions and
outcomes from that discussion.

For context, the notes from the previous meeting can be found here:
https://lore.kernel.org/all/365acb25-4b25-86a2-10b0-1df98703e287@google.com
The notes for yesterday's meeting are not yes available.

The key outcomes are as follows:

1. User-Enforced Ordering
-------------------------
The responsibility for enforcing the correct order of operations will
lie with the userspace agent. If fd_A is a dependency for fd_B,
userspace must ensure that fd_A is preserved before fd_B. This same
ordering must be honored during the restoration phase after the reboot
(fd_A must be restored before fd_B). The kernel preserve the ordering.

2. Serialization in PRESERVE_FD
-------------------------------
To keep the global prepare() phase lightweight and predictable, the
consensus was to shift the heavy serialization work into the
PRESERVE_FD ioctl handler. This means that when userspace requests to
preserve a file, the file handler should perform the bulk of the
state-saving work immediately.

The proposed sequence of operations reflects this shift:

Shutdown Flow:
fd_preserve() (heavy serialization) -> prepare() (lightweight final
checks) -> Suspend VM -> reboot(KEXEC) -> freeze() (lightweight)

Boot & Restore Flow:
fd_restore() (lightweight object creation) -> Resume VM -> Heavy
post-restore IOCTLs (e.g., hardware page table re-creation) ->
finish() (lightweight cleanup)

This decision primarily serves as a guideline for file handler
implementations. For the LUO core, this implies minor API changes,
such as renaming can_preserve() to a more active preserve() and adding
a corresponding unpreserve() callback to be called during
UNPRESERVE_FD.

3. FD Data Query API
--------------------
We identified the need for a kernel API to allow subsystems to query
preserved FD data during the boot process, before userspace has
initiated the restore.

The proposed API would allow a file handler to retrieve a list of all
its preserved FDs, including their session names, tokens, and the
private data payload.

Proposed Data Structure:

struct liveupdate_fd {
        char *session; /* session name */
        u64 token; /* Preserved FD token */
        u64 data; /* Private preserved data */
};

Proposed Function:
liveupdate_fd_data_query(struct liveupdate_file_handler *h,
                         struct liveupdate_fd *fds, long *count);

4. New File-Lifecycle-Bound Global State
----------------------------------------
A new mechanism for managing global state was proposed, designed to be
tied to the lifecycle of the preserved files themselves. This would
allow a file owner (e.g., the IOMMU subsystem) to save and retrieve
global state that is only relevant when one or more of its FDs are
being managed by LUO.

The key characteristics of this new mechanism are:
The global state is optionally created on the first preserve() call
for a given file handler.
The state can be updated on subsequent preserve() calls.
The state is destroyed when the last corresponding file is unpreserved
or finished.
The data can be accessed during boot.

I am thinking of an API like this.

1. Add three more callbacks to liveupdate_file_ops:
/*
 * Optional. Called by LUO during first get global state call.
 * The handler should allocate/KHO preserve its global state object and return a
 * pointer to it via 'obj'. It must also provide a u64 handle (e.g., a physical
 * address of preserved memory) via 'data_handle' that LUO will save.
 * Return: 0 on success.
 */
int (*global_state_create)(struct liveupdate_file_handler *h,
                           void **obj, u64 *data_handle);

/*
 * Optional. Called by LUO in the new kernel
 * before the first access to the global state. The handler receives
 * the preserved u64 data_handle and should use it to reconstruct its
 * global state object, returning a pointer to it via 'obj'.
 * Return: 0 on success.
 */
int (*global_state_restore)(struct liveupdate_file_handler *h,
                            u64 data_handle, void **obj);

/*
 * Optional. Called by LUO after the last
 * file for this handler is unpreserved or finished. The handler
 * must free its global state object and any associated resources.
 */
void (*global_state_destroy)(struct liveupdate_file_handler *h, void *obj);

The get/put global state data:

/* Get and lock the data with file_handler scoped lock */
int liveupdate_fh_global_state_get(struct liveupdate_file_handler *h,
                                   void **obj);

/* Unlock the data */
void liveupdate_fh_global_state_put(struct liveupdate_file_handler *h);

Execution Flow:
1. Outgoing Kernel (First preserve() call):
2. Handler's preserve() is called. It needs the global state, so it calls
   liveupdate_fh_global_state_get(&h, &obj). LUO acquires h->global_state_lock.
   It sees h->global_state_obj is NULL.
   LUO calls h->ops->global_state_create(h, &h->global_state_obj, &handle).
   The handler allocates its state, preserves it with KHO, and returns its live
   pointer and a u64 handle.
3. LUO stores the handle internally for later serialization.
4. LUO sets *obj = h->global_state_obj and returns 0 with the lock still held.
5. The preserve() callback does its work using the obj.
6. It calls liveupdate_fh_global_state_put(h), which releases the lock.

Global PREPARE:
1. LUO iterates handlers. If h->count > 0, it writes the stored data_handle into
   the LUO FDT.

Incoming Kernel (First access):
1. When liveupdate_fh_global_state_get(&h, &obj) is called the first time. LUO
   acquires h->global_state_lock.
2. It sees h->global_state_obj is NULL, but it knows it has a preserved u64
   handle from the FDT. LUO calls h->ops->global_state_restore()
3. Reconstructs its state object, and returns the live pointer.
4. LUO sets *obj = h->global_state_obj and returns 0 with the lock held.
5. The caller does its work.
6. It calls liveupdate_fh_global_state_put(h) to release the lock.

Last File Cleanup (in unpreserve or finish):
1. LUO decrements h->count to 0.
2. This triggers the cleanup logic.
3. LUO calls h->ops->global_state_destroy(h, h->global_state_obj).
4. The handler frees its memory and resources.
5. LUO sets h->global_state_obj = NULL, resetting it for a future live update
   cycle.

Pasha


Pasha

^ permalink raw reply

* Re: [PATCH v4 00/30] Live Update Orchestrator
From: Jason Gunthorpe @ 2025-10-07 17:50 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: pratyush, jasonmiu, graf, changyuanl, rppt, dmatlack, rientjes,
	corbet, rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl,
	masahiroy, akpm, tj, yoann.congal, mmaurer, roman.gushchin,
	chenridong, axboe, mark.rutland, jannh, vincent.guittot, hannes,
	dan.j.williams, david, joel.granados, rostedt, anna.schumaker,
	song, zhangguopeng, linux, linux-kernel, linux-doc, linux-mm,
	gregkh, tglx, mingo, bp, dave.hansen, x86, hpa, rafael, dakr,
	bartosz.golaszewski, cw00.choi, myungjoo.ham, yesanishhere,
	Jonathan.Cameron, quic_zijuhu, aleksander.lobakin, ira.weiny,
	andriy.shevchenko, leon, lukas, bhelgaas, wagi, djeffery,
	stuart.w.hayes, ptyadav, lennart, brauner, linux-api,
	linux-fsdevel, saeedm, ajayachandra, parav, leonro, witu, hughd,
	skhawaja, chrisl, steven.sistare
In-Reply-To: <CA+CK2bB+RdapsozPHe84MP4NVSPLo6vje5hji5MKSg8L6ViAbw@mail.gmail.com>

On Tue, Oct 07, 2025 at 01:10:30PM -0400, Pasha Tatashin wrote:
> 
> 1. Add three more callbacks to liveupdate_file_ops:
> /*
>  * Optional. Called by LUO during first get global state call.
>  * The handler should allocate/KHO preserve its global state object and return a
>  * pointer to it via 'obj'. It must also provide a u64 handle (e.g., a physical
>  * address of preserved memory) via 'data_handle' that LUO will save.
>  * Return: 0 on success.
>  */
> int (*global_state_create)(struct liveupdate_file_handler *h,
>                            void **obj, u64 *data_handle);
> 
> /*
>  * Optional. Called by LUO in the new kernel
>  * before the first access to the global state. The handler receives
>  * the preserved u64 data_handle and should use it to reconstruct its
>  * global state object, returning a pointer to it via 'obj'.
>  * Return: 0 on success.
>  */
> int (*global_state_restore)(struct liveupdate_file_handler *h,
>                             u64 data_handle, void **obj);

It shouldn't be a "push" like this. Everything has a certain logical point
when it will need the luo data, it should be coded to 'pull' the data
right at that point.


> /*
>  * Optional. Called by LUO after the last
>  * file for this handler is unpreserved or finished. The handler
>  * must free its global state object and any associated resources.
>  */
> void (*global_state_destroy)(struct liveupdate_file_handler *h, void *obj);

I'm not sure a callback here is a good idea, the users are synchronous
at early boot, they should get their data and immediately process it
within the context of the caller. A 'unpack' callback does not seem so
useful to me.

> The get/put global state data:
> 
> /* Get and lock the data with file_handler scoped lock */
> int liveupdate_fh_global_state_get(struct liveupdate_file_handler *h,
>                                    void **obj);
> 
> /* Unlock the data */
> void liveupdate_fh_global_state_put(struct liveupdate_file_handler *h);

Maybe lock/unlock if it is locking.

It seems like a good direction overall. Really need to see how it
works with some examples

Jason

^ permalink raw reply

* Re: [PATCH v4 00/30] Live Update Orchestrator
From: Pasha Tatashin @ 2025-10-08  3:18 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: pratyush, jasonmiu, graf, changyuanl, rppt, dmatlack, rientjes,
	corbet, rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl,
	masahiroy, akpm, tj, yoann.congal, mmaurer, roman.gushchin,
	chenridong, axboe, mark.rutland, jannh, vincent.guittot, hannes,
	dan.j.williams, david, joel.granados, rostedt, anna.schumaker,
	song, zhangguopeng, linux, linux-kernel, linux-doc, linux-mm,
	gregkh, tglx, mingo, bp, dave.hansen, x86, hpa, rafael, dakr,
	bartosz.golaszewski, cw00.choi, myungjoo.ham, yesanishhere,
	Jonathan.Cameron, quic_zijuhu, aleksander.lobakin, ira.weiny,
	andriy.shevchenko, leon, lukas, bhelgaas, wagi, djeffery,
	stuart.w.hayes, ptyadav, lennart, brauner, linux-api,
	linux-fsdevel, saeedm, ajayachandra, parav, leonro, witu, hughd,
	skhawaja, chrisl, steven.sistare
In-Reply-To: <20251007175038.GB3474167@nvidia.com>

On Tue, Oct 7, 2025 at 1:50 PM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> On Tue, Oct 07, 2025 at 01:10:30PM -0400, Pasha Tatashin wrote:
> >
> > 1. Add three more callbacks to liveupdate_file_ops:
> > /*
> >  * Optional. Called by LUO during first get global state call.
> >  * The handler should allocate/KHO preserve its global state object and return a
> >  * pointer to it via 'obj'. It must also provide a u64 handle (e.g., a physical
> >  * address of preserved memory) via 'data_handle' that LUO will save.
> >  * Return: 0 on success.
> >  */
> > int (*global_state_create)(struct liveupdate_file_handler *h,
> >                            void **obj, u64 *data_handle);
> >
> > /*
> >  * Optional. Called by LUO in the new kernel
> >  * before the first access to the global state. The handler receives
> >  * the preserved u64 data_handle and should use it to reconstruct its
> >  * global state object, returning a pointer to it via 'obj'.
> >  * Return: 0 on success.
> >  */
> > int (*global_state_restore)(struct liveupdate_file_handler *h,
> >                             u64 data_handle, void **obj);
>
> It shouldn't be a "push" like this. Everything has a certain logical point
> when it will need the luo data, it should be coded to 'pull' the data
> right at that point.

It is not pushed, this callback is done automatically on the first
call liveupdate_fh_global_state_lock() in the new kernel, so exactly,
when a user tries to access the global data, it is restored from KHO
for the user to be able to access it via a normal pointer.

> > /*
> >  * Optional. Called by LUO after the last
> >  * file for this handler is unpreserved or finished. The handler
> >  * must free its global state object and any associated resources.
> >  */
> > void (*global_state_destroy)(struct liveupdate_file_handler *h, void *obj);
>
> I'm not sure a callback here is a good idea, the users are synchronous
> at early boot, they should get their data and immediately process it
> within the context of the caller. A 'unpack' callback does not seem so
> useful to me.

This callback is also automatic, it is called only once the last FD is
finished, and LUO has no FDs for this file handler, so the global
state can be properly freed. There is no unpack here.

>
> > The get/put global state data:
> >
> > /* Get and lock the data with file_handler scoped lock */
> > int liveupdate_fh_global_state_get(struct liveupdate_file_handler *h,
> >                                    void **obj);
> >
> > /* Unlock the data */
> > void liveupdate_fh_global_state_put(struct liveupdate_file_handler *h);
>
> Maybe lock/unlock if it is locking.

Sure, will name them:
liveupdate_fh_global_state_lock()
liveupdate_fh_global_state_unlock()

>
> It seems like a good direction overall. Really need to see how it
> works with some examples
>
> Jason

^ permalink raw reply

* Re: [PATCH v4 00/30] Live Update Orchestrator
From: Samiullah Khawaja @ 2025-10-08  7:03 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: pratyush, jasonmiu, graf, changyuanl, rppt, dmatlack, rientjes,
	corbet, rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl,
	masahiroy, akpm, tj, yoann.congal, mmaurer, roman.gushchin,
	chenridong, axboe, mark.rutland, jannh, vincent.guittot, hannes,
	dan.j.williams, david, joel.granados, rostedt, anna.schumaker,
	song, zhangguopeng, linux, linux-kernel, linux-doc, linux-mm,
	gregkh, tglx, mingo, bp, dave.hansen, x86, hpa, rafael, dakr,
	bartosz.golaszewski, cw00.choi, myungjoo.ham, yesanishhere,
	Jonathan.Cameron, quic_zijuhu, aleksander.lobakin, ira.weiny,
	andriy.shevchenko, leon, lukas, bhelgaas, wagi, djeffery,
	stuart.w.hayes, ptyadav, lennart, brauner, linux-api,
	linux-fsdevel, saeedm, ajayachandra, jgg, parav, leonro, witu,
	hughd, chrisl, steven.sistare
In-Reply-To: <CA+CK2bB+RdapsozPHe84MP4NVSPLo6vje5hji5MKSg8L6ViAbw@mail.gmail.com>

On Tue, Oct 7, 2025 at 10:11 AM Pasha Tatashin
<pasha.tatashin@soleen.com> wrote:
>
> On Sun, Sep 28, 2025 at 9:03 PM Pasha Tatashin
> <pasha.tatashin@soleen.com> wrote:
> >
> > This series introduces the Live Update Orchestrator (LUO), a kernel
> > subsystem designed to facilitate live kernel updates. LUO enables
> > kexec-based reboots with minimal downtime, a critical capability for
> > cloud environments where hypervisors must be updated without disrupting
> > running virtual machines. By preserving the state of selected resources,
> > such as file descriptors and memory, LUO allows workloads to resume
> > seamlessly in the new kernel.
> >
> > The git branch for this series can be found at:
> > https://github.com/googleprodkernel/linux-liveupdate/tree/luo/v4
> >
> > The patch series applies against linux-next tag: next-20250926
> >
> > While this series is showed cased using memfd preservation. There are
> > works to preserve devices:
> > 1. IOMMU: https://lore.kernel.org/all/20250928190624.3735830-16-skhawaja@google.com
> > 2. PCI: https://lore.kernel.org/all/20250916-luo-pci-v2-0-c494053c3c08@kernel.org
> >
> > =======================================================================
> > Changelog since v3:
> > (https://lore.kernel.org/all/20250807014442.3829950-1-pasha.tatashin@soleen.com):
> >
> > - The main architectural change in this version is introduction of
> >   "sessions" to manage the lifecycle of preserved file descriptors.
> >   In v3, session management was left to a single userspace agent. This
> >   approach has been revised to improve robustness. Now, each session is
> >   represented by a file descriptor (/dev/liveupdate). The lifecycle of
> >   all preserved resources within a session is tied to this FD, ensuring
> >   automatic cleanup by the kernel if the controlling userspace agent
> >   crashes or exits unexpectedly.
> >
> > - The first three KHO fixes from the previous series have been merged
> >   into Linus' tree.
> >
> > - Various bug fixes and refactorings, including correcting memory
> >   unpreservation logic during a kho_abort() sequence.
> >
> > - Addressing all comments from reviewers.
> >
> > - Removing sysfs interface (/sys/kernel/liveupdate/state), the state
> >   can now be queried  only via ioctl() API.
> >
> > =======================================================================
>
> Hi all,
>
> Following up on yesterday's Hypervisor Live Update meeting, we
> discussed the requirements for the LUO to track dependencies,
> particularly for IOMMU preservation and other stateful file
> descriptors. This email summarizes the main design decisions and
> outcomes from that discussion.
>
> For context, the notes from the previous meeting can be found here:
> https://lore.kernel.org/all/365acb25-4b25-86a2-10b0-1df98703e287@google.com
> The notes for yesterday's meeting are not yes available.
>
> The key outcomes are as follows:
>
> 1. User-Enforced Ordering
> -------------------------
> The responsibility for enforcing the correct order of operations will
> lie with the userspace agent. If fd_A is a dependency for fd_B,
> userspace must ensure that fd_A is preserved before fd_B. This same
> ordering must be honored during the restoration phase after the reboot
> (fd_A must be restored before fd_B). The kernel preserve the ordering.
>
> 2. Serialization in PRESERVE_FD
> -------------------------------
> To keep the global prepare() phase lightweight and predictable, the
> consensus was to shift the heavy serialization work into the
> PRESERVE_FD ioctl handler. This means that when userspace requests to
> preserve a file, the file handler should perform the bulk of the
> state-saving work immediately.
>
> The proposed sequence of operations reflects this shift:
>
> Shutdown Flow:
> fd_preserve() (heavy serialization) -> prepare() (lightweight final
> checks) -> Suspend VM -> reboot(KEXEC) -> freeze() (lightweight)
>
> Boot & Restore Flow:
> fd_restore() (lightweight object creation) -> Resume VM -> Heavy
> post-restore IOCTLs (e.g., hardware page table re-creation) ->
> finish() (lightweight cleanup)
>
> This decision primarily serves as a guideline for file handler
> implementations. For the LUO core, this implies minor API changes,
> such as renaming can_preserve() to a more active preserve() and adding
> a corresponding unpreserve() callback to be called during
> UNPRESERVE_FD.
>
> 3. FD Data Query API
> --------------------
> We identified the need for a kernel API to allow subsystems to query
> preserved FD data during the boot process, before userspace has
> initiated the restore.
>
> The proposed API would allow a file handler to retrieve a list of all
> its preserved FDs, including their session names, tokens, and the
> private data payload.
>
> Proposed Data Structure:
>
> struct liveupdate_fd {
>         char *session; /* session name */
>         u64 token; /* Preserved FD token */
>         u64 data; /* Private preserved data */
> };
>
> Proposed Function:
> liveupdate_fd_data_query(struct liveupdate_file_handler *h,
>                          struct liveupdate_fd *fds, long *count);
>
> 4. New File-Lifecycle-Bound Global State
> ----------------------------------------
> A new mechanism for managing global state was proposed, designed to be
> tied to the lifecycle of the preserved files themselves. This would
> allow a file owner (e.g., the IOMMU subsystem) to save and retrieve
> global state that is only relevant when one or more of its FDs are
> being managed by LUO.
>
> The key characteristics of this new mechanism are:
> The global state is optionally created on the first preserve() call
> for a given file handler.
> The state can be updated on subsequent preserve() calls.
> The state is destroyed when the last corresponding file is unpreserved
> or finished.
> The data can be accessed during boot.
>
> I am thinking of an API like this.
>
> 1. Add three more callbacks to liveupdate_file_ops:

This part is a little tricky, the file handler might be in a
completely different subsystem as compared to the global state. While
FD is supposed to own and control the lifecycle of the preserved
state, the global state might be needed in a completely different
layer during boot or some other event. Maybe the user can put some
APIs in place to move this state across layers?

Subsystems actually do provide this flexibility. But If I see
correctly, this approach is "global" like a subsystem but synchronous.
Users can decide when to preserve/create global state when they need
without the need to stage it and preserve it when subsystem PREPARE is
called.
> /*
>  * Optional. Called by LUO during first get global state call.
>  * The handler should allocate/KHO preserve its global state object and return a
>  * pointer to it via 'obj'. It must also provide a u64 handle (e.g., a physical
>  * address of preserved memory) via 'data_handle' that LUO will save.
>  * Return: 0 on success.
>  */
> int (*global_state_create)(struct liveupdate_file_handler *h,
>                            void **obj, u64 *data_handle);
>
> /*
>  * Optional. Called by LUO in the new kernel
>  * before the first access to the global state. The handler receives
>  * the preserved u64 data_handle and should use it to reconstruct its
>  * global state object, returning a pointer to it via 'obj'.
>  * Return: 0 on success.
>  */
> int (*global_state_restore)(struct liveupdate_file_handler *h,
>                             u64 data_handle, void **obj);

If I understand correctly, is this only for unpacking? Once unpacked
the user can call the _get function and get the global state that it
just unpacked. This should be fine.
>
> /*
>  * Optional. Called by LUO after the last
>  * file for this handler is unpreserved or finished. The handler
>  * must free its global state object and any associated resources.
>  */
> void (*global_state_destroy)(struct liveupdate_file_handler *h, void *obj);
>
> The get/put global state data:
>
> /* Get and lock the data with file_handler scoped lock */
> int liveupdate_fh_global_state_get(struct liveupdate_file_handler *h,
>                                    void **obj);
>
> /* Unlock the data */
> void liveupdate_fh_global_state_put(struct liveupdate_file_handler *h);
>
> Execution Flow:
> 1. Outgoing Kernel (First preserve() call):
> 2. Handler's preserve() is called. It needs the global state, so it calls
>    liveupdate_fh_global_state_get(&h, &obj). LUO acquires h->global_state_lock.
>    It sees h->global_state_obj is NULL.
>    LUO calls h->ops->global_state_create(h, &h->global_state_obj, &handle).
>    The handler allocates its state, preserves it with KHO, and returns its live
>    pointer and a u64 handle.
> 3. LUO stores the handle internally for later serialization.
> 4. LUO sets *obj = h->global_state_obj and returns 0 with the lock still held.
> 5. The preserve() callback does its work using the obj.
> 6. It calls liveupdate_fh_global_state_put(h), which releases the lock.
>
> Global PREPARE:
> 1. LUO iterates handlers. If h->count > 0, it writes the stored data_handle into
>    the LUO FDT.
>
> Incoming Kernel (First access):
> 1. When liveupdate_fh_global_state_get(&h, &obj) is called the first time. LUO
>    acquires h->global_state_lock.
> 2. It sees h->global_state_obj is NULL, but it knows it has a preserved u64
>    handle from the FDT. LUO calls h->ops->global_state_restore()
> 3. Reconstructs its state object, and returns the live pointer.
> 4. LUO sets *obj = h->global_state_obj and returns 0 with the lock held.
> 5. The caller does its work.
> 6. It calls liveupdate_fh_global_state_put(h) to release the lock.
>
> Last File Cleanup (in unpreserve or finish):
> 1. LUO decrements h->count to 0.
> 2. This triggers the cleanup logic.
> 3. LUO calls h->ops->global_state_destroy(h, h->global_state_obj).
> 4. The handler frees its memory and resources.
> 5. LUO sets h->global_state_obj = NULL, resetting it for a future live update
>    cycle.
>
> Pasha
>
>
> Pasha

^ permalink raw reply

* [PATCH 0/2] Fix to EOPNOTSUPP double conversion in ioctl_setflags()
From: Andrey Albershteyn @ 2025-10-08 12:44 UTC (permalink / raw)
  To: linux-api, linux-fsdevel, linux-kernel, linux-xfs
  Cc: Jan Kara, Jiri Slaby, Christian Brauner, Arnd Bergmann,
	Andrey Albershteyn

Revert original double conversion patch from ENOIOCTLCMD to EOPNOSUPP for
vfs_fileattr_get and vfs_fileattr_set. Instead, convert ENOIOCTLCMD only
where necessary.

To: linux-api@vger.kernel.org
To: linux-fsdevel@vger.kernel.org
To: linux-kernel@vger.kernel.org
To: linux-xfs@vger.kernel.org,
Cc: "Jan Kara" <jack@suse.cz>
Cc: "Jiri Slaby" <jirislaby@kernel.org>
Cc: "Christian Brauner" <brauner@kernel.org>
Cc: "Arnd Bergmann" <arnd@arndb.de>

Signed-off-by: Andrey Albershteyn <aalbersh@kernel.org>
---
Andrey Albershteyn (2):
      Revert "fs: make vfs_fileattr_[get|set] return -EOPNOTSUPP"
      fs: return EOPNOTSUPP from file_setattr/file_getattr syscalls

 fs/file_attr.c         | 16 ++++++----------
 fs/fuse/ioctl.c        |  4 ----
 fs/overlayfs/copy_up.c |  2 +-
 fs/overlayfs/inode.c   |  5 ++++-
 4 files changed, 11 insertions(+), 16 deletions(-)
---
base-commit: e5f0a698b34ed76002dc5cff3804a61c80233a7a
change-id: 20251007-eopnosupp-fix-d2f30fd7d873

Best regards,
--  
Andrey Albershteyn <aalbersh@kernel.org>


^ permalink raw reply

* [PATCH 1/2] Revert "fs: make vfs_fileattr_[get|set] return -EOPNOTSUPP"
From: Andrey Albershteyn @ 2025-10-08 12:44 UTC (permalink / raw)
  To: linux-api, linux-fsdevel, linux-kernel, linux-xfs
  Cc: Jan Kara, Jiri Slaby, Christian Brauner, Arnd Bergmann,
	Andrey Albershteyn
In-Reply-To: <20251008-eopnosupp-fix-v1-0-5990de009c9f@kernel.org>

This reverts commit 474b155adf3927d2c944423045757b54aa1ca4de.

This patch caused regression in ioctl_setflags(). Underlying filesystems
use EOPNOTSUPP to indicate that flag is not supported. This error is
also gets converted in ioctl_setflags(). Therefore, for unsupported
flags error changed from EOPNOSUPP to ENOIOCTLCMD.

Link: https://lore.kernel.org/linux-xfs/a622643f-1585-40b0-9441-cf7ece176e83@kernel.org/
Signed-off-by: Andrey Albershteyn <aalbersh@kernel.org>
---
 fs/file_attr.c         | 12 ++----------
 fs/fuse/ioctl.c        |  4 ----
 fs/overlayfs/copy_up.c |  2 +-
 fs/overlayfs/inode.c   |  5 ++++-
 4 files changed, 7 insertions(+), 16 deletions(-)

diff --git a/fs/file_attr.c b/fs/file_attr.c
index 12424d4945d0..460b2dd21a85 100644
--- a/fs/file_attr.c
+++ b/fs/file_attr.c
@@ -84,7 +84,7 @@ int vfs_fileattr_get(struct dentry *dentry, struct file_kattr *fa)
 	int error;
 
 	if (!inode->i_op->fileattr_get)
-		return -EOPNOTSUPP;
+		return -ENOIOCTLCMD;
 
 	error = security_inode_file_getattr(dentry, fa);
 	if (error)
@@ -270,7 +270,7 @@ int vfs_fileattr_set(struct mnt_idmap *idmap, struct dentry *dentry,
 	int err;
 
 	if (!inode->i_op->fileattr_set)
-		return -EOPNOTSUPP;
+		return -ENOIOCTLCMD;
 
 	if (!inode_owner_or_capable(idmap, inode))
 		return -EPERM;
@@ -312,8 +312,6 @@ int ioctl_getflags(struct file *file, unsigned int __user *argp)
 	int err;
 
 	err = vfs_fileattr_get(file->f_path.dentry, &fa);
-	if (err == -EOPNOTSUPP)
-		err = -ENOIOCTLCMD;
 	if (!err)
 		err = put_user(fa.flags, argp);
 	return err;
@@ -335,8 +333,6 @@ int ioctl_setflags(struct file *file, unsigned int __user *argp)
 			fileattr_fill_flags(&fa, flags);
 			err = vfs_fileattr_set(idmap, dentry, &fa);
 			mnt_drop_write_file(file);
-			if (err == -EOPNOTSUPP)
-				err = -ENOIOCTLCMD;
 		}
 	}
 	return err;
@@ -349,8 +345,6 @@ int ioctl_fsgetxattr(struct file *file, void __user *argp)
 	int err;
 
 	err = vfs_fileattr_get(file->f_path.dentry, &fa);
-	if (err == -EOPNOTSUPP)
-		err = -ENOIOCTLCMD;
 	if (!err)
 		err = copy_fsxattr_to_user(&fa, argp);
 
@@ -371,8 +365,6 @@ int ioctl_fssetxattr(struct file *file, void __user *argp)
 		if (!err) {
 			err = vfs_fileattr_set(idmap, dentry, &fa);
 			mnt_drop_write_file(file);
-			if (err == -EOPNOTSUPP)
-				err = -ENOIOCTLCMD;
 		}
 	}
 	return err;
diff --git a/fs/fuse/ioctl.c b/fs/fuse/ioctl.c
index 57032eadca6c..fdc175e93f74 100644
--- a/fs/fuse/ioctl.c
+++ b/fs/fuse/ioctl.c
@@ -536,8 +536,6 @@ int fuse_fileattr_get(struct dentry *dentry, struct file_kattr *fa)
 cleanup:
 	fuse_priv_ioctl_cleanup(inode, ff);
 
-	if (err == -ENOTTY)
-		err = -EOPNOTSUPP;
 	return err;
 }
 
@@ -574,7 +572,5 @@ int fuse_fileattr_set(struct mnt_idmap *idmap,
 cleanup:
 	fuse_priv_ioctl_cleanup(inode, ff);
 
-	if (err == -ENOTTY)
-		err = -EOPNOTSUPP;
 	return err;
 }
diff --git a/fs/overlayfs/copy_up.c b/fs/overlayfs/copy_up.c
index 27396fe63f6d..20c92ea58093 100644
--- a/fs/overlayfs/copy_up.c
+++ b/fs/overlayfs/copy_up.c
@@ -178,7 +178,7 @@ static int ovl_copy_fileattr(struct inode *inode, const struct path *old,
 	err = ovl_real_fileattr_get(old, &oldfa);
 	if (err) {
 		/* Ntfs-3g returns -EINVAL for "no fileattr support" */
-		if (err == -EOPNOTSUPP || err == -EINVAL)
+		if (err == -ENOTTY || err == -EINVAL)
 			return 0;
 		pr_warn("failed to retrieve lower fileattr (%pd2, err=%i)\n",
 			old->dentry, err);
diff --git a/fs/overlayfs/inode.c b/fs/overlayfs/inode.c
index ecb9f2019395..d4722e1b83bc 100644
--- a/fs/overlayfs/inode.c
+++ b/fs/overlayfs/inode.c
@@ -720,7 +720,10 @@ int ovl_real_fileattr_get(const struct path *realpath, struct file_kattr *fa)
 	if (err)
 		return err;
 
-	return vfs_fileattr_get(realpath->dentry, fa);
+	err = vfs_fileattr_get(realpath->dentry, fa);
+	if (err == -ENOIOCTLCMD)
+		err = -ENOTTY;
+	return err;
 }
 
 int ovl_fileattr_get(struct dentry *dentry, struct file_kattr *fa)

-- 
2.51.0


^ permalink raw reply related

* [PATCH 2/2] fs: return EOPNOTSUPP from file_setattr/file_getattr syscalls
From: Andrey Albershteyn @ 2025-10-08 12:44 UTC (permalink / raw)
  To: linux-api, linux-fsdevel, linux-kernel, linux-xfs
  Cc: Jan Kara, Jiri Slaby, Christian Brauner, Arnd Bergmann,
	Andrey Albershteyn
In-Reply-To: <20251008-eopnosupp-fix-v1-0-5990de009c9f@kernel.org>

These syscalls call to vfs_fileattr_get/set functions which return
ENOIOCTLCMD if filesystem doesn't support setting file attribute on an
inode. For syscalls EOPNOTSUPP would be more appropriate return error.

Signed-off-by: Andrey Albershteyn <aalbersh@kernel.org>
---
 fs/file_attr.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/fs/file_attr.c b/fs/file_attr.c
index 460b2dd21a85..5e3e2aba97b5 100644
--- a/fs/file_attr.c
+++ b/fs/file_attr.c
@@ -416,6 +416,8 @@ SYSCALL_DEFINE5(file_getattr, int, dfd, const char __user *, filename,
 	}
 
 	error = vfs_fileattr_get(filepath.dentry, &fa);
+	if (error == -ENOIOCTLCMD)
+		error = -EOPNOTSUPP;
 	if (error)
 		return error;
 
@@ -483,6 +485,8 @@ SYSCALL_DEFINE5(file_setattr, int, dfd, const char __user *, filename,
 	if (!error) {
 		error = vfs_fileattr_set(mnt_idmap(filepath.mnt),
 					 filepath.dentry, &fa);
+		if (error == -ENOIOCTLCMD)
+			error = -EOPNOTSUPP;
 		mnt_drop_write(filepath.mnt);
 	}
 

-- 
2.51.0


^ permalink raw reply related

* Re: [PATCH 2/2] fs: return EOPNOTSUPP from file_setattr/file_getattr syscalls
From: Arnd Bergmann @ 2025-10-08 13:23 UTC (permalink / raw)
  To: Andrey Albershteyn, linux-api, linux-fsdevel, linux-kernel,
	linux-xfs
  Cc: Jan Kara, Jiri Slaby, Christian Brauner, Andrey Albershteyn
In-Reply-To: <20251008-eopnosupp-fix-v1-2-5990de009c9f@kernel.org>

On Wed, Oct 8, 2025, at 14:44, Andrey Albershteyn wrote:
> These syscalls call to vfs_fileattr_get/set functions which return
> ENOIOCTLCMD if filesystem doesn't support setting file attribute on an
> inode. For syscalls EOPNOTSUPP would be more appropriate return error.
>
> Signed-off-by: Andrey Albershteyn <aalbersh@kernel.org>

Reviewed-by: Arnd Bergmann <arnd@arndb.de>

^ permalink raw reply

* Re: [PATCH 2/2] fs: return EOPNOTSUPP from file_setattr/file_getattr syscalls
From: Jan Kara @ 2025-10-08 13:30 UTC (permalink / raw)
  To: Andrey Albershteyn
  Cc: linux-api, linux-fsdevel, linux-kernel, linux-xfs, Jan Kara,
	Jiri Slaby, Christian Brauner, Arnd Bergmann, Andrey Albershteyn
In-Reply-To: <20251008-eopnosupp-fix-v1-2-5990de009c9f@kernel.org>

On Wed 08-10-25 14:44:18, Andrey Albershteyn wrote:
> These syscalls call to vfs_fileattr_get/set functions which return
> ENOIOCTLCMD if filesystem doesn't support setting file attribute on an
> inode. For syscalls EOPNOTSUPP would be more appropriate return error.
> 
> Signed-off-by: Andrey Albershteyn <aalbersh@kernel.org>

Looks good. Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/file_attr.c | 4 ++++
>  1 file changed, 4 insertions(+)
> 
> diff --git a/fs/file_attr.c b/fs/file_attr.c
> index 460b2dd21a85..5e3e2aba97b5 100644
> --- a/fs/file_attr.c
> +++ b/fs/file_attr.c
> @@ -416,6 +416,8 @@ SYSCALL_DEFINE5(file_getattr, int, dfd, const char __user *, filename,
>  	}
>  
>  	error = vfs_fileattr_get(filepath.dentry, &fa);
> +	if (error == -ENOIOCTLCMD)
> +		error = -EOPNOTSUPP;
>  	if (error)
>  		return error;
>  
> @@ -483,6 +485,8 @@ SYSCALL_DEFINE5(file_setattr, int, dfd, const char __user *, filename,
>  	if (!error) {
>  		error = vfs_fileattr_set(mnt_idmap(filepath.mnt),
>  					 filepath.dentry, &fa);
> +		if (error == -ENOIOCTLCMD)
> +			error = -EOPNOTSUPP;
>  		mnt_drop_write(filepath.mnt);
>  	}
>  
> 
> -- 
> 2.51.0
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply

* Re: [PATCH] fs: Propagate FMODE_NOCMTIME flag to user-facing O_NOCMTIME
From: Andy Lutomirski @ 2025-10-08 15:22 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Pavel Emelyanov, linux-fsdevel, Raphael S . Carvalho, linux-api,
	linux-xfs
In-Reply-To: <aOSgXXzvuq5YDj7q@infradead.org>

On Mon, Oct 6, 2025 at 10:08 PM Christoph Hellwig <hch@infradead.org> wrote:
>
> On Sat, Oct 04, 2025 at 09:08:05AM -0700, Andy Lutomirski wrote:
> > > Well, we'll need to look into that, including maybe non-blockin
> > > timestamp updates.
> > >
> >
> > It's been 12 years (!), but maybe it's time to reconsider this:
> >
> > https://lore.kernel.org/all/cover.1377193658.git.luto@amacapital.net/
>
> I don't see how that is relevant here.  Also writes through shared
> mmaps are problematic for so many reasons that I'm not sure we want
> to encourage people to use that more.
>

Because the same exact issue exists in the normal non-mmap write path,
and I can even quote you upthread :)

> Well, we'll need to look into that, including maybe non-blockin
timestamp updates.

I assume the code path that inspired this thread in the first place is:

ssize_t __generic_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
{
        struct file *file = iocb->ki_filp;
        struct address_space *mapping = file->f_mapping;
        struct inode *inode = mapping->host;
        ssize_t ret;

        ret = file_remove_privs(file);
        if (ret)
                return ret;

        ret = file_update_time(file);

and this has *exactly* the same problem as the shared-mmap write path:
it synchronously updates the time (well, synchronously enough that it
sometimes blocks), and it does so before updating the file contents
(although the window during which the timestamp is updated and the
contents are not is not as absurdly long as it is in the mmap case).

Now my series does not change any of this, but I'm thinking more of
the concept: instead of doing file/inode_update_time when a file is
logically written (in write_iter, page_mkwrite, etc), set a flag so
that the writeback code knows that the timestamp needs updating.
Thinking out loud, to handle both write_iter and mmap, there might
need to be two bits: one saying "the timestamp needs to be updated"
and another saying "the timestamp has been updated in the in-memory
inode, but the inode hasn't been dirtied yet".  And maybe the latter
is doable entirely within fs-specific code without any help from the
generic code, but it might still be nice to keep generic_update_time
usable for filesystems that want to do this.

--Andy

^ permalink raw reply

* Re: [PATCH v4 00/30] Live Update Orchestrator
From: Pasha Tatashin @ 2025-10-08 16:40 UTC (permalink / raw)
  To: Samiullah Khawaja
  Cc: pratyush, jasonmiu, graf, changyuanl, rppt, dmatlack, rientjes,
	corbet, rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl,
	masahiroy, akpm, tj, yoann.congal, mmaurer, roman.gushchin,
	chenridong, axboe, mark.rutland, jannh, vincent.guittot, hannes,
	dan.j.williams, david, joel.granados, rostedt, anna.schumaker,
	song, zhangguopeng, linux, linux-kernel, linux-doc, linux-mm,
	gregkh, tglx, mingo, bp, dave.hansen, x86, hpa, rafael, dakr,
	bartosz.golaszewski, cw00.choi, myungjoo.ham, yesanishhere,
	Jonathan.Cameron, quic_zijuhu, aleksander.lobakin, ira.weiny,
	andriy.shevchenko, leon, lukas, bhelgaas, wagi, djeffery,
	stuart.w.hayes, ptyadav, lennart, brauner, linux-api,
	linux-fsdevel, saeedm, ajayachandra, jgg, parav, leonro, witu,
	hughd, chrisl, steven.sistare
In-Reply-To: <CAAywjhSP=ugnSJOHPGmTUPGh82wt+qnaqZAqo99EfhF-XHD5Sg@mail.gmail.com>

On Wed, Oct 8, 2025 at 3:04 AM Samiullah Khawaja <skhawaja@google.com> wrote:
>
> On Tue, Oct 7, 2025 at 10:11 AM Pasha Tatashin
> <pasha.tatashin@soleen.com> wrote:
> >
> > On Sun, Sep 28, 2025 at 9:03 PM Pasha Tatashin
> > <pasha.tatashin@soleen.com> wrote:
> > >
> > > This series introduces the Live Update Orchestrator (LUO), a kernel
> > > subsystem designed to facilitate live kernel updates. LUO enables
> > > kexec-based reboots with minimal downtime, a critical capability for
> > > cloud environments where hypervisors must be updated without disrupting
> > > running virtual machines. By preserving the state of selected resources,
> > > such as file descriptors and memory, LUO allows workloads to resume
> > > seamlessly in the new kernel.
> > >
> > > The git branch for this series can be found at:
> > > https://github.com/googleprodkernel/linux-liveupdate/tree/luo/v4
> > >
> > > The patch series applies against linux-next tag: next-20250926
> > >
> > > While this series is showed cased using memfd preservation. There are
> > > works to preserve devices:
> > > 1. IOMMU: https://lore.kernel.org/all/20250928190624.3735830-16-skhawaja@google.com
> > > 2. PCI: https://lore.kernel.org/all/20250916-luo-pci-v2-0-c494053c3c08@kernel.org
> > >
> > > =======================================================================
> > > Changelog since v3:
> > > (https://lore.kernel.org/all/20250807014442.3829950-1-pasha.tatashin@soleen.com):
> > >
> > > - The main architectural change in this version is introduction of
> > >   "sessions" to manage the lifecycle of preserved file descriptors.
> > >   In v3, session management was left to a single userspace agent. This
> > >   approach has been revised to improve robustness. Now, each session is
> > >   represented by a file descriptor (/dev/liveupdate). The lifecycle of
> > >   all preserved resources within a session is tied to this FD, ensuring
> > >   automatic cleanup by the kernel if the controlling userspace agent
> > >   crashes or exits unexpectedly.
> > >
> > > - The first three KHO fixes from the previous series have been merged
> > >   into Linus' tree.
> > >
> > > - Various bug fixes and refactorings, including correcting memory
> > >   unpreservation logic during a kho_abort() sequence.
> > >
> > > - Addressing all comments from reviewers.
> > >
> > > - Removing sysfs interface (/sys/kernel/liveupdate/state), the state
> > >   can now be queried  only via ioctl() API.
> > >
> > > =======================================================================
> >
> > Hi all,
> >
> > Following up on yesterday's Hypervisor Live Update meeting, we
> > discussed the requirements for the LUO to track dependencies,
> > particularly for IOMMU preservation and other stateful file
> > descriptors. This email summarizes the main design decisions and
> > outcomes from that discussion.
> >
> > For context, the notes from the previous meeting can be found here:
> > https://lore.kernel.org/all/365acb25-4b25-86a2-10b0-1df98703e287@google.com
> > The notes for yesterday's meeting are not yes available.
> >
> > The key outcomes are as follows:
> >
> > 1. User-Enforced Ordering
> > -------------------------
> > The responsibility for enforcing the correct order of operations will
> > lie with the userspace agent. If fd_A is a dependency for fd_B,
> > userspace must ensure that fd_A is preserved before fd_B. This same
> > ordering must be honored during the restoration phase after the reboot
> > (fd_A must be restored before fd_B). The kernel preserve the ordering.
> >
> > 2. Serialization in PRESERVE_FD
> > -------------------------------
> > To keep the global prepare() phase lightweight and predictable, the
> > consensus was to shift the heavy serialization work into the
> > PRESERVE_FD ioctl handler. This means that when userspace requests to
> > preserve a file, the file handler should perform the bulk of the
> > state-saving work immediately.
> >
> > The proposed sequence of operations reflects this shift:
> >
> > Shutdown Flow:
> > fd_preserve() (heavy serialization) -> prepare() (lightweight final
> > checks) -> Suspend VM -> reboot(KEXEC) -> freeze() (lightweight)
> >
> > Boot & Restore Flow:
> > fd_restore() (lightweight object creation) -> Resume VM -> Heavy
> > post-restore IOCTLs (e.g., hardware page table re-creation) ->
> > finish() (lightweight cleanup)
> >
> > This decision primarily serves as a guideline for file handler
> > implementations. For the LUO core, this implies minor API changes,
> > such as renaming can_preserve() to a more active preserve() and adding
> > a corresponding unpreserve() callback to be called during
> > UNPRESERVE_FD.
> >
> > 3. FD Data Query API
> > --------------------
> > We identified the need for a kernel API to allow subsystems to query
> > preserved FD data during the boot process, before userspace has
> > initiated the restore.
> >
> > The proposed API would allow a file handler to retrieve a list of all
> > its preserved FDs, including their session names, tokens, and the
> > private data payload.
> >
> > Proposed Data Structure:
> >
> > struct liveupdate_fd {
> >         char *session; /* session name */
> >         u64 token; /* Preserved FD token */
> >         u64 data; /* Private preserved data */
> > };
> >
> > Proposed Function:
> > liveupdate_fd_data_query(struct liveupdate_file_handler *h,
> >                          struct liveupdate_fd *fds, long *count);
> >
> > 4. New File-Lifecycle-Bound Global State
> > ----------------------------------------
> > A new mechanism for managing global state was proposed, designed to be
> > tied to the lifecycle of the preserved files themselves. This would
> > allow a file owner (e.g., the IOMMU subsystem) to save and retrieve
> > global state that is only relevant when one or more of its FDs are
> > being managed by LUO.
> >
> > The key characteristics of this new mechanism are:
> > The global state is optionally created on the first preserve() call
> > for a given file handler.
> > The state can be updated on subsequent preserve() calls.
> > The state is destroyed when the last corresponding file is unpreserved
> > or finished.
> > The data can be accessed during boot.
> >
> > I am thinking of an API like this.

Sami and I discussed this further, and we agree that the proposed API
will work. We also identified two additional requirements that were
not mentioned in my previous email:

1. Ordered Un-preservation
The un-preservation of file descriptors must also be ordered and must
occur in the reverse order of preservation. For example, if a user
preserves a memfd first and then an iommufd that depends on it, the
iommufd must be un-preserved before the memfd when the session is
closed or the FDs are explicitly un-preserved.

2. New API to Check Preservation Status
A new LUO API will be needed to check if a struct file is already
preserved within a session. This is needed for dependency validation.
The proposed function would look like this:

bool liveupdate_is_file_preserved(struct liveupdate_session *session,
struct file *file);

This will allow the file handler for one FD (e.g., iommufd) to verify
during its preserve() callback that its dependencies (e.g., the
backing memfd) have already been preserved in the same session.

Pasha

^ permalink raw reply

* Re: [PATCH v4 00/30] Live Update Orchestrator
From: Jason Gunthorpe @ 2025-10-08 19:35 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: Samiullah Khawaja, pratyush, jasonmiu, graf, changyuanl, rppt,
	dmatlack, rientjes, corbet, rdunlap, ilpo.jarvinen, kanie, ojeda,
	aliceryhl, masahiroy, akpm, tj, yoann.congal, mmaurer,
	roman.gushchin, chenridong, axboe, mark.rutland, jannh,
	vincent.guittot, hannes, dan.j.williams, david, joel.granados,
	rostedt, anna.schumaker, song, zhangguopeng, linux, linux-kernel,
	linux-doc, linux-mm, gregkh, tglx, mingo, bp, dave.hansen, x86,
	hpa, rafael, dakr, bartosz.golaszewski, cw00.choi, myungjoo.ham,
	yesanishhere, Jonathan.Cameron, quic_zijuhu, aleksander.lobakin,
	ira.weiny, andriy.shevchenko, leon, lukas, bhelgaas, wagi,
	djeffery, stuart.w.hayes, ptyadav, lennart, brauner, linux-api,
	linux-fsdevel, saeedm, ajayachandra, parav, leonro, witu, hughd,
	chrisl, steven.sistare
In-Reply-To: <CA+CK2bAG+YAS7oSpdrZYDK0LU2mhfRuj2qTJtT-Hn8FLUbt=Dw@mail.gmail.com>

On Wed, Oct 08, 2025 at 12:40:34PM -0400, Pasha Tatashin wrote:
> 1. Ordered Un-preservation
> The un-preservation of file descriptors must also be ordered and must
> occur in the reverse order of preservation. For example, if a user
> preserves a memfd first and then an iommufd that depends on it, the
> iommufd must be un-preserved before the memfd when the session is
> closed or the FDs are explicitly un-preserved.

Why?

I imagined the first to unpreserve would restore the struct file * -
that would satisfy the order.

The ioctl version that is to get back a FD would recover the struct
file and fd_install it.

Meaning preserve side is retaining a database of labels to restored
struct file *'s.

As discussed unpreserve a FD does not imply unfreeze, which is the
opposite of how preserver works.

> 2. New API to Check Preservation Status
> A new LUO API will be needed to check if a struct file is already
> preserved within a session. This is needed for dependency validation.
> The proposed function would look like this:

This doesn't seem right, the API should be more like 'luo get
serialization handle for this file *'

If it hasn't been preserved then there won't be a handle, otherwise it
should return something to allow the unpreserving side to recover this
struct file *.

That's the general use case at least, there may be some narrower use
cases where the preserver throws away the handle.

Jason

^ permalink raw reply

* Re: [PATCH v4 00/30] Live Update Orchestrator
From: Pasha Tatashin @ 2025-10-08 20:26 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Samiullah Khawaja, pratyush, jasonmiu, graf, changyuanl, rppt,
	dmatlack, rientjes, corbet, rdunlap, ilpo.jarvinen, kanie, ojeda,
	aliceryhl, masahiroy, akpm, tj, yoann.congal, mmaurer,
	roman.gushchin, chenridong, axboe, mark.rutland, jannh,
	vincent.guittot, hannes, dan.j.williams, david, joel.granados,
	rostedt, anna.schumaker, song, zhangguopeng, linux, linux-kernel,
	linux-doc, linux-mm, gregkh, tglx, mingo, bp, dave.hansen, x86,
	hpa, rafael, dakr, bartosz.golaszewski, cw00.choi, myungjoo.ham,
	yesanishhere, Jonathan.Cameron, quic_zijuhu, aleksander.lobakin,
	ira.weiny, andriy.shevchenko, leon, lukas, bhelgaas, wagi,
	djeffery, stuart.w.hayes, ptyadav, lennart, brauner, linux-api,
	linux-fsdevel, saeedm, ajayachandra, parav, leonro, witu, hughd,
	chrisl, steven.sistare
In-Reply-To: <20251008193551.GA3839422@nvidia.com>

On Wed, Oct 8, 2025 at 3:36 PM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> On Wed, Oct 08, 2025 at 12:40:34PM -0400, Pasha Tatashin wrote:
> > 1. Ordered Un-preservation
> > The un-preservation of file descriptors must also be ordered and must
> > occur in the reverse order of preservation. For example, if a user
> > preserves a memfd first and then an iommufd that depends on it, the
> > iommufd must be un-preserved before the memfd when the session is
> > closed or the FDs are explicitly un-preserved.
>
> Why?
>
> I imagined the first to unpreserve would restore the struct file * -
> that would satisfy the order.

In my description, "un-preserve" refers to the action of canceling a
preservation request in the outgoing kernel, before kexec ever
happens. It's the pre-reboot counterpart to the PRESERVE_FD ioctl,
used when a user decides not to go through with the live update for a
specific FD.

The terminology I am using:
preserve: Put FD into LUO in the outgoing kernel
unpreserve: Remove FD from LUO from the outgoing kernel
retrieve: Restore FD and return it to user in the next kernel

For the retrieval part, we are going to be using FIFO order, the same
as preserve.

> The ioctl version that is to get back a FD would recover the struct
> file and fd_install it.
>
> Meaning preserve side is retaining a database of labels to restored
> struct file *'s.
>
> As discussed unpreserve a FD does not imply unfreeze, which is the
> opposite of how preserver works.
>
> > 2. New API to Check Preservation Status
> > A new LUO API will be needed to check if a struct file is already
> > preserved within a session. This is needed for dependency validation.
> > The proposed function would look like this:
>
> This doesn't seem right, the API should be more like 'luo get
> serialization handle for this file *'

How about:

int liveupdate_find_token(struct liveupdate_session *session,
                          struct file *file, u64 *token);

And if needed:
int liveupdate_find_file(struct liveupdate_session *session,
                         u64 token, struct file **file);

Return: 0 on success, or -ENOENT if the file is not preserved.

Pasha

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox