* Re: [PATCH v4 00/30] Live Update Orchestrator
From: Pasha Tatashin @ 2025-10-07 17:10 UTC (permalink / raw)
To: pratyush, jasonmiu, graf, changyuanl, pasha.tatashin, rppt,
dmatlack, rientjes, corbet, rdunlap, ilpo.jarvinen, kanie, ojeda,
aliceryhl, masahiroy, akpm, tj, yoann.congal, mmaurer,
roman.gushchin, chenridong, axboe, mark.rutland, jannh,
vincent.guittot, hannes, dan.j.williams, david, joel.granados,
rostedt, anna.schumaker, song, zhangguopeng, linux, linux-kernel,
linux-doc, linux-mm, gregkh, tglx, mingo, bp, dave.hansen, x86,
hpa, rafael, dakr, bartosz.golaszewski, cw00.choi, myungjoo.ham,
yesanishhere, Jonathan.Cameron, quic_zijuhu, aleksander.lobakin,
ira.weiny, andriy.shevchenko, leon, lukas, bhelgaas, wagi,
djeffery, stuart.w.hayes, ptyadav, lennart, brauner, linux-api,
linux-fsdevel, saeedm, ajayachandra, jgg, parav, leonro, witu,
hughd, skhawaja, chrisl, steven.sistare
In-Reply-To: <20250929010321.3462457-1-pasha.tatashin@soleen.com>
On Sun, Sep 28, 2025 at 9:03 PM Pasha Tatashin
<pasha.tatashin@soleen.com> wrote:
>
> This series introduces the Live Update Orchestrator (LUO), a kernel
> subsystem designed to facilitate live kernel updates. LUO enables
> kexec-based reboots with minimal downtime, a critical capability for
> cloud environments where hypervisors must be updated without disrupting
> running virtual machines. By preserving the state of selected resources,
> such as file descriptors and memory, LUO allows workloads to resume
> seamlessly in the new kernel.
>
> The git branch for this series can be found at:
> https://github.com/googleprodkernel/linux-liveupdate/tree/luo/v4
>
> The patch series applies against linux-next tag: next-20250926
>
> While this series is showed cased using memfd preservation. There are
> works to preserve devices:
> 1. IOMMU: https://lore.kernel.org/all/20250928190624.3735830-16-skhawaja@google.com
> 2. PCI: https://lore.kernel.org/all/20250916-luo-pci-v2-0-c494053c3c08@kernel.org
>
> =======================================================================
> Changelog since v3:
> (https://lore.kernel.org/all/20250807014442.3829950-1-pasha.tatashin@soleen.com):
>
> - The main architectural change in this version is introduction of
> "sessions" to manage the lifecycle of preserved file descriptors.
> In v3, session management was left to a single userspace agent. This
> approach has been revised to improve robustness. Now, each session is
> represented by a file descriptor (/dev/liveupdate). The lifecycle of
> all preserved resources within a session is tied to this FD, ensuring
> automatic cleanup by the kernel if the controlling userspace agent
> crashes or exits unexpectedly.
>
> - The first three KHO fixes from the previous series have been merged
> into Linus' tree.
>
> - Various bug fixes and refactorings, including correcting memory
> unpreservation logic during a kho_abort() sequence.
>
> - Addressing all comments from reviewers.
>
> - Removing sysfs interface (/sys/kernel/liveupdate/state), the state
> can now be queried only via ioctl() API.
>
> =======================================================================
Hi all,
Following up on yesterday's Hypervisor Live Update meeting, we
discussed the requirements for the LUO to track dependencies,
particularly for IOMMU preservation and other stateful file
descriptors. This email summarizes the main design decisions and
outcomes from that discussion.
For context, the notes from the previous meeting can be found here:
https://lore.kernel.org/all/365acb25-4b25-86a2-10b0-1df98703e287@google.com
The notes for yesterday's meeting are not yes available.
The key outcomes are as follows:
1. User-Enforced Ordering
-------------------------
The responsibility for enforcing the correct order of operations will
lie with the userspace agent. If fd_A is a dependency for fd_B,
userspace must ensure that fd_A is preserved before fd_B. This same
ordering must be honored during the restoration phase after the reboot
(fd_A must be restored before fd_B). The kernel preserve the ordering.
2. Serialization in PRESERVE_FD
-------------------------------
To keep the global prepare() phase lightweight and predictable, the
consensus was to shift the heavy serialization work into the
PRESERVE_FD ioctl handler. This means that when userspace requests to
preserve a file, the file handler should perform the bulk of the
state-saving work immediately.
The proposed sequence of operations reflects this shift:
Shutdown Flow:
fd_preserve() (heavy serialization) -> prepare() (lightweight final
checks) -> Suspend VM -> reboot(KEXEC) -> freeze() (lightweight)
Boot & Restore Flow:
fd_restore() (lightweight object creation) -> Resume VM -> Heavy
post-restore IOCTLs (e.g., hardware page table re-creation) ->
finish() (lightweight cleanup)
This decision primarily serves as a guideline for file handler
implementations. For the LUO core, this implies minor API changes,
such as renaming can_preserve() to a more active preserve() and adding
a corresponding unpreserve() callback to be called during
UNPRESERVE_FD.
3. FD Data Query API
--------------------
We identified the need for a kernel API to allow subsystems to query
preserved FD data during the boot process, before userspace has
initiated the restore.
The proposed API would allow a file handler to retrieve a list of all
its preserved FDs, including their session names, tokens, and the
private data payload.
Proposed Data Structure:
struct liveupdate_fd {
char *session; /* session name */
u64 token; /* Preserved FD token */
u64 data; /* Private preserved data */
};
Proposed Function:
liveupdate_fd_data_query(struct liveupdate_file_handler *h,
struct liveupdate_fd *fds, long *count);
4. New File-Lifecycle-Bound Global State
----------------------------------------
A new mechanism for managing global state was proposed, designed to be
tied to the lifecycle of the preserved files themselves. This would
allow a file owner (e.g., the IOMMU subsystem) to save and retrieve
global state that is only relevant when one or more of its FDs are
being managed by LUO.
The key characteristics of this new mechanism are:
The global state is optionally created on the first preserve() call
for a given file handler.
The state can be updated on subsequent preserve() calls.
The state is destroyed when the last corresponding file is unpreserved
or finished.
The data can be accessed during boot.
I am thinking of an API like this.
1. Add three more callbacks to liveupdate_file_ops:
/*
* Optional. Called by LUO during first get global state call.
* The handler should allocate/KHO preserve its global state object and return a
* pointer to it via 'obj'. It must also provide a u64 handle (e.g., a physical
* address of preserved memory) via 'data_handle' that LUO will save.
* Return: 0 on success.
*/
int (*global_state_create)(struct liveupdate_file_handler *h,
void **obj, u64 *data_handle);
/*
* Optional. Called by LUO in the new kernel
* before the first access to the global state. The handler receives
* the preserved u64 data_handle and should use it to reconstruct its
* global state object, returning a pointer to it via 'obj'.
* Return: 0 on success.
*/
int (*global_state_restore)(struct liveupdate_file_handler *h,
u64 data_handle, void **obj);
/*
* Optional. Called by LUO after the last
* file for this handler is unpreserved or finished. The handler
* must free its global state object and any associated resources.
*/
void (*global_state_destroy)(struct liveupdate_file_handler *h, void *obj);
The get/put global state data:
/* Get and lock the data with file_handler scoped lock */
int liveupdate_fh_global_state_get(struct liveupdate_file_handler *h,
void **obj);
/* Unlock the data */
void liveupdate_fh_global_state_put(struct liveupdate_file_handler *h);
Execution Flow:
1. Outgoing Kernel (First preserve() call):
2. Handler's preserve() is called. It needs the global state, so it calls
liveupdate_fh_global_state_get(&h, &obj). LUO acquires h->global_state_lock.
It sees h->global_state_obj is NULL.
LUO calls h->ops->global_state_create(h, &h->global_state_obj, &handle).
The handler allocates its state, preserves it with KHO, and returns its live
pointer and a u64 handle.
3. LUO stores the handle internally for later serialization.
4. LUO sets *obj = h->global_state_obj and returns 0 with the lock still held.
5. The preserve() callback does its work using the obj.
6. It calls liveupdate_fh_global_state_put(h), which releases the lock.
Global PREPARE:
1. LUO iterates handlers. If h->count > 0, it writes the stored data_handle into
the LUO FDT.
Incoming Kernel (First access):
1. When liveupdate_fh_global_state_get(&h, &obj) is called the first time. LUO
acquires h->global_state_lock.
2. It sees h->global_state_obj is NULL, but it knows it has a preserved u64
handle from the FDT. LUO calls h->ops->global_state_restore()
3. Reconstructs its state object, and returns the live pointer.
4. LUO sets *obj = h->global_state_obj and returns 0 with the lock held.
5. The caller does its work.
6. It calls liveupdate_fh_global_state_put(h) to release the lock.
Last File Cleanup (in unpreserve or finish):
1. LUO decrements h->count to 0.
2. This triggers the cleanup logic.
3. LUO calls h->ops->global_state_destroy(h, h->global_state_obj).
4. The handler frees its memory and resources.
5. LUO sets h->global_state_obj = NULL, resetting it for a future live update
cycle.
Pasha
Pasha
^ permalink raw reply
* Re: [PATCH v4 03/30] kho: drop notifiers
From: Pratyush Yadav @ 2025-10-07 13:30 UTC (permalink / raw)
To: Pasha Tatashin
Cc: Pratyush Yadav, jasonmiu, graf, changyuanl, rppt, dmatlack,
rientjes, corbet, rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl,
masahiroy, akpm, tj, yoann.congal, mmaurer, roman.gushchin,
chenridong, axboe, mark.rutland, jannh, vincent.guittot, hannes,
dan.j.williams, david, joel.granados, rostedt, anna.schumaker,
song, zhangguopeng, linux, linux-kernel, linux-doc, linux-mm,
gregkh, tglx, mingo, bp, dave.hansen, x86, hpa, rafael, dakr,
bartosz.golaszewski, cw00.choi, myungjoo.ham, yesanishhere,
Jonathan.Cameron, quic_zijuhu, aleksander.lobakin, ira.weiny,
andriy.shevchenko, leon, lukas, bhelgaas, wagi, djeffery,
stuart.w.hayes, lennart, brauner, linux-api, linux-fsdevel,
saeedm, ajayachandra, jgg, parav, leonro, witu, hughd, skhawaja,
chrisl, steven.sistare
In-Reply-To: <CA+CK2bCFsPZQQQ0JFErnYt=dbzBx=ZJdV+eNXYWyNUE+xk7=yA@mail.gmail.com>
On Tue, Oct 07 2025, Pasha Tatashin wrote:
> On Tue, Oct 7, 2025 at 8:10 AM Pratyush Yadav <pratyush@kernel.org> wrote:
>>
>> On Mon, Oct 06 2025, Pasha Tatashin wrote:
>>
>> > On Mon, Oct 6, 2025 at 1:01 PM Pratyush Yadav <pratyush@kernel.org> wrote:
>> >>
>> >> On Mon, Sep 29 2025, Pasha Tatashin wrote:
>> >>
>> >> > From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
>> >> >
>> >> > The KHO framework uses a notifier chain as the mechanism for clients to
>> >> > participate in the finalization process. While this works for a single,
>> >> > central state machine, it is too restrictive for kernel-internal
>> >> > components like pstore/reserve_mem or IMA. These components need a
>> >> > simpler, direct way to register their state for preservation (e.g.,
>> >> > during their initcall) without being part of a complex,
>> >> > shutdown-time notifier sequence. The notifier model forces all
>> >> > participants into a single finalization flow and makes direct
>> >> > preservation from an arbitrary context difficult.
>> >> > This patch refactors the client participation model by removing the
>> >> > notifier chain and introducing a direct API for managing FDT subtrees.
>> >> >
>> >> > The core kho_finalize() and kho_abort() state machine remains, but
>> >> > clients now register their data with KHO beforehand.
>> >> >
>> >> > Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
>> >> > Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
>> >> [...]
>> >> > diff --git a/mm/memblock.c b/mm/memblock.c
>> >> > index e23e16618e9b..c4b2d4e4c715 100644
>> >> > --- a/mm/memblock.c
>> >> > +++ b/mm/memblock.c
>> >> > @@ -2444,53 +2444,18 @@ int reserve_mem_release_by_name(const char *name)
>> >> > #define MEMBLOCK_KHO_FDT "memblock"
>> >> > #define MEMBLOCK_KHO_NODE_COMPATIBLE "memblock-v1"
>> >> > #define RESERVE_MEM_KHO_NODE_COMPATIBLE "reserve-mem-v1"
>> >> > -static struct page *kho_fdt;
>> >> > -
>> >> > -static int reserve_mem_kho_finalize(struct kho_serialization *ser)
>> >> > -{
>> >> > - int err = 0, i;
>> >> > -
>> >> > - for (i = 0; i < reserved_mem_count; i++) {
>> >> > - struct reserve_mem_table *map = &reserved_mem_table[i];
>> >> > - struct page *page = phys_to_page(map->start);
>> >> > - unsigned int nr_pages = map->size >> PAGE_SHIFT;
>> >> > -
>> >> > - err |= kho_preserve_pages(page, nr_pages);
>> >> > - }
>> >> > -
>> >> > - err |= kho_preserve_folio(page_folio(kho_fdt));
>> >> > - err |= kho_add_subtree(ser, MEMBLOCK_KHO_FDT, page_to_virt(kho_fdt));
>> >> > -
>> >> > - return notifier_from_errno(err);
>> >> > -}
>> >> > -
>> >> > -static int reserve_mem_kho_notifier(struct notifier_block *self,
>> >> > - unsigned long cmd, void *v)
>> >> > -{
>> >> > - switch (cmd) {
>> >> > - case KEXEC_KHO_FINALIZE:
>> >> > - return reserve_mem_kho_finalize((struct kho_serialization *)v);
>> >> > - case KEXEC_KHO_ABORT:
>> >> > - return NOTIFY_DONE;
>> >> > - default:
>> >> > - return NOTIFY_BAD;
>> >> > - }
>> >> > -}
>> >> > -
>> >> > -static struct notifier_block reserve_mem_kho_nb = {
>> >> > - .notifier_call = reserve_mem_kho_notifier,
>> >> > -};
>> >> >
>> >> > static int __init prepare_kho_fdt(void)
>> >> > {
>> >> > int err = 0, i;
>> >> > + struct page *fdt_page;
>> >> > void *fdt;
>> >> >
>> >> > - kho_fdt = alloc_page(GFP_KERNEL);
>> >> > - if (!kho_fdt)
>> >> > + fdt_page = alloc_page(GFP_KERNEL);
>> >> > + if (!fdt_page)
>> >> > return -ENOMEM;
>> >> >
>> >> > - fdt = page_to_virt(kho_fdt);
>> >> > + fdt = page_to_virt(fdt_page);
>> >> >
>> >> > err |= fdt_create(fdt, PAGE_SIZE);
>> >> > err |= fdt_finish_reservemap(fdt);
>> >> > @@ -2499,7 +2464,10 @@ static int __init prepare_kho_fdt(void)
>> >> > err |= fdt_property_string(fdt, "compatible", MEMBLOCK_KHO_NODE_COMPATIBLE);
>> >> > for (i = 0; i < reserved_mem_count; i++) {
>> >> > struct reserve_mem_table *map = &reserved_mem_table[i];
>> >> > + struct page *page = phys_to_page(map->start);
>> >> > + unsigned int nr_pages = map->size >> PAGE_SHIFT;
>> >> >
>> >> > + err |= kho_preserve_pages(page, nr_pages);
>> >> > err |= fdt_begin_node(fdt, map->name);
>> >> > err |= fdt_property_string(fdt, "compatible", RESERVE_MEM_KHO_NODE_COMPATIBLE);
>> >> > err |= fdt_property(fdt, "start", &map->start, sizeof(map->start));
>> >> > @@ -2507,13 +2475,14 @@ static int __init prepare_kho_fdt(void)
>> >> > err |= fdt_end_node(fdt);
>> >> > }
>> >> > err |= fdt_end_node(fdt);
>> >> > -
>> >> > err |= fdt_finish(fdt);
>> >> >
>> >> > + err |= kho_preserve_folio(page_folio(fdt_page));
>> >> > + err |= kho_add_subtree(MEMBLOCK_KHO_FDT, fdt);
>> >> > +
>> >> > if (err) {
>> >> > pr_err("failed to prepare memblock FDT for KHO: %d\n", err);
>> >> > - put_page(kho_fdt);
>> >> > - kho_fdt = NULL;
>> >> > + put_page(fdt_page);
>> >>
>> >> This adds subtree to KHO even if the FDT might be invalid. And then
>> >> leaves a dangling reference in KHO to the FDT in case of an error. I
>> >> think you should either do this check after
>> >> kho_preserve_folio(page_folio(fdt_page)) and do a clean error check for
>> >> kho_add_subtree(), or call kho_remove_subtree() in the error block.
>> >
>> > I agree, I do not like these err |= stuff, we should be checking
>> > errors cleanly, and do proper clean-ups.
>>
>> Yeah, this is mainly a byproduct of using FDTs. Getting and setting
>> simple properties also needs error checking and that can get tedious
>> real quick. Which is why this pattern has shown up I suppose.
>
> Exactly. This is also why it's important to replace FDT with something
> more sensible for general-purpose live update purposes.
>
> By the way, I forgot to address this comment in the v5 of the KHO
> series I sent out yesterday. Could you please take another look? If
> everything else is good, I will refresh that series so we can ask
> Andrew to take in the KHO patches. That would simplify the LUO series.
Good idea. Will take a look.
[...]
--
Regards,
Pratyush Yadav
^ permalink raw reply
* Re: [PATCH v4 03/30] kho: drop notifiers
From: Pasha Tatashin @ 2025-10-07 13:16 UTC (permalink / raw)
To: Pratyush Yadav
Cc: jasonmiu, graf, changyuanl, rppt, dmatlack, rientjes, corbet,
rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl, masahiroy, akpm,
tj, yoann.congal, mmaurer, roman.gushchin, chenridong, axboe,
mark.rutland, jannh, vincent.guittot, hannes, dan.j.williams,
david, joel.granados, rostedt, anna.schumaker, song, zhangguopeng,
linux, linux-kernel, linux-doc, linux-mm, gregkh, tglx, mingo, bp,
dave.hansen, x86, hpa, rafael, dakr, bartosz.golaszewski,
cw00.choi, myungjoo.ham, yesanishhere, Jonathan.Cameron,
quic_zijuhu, aleksander.lobakin, ira.weiny, andriy.shevchenko,
leon, lukas, bhelgaas, wagi, djeffery, stuart.w.hayes, lennart,
brauner, linux-api, linux-fsdevel, saeedm, ajayachandra, jgg,
parav, leonro, witu, hughd, skhawaja, chrisl, steven.sistare
In-Reply-To: <mafs0playoqui.fsf@kernel.org>
On Tue, Oct 7, 2025 at 8:10 AM Pratyush Yadav <pratyush@kernel.org> wrote:
>
> On Mon, Oct 06 2025, Pasha Tatashin wrote:
>
> > On Mon, Oct 6, 2025 at 1:01 PM Pratyush Yadav <pratyush@kernel.org> wrote:
> >>
> >> On Mon, Sep 29 2025, Pasha Tatashin wrote:
> >>
> >> > From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
> >> >
> >> > The KHO framework uses a notifier chain as the mechanism for clients to
> >> > participate in the finalization process. While this works for a single,
> >> > central state machine, it is too restrictive for kernel-internal
> >> > components like pstore/reserve_mem or IMA. These components need a
> >> > simpler, direct way to register their state for preservation (e.g.,
> >> > during their initcall) without being part of a complex,
> >> > shutdown-time notifier sequence. The notifier model forces all
> >> > participants into a single finalization flow and makes direct
> >> > preservation from an arbitrary context difficult.
> >> > This patch refactors the client participation model by removing the
> >> > notifier chain and introducing a direct API for managing FDT subtrees.
> >> >
> >> > The core kho_finalize() and kho_abort() state machine remains, but
> >> > clients now register their data with KHO beforehand.
> >> >
> >> > Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> >> > Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
> >> [...]
> >> > diff --git a/mm/memblock.c b/mm/memblock.c
> >> > index e23e16618e9b..c4b2d4e4c715 100644
> >> > --- a/mm/memblock.c
> >> > +++ b/mm/memblock.c
> >> > @@ -2444,53 +2444,18 @@ int reserve_mem_release_by_name(const char *name)
> >> > #define MEMBLOCK_KHO_FDT "memblock"
> >> > #define MEMBLOCK_KHO_NODE_COMPATIBLE "memblock-v1"
> >> > #define RESERVE_MEM_KHO_NODE_COMPATIBLE "reserve-mem-v1"
> >> > -static struct page *kho_fdt;
> >> > -
> >> > -static int reserve_mem_kho_finalize(struct kho_serialization *ser)
> >> > -{
> >> > - int err = 0, i;
> >> > -
> >> > - for (i = 0; i < reserved_mem_count; i++) {
> >> > - struct reserve_mem_table *map = &reserved_mem_table[i];
> >> > - struct page *page = phys_to_page(map->start);
> >> > - unsigned int nr_pages = map->size >> PAGE_SHIFT;
> >> > -
> >> > - err |= kho_preserve_pages(page, nr_pages);
> >> > - }
> >> > -
> >> > - err |= kho_preserve_folio(page_folio(kho_fdt));
> >> > - err |= kho_add_subtree(ser, MEMBLOCK_KHO_FDT, page_to_virt(kho_fdt));
> >> > -
> >> > - return notifier_from_errno(err);
> >> > -}
> >> > -
> >> > -static int reserve_mem_kho_notifier(struct notifier_block *self,
> >> > - unsigned long cmd, void *v)
> >> > -{
> >> > - switch (cmd) {
> >> > - case KEXEC_KHO_FINALIZE:
> >> > - return reserve_mem_kho_finalize((struct kho_serialization *)v);
> >> > - case KEXEC_KHO_ABORT:
> >> > - return NOTIFY_DONE;
> >> > - default:
> >> > - return NOTIFY_BAD;
> >> > - }
> >> > -}
> >> > -
> >> > -static struct notifier_block reserve_mem_kho_nb = {
> >> > - .notifier_call = reserve_mem_kho_notifier,
> >> > -};
> >> >
> >> > static int __init prepare_kho_fdt(void)
> >> > {
> >> > int err = 0, i;
> >> > + struct page *fdt_page;
> >> > void *fdt;
> >> >
> >> > - kho_fdt = alloc_page(GFP_KERNEL);
> >> > - if (!kho_fdt)
> >> > + fdt_page = alloc_page(GFP_KERNEL);
> >> > + if (!fdt_page)
> >> > return -ENOMEM;
> >> >
> >> > - fdt = page_to_virt(kho_fdt);
> >> > + fdt = page_to_virt(fdt_page);
> >> >
> >> > err |= fdt_create(fdt, PAGE_SIZE);
> >> > err |= fdt_finish_reservemap(fdt);
> >> > @@ -2499,7 +2464,10 @@ static int __init prepare_kho_fdt(void)
> >> > err |= fdt_property_string(fdt, "compatible", MEMBLOCK_KHO_NODE_COMPATIBLE);
> >> > for (i = 0; i < reserved_mem_count; i++) {
> >> > struct reserve_mem_table *map = &reserved_mem_table[i];
> >> > + struct page *page = phys_to_page(map->start);
> >> > + unsigned int nr_pages = map->size >> PAGE_SHIFT;
> >> >
> >> > + err |= kho_preserve_pages(page, nr_pages);
> >> > err |= fdt_begin_node(fdt, map->name);
> >> > err |= fdt_property_string(fdt, "compatible", RESERVE_MEM_KHO_NODE_COMPATIBLE);
> >> > err |= fdt_property(fdt, "start", &map->start, sizeof(map->start));
> >> > @@ -2507,13 +2475,14 @@ static int __init prepare_kho_fdt(void)
> >> > err |= fdt_end_node(fdt);
> >> > }
> >> > err |= fdt_end_node(fdt);
> >> > -
> >> > err |= fdt_finish(fdt);
> >> >
> >> > + err |= kho_preserve_folio(page_folio(fdt_page));
> >> > + err |= kho_add_subtree(MEMBLOCK_KHO_FDT, fdt);
> >> > +
> >> > if (err) {
> >> > pr_err("failed to prepare memblock FDT for KHO: %d\n", err);
> >> > - put_page(kho_fdt);
> >> > - kho_fdt = NULL;
> >> > + put_page(fdt_page);
> >>
> >> This adds subtree to KHO even if the FDT might be invalid. And then
> >> leaves a dangling reference in KHO to the FDT in case of an error. I
> >> think you should either do this check after
> >> kho_preserve_folio(page_folio(fdt_page)) and do a clean error check for
> >> kho_add_subtree(), or call kho_remove_subtree() in the error block.
> >
> > I agree, I do not like these err |= stuff, we should be checking
> > errors cleanly, and do proper clean-ups.
>
> Yeah, this is mainly a byproduct of using FDTs. Getting and setting
> simple properties also needs error checking and that can get tedious
> real quick. Which is why this pattern has shown up I suppose.
Exactly. This is also why it's important to replace FDT with something
more sensible for general-purpose live update purposes.
By the way, I forgot to address this comment in the v5 of the KHO
series I sent out yesterday. Could you please take another look? If
everything else is good, I will refresh that series so we can ask
Andrew to take in the KHO patches. That would simplify the LUO series.
Pasha
>
> >
> >> I prefer the former since if kho_add_subtree() is the one that fails,
> >> there is little sense in removing a subtree that was never added.
> >>
> >> > }
> >> >
> >> > return err;
> >> > @@ -2529,13 +2498,6 @@ static int __init reserve_mem_init(void)
> >> > err = prepare_kho_fdt();
> >> > if (err)
> >> > return err;
> >> > -
> >> > - err = register_kho_notifier(&reserve_mem_kho_nb);
> >> > - if (err) {
> >> > - put_page(kho_fdt);
> >> > - kho_fdt = NULL;
> >> > - }
> >> > -
> >> > return err;
> >> > }
> >> > late_initcall(reserve_mem_init);
> >>
> >> --
> >> Regards,
> >> Pratyush Yadav
>
> --
> Regards,
> Pratyush Yadav
^ permalink raw reply
* Re: [PATCH v4 03/30] kho: drop notifiers
From: Pratyush Yadav @ 2025-10-07 12:09 UTC (permalink / raw)
To: Pasha Tatashin
Cc: Pratyush Yadav, jasonmiu, graf, changyuanl, rppt, dmatlack,
rientjes, corbet, rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl,
masahiroy, akpm, tj, yoann.congal, mmaurer, roman.gushchin,
chenridong, axboe, mark.rutland, jannh, vincent.guittot, hannes,
dan.j.williams, david, joel.granados, rostedt, anna.schumaker,
song, zhangguopeng, linux, linux-kernel, linux-doc, linux-mm,
gregkh, tglx, mingo, bp, dave.hansen, x86, hpa, rafael, dakr,
bartosz.golaszewski, cw00.choi, myungjoo.ham, yesanishhere,
Jonathan.Cameron, quic_zijuhu, aleksander.lobakin, ira.weiny,
andriy.shevchenko, leon, lukas, bhelgaas, wagi, djeffery,
stuart.w.hayes, lennart, brauner, linux-api, linux-fsdevel,
saeedm, ajayachandra, jgg, parav, leonro, witu, hughd, skhawaja,
chrisl, steven.sistare
In-Reply-To: <CA+CK2bA2qfLF1Mbyvnat+L9+5KAw6LnhYETXVoYcMGJxwTGahg@mail.gmail.com>
On Mon, Oct 06 2025, Pasha Tatashin wrote:
> On Mon, Oct 6, 2025 at 1:01 PM Pratyush Yadav <pratyush@kernel.org> wrote:
>>
>> On Mon, Sep 29 2025, Pasha Tatashin wrote:
>>
>> > From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
>> >
>> > The KHO framework uses a notifier chain as the mechanism for clients to
>> > participate in the finalization process. While this works for a single,
>> > central state machine, it is too restrictive for kernel-internal
>> > components like pstore/reserve_mem or IMA. These components need a
>> > simpler, direct way to register their state for preservation (e.g.,
>> > during their initcall) without being part of a complex,
>> > shutdown-time notifier sequence. The notifier model forces all
>> > participants into a single finalization flow and makes direct
>> > preservation from an arbitrary context difficult.
>> > This patch refactors the client participation model by removing the
>> > notifier chain and introducing a direct API for managing FDT subtrees.
>> >
>> > The core kho_finalize() and kho_abort() state machine remains, but
>> > clients now register their data with KHO beforehand.
>> >
>> > Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
>> > Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
>> [...]
>> > diff --git a/mm/memblock.c b/mm/memblock.c
>> > index e23e16618e9b..c4b2d4e4c715 100644
>> > --- a/mm/memblock.c
>> > +++ b/mm/memblock.c
>> > @@ -2444,53 +2444,18 @@ int reserve_mem_release_by_name(const char *name)
>> > #define MEMBLOCK_KHO_FDT "memblock"
>> > #define MEMBLOCK_KHO_NODE_COMPATIBLE "memblock-v1"
>> > #define RESERVE_MEM_KHO_NODE_COMPATIBLE "reserve-mem-v1"
>> > -static struct page *kho_fdt;
>> > -
>> > -static int reserve_mem_kho_finalize(struct kho_serialization *ser)
>> > -{
>> > - int err = 0, i;
>> > -
>> > - for (i = 0; i < reserved_mem_count; i++) {
>> > - struct reserve_mem_table *map = &reserved_mem_table[i];
>> > - struct page *page = phys_to_page(map->start);
>> > - unsigned int nr_pages = map->size >> PAGE_SHIFT;
>> > -
>> > - err |= kho_preserve_pages(page, nr_pages);
>> > - }
>> > -
>> > - err |= kho_preserve_folio(page_folio(kho_fdt));
>> > - err |= kho_add_subtree(ser, MEMBLOCK_KHO_FDT, page_to_virt(kho_fdt));
>> > -
>> > - return notifier_from_errno(err);
>> > -}
>> > -
>> > -static int reserve_mem_kho_notifier(struct notifier_block *self,
>> > - unsigned long cmd, void *v)
>> > -{
>> > - switch (cmd) {
>> > - case KEXEC_KHO_FINALIZE:
>> > - return reserve_mem_kho_finalize((struct kho_serialization *)v);
>> > - case KEXEC_KHO_ABORT:
>> > - return NOTIFY_DONE;
>> > - default:
>> > - return NOTIFY_BAD;
>> > - }
>> > -}
>> > -
>> > -static struct notifier_block reserve_mem_kho_nb = {
>> > - .notifier_call = reserve_mem_kho_notifier,
>> > -};
>> >
>> > static int __init prepare_kho_fdt(void)
>> > {
>> > int err = 0, i;
>> > + struct page *fdt_page;
>> > void *fdt;
>> >
>> > - kho_fdt = alloc_page(GFP_KERNEL);
>> > - if (!kho_fdt)
>> > + fdt_page = alloc_page(GFP_KERNEL);
>> > + if (!fdt_page)
>> > return -ENOMEM;
>> >
>> > - fdt = page_to_virt(kho_fdt);
>> > + fdt = page_to_virt(fdt_page);
>> >
>> > err |= fdt_create(fdt, PAGE_SIZE);
>> > err |= fdt_finish_reservemap(fdt);
>> > @@ -2499,7 +2464,10 @@ static int __init prepare_kho_fdt(void)
>> > err |= fdt_property_string(fdt, "compatible", MEMBLOCK_KHO_NODE_COMPATIBLE);
>> > for (i = 0; i < reserved_mem_count; i++) {
>> > struct reserve_mem_table *map = &reserved_mem_table[i];
>> > + struct page *page = phys_to_page(map->start);
>> > + unsigned int nr_pages = map->size >> PAGE_SHIFT;
>> >
>> > + err |= kho_preserve_pages(page, nr_pages);
>> > err |= fdt_begin_node(fdt, map->name);
>> > err |= fdt_property_string(fdt, "compatible", RESERVE_MEM_KHO_NODE_COMPATIBLE);
>> > err |= fdt_property(fdt, "start", &map->start, sizeof(map->start));
>> > @@ -2507,13 +2475,14 @@ static int __init prepare_kho_fdt(void)
>> > err |= fdt_end_node(fdt);
>> > }
>> > err |= fdt_end_node(fdt);
>> > -
>> > err |= fdt_finish(fdt);
>> >
>> > + err |= kho_preserve_folio(page_folio(fdt_page));
>> > + err |= kho_add_subtree(MEMBLOCK_KHO_FDT, fdt);
>> > +
>> > if (err) {
>> > pr_err("failed to prepare memblock FDT for KHO: %d\n", err);
>> > - put_page(kho_fdt);
>> > - kho_fdt = NULL;
>> > + put_page(fdt_page);
>>
>> This adds subtree to KHO even if the FDT might be invalid. And then
>> leaves a dangling reference in KHO to the FDT in case of an error. I
>> think you should either do this check after
>> kho_preserve_folio(page_folio(fdt_page)) and do a clean error check for
>> kho_add_subtree(), or call kho_remove_subtree() in the error block.
>
> I agree, I do not like these err |= stuff, we should be checking
> errors cleanly, and do proper clean-ups.
Yeah, this is mainly a byproduct of using FDTs. Getting and setting
simple properties also needs error checking and that can get tedious
real quick. Which is why this pattern has shown up I suppose.
>
>> I prefer the former since if kho_add_subtree() is the one that fails,
>> there is little sense in removing a subtree that was never added.
>>
>> > }
>> >
>> > return err;
>> > @@ -2529,13 +2498,6 @@ static int __init reserve_mem_init(void)
>> > err = prepare_kho_fdt();
>> > if (err)
>> > return err;
>> > -
>> > - err = register_kho_notifier(&reserve_mem_kho_nb);
>> > - if (err) {
>> > - put_page(kho_fdt);
>> > - kho_fdt = NULL;
>> > - }
>> > -
>> > return err;
>> > }
>> > late_initcall(reserve_mem_init);
>>
>> --
>> Regards,
>> Pratyush Yadav
--
Regards,
Pratyush Yadav
^ permalink raw reply
* Re: [PATCH v6 4/6] fs: make vfs_fileattr_[get|set] return -EOPNOSUPP
From: Christian Brauner @ 2025-10-07 11:00 UTC (permalink / raw)
To: Andrey Albershteyn
Cc: Jan Kara, Jiri Slaby, Amir Goldstein, Arnd Bergmann,
Casey Schaufler, Pali Rohár, Paul Moore, linux-api,
linux-fsdevel, linux-kernel, linux-xfs, selinux,
Andrey Albershteyn
In-Reply-To: <eyl6bzyi33tn6uys2ba5xjluvw7yjempqnla3jaih76mtgxgxq@i6xe2nquwqaf>
On Mon, Oct 06, 2025 at 08:52:32PM +0200, Andrey Albershteyn wrote:
> On 2025-10-06 17:39:46, Jan Kara wrote:
> > On Mon 06-10-25 13:09:05, Jiri Slaby wrote:
> > > On 30. 06. 25, 18:20, Andrey Albershteyn wrote:
> > > > Future patches will add new syscalls which use these functions. As
> > > > this interface won't be used for ioctls only, the EOPNOSUPP is more
> > > > appropriate return code.
> > > >
> > > > This patch converts return code from ENOIOCTLCMD to EOPNOSUPP for
> > > > vfs_fileattr_get and vfs_fileattr_set. To save old behavior translate
> > > > EOPNOSUPP back for current users - overlayfs, encryptfs and fs/ioctl.c.
> > > >
> > > > Signed-off-by: Andrey Albershteyn <aalbersh@kernel.org>
> > > ...
> > > > @@ -292,6 +294,8 @@ int ioctl_setflags(struct file *file, unsigned int __user *argp)
> > > > fileattr_fill_flags(&fa, flags);
> > > > err = vfs_fileattr_set(idmap, dentry, &fa);
> > > > mnt_drop_write_file(file);
> > > > + if (err == -EOPNOTSUPP)
> > > > + err = -ENOIOCTLCMD;
> > >
> > > This breaks borg code (unit tests already) as it expects EOPNOTSUPP, not
> > > ENOIOCTLCMD/ENOTTY:
> > > https://github.com/borgbackup/borg/blob/1c6ef7a200c7f72f8d1204d727fea32168616ceb/src/borg/platform/linux.pyx#L147
> > >
> > > I.e. setflags now returns ENOIOCTLCMD/ENOTTY for cases where 6.16 used to
> > > return EOPNOTSUPP.
> > >
> > > This minimal testcase program doing ioctl(fd2, FS_IOC_SETFLAGS,
> > > &FS_NODUMP_FL):
> > > https://github.com/jirislaby/collected_sources/tree/master/ioctl_setflags
> > >
> > > dumps in 6.16:
> > > sf: ioctl: Operation not supported
> > >
> > > with the above patch:
> > > sf: ioctl: Inappropriate ioctl for device
> > >
> > > Is this expected?
Nope, unintentional regression as Arnd noted.
> >
> > No, that's a bug and a clear userspace regression so we need to fix it. I
> > think we need to revert this commit and instead convert ENOIOCTLCMD from
> > vfs_fileattr_get/set() to EOPNOTSUPP in appropriate places. Andrey?
>
> I will prepare a patch soon
Thanks!
^ permalink raw reply
* Re: [PATCH 1/7] tools/nolibc: remove __nolibc_enosys() fallback from time64-related functions
From: Arnd Bergmann @ 2025-10-07 10:03 UTC (permalink / raw)
To: Thomas Weißschuh
Cc: Willy Tarreau, Catalin Marinas, Will Deacon, shuah, Mark Brown,
linux-kernel, linux-arm-kernel, linux-kselftest, Linux-Arch,
linux-api
In-Reply-To: <a5b8344e-8214-4946-8344-f34e969d30b2@t-8ch.de>
On Mon, Oct 6, 2025, at 22:14, Thomas Weißschuh wrote:
> On 2025-10-01 09:43:37+0200, Arnd Bergmann wrote:
>> On Thu, Aug 21, 2025, at 17:40, Thomas Weißschuh wrote:
>> - the old types are often too short, both for the y2038
>> overflow and for the file system types.
>
> So far this was not something we actively tried to support,
> especially with the restriction mentioned below.
>
>> I suspect the problem is that the kernel's uapi/linux/time.h
>> still defines the old types as the default, and nolibc
>> historically just picks it up from there.
>
> So far we have tried to keep nolibc compatible with the kernel UAPI when
> included in any order. This forced us to use 'struct timespec' from
> uapi/linux/time.h. With the upcoming implementation of signals in nolibc
> this guideline is relaxed a bit, so we should be able to use our own
> always-64-bit 'struct timespec'.
You can probably either "#define timespec __kernel_timespec" or
"#define __kernel_timespec timespec" before including
linux/time_types.h.
Note that there is no time64 variant of "struct timeval", so
any syscall that needs this has to be implemented in userspace
as a wrapper around the timespec based one, e.g. gettimeofday()
needs to call clock_gettime() on all 32-bit systems.
>> We should also consider drop the
>> legacy type definitions from uapi/linux/time.h and
>> require each libc to define their own.
>
> Can we even just drop them? Or should they also get some backwards
> compat guards?
This is the big question, and we kind of left this one open
to be decided later when we finished the actual binary interface.
I think simply dropping the old definition is one of several
options we have, because that does not change the ABI in an
incompatible way and just requires the few user space sources
that use this to either require old kernel headers or make
simple source-level changes that they should have done for
portability anyway.
I see multiple decisions we have to make in that option space
once we decide to do anything:
- do we change the headers for both 32-bit and 64-bit userspace
for consistency, or only for 32-bit userspace to limit
the impact to those users that care about 32-bit?
- do we remove only the type definitions (timespec, timeval,
itimerspec, itimerval, timex) or also the syscall macros
for the time32 syscalls using them?
- What method (if any) would be used to choose between the
time32 definitions, the time64 definitions or none of them
when including the kernel headers?
Arnd
^ permalink raw reply
* Re: [PATCH] fs: Propagate FMODE_NOCMTIME flag to user-facing O_NOCMTIME
From: Christoph Hellwig @ 2025-10-07 5:10 UTC (permalink / raw)
To: Dave Chinner
Cc: Christoph Hellwig, Pavel Emelyanov, linux-fsdevel,
Raphael S . Carvalho, linux-api, linux-xfs
In-Reply-To: <aOLr8M6s1W2qC5-Q@dread.disaster.area>
On Mon, Oct 06, 2025 at 09:06:40AM +1100, Dave Chinner wrote:
> If you don't care about accurate c/mtime, then mount the filesystem
> with '-o lazytime' to degrade c/mtime updates to "eventual
> consistency" behaviour for IO operations.
Exactly.
> Lazytime updates can generally be done in a non-blocking manner
> right now (someone raised that in the context of io-uring on #xfs
> about a month ago), but the NOWAIT behaviour for timestamp updates
> is done at a higher level in the VFS and does not take into account
> filesystem specific non-blocking lazytime updates at all. If we
> push the NOWAIT checking behaviour down to the filesystem, we can do
> this.
We might not even have to push it out, but just make the VFS/rw helper
check aware of lazytime. Either way currently even a lazytime
timestampt update will cause a write to block, which renders the
nowait writes pretty useless on anything but block devices.
^ permalink raw reply
* Re: [PATCH] fs: Propagate FMODE_NOCMTIME flag to user-facing O_NOCMTIME
From: Christoph Hellwig @ 2025-10-07 5:08 UTC (permalink / raw)
To: Andy Lutomirski
Cc: Christoph Hellwig, Pavel Emelyanov, linux-fsdevel,
Raphael S . Carvalho, linux-api, linux-xfs
In-Reply-To: <CALCETrVsD6Z42gO7S-oAbweN5OwV1OLqxztBkB58goSzccSZKw@mail.gmail.com>
On Sat, Oct 04, 2025 at 09:08:05AM -0700, Andy Lutomirski wrote:
> > Well, we'll need to look into that, including maybe non-blockin
> > timestamp updates.
> >
>
> It's been 12 years (!), but maybe it's time to reconsider this:
>
> https://lore.kernel.org/all/cover.1377193658.git.luto@amacapital.net/
I don't see how that is relevant here. Also writes through shared
mmaps are problematic for so many reasons that I'm not sure we want
to encourage people to use that more.
^ permalink raw reply
* Re: [PATCH v6 4/6] fs: make vfs_fileattr_[get|set] return -EOPNOSUPP
From: Andrey Albershteyn @ 2025-10-06 18:52 UTC (permalink / raw)
To: Jan Kara
Cc: Jiri Slaby, Amir Goldstein, Arnd Bergmann, Casey Schaufler,
Christian Brauner, Pali Rohár, Paul Moore, linux-api,
linux-fsdevel, linux-kernel, linux-xfs, selinux,
Andrey Albershteyn
In-Reply-To: <jp3vopwtpik7bj77aejuknaziecuml6x2l2dr3oe2xoats6tls@yskzvehakmkv>
On 2025-10-06 17:39:46, Jan Kara wrote:
> On Mon 06-10-25 13:09:05, Jiri Slaby wrote:
> > On 30. 06. 25, 18:20, Andrey Albershteyn wrote:
> > > Future patches will add new syscalls which use these functions. As
> > > this interface won't be used for ioctls only, the EOPNOSUPP is more
> > > appropriate return code.
> > >
> > > This patch converts return code from ENOIOCTLCMD to EOPNOSUPP for
> > > vfs_fileattr_get and vfs_fileattr_set. To save old behavior translate
> > > EOPNOSUPP back for current users - overlayfs, encryptfs and fs/ioctl.c.
> > >
> > > Signed-off-by: Andrey Albershteyn <aalbersh@kernel.org>
> > ...
> > > @@ -292,6 +294,8 @@ int ioctl_setflags(struct file *file, unsigned int __user *argp)
> > > fileattr_fill_flags(&fa, flags);
> > > err = vfs_fileattr_set(idmap, dentry, &fa);
> > > mnt_drop_write_file(file);
> > > + if (err == -EOPNOTSUPP)
> > > + err = -ENOIOCTLCMD;
> >
> > This breaks borg code (unit tests already) as it expects EOPNOTSUPP, not
> > ENOIOCTLCMD/ENOTTY:
> > https://github.com/borgbackup/borg/blob/1c6ef7a200c7f72f8d1204d727fea32168616ceb/src/borg/platform/linux.pyx#L147
> >
> > I.e. setflags now returns ENOIOCTLCMD/ENOTTY for cases where 6.16 used to
> > return EOPNOTSUPP.
> >
> > This minimal testcase program doing ioctl(fd2, FS_IOC_SETFLAGS,
> > &FS_NODUMP_FL):
> > https://github.com/jirislaby/collected_sources/tree/master/ioctl_setflags
> >
> > dumps in 6.16:
> > sf: ioctl: Operation not supported
> >
> > with the above patch:
> > sf: ioctl: Inappropriate ioctl for device
> >
> > Is this expected?
>
> No, that's a bug and a clear userspace regression so we need to fix it. I
> think we need to revert this commit and instead convert ENOIOCTLCMD from
> vfs_fileattr_get/set() to EOPNOTSUPP in appropriate places. Andrey?
I will prepare a patch soon
>
> Honza
> --
> Jan Kara <jack@suse.com>
> SUSE Labs, CR
>
--
- Andrey
^ permalink raw reply
* Re: [PATCH v4 02/30] kho: make debugfs interface optional
From: Pasha Tatashin @ 2025-10-06 18:02 UTC (permalink / raw)
To: Pratyush Yadav
Cc: jasonmiu, graf, changyuanl, rppt, dmatlack, rientjes, corbet,
rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl, masahiroy, akpm,
tj, yoann.congal, mmaurer, roman.gushchin, chenridong, axboe,
mark.rutland, jannh, vincent.guittot, hannes, dan.j.williams,
david, joel.granados, rostedt, anna.schumaker, song, zhangguopeng,
linux, linux-kernel, linux-doc, linux-mm, gregkh, tglx, mingo, bp,
dave.hansen, x86, hpa, rafael, dakr, bartosz.golaszewski,
cw00.choi, myungjoo.ham, yesanishhere, Jonathan.Cameron,
quic_zijuhu, aleksander.lobakin, ira.weiny, andriy.shevchenko,
leon, lukas, bhelgaas, wagi, djeffery, stuart.w.hayes, lennart,
brauner, linux-api, linux-fsdevel, saeedm, ajayachandra, jgg,
parav, leonro, witu, hughd, skhawaja, chrisl, steven.sistare
In-Reply-To: <mafs07bx8ouva.fsf@kernel.org>
On Mon, Oct 6, 2025 at 12:31 PM Pratyush Yadav <pratyush@kernel.org> wrote:
>
> On Mon, Sep 29 2025, Pasha Tatashin wrote:
>
> > Currently, KHO is controlled via debugfs interface, but once LUO is
> > introduced, it can control KHO, and the debug interface becomes
> > optional.
> >
> > Add a separate config CONFIG_KEXEC_HANDOVER_DEBUG that enables
> > the debugfs interface, and allows to inspect the tree.
> >
> > Move all debugfs related code to a new file to keep the .c files
> > clear of ifdefs.
> >
> > Co-developed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> > Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> > Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
> > ---
> > MAINTAINERS | 3 +-
> > kernel/Kconfig.kexec | 10 ++
> > kernel/Makefile | 1 +
> > kernel/kexec_handover.c | 255 +++++--------------------------
> > kernel/kexec_handover_debug.c | 218 ++++++++++++++++++++++++++
> > kernel/kexec_handover_internal.h | 44 ++++++
> > 6 files changed, 311 insertions(+), 220 deletions(-)
> > create mode 100644 kernel/kexec_handover_debug.c
> > create mode 100644 kernel/kexec_handover_internal.h
> >
> [...]
> > --- a/kernel/Kconfig.kexec
> > +++ b/kernel/Kconfig.kexec
> > @@ -109,6 +109,16 @@ config KEXEC_HANDOVER
> > to keep data or state alive across the kexec. For this to work,
> > both source and target kernels need to have this option enabled.
> >
> > +config KEXEC_HANDOVER_DEBUG
>
> Nit: can we call it KEXEC_HANDOVER_DEBUGFS instead? I think we would
> like to add a KEXEC_HANDOVER_DEBUG at some point to control debug
> asserts for KHO, and the naming would get confusing. And renaming config
> symbols is kind of a pain.
Done.
>
> > + bool "kexec handover debug interface"
> > + depends on KEXEC_HANDOVER
> > + depends on DEBUG_FS
> > + help
> > + Allow to control kexec handover device tree via debugfs
> > + interface, i.e. finalize the state or aborting the finalization.
> > + Also, enables inspecting the KHO fdt trees with the debugfs binary
> > + blobs.
> > +
> [...]
>
> --
> Regards,
> Pratyush Yadav
^ permalink raw reply
* Re: [PATCH v4 02/30] kho: make debugfs interface optional
From: Pasha Tatashin @ 2025-10-06 17:23 UTC (permalink / raw)
To: Pratyush Yadav
Cc: jasonmiu, graf, changyuanl, rppt, dmatlack, rientjes, corbet,
rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl, masahiroy, akpm,
tj, yoann.congal, mmaurer, roman.gushchin, chenridong, axboe,
mark.rutland, jannh, vincent.guittot, hannes, dan.j.williams,
david, joel.granados, rostedt, anna.schumaker, song, zhangguopeng,
linux, linux-kernel, linux-doc, linux-mm, gregkh, tglx, mingo, bp,
dave.hansen, x86, hpa, rafael, dakr, bartosz.golaszewski,
cw00.choi, myungjoo.ham, yesanishhere, Jonathan.Cameron,
quic_zijuhu, aleksander.lobakin, ira.weiny, andriy.shevchenko,
leon, lukas, bhelgaas, wagi, djeffery, stuart.w.hayes, lennart,
brauner, linux-api, linux-fsdevel, saeedm, ajayachandra, jgg,
parav, leonro, witu, hughd, skhawaja, chrisl, steven.sistare
In-Reply-To: <mafs0y0ponf6b.fsf@kernel.org>
On Mon, Oct 6, 2025 at 12:55 PM Pratyush Yadav <pratyush@kernel.org> wrote:
>
> Hi Pasha,
>
> On Mon, Sep 29 2025, Pasha Tatashin wrote:
>
> > Currently, KHO is controlled via debugfs interface, but once LUO is
> > introduced, it can control KHO, and the debug interface becomes
> > optional.
> >
> > Add a separate config CONFIG_KEXEC_HANDOVER_DEBUG that enables
> > the debugfs interface, and allows to inspect the tree.
> >
> > Move all debugfs related code to a new file to keep the .c files
> > clear of ifdefs.
> >
> > Co-developed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> > Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> > Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
> [...]
> > @@ -662,36 +660,24 @@ static void __init kho_reserve_scratch(void)
> > kho_enable = false;
> > }
> >
> > -struct fdt_debugfs {
> > - struct list_head list;
> > - struct debugfs_blob_wrapper wrapper;
> > - struct dentry *file;
> > +struct kho_out {
> > + struct blocking_notifier_head chain_head;
> > + struct mutex lock; /* protects KHO FDT finalization */
> > + struct kho_serialization ser;
> > + bool finalized;
> > + struct kho_debugfs dbg;
> > };
> >
> > -static int kho_debugfs_fdt_add(struct list_head *list, struct dentry *dir,
> > - const char *name, const void *fdt)
> > -{
> > - struct fdt_debugfs *f;
> > - struct dentry *file;
> > -
> > - f = kmalloc(sizeof(*f), GFP_KERNEL);
> > - if (!f)
> > - return -ENOMEM;
> > -
> > - f->wrapper.data = (void *)fdt;
> > - f->wrapper.size = fdt_totalsize(fdt);
> > -
> > - file = debugfs_create_blob(name, 0400, dir, &f->wrapper);
> > - if (IS_ERR(file)) {
> > - kfree(f);
> > - return PTR_ERR(file);
> > - }
> > -
> > - f->file = file;
> > - list_add(&f->list, list);
> > -
> > - return 0;
> > -}
> > +static struct kho_out kho_out = {
> > + .chain_head = BLOCKING_NOTIFIER_INIT(kho_out.chain_head),
> > + .lock = __MUTEX_INITIALIZER(kho_out.lock),
> > + .ser = {
> > + .track = {
> > + .orders = XARRAY_INIT(kho_out.ser.track.orders, 0),
> > + },
> > + },
> > + .finalized = false,
> > +};
>
> There is already one definition for struct kho_out and a static struct
> kho_out early in the file. This is a second declaration and definition.
> And I was super confused when I saw patch 3 since it seemed to be making
> unrelated changes to this struct (and removing an instance of this,
> which should be done in this patch instead). In fact, this patch doesn't
> even build due to this problem. I think some patch massaging is needed
> to fix this all up.
Let me fix it. I Plan to send a separate series only with KHO changes
from LUO, so we can expedite its landing.
Pasha
>
> >
> > /**
> > * kho_add_subtree - record the physical address of a sub FDT in KHO root tree.
> [...]
>
> --
> Regards,
> Pratyush Yadav
^ permalink raw reply
* Re: [PATCH v4 03/30] kho: drop notifiers
From: Pasha Tatashin @ 2025-10-06 17:21 UTC (permalink / raw)
To: Pratyush Yadav
Cc: jasonmiu, graf, changyuanl, rppt, dmatlack, rientjes, corbet,
rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl, masahiroy, akpm,
tj, yoann.congal, mmaurer, roman.gushchin, chenridong, axboe,
mark.rutland, jannh, vincent.guittot, hannes, dan.j.williams,
david, joel.granados, rostedt, anna.schumaker, song, zhangguopeng,
linux, linux-kernel, linux-doc, linux-mm, gregkh, tglx, mingo, bp,
dave.hansen, x86, hpa, rafael, dakr, bartosz.golaszewski,
cw00.choi, myungjoo.ham, yesanishhere, Jonathan.Cameron,
quic_zijuhu, aleksander.lobakin, ira.weiny, andriy.shevchenko,
leon, lukas, bhelgaas, wagi, djeffery, stuart.w.hayes, lennart,
brauner, linux-api, linux-fsdevel, saeedm, ajayachandra, jgg,
parav, leonro, witu, hughd, skhawaja, chrisl, steven.sistare
In-Reply-To: <mafs0tt0cnevi.fsf@kernel.org>
On Mon, Oct 6, 2025 at 1:01 PM Pratyush Yadav <pratyush@kernel.org> wrote:
>
> On Mon, Sep 29 2025, Pasha Tatashin wrote:
>
> > From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
> >
> > The KHO framework uses a notifier chain as the mechanism for clients to
> > participate in the finalization process. While this works for a single,
> > central state machine, it is too restrictive for kernel-internal
> > components like pstore/reserve_mem or IMA. These components need a
> > simpler, direct way to register their state for preservation (e.g.,
> > during their initcall) without being part of a complex,
> > shutdown-time notifier sequence. The notifier model forces all
> > participants into a single finalization flow and makes direct
> > preservation from an arbitrary context difficult.
> > This patch refactors the client participation model by removing the
> > notifier chain and introducing a direct API for managing FDT subtrees.
> >
> > The core kho_finalize() and kho_abort() state machine remains, but
> > clients now register their data with KHO beforehand.
> >
> > Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> > Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
> [...]
> > diff --git a/mm/memblock.c b/mm/memblock.c
> > index e23e16618e9b..c4b2d4e4c715 100644
> > --- a/mm/memblock.c
> > +++ b/mm/memblock.c
> > @@ -2444,53 +2444,18 @@ int reserve_mem_release_by_name(const char *name)
> > #define MEMBLOCK_KHO_FDT "memblock"
> > #define MEMBLOCK_KHO_NODE_COMPATIBLE "memblock-v1"
> > #define RESERVE_MEM_KHO_NODE_COMPATIBLE "reserve-mem-v1"
> > -static struct page *kho_fdt;
> > -
> > -static int reserve_mem_kho_finalize(struct kho_serialization *ser)
> > -{
> > - int err = 0, i;
> > -
> > - for (i = 0; i < reserved_mem_count; i++) {
> > - struct reserve_mem_table *map = &reserved_mem_table[i];
> > - struct page *page = phys_to_page(map->start);
> > - unsigned int nr_pages = map->size >> PAGE_SHIFT;
> > -
> > - err |= kho_preserve_pages(page, nr_pages);
> > - }
> > -
> > - err |= kho_preserve_folio(page_folio(kho_fdt));
> > - err |= kho_add_subtree(ser, MEMBLOCK_KHO_FDT, page_to_virt(kho_fdt));
> > -
> > - return notifier_from_errno(err);
> > -}
> > -
> > -static int reserve_mem_kho_notifier(struct notifier_block *self,
> > - unsigned long cmd, void *v)
> > -{
> > - switch (cmd) {
> > - case KEXEC_KHO_FINALIZE:
> > - return reserve_mem_kho_finalize((struct kho_serialization *)v);
> > - case KEXEC_KHO_ABORT:
> > - return NOTIFY_DONE;
> > - default:
> > - return NOTIFY_BAD;
> > - }
> > -}
> > -
> > -static struct notifier_block reserve_mem_kho_nb = {
> > - .notifier_call = reserve_mem_kho_notifier,
> > -};
> >
> > static int __init prepare_kho_fdt(void)
> > {
> > int err = 0, i;
> > + struct page *fdt_page;
> > void *fdt;
> >
> > - kho_fdt = alloc_page(GFP_KERNEL);
> > - if (!kho_fdt)
> > + fdt_page = alloc_page(GFP_KERNEL);
> > + if (!fdt_page)
> > return -ENOMEM;
> >
> > - fdt = page_to_virt(kho_fdt);
> > + fdt = page_to_virt(fdt_page);
> >
> > err |= fdt_create(fdt, PAGE_SIZE);
> > err |= fdt_finish_reservemap(fdt);
> > @@ -2499,7 +2464,10 @@ static int __init prepare_kho_fdt(void)
> > err |= fdt_property_string(fdt, "compatible", MEMBLOCK_KHO_NODE_COMPATIBLE);
> > for (i = 0; i < reserved_mem_count; i++) {
> > struct reserve_mem_table *map = &reserved_mem_table[i];
> > + struct page *page = phys_to_page(map->start);
> > + unsigned int nr_pages = map->size >> PAGE_SHIFT;
> >
> > + err |= kho_preserve_pages(page, nr_pages);
> > err |= fdt_begin_node(fdt, map->name);
> > err |= fdt_property_string(fdt, "compatible", RESERVE_MEM_KHO_NODE_COMPATIBLE);
> > err |= fdt_property(fdt, "start", &map->start, sizeof(map->start));
> > @@ -2507,13 +2475,14 @@ static int __init prepare_kho_fdt(void)
> > err |= fdt_end_node(fdt);
> > }
> > err |= fdt_end_node(fdt);
> > -
> > err |= fdt_finish(fdt);
> >
> > + err |= kho_preserve_folio(page_folio(fdt_page));
> > + err |= kho_add_subtree(MEMBLOCK_KHO_FDT, fdt);
> > +
> > if (err) {
> > pr_err("failed to prepare memblock FDT for KHO: %d\n", err);
> > - put_page(kho_fdt);
> > - kho_fdt = NULL;
> > + put_page(fdt_page);
>
> This adds subtree to KHO even if the FDT might be invalid. And then
> leaves a dangling reference in KHO to the FDT in case of an error. I
> think you should either do this check after
> kho_preserve_folio(page_folio(fdt_page)) and do a clean error check for
> kho_add_subtree(), or call kho_remove_subtree() in the error block.
I agree, I do not like these err |= stuff, we should be checking
errors cleanly, and do proper clean-ups.
> I prefer the former since if kho_add_subtree() is the one that fails,
> there is little sense in removing a subtree that was never added.
>
> > }
> >
> > return err;
> > @@ -2529,13 +2498,6 @@ static int __init reserve_mem_init(void)
> > err = prepare_kho_fdt();
> > if (err)
> > return err;
> > -
> > - err = register_kho_notifier(&reserve_mem_kho_nb);
> > - if (err) {
> > - put_page(kho_fdt);
> > - kho_fdt = NULL;
> > - }
> > -
> > return err;
> > }
> > late_initcall(reserve_mem_init);
>
> --
> Regards,
> Pratyush Yadav
^ permalink raw reply
* Re: [PATCH v4 03/30] kho: drop notifiers
From: Pratyush Yadav @ 2025-10-06 17:01 UTC (permalink / raw)
To: Pasha Tatashin
Cc: pratyush, jasonmiu, graf, changyuanl, rppt, dmatlack, rientjes,
corbet, rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl,
masahiroy, akpm, tj, yoann.congal, mmaurer, roman.gushchin,
chenridong, axboe, mark.rutland, jannh, vincent.guittot, hannes,
dan.j.williams, david, joel.granados, rostedt, anna.schumaker,
song, zhangguopeng, linux, linux-kernel, linux-doc, linux-mm,
gregkh, tglx, mingo, bp, dave.hansen, x86, hpa, rafael, dakr,
bartosz.golaszewski, cw00.choi, myungjoo.ham, yesanishhere,
Jonathan.Cameron, quic_zijuhu, aleksander.lobakin, ira.weiny,
andriy.shevchenko, leon, lukas, bhelgaas, wagi, djeffery,
stuart.w.hayes, lennart, brauner, linux-api, linux-fsdevel,
saeedm, ajayachandra, jgg, parav, leonro, witu, hughd, skhawaja,
chrisl, steven.sistare
In-Reply-To: <20250929010321.3462457-4-pasha.tatashin@soleen.com>
On Mon, Sep 29 2025, Pasha Tatashin wrote:
> From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
>
> The KHO framework uses a notifier chain as the mechanism for clients to
> participate in the finalization process. While this works for a single,
> central state machine, it is too restrictive for kernel-internal
> components like pstore/reserve_mem or IMA. These components need a
> simpler, direct way to register their state for preservation (e.g.,
> during their initcall) without being part of a complex,
> shutdown-time notifier sequence. The notifier model forces all
> participants into a single finalization flow and makes direct
> preservation from an arbitrary context difficult.
> This patch refactors the client participation model by removing the
> notifier chain and introducing a direct API for managing FDT subtrees.
>
> The core kho_finalize() and kho_abort() state machine remains, but
> clients now register their data with KHO beforehand.
>
> Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
[...]
> diff --git a/mm/memblock.c b/mm/memblock.c
> index e23e16618e9b..c4b2d4e4c715 100644
> --- a/mm/memblock.c
> +++ b/mm/memblock.c
> @@ -2444,53 +2444,18 @@ int reserve_mem_release_by_name(const char *name)
> #define MEMBLOCK_KHO_FDT "memblock"
> #define MEMBLOCK_KHO_NODE_COMPATIBLE "memblock-v1"
> #define RESERVE_MEM_KHO_NODE_COMPATIBLE "reserve-mem-v1"
> -static struct page *kho_fdt;
> -
> -static int reserve_mem_kho_finalize(struct kho_serialization *ser)
> -{
> - int err = 0, i;
> -
> - for (i = 0; i < reserved_mem_count; i++) {
> - struct reserve_mem_table *map = &reserved_mem_table[i];
> - struct page *page = phys_to_page(map->start);
> - unsigned int nr_pages = map->size >> PAGE_SHIFT;
> -
> - err |= kho_preserve_pages(page, nr_pages);
> - }
> -
> - err |= kho_preserve_folio(page_folio(kho_fdt));
> - err |= kho_add_subtree(ser, MEMBLOCK_KHO_FDT, page_to_virt(kho_fdt));
> -
> - return notifier_from_errno(err);
> -}
> -
> -static int reserve_mem_kho_notifier(struct notifier_block *self,
> - unsigned long cmd, void *v)
> -{
> - switch (cmd) {
> - case KEXEC_KHO_FINALIZE:
> - return reserve_mem_kho_finalize((struct kho_serialization *)v);
> - case KEXEC_KHO_ABORT:
> - return NOTIFY_DONE;
> - default:
> - return NOTIFY_BAD;
> - }
> -}
> -
> -static struct notifier_block reserve_mem_kho_nb = {
> - .notifier_call = reserve_mem_kho_notifier,
> -};
>
> static int __init prepare_kho_fdt(void)
> {
> int err = 0, i;
> + struct page *fdt_page;
> void *fdt;
>
> - kho_fdt = alloc_page(GFP_KERNEL);
> - if (!kho_fdt)
> + fdt_page = alloc_page(GFP_KERNEL);
> + if (!fdt_page)
> return -ENOMEM;
>
> - fdt = page_to_virt(kho_fdt);
> + fdt = page_to_virt(fdt_page);
>
> err |= fdt_create(fdt, PAGE_SIZE);
> err |= fdt_finish_reservemap(fdt);
> @@ -2499,7 +2464,10 @@ static int __init prepare_kho_fdt(void)
> err |= fdt_property_string(fdt, "compatible", MEMBLOCK_KHO_NODE_COMPATIBLE);
> for (i = 0; i < reserved_mem_count; i++) {
> struct reserve_mem_table *map = &reserved_mem_table[i];
> + struct page *page = phys_to_page(map->start);
> + unsigned int nr_pages = map->size >> PAGE_SHIFT;
>
> + err |= kho_preserve_pages(page, nr_pages);
> err |= fdt_begin_node(fdt, map->name);
> err |= fdt_property_string(fdt, "compatible", RESERVE_MEM_KHO_NODE_COMPATIBLE);
> err |= fdt_property(fdt, "start", &map->start, sizeof(map->start));
> @@ -2507,13 +2475,14 @@ static int __init prepare_kho_fdt(void)
> err |= fdt_end_node(fdt);
> }
> err |= fdt_end_node(fdt);
> -
> err |= fdt_finish(fdt);
>
> + err |= kho_preserve_folio(page_folio(fdt_page));
> + err |= kho_add_subtree(MEMBLOCK_KHO_FDT, fdt);
> +
> if (err) {
> pr_err("failed to prepare memblock FDT for KHO: %d\n", err);
> - put_page(kho_fdt);
> - kho_fdt = NULL;
> + put_page(fdt_page);
This adds subtree to KHO even if the FDT might be invalid. And then
leaves a dangling reference in KHO to the FDT in case of an error. I
think you should either do this check after
kho_preserve_folio(page_folio(fdt_page)) and do a clean error check for
kho_add_subtree(), or call kho_remove_subtree() in the error block.
I prefer the former since if kho_add_subtree() is the one that fails,
there is little sense in removing a subtree that was never added.
> }
>
> return err;
> @@ -2529,13 +2498,6 @@ static int __init reserve_mem_init(void)
> err = prepare_kho_fdt();
> if (err)
> return err;
> -
> - err = register_kho_notifier(&reserve_mem_kho_nb);
> - if (err) {
> - put_page(kho_fdt);
> - kho_fdt = NULL;
> - }
> -
> return err;
> }
> late_initcall(reserve_mem_init);
--
Regards,
Pratyush Yadav
^ permalink raw reply
* Re: [PATCH v4 02/30] kho: make debugfs interface optional
From: Pratyush Yadav @ 2025-10-06 16:55 UTC (permalink / raw)
To: Pasha Tatashin
Cc: pratyush, jasonmiu, graf, changyuanl, rppt, dmatlack, rientjes,
corbet, rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl,
masahiroy, akpm, tj, yoann.congal, mmaurer, roman.gushchin,
chenridong, axboe, mark.rutland, jannh, vincent.guittot, hannes,
dan.j.williams, david, joel.granados, rostedt, anna.schumaker,
song, zhangguopeng, linux, linux-kernel, linux-doc, linux-mm,
gregkh, tglx, mingo, bp, dave.hansen, x86, hpa, rafael, dakr,
bartosz.golaszewski, cw00.choi, myungjoo.ham, yesanishhere,
Jonathan.Cameron, quic_zijuhu, aleksander.lobakin, ira.weiny,
andriy.shevchenko, leon, lukas, bhelgaas, wagi, djeffery,
stuart.w.hayes, lennart, brauner, linux-api, linux-fsdevel,
saeedm, ajayachandra, jgg, parav, leonro, witu, hughd, skhawaja,
chrisl, steven.sistare
In-Reply-To: <20250929010321.3462457-3-pasha.tatashin@soleen.com>
Hi Pasha,
On Mon, Sep 29 2025, Pasha Tatashin wrote:
> Currently, KHO is controlled via debugfs interface, but once LUO is
> introduced, it can control KHO, and the debug interface becomes
> optional.
>
> Add a separate config CONFIG_KEXEC_HANDOVER_DEBUG that enables
> the debugfs interface, and allows to inspect the tree.
>
> Move all debugfs related code to a new file to keep the .c files
> clear of ifdefs.
>
> Co-developed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
[...]
> @@ -662,36 +660,24 @@ static void __init kho_reserve_scratch(void)
> kho_enable = false;
> }
>
> -struct fdt_debugfs {
> - struct list_head list;
> - struct debugfs_blob_wrapper wrapper;
> - struct dentry *file;
> +struct kho_out {
> + struct blocking_notifier_head chain_head;
> + struct mutex lock; /* protects KHO FDT finalization */
> + struct kho_serialization ser;
> + bool finalized;
> + struct kho_debugfs dbg;
> };
>
> -static int kho_debugfs_fdt_add(struct list_head *list, struct dentry *dir,
> - const char *name, const void *fdt)
> -{
> - struct fdt_debugfs *f;
> - struct dentry *file;
> -
> - f = kmalloc(sizeof(*f), GFP_KERNEL);
> - if (!f)
> - return -ENOMEM;
> -
> - f->wrapper.data = (void *)fdt;
> - f->wrapper.size = fdt_totalsize(fdt);
> -
> - file = debugfs_create_blob(name, 0400, dir, &f->wrapper);
> - if (IS_ERR(file)) {
> - kfree(f);
> - return PTR_ERR(file);
> - }
> -
> - f->file = file;
> - list_add(&f->list, list);
> -
> - return 0;
> -}
> +static struct kho_out kho_out = {
> + .chain_head = BLOCKING_NOTIFIER_INIT(kho_out.chain_head),
> + .lock = __MUTEX_INITIALIZER(kho_out.lock),
> + .ser = {
> + .track = {
> + .orders = XARRAY_INIT(kho_out.ser.track.orders, 0),
> + },
> + },
> + .finalized = false,
> +};
There is already one definition for struct kho_out and a static struct
kho_out early in the file. This is a second declaration and definition.
And I was super confused when I saw patch 3 since it seemed to be making
unrelated changes to this struct (and removing an instance of this,
which should be done in this patch instead). In fact, this patch doesn't
even build due to this problem. I think some patch massaging is needed
to fix this all up.
>
> /**
> * kho_add_subtree - record the physical address of a sub FDT in KHO root tree.
[...]
--
Regards,
Pratyush Yadav
^ permalink raw reply
* Re: [PATCH v4 03/30] kho: drop notifiers
From: Pratyush Yadav @ 2025-10-06 16:38 UTC (permalink / raw)
To: Pratyush Yadav
Cc: Pasha Tatashin, jasonmiu, graf, changyuanl, rppt, dmatlack,
rientjes, corbet, rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl,
masahiroy, akpm, tj, yoann.congal, mmaurer, roman.gushchin,
chenridong, axboe, mark.rutland, jannh, vincent.guittot, hannes,
dan.j.williams, david, joel.granados, rostedt, anna.schumaker,
song, zhangguopeng, linux, linux-kernel, linux-doc, linux-mm,
gregkh, tglx, mingo, bp, dave.hansen, x86, hpa, rafael, dakr,
bartosz.golaszewski, cw00.choi, myungjoo.ham, yesanishhere,
Jonathan.Cameron, quic_zijuhu, aleksander.lobakin, ira.weiny,
andriy.shevchenko, leon, lukas, bhelgaas, wagi, djeffery,
stuart.w.hayes, lennart, brauner, linux-api, linux-fsdevel,
saeedm, ajayachandra, jgg, parav, leonro, witu, hughd, skhawaja,
chrisl, steven.sistare
In-Reply-To: <mafs0bjmkp0gb.fsf@kernel.org>
Hi,
On Mon, Oct 06 2025, Pratyush Yadav wrote:
> Hi Pasha,
>
> On Mon, Sep 29 2025, Pasha Tatashin wrote:
>
>> From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
>>
>> The KHO framework uses a notifier chain as the mechanism for clients to
>> participate in the finalization process. While this works for a single,
>> central state machine, it is too restrictive for kernel-internal
>> components like pstore/reserve_mem or IMA. These components need a
>> simpler, direct way to register their state for preservation (e.g.,
>> during their initcall) without being part of a complex,
>> shutdown-time notifier sequence. The notifier model forces all
>> participants into a single finalization flow and makes direct
>> preservation from an arbitrary context difficult.
>> This patch refactors the client participation model by removing the
>> notifier chain and introducing a direct API for managing FDT subtrees.
>>
>> The core kho_finalize() and kho_abort() state machine remains, but
>> clients now register their data with KHO beforehand.
>>
>> Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
>> Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
>
> This patch breaks build of test_kho.c (under CONFIG_TEST_KEXEC_HANDOVER):
>
> lib/test_kho.c:49:14: error: ‘KEXEC_KHO_ABORT’ undeclared (first use in this function)
> 49 | case KEXEC_KHO_ABORT:
> | ^~~~~~~~~~~~~~~
> [...]
> lib/test_kho.c:51:14: error: ‘KEXEC_KHO_FINALIZE’ undeclared (first use in this function)
> 51 | case KEXEC_KHO_FINALIZE:
> | ^~~~~~~~~~~~~~~~~~
> [...]
>
> I think you need to update it as well to drop notifier usage.
Here's the fix. Build passes now and the test succeeds under my qemu
test setup.
--- 8< ---
From a8e6b5dfef38bfbcd41f3dd08598cb79a0701d7e Mon Sep 17 00:00:00 2001
From: Pratyush Yadav <pratyush@kernel.org>
Date: Mon, 6 Oct 2025 18:35:20 +0200
Subject: [PATCH] fixup! kho: drop notifiers
Update KHO test to drop the notifiers as well.
Signed-off-by: Pratyush Yadav <pratyush@kernel.org>
---
lib/test_kho.c | 32 +++-----------------------------
1 file changed, 3 insertions(+), 29 deletions(-)
diff --git a/lib/test_kho.c b/lib/test_kho.c
index fe8504e3407b5..e9462a1e4b93b 100644
--- a/lib/test_kho.c
+++ b/lib/test_kho.c
@@ -38,33 +38,6 @@ struct kho_test_state {
static struct kho_test_state kho_test_state;
-static int kho_test_notifier(struct notifier_block *self, unsigned long cmd,
- void *v)
-{
- struct kho_test_state *state = &kho_test_state;
- struct kho_serialization *ser = v;
- int err = 0;
-
- switch (cmd) {
- case KEXEC_KHO_ABORT:
- return NOTIFY_DONE;
- case KEXEC_KHO_FINALIZE:
- /* Handled below */
- break;
- default:
- return NOTIFY_BAD;
- }
-
- err |= kho_preserve_folio(state->fdt);
- err |= kho_add_subtree(ser, KHO_TEST_FDT, folio_address(state->fdt));
-
- return err ? NOTIFY_BAD : NOTIFY_DONE;
-}
-
-static struct notifier_block kho_test_nb = {
- .notifier_call = kho_test_notifier,
-};
-
static int kho_test_save_data(struct kho_test_state *state, void *fdt)
{
phys_addr_t *folios_info;
@@ -111,6 +84,7 @@ static int kho_test_prepare_fdt(struct kho_test_state *state)
fdt = folio_address(state->fdt);
+ err |= kho_preserve_folio(state->fdt);
err |= fdt_create(fdt, fdt_size);
err |= fdt_finish_reservemap(fdt);
@@ -194,7 +168,7 @@ static int kho_test_save(void)
if (err)
goto err_free_folios;
- err = register_kho_notifier(&kho_test_nb);
+ err = kho_add_subtree(KHO_TEST_FDT, folio_address(state->fdt));
if (err)
goto err_free_fdt;
@@ -309,7 +283,7 @@ static void kho_test_cleanup(void)
static void __exit kho_test_exit(void)
{
- unregister_kho_notifier(&kho_test_nb);
+ kho_remove_subtree(folio_address(kho_test_state.fdt));
kho_test_cleanup();
}
module_exit(kho_test_exit);
--
Regards,
Pratyush Yadav
^ permalink raw reply related
* Re: [PATCH v4 02/30] kho: make debugfs interface optional
From: Pratyush Yadav @ 2025-10-06 16:30 UTC (permalink / raw)
To: Pasha Tatashin
Cc: pratyush, jasonmiu, graf, changyuanl, rppt, dmatlack, rientjes,
corbet, rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl,
masahiroy, akpm, tj, yoann.congal, mmaurer, roman.gushchin,
chenridong, axboe, mark.rutland, jannh, vincent.guittot, hannes,
dan.j.williams, david, joel.granados, rostedt, anna.schumaker,
song, zhangguopeng, linux, linux-kernel, linux-doc, linux-mm,
gregkh, tglx, mingo, bp, dave.hansen, x86, hpa, rafael, dakr,
bartosz.golaszewski, cw00.choi, myungjoo.ham, yesanishhere,
Jonathan.Cameron, quic_zijuhu, aleksander.lobakin, ira.weiny,
andriy.shevchenko, leon, lukas, bhelgaas, wagi, djeffery,
stuart.w.hayes, lennart, brauner, linux-api, linux-fsdevel,
saeedm, ajayachandra, jgg, parav, leonro, witu, hughd, skhawaja,
chrisl, steven.sistare
In-Reply-To: <20250929010321.3462457-3-pasha.tatashin@soleen.com>
On Mon, Sep 29 2025, Pasha Tatashin wrote:
> Currently, KHO is controlled via debugfs interface, but once LUO is
> introduced, it can control KHO, and the debug interface becomes
> optional.
>
> Add a separate config CONFIG_KEXEC_HANDOVER_DEBUG that enables
> the debugfs interface, and allows to inspect the tree.
>
> Move all debugfs related code to a new file to keep the .c files
> clear of ifdefs.
>
> Co-developed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
> ---
> MAINTAINERS | 3 +-
> kernel/Kconfig.kexec | 10 ++
> kernel/Makefile | 1 +
> kernel/kexec_handover.c | 255 +++++--------------------------
> kernel/kexec_handover_debug.c | 218 ++++++++++++++++++++++++++
> kernel/kexec_handover_internal.h | 44 ++++++
> 6 files changed, 311 insertions(+), 220 deletions(-)
> create mode 100644 kernel/kexec_handover_debug.c
> create mode 100644 kernel/kexec_handover_internal.h
>
[...]
> --- a/kernel/Kconfig.kexec
> +++ b/kernel/Kconfig.kexec
> @@ -109,6 +109,16 @@ config KEXEC_HANDOVER
> to keep data or state alive across the kexec. For this to work,
> both source and target kernels need to have this option enabled.
>
> +config KEXEC_HANDOVER_DEBUG
Nit: can we call it KEXEC_HANDOVER_DEBUGFS instead? I think we would
like to add a KEXEC_HANDOVER_DEBUG at some point to control debug
asserts for KHO, and the naming would get confusing. And renaming config
symbols is kind of a pain.
> + bool "kexec handover debug interface"
> + depends on KEXEC_HANDOVER
> + depends on DEBUG_FS
> + help
> + Allow to control kexec handover device tree via debugfs
> + interface, i.e. finalize the state or aborting the finalization.
> + Also, enables inspecting the KHO fdt trees with the debugfs binary
> + blobs.
> +
[...]
--
Regards,
Pratyush Yadav
^ permalink raw reply
* Re: [PATCH v4 03/30] kho: drop notifiers
From: Pasha Tatashin @ 2025-10-06 16:17 UTC (permalink / raw)
To: Pratyush Yadav
Cc: jasonmiu, graf, changyuanl, rppt, dmatlack, rientjes, corbet,
rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl, masahiroy, akpm,
tj, yoann.congal, mmaurer, roman.gushchin, chenridong, axboe,
mark.rutland, jannh, vincent.guittot, hannes, dan.j.williams,
david, joel.granados, rostedt, anna.schumaker, song, zhangguopeng,
linux, linux-kernel, linux-doc, linux-mm, gregkh, tglx, mingo, bp,
dave.hansen, x86, hpa, rafael, dakr, bartosz.golaszewski,
cw00.choi, myungjoo.ham, yesanishhere, Jonathan.Cameron,
quic_zijuhu, aleksander.lobakin, ira.weiny, andriy.shevchenko,
leon, lukas, bhelgaas, wagi, djeffery, stuart.w.hayes, lennart,
brauner, linux-api, linux-fsdevel, saeedm, ajayachandra, jgg,
parav, leonro, witu, hughd, skhawaja, chrisl, steven.sistare
In-Reply-To: <mafs0bjmkp0gb.fsf@kernel.org>
On Mon, Oct 6, 2025 at 10:30 AM Pratyush Yadav <pratyush@kernel.org> wrote:
>
> Hi Pasha,
>
> On Mon, Sep 29 2025, Pasha Tatashin wrote:
>
> > From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
> >
> > The KHO framework uses a notifier chain as the mechanism for clients to
> > participate in the finalization process. While this works for a single,
> > central state machine, it is too restrictive for kernel-internal
> > components like pstore/reserve_mem or IMA. These components need a
> > simpler, direct way to register their state for preservation (e.g.,
> > during their initcall) without being part of a complex,
> > shutdown-time notifier sequence. The notifier model forces all
> > participants into a single finalization flow and makes direct
> > preservation from an arbitrary context difficult.
> > This patch refactors the client participation model by removing the
> > notifier chain and introducing a direct API for managing FDT subtrees.
> >
> > The core kho_finalize() and kho_abort() state machine remains, but
> > clients now register their data with KHO beforehand.
> >
> > Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> > Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
>
> This patch breaks build of test_kho.c (under CONFIG_TEST_KEXEC_HANDOVER):
>
> lib/test_kho.c:49:14: error: ‘KEXEC_KHO_ABORT’ undeclared (first use in this function)
> 49 | case KEXEC_KHO_ABORT:
> | ^~~~~~~~~~~~~~~
> [...]
> lib/test_kho.c:51:14: error: ‘KEXEC_KHO_FINALIZE’ undeclared (first use in this function)
> 51 | case KEXEC_KHO_FINALIZE:
> | ^~~~~~~~~~~~~~~~~~
> [...]
>
> I think you need to update it as well to drop notifier usage.
Yes, thank you Pratyush. I missed this change in my patch.
Pasha
^ permalink raw reply
* Re: [PATCH v6 4/6] fs: make vfs_fileattr_[get|set] return -EOPNOSUPP
From: Jan Kara @ 2025-10-06 15:39 UTC (permalink / raw)
To: Jiri Slaby
Cc: Andrey Albershteyn, Amir Goldstein, Arnd Bergmann,
Casey Schaufler, Christian Brauner, Jan Kara, Pali Rohár,
Paul Moore, linux-api, linux-fsdevel, linux-kernel, linux-xfs,
selinux, Andrey Albershteyn
In-Reply-To: <a622643f-1585-40b0-9441-cf7ece176e83@kernel.org>
On Mon 06-10-25 13:09:05, Jiri Slaby wrote:
> On 30. 06. 25, 18:20, Andrey Albershteyn wrote:
> > Future patches will add new syscalls which use these functions. As
> > this interface won't be used for ioctls only, the EOPNOSUPP is more
> > appropriate return code.
> >
> > This patch converts return code from ENOIOCTLCMD to EOPNOSUPP for
> > vfs_fileattr_get and vfs_fileattr_set. To save old behavior translate
> > EOPNOSUPP back for current users - overlayfs, encryptfs and fs/ioctl.c.
> >
> > Signed-off-by: Andrey Albershteyn <aalbersh@kernel.org>
> ...
> > @@ -292,6 +294,8 @@ int ioctl_setflags(struct file *file, unsigned int __user *argp)
> > fileattr_fill_flags(&fa, flags);
> > err = vfs_fileattr_set(idmap, dentry, &fa);
> > mnt_drop_write_file(file);
> > + if (err == -EOPNOTSUPP)
> > + err = -ENOIOCTLCMD;
>
> This breaks borg code (unit tests already) as it expects EOPNOTSUPP, not
> ENOIOCTLCMD/ENOTTY:
> https://github.com/borgbackup/borg/blob/1c6ef7a200c7f72f8d1204d727fea32168616ceb/src/borg/platform/linux.pyx#L147
>
> I.e. setflags now returns ENOIOCTLCMD/ENOTTY for cases where 6.16 used to
> return EOPNOTSUPP.
>
> This minimal testcase program doing ioctl(fd2, FS_IOC_SETFLAGS,
> &FS_NODUMP_FL):
> https://github.com/jirislaby/collected_sources/tree/master/ioctl_setflags
>
> dumps in 6.16:
> sf: ioctl: Operation not supported
>
> with the above patch:
> sf: ioctl: Inappropriate ioctl for device
>
> Is this expected?
No, that's a bug and a clear userspace regression so we need to fix it. I
think we need to revert this commit and instead convert ENOIOCTLCMD from
vfs_fileattr_get/set() to EOPNOTSUPP in appropriate places. Andrey?
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply
* Re: [PATCH v4 03/30] kho: drop notifiers
From: Pratyush Yadav @ 2025-10-06 14:30 UTC (permalink / raw)
To: Pasha Tatashin
Cc: pratyush, jasonmiu, graf, changyuanl, rppt, dmatlack, rientjes,
corbet, rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl,
masahiroy, akpm, tj, yoann.congal, mmaurer, roman.gushchin,
chenridong, axboe, mark.rutland, jannh, vincent.guittot, hannes,
dan.j.williams, david, joel.granados, rostedt, anna.schumaker,
song, zhangguopeng, linux, linux-kernel, linux-doc, linux-mm,
gregkh, tglx, mingo, bp, dave.hansen, x86, hpa, rafael, dakr,
bartosz.golaszewski, cw00.choi, myungjoo.ham, yesanishhere,
Jonathan.Cameron, quic_zijuhu, aleksander.lobakin, ira.weiny,
andriy.shevchenko, leon, lukas, bhelgaas, wagi, djeffery,
stuart.w.hayes, lennart, brauner, linux-api, linux-fsdevel,
saeedm, ajayachandra, jgg, parav, leonro, witu, hughd, skhawaja,
chrisl, steven.sistare
In-Reply-To: <20250929010321.3462457-4-pasha.tatashin@soleen.com>
Hi Pasha,
On Mon, Sep 29 2025, Pasha Tatashin wrote:
> From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
>
> The KHO framework uses a notifier chain as the mechanism for clients to
> participate in the finalization process. While this works for a single,
> central state machine, it is too restrictive for kernel-internal
> components like pstore/reserve_mem or IMA. These components need a
> simpler, direct way to register their state for preservation (e.g.,
> during their initcall) without being part of a complex,
> shutdown-time notifier sequence. The notifier model forces all
> participants into a single finalization flow and makes direct
> preservation from an arbitrary context difficult.
> This patch refactors the client participation model by removing the
> notifier chain and introducing a direct API for managing FDT subtrees.
>
> The core kho_finalize() and kho_abort() state machine remains, but
> clients now register their data with KHO beforehand.
>
> Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
This patch breaks build of test_kho.c (under CONFIG_TEST_KEXEC_HANDOVER):
lib/test_kho.c:49:14: error: ‘KEXEC_KHO_ABORT’ undeclared (first use in this function)
49 | case KEXEC_KHO_ABORT:
| ^~~~~~~~~~~~~~~
[...]
lib/test_kho.c:51:14: error: ‘KEXEC_KHO_FINALIZE’ undeclared (first use in this function)
51 | case KEXEC_KHO_FINALIZE:
| ^~~~~~~~~~~~~~~~~~
[...]
I think you need to update it as well to drop notifier usage.
[...]
--
Regards,
Pratyush Yadav
^ permalink raw reply
* Re: [PATCH v6 4/6] fs: make vfs_fileattr_[get|set] return -EOPNOSUPP
From: Arnd Bergmann @ 2025-10-06 11:43 UTC (permalink / raw)
To: Jiri Slaby, Andrey Albershteyn, Amir Goldstein, Casey Schaufler,
Christian Brauner, Jan Kara, Pali Rohár, Paul Moore
Cc: linux-api, linux-fsdevel, linux-kernel, linux-xfs, selinux,
Andrey Albershteyn
In-Reply-To: <a622643f-1585-40b0-9441-cf7ece176e83@kernel.org>
On Mon, Oct 6, 2025, at 13:09, Jiri Slaby wrote:
> On 30. 06. 25, 18:20, Andrey Albershteyn wrote:
>> Future patches will add new syscalls which use these functions. As
>> this interface won't be used for ioctls only, the EOPNOSUPP is more
>> appropriate return code.
>>
>> This patch converts return code from ENOIOCTLCMD to EOPNOSUPP for
>> vfs_fileattr_get and vfs_fileattr_set. To save old behavior translate
>> EOPNOSUPP back for current users - overlayfs, encryptfs and fs/ioctl.c.
>>
...
> dumps in 6.16:
> sf: ioctl: Operation not supported
>
> with the above patch:
> sf: ioctl: Inappropriate ioctl for device
>
>
> Is this expected?
This does look like an unintentional bug: As far as I can see, the
-ENOIOCTLCMD was previously used to indicate that a particular filesystem
does not have a fileattr_{get,set} callback at all, while individual
filesystems used EOPNOSUPP to indicate that a particular attribute
flag is unsupported. With the double conversion, both error codes
get turned into a single one.
Arnd
^ permalink raw reply
* Re: [PATCH v6 4/6] fs: make vfs_fileattr_[get|set] return -EOPNOSUPP
From: Jiri Slaby @ 2025-10-06 11:09 UTC (permalink / raw)
To: Andrey Albershteyn, Amir Goldstein, Arnd Bergmann,
Casey Schaufler, Christian Brauner, Jan Kara, Pali Rohár,
Paul Moore
Cc: linux-api, linux-fsdevel, linux-kernel, linux-xfs, selinux,
Andrey Albershteyn
In-Reply-To: <20250630-xattrat-syscall-v6-4-c4e3bc35227b@kernel.org>
On 30. 06. 25, 18:20, Andrey Albershteyn wrote:
> Future patches will add new syscalls which use these functions. As
> this interface won't be used for ioctls only, the EOPNOSUPP is more
> appropriate return code.
>
> This patch converts return code from ENOIOCTLCMD to EOPNOSUPP for
> vfs_fileattr_get and vfs_fileattr_set. To save old behavior translate
> EOPNOSUPP back for current users - overlayfs, encryptfs and fs/ioctl.c.
>
> Signed-off-by: Andrey Albershteyn <aalbersh@kernel.org>
...
> @@ -292,6 +294,8 @@ int ioctl_setflags(struct file *file, unsigned int __user *argp)
> fileattr_fill_flags(&fa, flags);
> err = vfs_fileattr_set(idmap, dentry, &fa);
> mnt_drop_write_file(file);
> + if (err == -EOPNOTSUPP)
> + err = -ENOIOCTLCMD;
This breaks borg code (unit tests already) as it expects EOPNOTSUPP, not
ENOIOCTLCMD/ENOTTY:
https://github.com/borgbackup/borg/blob/1c6ef7a200c7f72f8d1204d727fea32168616ceb/src/borg/platform/linux.pyx#L147
I.e. setflags now returns ENOIOCTLCMD/ENOTTY for cases where 6.16 used
to return EOPNOTSUPP.
This minimal testcase program doing ioctl(fd2, FS_IOC_SETFLAGS,
&FS_NODUMP_FL):
https://github.com/jirislaby/collected_sources/tree/master/ioctl_setflags
dumps in 6.16:
sf: ioctl: Operation not supported
with the above patch:
sf: ioctl: Inappropriate ioctl for device
Is this expected?
thanks,
--
js
suse labs
^ permalink raw reply
* Re: [PATCH 00/62] initrd: remove classic initrd support
From: Askar Safin @ 2025-10-06 6:19 UTC (permalink / raw)
To: rob
Cc: akpm, andy.shevchenko, axboe, brauner, cyphar, devicetree,
email2tema, graf, gregkh, hca, hch, hsiangkao, initramfs, jack,
julian.stecklina, kees, linux-acpi, linux-alpha, linux-api,
linux-arch, linux-block, linux-csky, linux-doc, linux-efi,
linux-ext4, linux-fsdevel, linux-hexagon, linux-kernel,
linux-m68k, linux-mips, linux-openrisc, linux-parisc, linux-riscv,
linux-s390, linux-sh, linux-snps-arc, linux-um, linuxppc-dev,
loongarch, mcgrof, mingo, monstr, mzxreary, patches, sparclinux,
thomas.weissschuh, thorsten.blum, torvalds, tytso, viro, x86
In-Reply-To: <0342fbda-9901-4293-afa7-ba6085eb1688@landley.net>
Rob Landley <rob@landley.net>:
> Still useful for embedded systems that can memory map flash, but it's
They can use workaround suggested in cover letter.
> While you're at it, could you fix static/builtin initramfs so PID 1 has
> a valid stdin/stdout/stderr?
This is in my low-priority TODO list. I want to help you. I will possibly do this
after a month or two or three...
> I posted various patches to make CONFIG_DEVTMPFS_MOUNT work for initmpfs
My solution will be different: I will create static /dev/console and /dev/null
after unpacking of builtin and external initramfs. (/dev/null because of
that bionic problem you somewhere wrote.)
> Oh hey, somebody using mkroot. Cool. :)
Yeah, thank you for mkroot.
> Now that lkml.iu.edu is back up (yay!) all the links in
> ramfs-rootfs-initramfs.txt can theoretically be fixed just by switching
> the domain name.
Yes, I plan to replace them with lore.kernel.org ones. This is in my low-priority
TODO list, too.
> > For example, I renamed the following global variables:
> >
> > __initramfs_start
> > __initramfs_size
>
> That already said initramfs, and you renamed it.
Yes, to distinguish builtin and external initramfs.
> > phys_initrd_start
> > phys_initrd_size
> > initrd_start
> > initrd_end
>
> Which is data delivered through grub's "initrd" command. Here's how I've
My plan is to change "official" names for these things.
"initramfs" will refer both to .cpio archive itself and to loading
mechanism. Name of GRUB's "initrd" command will become "wrong, kept for
compatibility".
But I plan to do all these renamings after I fully remove initrd support,
which will happen in September 2026, as I explained in another email.
> 3) rootfs is (for some reason) the name of the mounted filesystem in
> /proc/mounts (because letting it say "ramfs" or "tmpfs" like normal in
> /proc/mounts would be consistent and immediately understandable, so they
> couldn't have that).
I totally agree. I want to change it to ramfs/tmpfs. But this change
may break something, so I think we need some strong motivation to
do this. So I will wait for removal of nommu support. Arnd Bergmann said
"NOMMU removal maybe 2027" ( https://lwn.net/Articles/1035727/ ,
https://static.sched.com/hosted_files/osseu2025/75/32-bit%20Linux%20in%202025%20%28OSS%20Europe%29.pdf ,
slide 20). (Also he said 32-bit support will be removed, too.)
After that I will remove ramfs (yeah, I love to remove things),
and, while we are here, I will rename "rootfs" to "tmpfs" in
/proc/mounts (hopefully I will get away with this).
> > __builtin_initramfs_start
> > __builtin_initramfs_size
> > phys_external_initramfs_start
> > phys_external_initramfs_size
> > virt_external_initramfs_start
> > virt_external_initramfs_end
>
> Do you believe people will understand what the slightly longer names are
> without looking them up?
No. But I still hope new names are better. As I said above, all these
will be named "initramfs" under my new plan. But again, all these
will happen after full initrd removal, which will happen in Sep 2026.
> I'm all for removing obsolete code, but a partial cleanup that still
> leaves various sharp edges around isn't necessarily a net improvement.
> Did you remove the NFS mount code from init/do_mounts.c? Part of the
Okay, I put this to my low-priority TODO list.
> The one config symbol that really seems to bite people in this area is
> BLK_DEV_INITRD because a common thing people running from initramfs want
> to do is yank the block layer entirely (CONFIG_BLOCK=n) and use
> initramfs instead, and needing to enable CONFIG_BLK_DEV_INITRD while
>
> And the INSANE part is they generally want a static initrd to do it so
> they're not using the external loader, but Kconfig has INITRAMFS_SOURCE
> under CONFIG_BLK_DEV_INITRD and it's a mess. Renaming THAT symbol would
> be good.
You mean renaming CONFIG_BLK_DEV_INITRD will be good?
I do exactly that.
And while we are here, I also rename CONFIG_RD_*,
because configs will be broken anyway.
Also, recently we got keyword "transitional" to help with such
renamings: https://www.phoronix.com/news/Linux-6.18-Transitional .
I will use it.
> To you. I'm not entirely sure what virt_external means. (Yes I could go
It means "virtual address of external initramfs". But, yes, Borislav Petkov
said me in another email that kernel devs usually use "va" for virtual
address and "pa" for physical, so I will use these terms (in Sep 2026).
> Meanwhile 35 years of installed base expertise in other people's heads
> has been discarded and developed version skew for anyone maintaining an
I'm still not convinced. Ideally I want to remove word "initrd" from Linux
sources completely.
Decision to merge my patches or not is on maintainers anyway. They
will decide whether these renamings are good idea.
> > - Removed kernel command line parameter "ramdisk_start",
> > which was used for initrd only (not for initramfs)
>
> Some bootloaders appended that to the kernel command line to specify
> where in memory they've loaded the initrd image, which could be a
> cpio.gz once upon a time. No idea what regressions happened since though.
I double-checked: ramdisk_start is used for initrd code path only
in modern kernels, not for initramfs code path.
"initrd=" is used in both code paths, and I keep it.
==
While we are here, let me answer other your emails, too.
Here is answer to https://lore.kernel.org/all/94023988-8498-4070-bdb7-6758dbe4b91d@landley.net/ .
> There used to be a way to feed a the kernel config a text file listing
> what to make in the cpio file instead of just pointing it at a
> directory, and my old Aboriginal Linux build used that mechanism
...
> But kernel commit 469e87e89fd6 broke that mechanism because somebody
> dunning-krugered it away ("I don't understand why we need this therefore
I will consider fixing this, too. Put to my low-priority TODO list.
But it is possible that I will instead remove gen-init-cpio completely.
(I will do some experiments before deciding.)
If it was broken, and nobody except for you cared, then this means that
nobody except for you use it.
Of course, I will do that after sending patch for unconditional creating of
/dev/console and /dev/null, so you are safe.
> And again: you ONLY need this for static initramfs. Dynamic initramfs
> has code create /dev/console (at boot time, not build time):
>
> https://github.com/torvalds/linux/blob/v6.16/init/noinitramfs.c#L27
Your explanation is wrong here. As you can see in Makefile, noinitramfs.c
is not built if there is BLK_DEV_INITRD.
If you don't have BLK_DEV_INITRD, then noinitramfs.c
is built, and it creates /dev/console.
If there is BLK_DEV_INITRD and there is no INITRAMFS_SOURCE, then
default built-in initramfs is used, which is specified here:
https://elixir.bootlin.com/linux/v6.17/source/usr/default_cpio_list
(and it happens to be equivalent to specified in noinitramfs.c).
If there are both BLK_DEV_INITRD and INITRAMFS_SOURCE, then
INITRAMFS_SOURCE is used instead of default built-in initramfs,
so there is no /dev/console.
I am totally sure that my explanation is correct.
> I could emit cpio contents with xxd -r from a HERE document hexdump or
There is no need for "xxd -r". cpio encoding of /dev/console is ASCII
(except for some null bytes). See:
$ echo /dev/console | cpio --create --format=newc --quiet | xxd
00000000: 3037 3037 3031 3030 3030 3030 3043 3030 0707010000000C00
00000010: 3030 3231 3830 3030 3030 3030 3030 3030 0021800000000000
00000020: 3030 3030 3030 3030 3030 3030 3031 3638 0000000000000168
00000030: 4438 4337 4241 3030 3030 3030 3030 3030 D8C7BA0000000000
00000040: 3030 3030 3030 3030 3030 3030 3036 3030 0000000000000600
00000050: 3030 3030 3035 3030 3030 3030 3031 3030 0000050000000100
00000060: 3030 3030 3044 3030 3030 3030 3030 2f64 00000D00000000/d
00000070: 6576 2f63 6f6e 736f 6c65 0000 3037 3037 ev/console..0707
00000080: 3031 3030 3030 3030 3030 3030 3030 3030 0100000000000000
00000090: 3030 3030 3030 3030 3030 3030 3030 3030 0000000000000000
000000a0: 3030 3030 3030 3030 3031 3030 3030 3030 0000000001000000
000000b0: 3030 3030 3030 3030 3030 3030 3030 3030 0000000000000000
000000c0: 3030 3030 3030 3030 3030 3030 3030 3030 0000000000000000
000000d0: 3030 3030 3030 3030 3030 3030 3030 3030 0000000000000000
000000e0: 3042 3030 3030 3030 3030 5452 4149 4c45 0B00000000TRAILE
000000f0: 5221 2121 0000 0000 0000 0000 0000 0000 R!!!............
00000100: 0000 0000 0000 0000 0000 0000 0000 0000 ................
00000110: 0000 0000 0000 0000 0000 0000 0000 0000 ................
00000120: 0000 0000 0000 0000 0000 0000 0000 0000 ................
00000130: 0000 0000 0000 0000 0000 0000 0000 0000 ................
00000140: 0000 0000 0000 0000 0000 0000 0000 0000 ................
00000150: 0000 0000 0000 0000 0000 0000 0000 0000 ................
00000160: 0000 0000 0000 0000 0000 0000 0000 0000 ................
00000170: 0000 0000 0000 0000 0000 0000 0000 0000 ................
00000180: 0000 0000 0000 0000 0000 0000 0000 0000 ................
00000190: 0000 0000 0000 0000 0000 0000 0000 0000 ................
000001a0: 0000 0000 0000 0000 0000 0000 0000 0000 ................
000001b0: 0000 0000 0000 0000 0000 0000 0000 0000 ................
000001c0: 0000 0000 0000 0000 0000 0000 0000 0000 ................
000001d0: 0000 0000 0000 0000 0000 0000 0000 0000 ................
000001e0: 0000 0000 0000 0000 0000 0000 0000 0000 ................
000001f0: 0000 0000 0000 0000 0000 0000 0000 0000 ................
So, I think the following will go (not tested):
==
printf '%s' '0707010000000C0000218000000000000000000000000168D8C7BA00000000000000000000000600000005000000010000000D00000000/dev/console' > out.cpio
printf '\0\0' >> out.cpio
==
Maybe even last '\0\0' is not needed.
Also, this your email ( https://lore.kernel.org/all/94023988-8498-4070-bdb7-6758dbe4b91d@landley.net/ )
for some reasons didn't end up on https://lore.kernel.org/lkml .
As you can see here https://lore.kernel.org/lkml/94023988-8498-4070-bdb7-6758dbe4b91d@landley.net/ ,
the full list of lore mailing lists, which got it, is linux-snps-arc, linux-riscv and linux-sh .
I wrote about this to public-inbox:
http://public-inbox.org/meta/CAPnZJGB7ugY5rytS+hO-QzvPQBNjCh1jzs4WVkuakafBM9c_=w@mail.gmail.com/T/#u .
But it is possible that the problem is on your side.
Maybe this is why people ignore your emails? Maybe they simply don't get them?
Consider applying for linux.dev email ( https://linux.dev ). They are free for linux devs.
==
Now let me answer to https://lore.kernel.org/lkml/8f595eec-e85e-4c1f-acb0-5069a01c1012@landley.net/T/#u .
> I find the community an elaborate bureaucracy unresponsive to hobbyists.
> Documentation/process/submitting-patches.rst being a 934 line document
> with a bibliography, plus a 24 step checklist not counting the a) b) c)
> subsections are just symptoms. The real problem is following those is
> not sufficient to navigate said bureaucracy.
I totally agree.
Still I somehow was able to manage this.
Again: I totally agree. I just want to share some practical advice, that helped me
to get my patches merged.
As you can see, I was able to get my patches merged:
https://lore.kernel.org/all/?q=f:%22Askar%20Safin%22 .
And this is despite nobody paid me for this. I do this in my own free time.
As well as I understand, you are doing embedded Linux development as your job,
so you are in better position.
My patches are merged despite my productivity is low. I am very slow person.
You don't need to remember all of submitting-patches.rst . Just do this:
- Run checkpatch.pl . It accepts git ranges, e. g. "checkpatch.pl origin/HEAD..HEAD"
- After posting patches respond to comments, apply their edits, send new version, then again and again
When sending patches and responding to comments don't write too long letters.
Nobody will carefully read long letters and respond to them.
I respond to such letters, because I'm autistic, and I feel responsibility to carefully
read and respond to each letter. But other people don't do this.
In particular, when sending patches and responding to comments don't write long
paragraphs about good things you did in the past and about how you are disappointed
in the entire world, such as these:
> Let's see, I wrote the initramfs documentation in 2005:
>
> https://lwn.net/Articles/157676/
>
> Was already correcting kernel developers on how it actually worked
> (rather than theoretically worked) in 2006:
>
> https://lkml.iu.edu/hypermail//linux/kernel/0603.2/2760.html
>
> I added tmpfs support to it in 2013 (because nobody else had bothered
> for EIGHT YEARS):
>
> https://lkml.iu.edu/hypermail/linux/kernel/1306.3/04204.html
>
> I've maintained my own cpio implementation in toybox for over a decade:
>
> https://github.com/landley/toybox/commit/a2d558151a63
>
> The successor to aboriginal (above) is a 400 line bash script that
> builds a dozen archtectures that each boot to a shell prompt in qemu:
>
> https://github.com/landley/toybox/blob/master/mkroot/mkroot.sh
> https://landley.net/bin/mkroot/latest/
>
> With automated regression test infrastructure to boot them all under
> qemu and confirm that it runs, the clocks are set right, the network
> works, and it can read from -hda:
>
> https://github.com/landley/toybox/blob/master/mkroot/testroot.sh
>
> So yes I _can_ create my own bespoke C program to modify the file in
> arbitrary ways, I have my reasons not to do that, and have thought about
> them for a while now.
Again: I'm not trying to insult you. I'm just trying to give advice how
to get your patches merged.
When my patches are ready, I send them using something like this:
==
UPSTREAM=origin/HEAD
MERGE_BASE="$(git merge-base "$UPSTREAM" HEAD)"
mkdir /tmp/patches
# For --signoff
export GIT_COMMITTER_EMAIL=me@example.com
# Prepare patches
# --base for "base-commit:" footer
git format-patch --cover-letter --find-renames --base="$MERGE_BASE" --signoff -o /tmp/patches \
--subject-prefix='PATCH v2' "$MERGE_BASE"
editor /tmp/patches/0000-cover-letter.patch
# Send
# "--batch-size=1 --relogin-delay=20" to insert delays between patches. Hopefully
# this will help me to cope with my mailserver limits
# "--confirm=" to give myself chance to cancel
git send-email --batch-size=1 --relogin-delay=20 --confirm=always --to=a@example.com --cc=b@example.com \
/tmp/patches
==
This script will automatically generate nice diffstat in cover letter.
This script is not tested. Actually I use my own 182-line Rust program, which does
same thing.
This is checklist I plan to do when sending v2 version of this initrd patchset:
- Read all answers to prev. version, respond and apply edits
- checkpatch.pl
- Check that my patchset doesn't conflict with linux-next
- Check that every commit compiles for x86_64 with "W=1"
- Test everything using mkroot.sh rewritten in Rust
> Why keep the section when you removed the old mechanism?
This section still contains useful info, so I kept it.
But okay, I agree, I will rewrite it to not mention initrd.
I will do this after full removal of initrd, i. e. in Sep 2026.
If you want me to send some patch to this document _now_,
then just ask me, I will try to do this.
> Those two lines you just touched contradict each other
Will fix in Sep 2026, too.
> The init/noinitramfs.c file does init/mkdir("/dev") and
> init_mknod("/dev/console") because calling the syscall_blah() functions
> directly was considered icky so they created gratuitous wrappers to do
You cannot directly call syscall from kernel code if your syscall
works with strings. Reasons are here: https://lwn.net/Articles/832121/ .
mkdir syscall expects string, located in user memory. So you
cannot call it from kernel and pass kernel string to it.
Thus you need separate init_mkdir.
> Anyway, that's why the 130+ byte archive was there. It wasn't actually
> empty, even when initramfs was disabled.
I just double-checked. If BLK_DEV_INITRD is disabled, then
there is no any builtin initramfs at all. If BLK_DEV_INITRD is
disabled, then initramfs_data.S is not built, as we can see here:
https://elixir.bootlin.com/linux/v6.17/source/usr/Makefile#L15
And initramfs_data.S contains symbol __initramfs_size, so, yes,
initramfs_data.S is actual builtin initramfs.
In fact, that "obj-$(CONFIG_BLK_DEV_INITRD) :=" trick
is not needed, because whole usr/ dir is compiled out,
if there is no BLK_DEV_INITRD:
https://elixir.bootlin.com/linux/v6.17/source/init/Kconfig#L1455
Again: I acknoledge that bug with missing /dev/console. In fact,
I was able to reproduce it. I plan to fix it in a month or two.
> > +If the kernel has CONFIG_BLK_DEV_INITRD enabled, an external cpio.gz archive can also
>
> You renamed that symbol, then even you use the old name here.
I rename it in later commit.
> > -This has the memory efficiency advantages of initramfs (no ramdisk block
> > -device) but the separate packaging of initrd (which is nice if you have
> > +This is nice if you have
> > non-GPL code you'd like to run from initramfs, without conflating it with
> > -the GPL licensed Linux kernel binary).
> > +the GPL licensed Linux kernel binary.
>
> IANAL: Whether or not this qualifies as "mere aggregation" had yet to go
> to court last I heard.
This is possible that court will use this file as an argument.
So let's keep this paragraph here. :)
There is an example, where FAQ on FSF site was actually
used as argument in court: https://www.sonarsource.com/blog/will-the-new-judicial-ruling-in-the-vizio-lawsuit-strengthen-the-gpl/ .
I mean this quote:
> Vizio “did not dispute” the first two questions, focusing instead on the “expectations” of the contracting parties.
> Relying on the Free Software Foundation’s (FSF) GPL FAQs, it argued that the FSF never intended for third parties to enforce the contract,
> and therefore the parties to the contract could not have intended it.
> > echo init | cpio -o -H newc | gzip > test.cpio.gz
> > - # Testing external initramfs using the initrd loading mechanism.
> > + # Testing external initramfs.
>
> Does grub not still call it "initrd"?
Yes, grub still calls it "initrd".
As I said, in Sep 2026 I will rename bootloader loading mechanism to "initramfs",
and name of grub command "initrd" will simply become "wrong".
> A) they added -hda so you don't have to give it a dummy /dev/zero anymore.
Ok, I will fix.
> B) there's no longer a "qemu" defaulting to the current architecture,
Ok, I will fix.
--
Askar Safin
^ permalink raw reply
* Re: [PATCH] fs: Propagate FMODE_NOCMTIME flag to user-facing O_NOCMTIME
From: Theodore Ts'o @ 2025-10-06 2:16 UTC (permalink / raw)
To: Dave Chinner
Cc: Christoph Hellwig, Pavel Emelyanov, linux-fsdevel,
Raphael S . Carvalho, linux-api, linux-xfs
In-Reply-To: <aOMBbKUlvv2uYLzD@dread.disaster.area>
On Mon, Oct 06, 2025 at 10:38:20AM +1100, Dave Chinner wrote:
> We have already provided a safe method for minimising the overhead
> of c/mtime updates in the IO path - it's called lazytime. The
> lazytime mount option provides eventual consistency for c/mtime
> updates for IO operations instead of immediate consistency.
>
> Timestamps are still updated to have the correct values, but the
> latency/performance of the timestamp updates is greatly improved by
> holding them purely in memory until some other trigger forces them
> to be persisted to disk.
Specifically, the timestamps are persisted to stable store when (a)
the file system is unmounted, (b) when the inode needs to be pushed
out to memory due to memory pressure, (c) when the inode is forcibly
persisted using fsync(), (d) when some other inode field is updated,
and the inode gets written out, or (e) after 24 hours.
As a result, the on-disk timestamps will be at most 24 hours stale.
But this is POSIX compliant, because if you read the timestamps using
stat(1), you will get the updated values, and what happens after a
crash in the absense of an fsync(2) is not defined.
The reason why we implemented this at $WORK is you are constantly
updating a database using fdatasync(2), and you care about 99.9
percentage I/O latency, the 4k writes to the inode table will
eventually triger a hard drive's Adjacent Track Interference (ATI)
mitigation, which involves rewriting set of disk tracks to avoid the
analog signal for adjacent tracks getting weakened by the hot-spot
writes, and this is measurable if you are looking at long-tail I/O
latencies. (And yes, we had to talk to our HDD vendors to figure out
this is what was going on, since performance is out of scop[e of
SCSI/SATA specifications. Hence, random long-tail ATI latencies to
preserve data integrity is allowed, and in fact, actually a good
thing. :-)
- Ted
^ permalink raw reply
* Re: [PATCH] fs: Propagate FMODE_NOCMTIME flag to user-facing O_NOCMTIME
From: Dave Chinner @ 2025-10-05 23:38 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Pavel Emelyanov, linux-fsdevel, Raphael S . Carvalho, linux-api,
linux-xfs
In-Reply-To: <aOCiCkFUOBWV_1yY@infradead.org>
On Fri, Oct 03, 2025 at 09:26:50PM -0700, Christoph Hellwig wrote:
> On Fri, Oct 03, 2025 at 12:32:13PM +0300, Pavel Emelyanov wrote:
> > The FMODE_NOCMTIME flag tells that ctime and mtime stamps are not
> > updated on IO. The flag was introduced long ago by 4d4be482a4 ([XFS]
> > add a FMODE flag to make XFS invisible I/O less hacky. Back then it
> > was suggested that this flag is propagated to a O_NOCMTIME one.
>
> skipping c/mtime is dangerous. The XFS handle code allows it to
> support HSM where data is migrated out to tape, and requires
> CAP_SYS_ADMIN. Allowing it for any file owner would expand the scope
> for too much as now everyone could skip timestamp updates.
We have already provided a safe method for minimising the overhead
of c/mtime updates in the IO path - it's called lazytime. The
lazytime mount option provides eventual consistency for c/mtime
updates for IO operations instead of immediate consistency.
Timestamps are still updated to have the correct values, but the
latency/performance of the timestamp updates is greatly improved by
holding them purely in memory until some other trigger forces them
to be persisted to disk.
> > It can be used by workloads that want to write a file but don't care
> > much about the preciese timestamp on it and can update it later with
> > utimens() call.
>
> The workload might not care, the rest of the system does. ctime can't
> bet set to arbitrary values, so it is important for backups and as
> an audit trail.
Lazytime works for this use case; a call to utimens() will cause a
persistent update of the timestamps. As will any other inode
modification that has persistence requirements (e.g. block
allocation during IO or other syscalls that modify inode metadata).
> > There's another reason for having this patch. When performing AIO write,
> > the file_modified_flags() function checks whether or not to update inode
> > times. In case update is needed and iocb carries the RWF_NOWAIT flag,
> > the check return EINTR error that quickly propagates into cb completion
> > without doing any IO. This restriction effectively prevents doing AIO
> > writes with nowait flag, as file modifications really imply time update.
>
> Well, we'll need to look into that, including maybe non-blockin
> timestamp updates.
This came up recently on #xfs w.r.t. lazytime behaviour - we need to
pass the NOWAIT decision semnatics down to the filesystem to allow
lazytime to be truly non-blocking. At the moment the high level VFS
NOWAIT checks (via inode_needs_update_time()) have no visibility of
this filesystem specific functionality, so even if we can do the
lazy timestamp update without blocking we still give an -EAGAIN if
IOCB_NOWAIT is set.
-Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply
* Re: [PATCH] fs: Propagate FMODE_NOCMTIME flag to user-facing O_NOCMTIME
From: Dave Chinner @ 2025-10-05 22:06 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Pavel Emelyanov, linux-fsdevel, Raphael S . Carvalho, linux-api,
linux-xfs
In-Reply-To: <aOCiCkFUOBWV_1yY@infradead.org>
On Fri, Oct 03, 2025 at 09:26:50PM -0700, Christoph Hellwig wrote:
> On Fri, Oct 03, 2025 at 12:32:13PM +0300, Pavel Emelyanov wrote:
> > The FMODE_NOCMTIME flag tells that ctime and mtime stamps are not
> > updated on IO. The flag was introduced long ago by 4d4be482a4 ([XFS]
> > add a FMODE flag to make XFS invisible I/O less hacky. Back then it
> > was suggested that this flag is propagated to a O_NOCMTIME one.
>
> skipping c/mtime is dangerous. The XFS handle code allows it to
> support HSM where data is migrated out to tape, and requires
> CAP_SYS_ADMIN. Allowing it for any file owner would expand the scope
> for too much as now everyone could skip timestamp updates.
>
> > It can be used by workloads that want to write a file but don't care
> > much about the preciese timestamp on it and can update it later with
> > utimens() call.
If you don't care about accurate c/mtime, then mount the filesystem
with '-o lazytime' to degrade c/mtime updates to "eventual
consistency" behaviour for IO operations. If inode metadata is
otherwise modified (e.g. block allocation during IO) or the
application then calls utimens(), it will update the recorded
in-memory timestamps in a persistent manner immediately.
> The workload might not care, the rest of the system does. ctime can't
> bet set to arbitrary values, so it is important for backups and as
> an audit trail.
But we can (and do) delay the persistence of IO-based timestamp
updates with the lazytime option.
> > There's another reason for having this patch. When performing AIO write,
> > the file_modified_flags() function checks whether or not to update inode
> > times. In case update is needed and iocb carries the RWF_NOWAIT flag,
> > the check return EINTR error that quickly propagates into cb completion
> > without doing any IO. This restriction effectively prevents doing AIO
> > writes with nowait flag, as file modifications really imply time update.
>
> Well, we'll need to look into that, including maybe non-blockin
> timestamp updates.
Lazytime updates can generally be done in a non-blocking manner
right now (someone raised that in the context of io-uring on #xfs
about a month ago), but the NOWAIT behaviour for timestamp updates
is done at a higher level in the VFS and does not take into account
filesystem specific non-blocking lazytime updates at all. If we
push the NOWAIT checking behaviour down to the filesystem, we can do
this.
-Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox