* [PATCH 0/2] liveupdate: Small FLB fixes
@ 2026-05-28 17:41 David Matlack
2026-05-28 17:41 ` [PATCH 1/2] liveupdate: Reference count outgoing FLB data David Matlack
` (2 more replies)
0 siblings, 3 replies; 14+ messages in thread
From: David Matlack @ 2026-05-28 17:41 UTC (permalink / raw)
To: kexec, linux-kernel
Cc: Andrew Morton, Mike Rapoport, Pasha Tatashin, Pratyush Yadav,
David Matlack
This series has 2 small fixes to how FLBs are managed. First is to
increase the outgoing FLB refcount during liveupdate_flb_get_outgoing()
so it cannot be freed while the caller is using it, and to align with
the semantics of liveupdate_flb_get_incoming(). The second is to prevent
FLB retrieve() from being called multiple times if the first attempt
fails.
Both of these changes are needed for the correctness of the PCI core
support for Live Update:
https://lore.kernel.org/linux-pci/20260522202410.3104264-1-dmatlack@google.com/
David Matlack (2):
liveupdate: Reference count outgoing FLB data
liveupdate: Remember FLB retrieve() status
include/linux/liveupdate.h | 11 +++++++++--
kernel/liveupdate/luo_flb.c | 20 ++++++++++++++------
2 files changed, 23 insertions(+), 8 deletions(-)
base-commit: 5428435567cbe06c19914592fc22ca23c9ca1de5
--
2.54.0.823.g6e5bcc1fc9-goog
^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCH 1/2] liveupdate: Reference count outgoing FLB data
2026-05-28 17:41 [PATCH 0/2] liveupdate: Small FLB fixes David Matlack
@ 2026-05-28 17:41 ` David Matlack
2026-06-02 17:15 ` Pratyush Yadav
2026-06-03 3:36 ` Pasha Tatashin
2026-05-28 17:41 ` [PATCH 2/2] liveupdate: Remember FLB retrieve() status David Matlack
2026-06-04 5:28 ` [PATCH 0/2] liveupdate: Small FLB fixes Mike Rapoport
2 siblings, 2 replies; 14+ messages in thread
From: David Matlack @ 2026-05-28 17:41 UTC (permalink / raw)
To: kexec, linux-kernel
Cc: Andrew Morton, Mike Rapoport, Pasha Tatashin, Pratyush Yadav,
David Matlack
Increment the outgoing FLB refcount in liveupdate_flb_get_outgoing() so
that the FLB structure cannot be freed while the caller is actively
using it. Add an additional liveupdate_flb_put_outgoing() function so
the caller can explicitly indicate when it is done using the outgoing
FLB.
During a Live Update, the kernel may need to fetch the outgoing FLB
outside of the scope of a file handler's preserve() and unpreserve()
callbacks. In that situation there is no way for the caller to protect
itself against the outgoing FLB from being freed while it is using it.
Incrementing the reference count in liveupdate_flb_get_outgoing()
ensures it cannot be freed.
This change also aligns the outgoing FLB lifecycle management with the
incoming FLB, since the latter uses the same get/put semantics.
Fixes: cab056f2aae7 ("liveupdate: luo_flb: introduce File-Lifecycle-Bound global state")
Assisted-by: Gemini:gemini-3-pro-preview
Signed-off-by: David Matlack <dmatlack@google.com>
---
include/linux/liveupdate.h | 5 +++++
kernel/liveupdate/luo_flb.c | 10 +++++++---
2 files changed, 12 insertions(+), 3 deletions(-)
diff --git a/include/linux/liveupdate.h b/include/linux/liveupdate.h
index 88722e5caf02..c344bf987b63 100644
--- a/include/linux/liveupdate.h
+++ b/include/linux/liveupdate.h
@@ -243,6 +243,7 @@ int liveupdate_flb_get_incoming(struct liveupdate_flb *flb, void **objp);
void liveupdate_flb_put_incoming(struct liveupdate_flb *flb);
int liveupdate_flb_get_outgoing(struct liveupdate_flb *flb, void **objp);
+void liveupdate_flb_put_outgoing(struct liveupdate_flb *flb);
#else /* CONFIG_LIVEUPDATE */
@@ -292,5 +293,9 @@ static inline int liveupdate_flb_get_outgoing(struct liveupdate_flb *flb,
return -EOPNOTSUPP;
}
+static inline void liveupdate_flb_put_outgoing(struct liveupdate_flb *flb)
+{
+}
+
#endif /* CONFIG_LIVEUPDATE */
#endif /* _LINUX_LIVEUPDATE_H */
diff --git a/kernel/liveupdate/luo_flb.c b/kernel/liveupdate/luo_flb.c
index 8f5c5dd01cd0..7ddef552ff6b 100644
--- a/kernel/liveupdate/luo_flb.c
+++ b/kernel/liveupdate/luo_flb.c
@@ -135,7 +135,7 @@ static int luo_flb_file_preserve_one(struct liveupdate_flb *flb)
return 0;
}
-static void luo_flb_file_unpreserve_one(struct liveupdate_flb *flb)
+void liveupdate_flb_put_outgoing(struct liveupdate_flb *flb)
{
struct luo_flb_private *private = luo_flb_get_private(flb);
@@ -266,7 +266,7 @@ int luo_flb_file_preserve(struct liveupdate_file_handler *fh)
exit_err:
list_for_each_entry_continue_reverse(iter, flb_list, list)
- luo_flb_file_unpreserve_one(iter->flb);
+ liveupdate_flb_put_outgoing(iter->flb);
up_read(&luo_register_rwlock);
return err;
@@ -291,7 +291,7 @@ void luo_flb_file_unpreserve(struct liveupdate_file_handler *fh)
guard(rwsem_read)(&luo_register_rwlock);
list_for_each_entry_reverse(iter, flb_list, list)
- luo_flb_file_unpreserve_one(iter->flb);
+ liveupdate_flb_put_outgoing(iter->flb);
}
/**
@@ -546,6 +546,10 @@ int liveupdate_flb_get_outgoing(struct liveupdate_flb *flb, void **objp)
return -EOPNOTSUPP;
guard(mutex)(&private->outgoing.lock);
+ if (!private->outgoing.obj)
+ return -ENOENT;
+
+ refcount_inc(&private->outgoing.count);
*objp = private->outgoing.obj;
return 0;
--
2.54.0.823.g6e5bcc1fc9-goog
^ permalink raw reply related [flat|nested] 14+ messages in thread
* [PATCH 2/2] liveupdate: Remember FLB retrieve() status
2026-05-28 17:41 [PATCH 0/2] liveupdate: Small FLB fixes David Matlack
2026-05-28 17:41 ` [PATCH 1/2] liveupdate: Reference count outgoing FLB data David Matlack
@ 2026-05-28 17:41 ` David Matlack
2026-06-02 17:18 ` Pratyush Yadav
2026-06-03 3:36 ` Pasha Tatashin
2026-06-04 5:28 ` [PATCH 0/2] liveupdate: Small FLB fixes Mike Rapoport
2 siblings, 2 replies; 14+ messages in thread
From: David Matlack @ 2026-05-28 17:41 UTC (permalink / raw)
To: kexec, linux-kernel
Cc: Andrew Morton, Mike Rapoport, Pasha Tatashin, Pratyush Yadav,
David Matlack
LUO keeps track of successful retrieve attempts on an FLB. It does so
to avoid multiple retrievals of the same FLB. Multiple retrievals cause
problems because once the FLB is retrieved, the serialized data
structures are likely freed and the FLB is likely in a very different
state from what the code expects.
All this works well when retrieve succeeds. When it fails,
luo_flb_retrieve_one() returns the error immediately, without ever
storing anywhere that a retrieve was attempted or what its error code
was. If the user attempts to retrieve another file registered with the
same FLB, LUO will attempt to call the FLB's retrieve() callback again.
The retry is problematic for much of the same reasons listed above. The
FLB is likely in a very different state than what the retrieve logic
normally expects (e.g. some KHO pages may have already been restored and
freed).
There is no sane way of attempting the retrieve again. Remember the
error retrieve returned and directly return it on a retry.
This is done by changing the retrieved bool to a retrieve_status
integer. A value of 0 means retrieve was never attempted, a positive
value means it succeeded, and a negative value means it failed and the
error code is the value.
This is similar to commit f85b1c6af5bc ("liveupdate: luo_file: remember
retrieve() status") which did the same for LUO files.
Fixes: cab056f2aae7 ("liveupdate: luo_flb: introduce File-Lifecycle-Bound global state")
Assisted-by: Gemini:gemini-3-pro-preview
Signed-off-by: David Matlack <dmatlack@google.com>
---
include/linux/liveupdate.h | 6 ++++--
kernel/liveupdate/luo_flb.c | 10 +++++++---
2 files changed, 11 insertions(+), 5 deletions(-)
diff --git a/include/linux/liveupdate.h b/include/linux/liveupdate.h
index c344bf987b63..63ea5417de84 100644
--- a/include/linux/liveupdate.h
+++ b/include/linux/liveupdate.h
@@ -173,7 +173,9 @@ struct liveupdate_flb_ops {
* @lock: A mutex that protects all fields within this structure, providing
* the synchronization service for the FLB's ops.
* @finished: True once the FLB's finish() callback has run.
- * @retrieved: True once the FLB's retrieve() callback has run.
+ * @retrieve_status: Status code indicating whether retrieve() has been
+ * attempted. 0 means not attempted, 1 means successful,
+ * and negative value means it failed with that error code.
*/
struct luo_flb_private_state {
refcount_t count;
@@ -181,7 +183,7 @@ struct luo_flb_private_state {
void *obj;
struct mutex lock;
bool finished;
- bool retrieved;
+ int retrieve_status;
};
/*
diff --git a/kernel/liveupdate/luo_flb.c b/kernel/liveupdate/luo_flb.c
index 7ddef552ff6b..f8852f7e62e5 100644
--- a/kernel/liveupdate/luo_flb.c
+++ b/kernel/liveupdate/luo_flb.c
@@ -170,7 +170,10 @@ static int luo_flb_retrieve_one(struct liveupdate_flb *flb)
if (private->incoming.finished)
return -ENODATA;
- if (private->incoming.retrieved)
+ if (private->incoming.retrieve_status < 0)
+ return private->incoming.retrieve_status;
+
+ if (private->incoming.retrieve_status > 0)
return 0;
if (!fh->active)
@@ -196,12 +199,13 @@ static int luo_flb_retrieve_one(struct liveupdate_flb *flb)
err = flb->ops->retrieve(&args);
if (err) {
+ private->incoming.retrieve_status = err;
module_put(flb->ops->owner);
return err;
}
private->incoming.obj = args.obj;
- private->incoming.retrieved = true;
+ private->incoming.retrieve_status = 1;
return 0;
}
@@ -215,7 +219,7 @@ void liveupdate_flb_put_incoming(struct liveupdate_flb *flb)
if (!refcount_dec_and_test(&private->incoming.count))
return;
- if (!private->incoming.retrieved) {
+ if (private->incoming.retrieve_status <= 0) {
int err = luo_flb_retrieve_one(flb);
if (WARN_ON(err))
--
2.54.0.823.g6e5bcc1fc9-goog
^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [PATCH 1/2] liveupdate: Reference count outgoing FLB data
2026-05-28 17:41 ` [PATCH 1/2] liveupdate: Reference count outgoing FLB data David Matlack
@ 2026-06-02 17:15 ` Pratyush Yadav
2026-06-02 17:25 ` David Matlack
2026-06-03 3:36 ` Pasha Tatashin
1 sibling, 1 reply; 14+ messages in thread
From: Pratyush Yadav @ 2026-06-02 17:15 UTC (permalink / raw)
To: David Matlack
Cc: kexec, linux-kernel, Andrew Morton, Mike Rapoport, Pasha Tatashin,
Pratyush Yadav
Hi David,
On Thu, May 28 2026, David Matlack wrote:
> Increment the outgoing FLB refcount in liveupdate_flb_get_outgoing() so
> that the FLB structure cannot be freed while the caller is actively
> using it. Add an additional liveupdate_flb_put_outgoing() function so
> the caller can explicitly indicate when it is done using the outgoing
> FLB.
>
> During a Live Update, the kernel may need to fetch the outgoing FLB
> outside of the scope of a file handler's preserve() and unpreserve()
> callbacks. In that situation there is no way for the caller to protect
> itself against the outgoing FLB from being freed while it is using it.
> Incrementing the reference count in liveupdate_flb_get_outgoing()
> ensures it cannot be freed.
We grab a reference to the FLB's module when the first file using the
FLB is preserved. So the FLB should never go away while preserved files
exist. Once all preserved files go away, you normally shouldn't be doing
anything with the FLB anyway.
Can you please elaborate on the use case and why this is a problem?
Using the FLB outside of the standard LUO file callbacks sounds
problematic.
>
> This change also aligns the outgoing FLB lifecycle management with the
> incoming FLB, since the latter uses the same get/put semantics.
>
> Fixes: cab056f2aae7 ("liveupdate: luo_flb: introduce File-Lifecycle-Bound global state")
> Assisted-by: Gemini:gemini-3-pro-preview
> Signed-off-by: David Matlack <dmatlack@google.com>
[...]
--
Regards,
Pratyush Yadav
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 2/2] liveupdate: Remember FLB retrieve() status
2026-05-28 17:41 ` [PATCH 2/2] liveupdate: Remember FLB retrieve() status David Matlack
@ 2026-06-02 17:18 ` Pratyush Yadav
2026-06-03 3:36 ` Pasha Tatashin
1 sibling, 0 replies; 14+ messages in thread
From: Pratyush Yadav @ 2026-06-02 17:18 UTC (permalink / raw)
To: David Matlack
Cc: kexec, linux-kernel, Andrew Morton, Mike Rapoport, Pasha Tatashin,
Pratyush Yadav
On Thu, May 28 2026, David Matlack wrote:
> LUO keeps track of successful retrieve attempts on an FLB. It does so
> to avoid multiple retrievals of the same FLB. Multiple retrievals cause
> problems because once the FLB is retrieved, the serialized data
> structures are likely freed and the FLB is likely in a very different
> state from what the code expects.
>
> All this works well when retrieve succeeds. When it fails,
> luo_flb_retrieve_one() returns the error immediately, without ever
> storing anywhere that a retrieve was attempted or what its error code
> was. If the user attempts to retrieve another file registered with the
> same FLB, LUO will attempt to call the FLB's retrieve() callback again.
>
> The retry is problematic for much of the same reasons listed above. The
> FLB is likely in a very different state than what the retrieve logic
> normally expects (e.g. some KHO pages may have already been restored and
> freed).
>
> There is no sane way of attempting the retrieve again. Remember the
> error retrieve returned and directly return it on a retry.
>
> This is done by changing the retrieved bool to a retrieve_status
> integer. A value of 0 means retrieve was never attempted, a positive
> value means it succeeded, and a negative value means it failed and the
> error code is the value.
>
> This is similar to commit f85b1c6af5bc ("liveupdate: luo_file: remember
> retrieve() status") which did the same for LUO files.
>
> Fixes: cab056f2aae7 ("liveupdate: luo_flb: introduce File-Lifecycle-Bound global state")
> Assisted-by: Gemini:gemini-3-pro-preview
> Signed-off-by: David Matlack <dmatlack@google.com>
Reviewed-by: Pratyush Yadav (Google) <pratyush@kernel.org>
Thanks for fixing this!
[...]
--
Regards,
Pratyush Yadav
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 1/2] liveupdate: Reference count outgoing FLB data
2026-06-02 17:15 ` Pratyush Yadav
@ 2026-06-02 17:25 ` David Matlack
2026-06-08 14:19 ` Pratyush Yadav
0 siblings, 1 reply; 14+ messages in thread
From: David Matlack @ 2026-06-02 17:25 UTC (permalink / raw)
To: Pratyush Yadav
Cc: kexec, linux-kernel, Andrew Morton, Mike Rapoport, Pasha Tatashin
On 2026-06-02 07:15 PM, Pratyush Yadav wrote:
> Hi David,
>
> On Thu, May 28 2026, David Matlack wrote:
>
> > Increment the outgoing FLB refcount in liveupdate_flb_get_outgoing() so
> > that the FLB structure cannot be freed while the caller is actively
> > using it. Add an additional liveupdate_flb_put_outgoing() function so
> > the caller can explicitly indicate when it is done using the outgoing
> > FLB.
> >
> > During a Live Update, the kernel may need to fetch the outgoing FLB
> > outside of the scope of a file handler's preserve() and unpreserve()
> > callbacks. In that situation there is no way for the caller to protect
> > itself against the outgoing FLB from being freed while it is using it.
> > Incrementing the reference count in liveupdate_flb_get_outgoing()
> > ensures it cannot be freed.
>
> We grab a reference to the FLB's module when the first file using the
> FLB is preserved. So the FLB should never go away while preserved files
> exist. Once all preserved files go away, you normally shouldn't be doing
> anything with the FLB anyway.
>
> Can you please elaborate on the use case and why this is a problem?
> Using the FLB outside of the standard LUO file callbacks sounds
> problematic.
The scenario I had in mind was to remove a PCI device from the outgoing
FLB if the device is forcibly removed while the file is still preserved,
for example someone writes 1 to /sys/bus/pci/devices/.../remove or a
device is physically hot-unplugged.
Specifically this call here from the patch below:
+void pci_liveupdate_cleanup_device(struct pci_dev *dev)
+{
+ /*
+ * It should be safe to READ_ONCE() outside of the rwsem during cleanup
+ * since there should no longer be any references to @dev on the system.
+ */
+ if (READ_ONCE(dev->liveupdate.outgoing)) {
+ pci_WARN(dev, 1, "Destroying outgoing-preserved device!\n");
+ pci_liveupdate_unpreserve(dev);
+ }
+}
https://lore.kernel.org/linux-pci/20260522202410.3104264-3-dmatlack@google.com/
I can do this without adding reference counting to
liveupdate_flb_get_outgoing(), but the reference counting makes it
obvious that the outgoing FLB will not be freed while I am using it
here, and also aligns with liveupdate_flb_get_incoming().
> >
> > This change also aligns the outgoing FLB lifecycle management with the
> > incoming FLB, since the latter uses the same get/put semantics.
> >
> > Fixes: cab056f2aae7 ("liveupdate: luo_flb: introduce File-Lifecycle-Bound global state")
> > Assisted-by: Gemini:gemini-3-pro-preview
> > Signed-off-by: David Matlack <dmatlack@google.com>
> [...]
>
> --
> Regards,
> Pratyush Yadav
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 2/2] liveupdate: Remember FLB retrieve() status
2026-05-28 17:41 ` [PATCH 2/2] liveupdate: Remember FLB retrieve() status David Matlack
2026-06-02 17:18 ` Pratyush Yadav
@ 2026-06-03 3:36 ` Pasha Tatashin
1 sibling, 0 replies; 14+ messages in thread
From: Pasha Tatashin @ 2026-06-03 3:36 UTC (permalink / raw)
To: David Matlack
Cc: kexec, linux-kernel, Andrew Morton, Mike Rapoport, Pasha Tatashin,
Pratyush Yadav
On 05-28 17:41, David Matlack wrote:
> LUO keeps track of successful retrieve attempts on an FLB. It does so
> to avoid multiple retrievals of the same FLB. Multiple retrievals cause
> problems because once the FLB is retrieved, the serialized data
> structures are likely freed and the FLB is likely in a very different
> state from what the code expects.
>
> All this works well when retrieve succeeds. When it fails,
> luo_flb_retrieve_one() returns the error immediately, without ever
> storing anywhere that a retrieve was attempted or what its error code
> was. If the user attempts to retrieve another file registered with the
> same FLB, LUO will attempt to call the FLB's retrieve() callback again.
>
> The retry is problematic for much of the same reasons listed above. The
> FLB is likely in a very different state than what the retrieve logic
> normally expects (e.g. some KHO pages may have already been restored and
> freed).
>
> There is no sane way of attempting the retrieve again. Remember the
> error retrieve returned and directly return it on a retry.
>
> This is done by changing the retrieved bool to a retrieve_status
> integer. A value of 0 means retrieve was never attempted, a positive
> value means it succeeded, and a negative value means it failed and the
> error code is the value.
>
> This is similar to commit f85b1c6af5bc ("liveupdate: luo_file: remember
> retrieve() status") which did the same for LUO files.
>
> Fixes: cab056f2aae7 ("liveupdate: luo_flb: introduce File-Lifecycle-Bound global state")
> Assisted-by: Gemini:gemini-3-pro-preview
> Signed-off-by: David Matlack <dmatlack@google.com>
Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com>
> ---
> include/linux/liveupdate.h | 6 ++++--
> kernel/liveupdate/luo_flb.c | 10 +++++++---
> 2 files changed, 11 insertions(+), 5 deletions(-)
>
> diff --git a/include/linux/liveupdate.h b/include/linux/liveupdate.h
> index c344bf987b63..63ea5417de84 100644
> --- a/include/linux/liveupdate.h
> +++ b/include/linux/liveupdate.h
> @@ -173,7 +173,9 @@ struct liveupdate_flb_ops {
> * @lock: A mutex that protects all fields within this structure, providing
> * the synchronization service for the FLB's ops.
> * @finished: True once the FLB's finish() callback has run.
> - * @retrieved: True once the FLB's retrieve() callback has run.
> + * @retrieve_status: Status code indicating whether retrieve() has been
> + * attempted. 0 means not attempted, 1 means successful,
> + * and negative value means it failed with that error code.
> */
> struct luo_flb_private_state {
> refcount_t count;
> @@ -181,7 +183,7 @@ struct luo_flb_private_state {
> void *obj;
> struct mutex lock;
> bool finished;
> - bool retrieved;
> + int retrieve_status;
> };
>
> /*
> diff --git a/kernel/liveupdate/luo_flb.c b/kernel/liveupdate/luo_flb.c
> index 7ddef552ff6b..f8852f7e62e5 100644
> --- a/kernel/liveupdate/luo_flb.c
> +++ b/kernel/liveupdate/luo_flb.c
> @@ -170,7 +170,10 @@ static int luo_flb_retrieve_one(struct liveupdate_flb *flb)
> if (private->incoming.finished)
> return -ENODATA;
>
> - if (private->incoming.retrieved)
> + if (private->incoming.retrieve_status < 0)
> + return private->incoming.retrieve_status;
> +
> + if (private->incoming.retrieve_status > 0)
> return 0;
>
> if (!fh->active)
> @@ -196,12 +199,13 @@ static int luo_flb_retrieve_one(struct liveupdate_flb *flb)
>
> err = flb->ops->retrieve(&args);
> if (err) {
> + private->incoming.retrieve_status = err;
> module_put(flb->ops->owner);
> return err;
> }
>
> private->incoming.obj = args.obj;
> - private->incoming.retrieved = true;
> + private->incoming.retrieve_status = 1;
>
> return 0;
> }
> @@ -215,7 +219,7 @@ void liveupdate_flb_put_incoming(struct liveupdate_flb *flb)
> if (!refcount_dec_and_test(&private->incoming.count))
> return;
>
> - if (!private->incoming.retrieved) {
> + if (private->incoming.retrieve_status <= 0) {
> int err = luo_flb_retrieve_one(flb);
>
> if (WARN_ON(err))
> --
> 2.54.0.823.g6e5bcc1fc9-goog
>
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 1/2] liveupdate: Reference count outgoing FLB data
2026-05-28 17:41 ` [PATCH 1/2] liveupdate: Reference count outgoing FLB data David Matlack
2026-06-02 17:15 ` Pratyush Yadav
@ 2026-06-03 3:36 ` Pasha Tatashin
1 sibling, 0 replies; 14+ messages in thread
From: Pasha Tatashin @ 2026-06-03 3:36 UTC (permalink / raw)
To: David Matlack
Cc: kexec, linux-kernel, Andrew Morton, Mike Rapoport, Pasha Tatashin,
Pratyush Yadav
On 05-28 17:41, David Matlack wrote:
> Increment the outgoing FLB refcount in liveupdate_flb_get_outgoing() so
> that the FLB structure cannot be freed while the caller is actively
> using it. Add an additional liveupdate_flb_put_outgoing() function so
> the caller can explicitly indicate when it is done using the outgoing
> FLB.
>
> During a Live Update, the kernel may need to fetch the outgoing FLB
> outside of the scope of a file handler's preserve() and unpreserve()
> callbacks. In that situation there is no way for the caller to protect
> itself against the outgoing FLB from being freed while it is using it.
> Incrementing the reference count in liveupdate_flb_get_outgoing()
> ensures it cannot be freed.
>
> This change also aligns the outgoing FLB lifecycle management with the
> incoming FLB, since the latter uses the same get/put semantics.
>
> Fixes: cab056f2aae7 ("liveupdate: luo_flb: introduce File-Lifecycle-Bound global state")
> Assisted-by: Gemini:gemini-3-pro-preview
> Signed-off-by: David Matlack <dmatlack@google.com>
Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com>
> ---
> include/linux/liveupdate.h | 5 +++++
> kernel/liveupdate/luo_flb.c | 10 +++++++---
> 2 files changed, 12 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/liveupdate.h b/include/linux/liveupdate.h
> index 88722e5caf02..c344bf987b63 100644
> --- a/include/linux/liveupdate.h
> +++ b/include/linux/liveupdate.h
> @@ -243,6 +243,7 @@ int liveupdate_flb_get_incoming(struct liveupdate_flb *flb, void **objp);
> void liveupdate_flb_put_incoming(struct liveupdate_flb *flb);
>
> int liveupdate_flb_get_outgoing(struct liveupdate_flb *flb, void **objp);
> +void liveupdate_flb_put_outgoing(struct liveupdate_flb *flb);
>
> #else /* CONFIG_LIVEUPDATE */
>
> @@ -292,5 +293,9 @@ static inline int liveupdate_flb_get_outgoing(struct liveupdate_flb *flb,
> return -EOPNOTSUPP;
> }
>
> +static inline void liveupdate_flb_put_outgoing(struct liveupdate_flb *flb)
> +{
> +}
> +
> #endif /* CONFIG_LIVEUPDATE */
> #endif /* _LINUX_LIVEUPDATE_H */
> diff --git a/kernel/liveupdate/luo_flb.c b/kernel/liveupdate/luo_flb.c
> index 8f5c5dd01cd0..7ddef552ff6b 100644
> --- a/kernel/liveupdate/luo_flb.c
> +++ b/kernel/liveupdate/luo_flb.c
> @@ -135,7 +135,7 @@ static int luo_flb_file_preserve_one(struct liveupdate_flb *flb)
> return 0;
> }
>
> -static void luo_flb_file_unpreserve_one(struct liveupdate_flb *flb)
> +void liveupdate_flb_put_outgoing(struct liveupdate_flb *flb)
> {
> struct luo_flb_private *private = luo_flb_get_private(flb);
>
> @@ -266,7 +266,7 @@ int luo_flb_file_preserve(struct liveupdate_file_handler *fh)
>
> exit_err:
> list_for_each_entry_continue_reverse(iter, flb_list, list)
> - luo_flb_file_unpreserve_one(iter->flb);
> + liveupdate_flb_put_outgoing(iter->flb);
> up_read(&luo_register_rwlock);
>
> return err;
> @@ -291,7 +291,7 @@ void luo_flb_file_unpreserve(struct liveupdate_file_handler *fh)
>
> guard(rwsem_read)(&luo_register_rwlock);
> list_for_each_entry_reverse(iter, flb_list, list)
> - luo_flb_file_unpreserve_one(iter->flb);
> + liveupdate_flb_put_outgoing(iter->flb);
> }
>
> /**
> @@ -546,6 +546,10 @@ int liveupdate_flb_get_outgoing(struct liveupdate_flb *flb, void **objp)
> return -EOPNOTSUPP;
>
> guard(mutex)(&private->outgoing.lock);
> + if (!private->outgoing.obj)
> + return -ENOENT;
> +
> + refcount_inc(&private->outgoing.count);
> *objp = private->outgoing.obj;
>
> return 0;
> --
> 2.54.0.823.g6e5bcc1fc9-goog
>
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 0/2] liveupdate: Small FLB fixes
2026-05-28 17:41 [PATCH 0/2] liveupdate: Small FLB fixes David Matlack
2026-05-28 17:41 ` [PATCH 1/2] liveupdate: Reference count outgoing FLB data David Matlack
2026-05-28 17:41 ` [PATCH 2/2] liveupdate: Remember FLB retrieve() status David Matlack
@ 2026-06-04 5:28 ` Mike Rapoport
2026-06-05 13:09 ` Pratyush Yadav
2 siblings, 1 reply; 14+ messages in thread
From: Mike Rapoport @ 2026-06-04 5:28 UTC (permalink / raw)
To: David Matlack
Cc: kexec, linux-kernel, Andrew Morton, Pasha Tatashin,
Pratyush Yadav
On Thu, May 28, 2026 at 05:41:38PM +0000, David Matlack wrote:
> This series has 2 small fixes to how FLBs are managed. First is to
> increase the outgoing FLB refcount during liveupdate_flb_get_outgoing()
> so it cannot be freed while the caller is using it, and to align with
> the semantics of liveupdate_flb_get_incoming(). The second is to prevent
> FLB retrieve() from being called multiple times if the first attempt
> fails.
>
> Both of these changes are needed for the correctness of the PCI core
> support for Live Update:
>
> https://lore.kernel.org/linux-pci/20260522202410.3104264-1-dmatlack@google.com/
We are late in the release cycle and since there no in-tree flb users let's
postpone this after rc1.
> David Matlack (2):
> liveupdate: Reference count outgoing FLB data
> liveupdate: Remember FLB retrieve() status
>
> include/linux/liveupdate.h | 11 +++++++++--
> kernel/liveupdate/luo_flb.c | 20 ++++++++++++++------
> 2 files changed, 23 insertions(+), 8 deletions(-)
>
>
> base-commit: 5428435567cbe06c19914592fc22ca23c9ca1de5
> --
> 2.54.0.823.g6e5bcc1fc9-goog
>
--
Sincerely yours,
Mike.
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 0/2] liveupdate: Small FLB fixes
2026-06-04 5:28 ` [PATCH 0/2] liveupdate: Small FLB fixes Mike Rapoport
@ 2026-06-05 13:09 ` Pratyush Yadav
0 siblings, 0 replies; 14+ messages in thread
From: Pratyush Yadav @ 2026-06-05 13:09 UTC (permalink / raw)
To: Mike Rapoport
Cc: David Matlack, kexec, linux-kernel, Andrew Morton, Pasha Tatashin,
Pratyush Yadav
On Thu, Jun 04 2026, Mike Rapoport wrote:
> On Thu, May 28, 2026 at 05:41:38PM +0000, David Matlack wrote:
>> This series has 2 small fixes to how FLBs are managed. First is to
>> increase the outgoing FLB refcount during liveupdate_flb_get_outgoing()
>> so it cannot be freed while the caller is using it, and to align with
>> the semantics of liveupdate_flb_get_incoming(). The second is to prevent
>> FLB retrieve() from being called multiple times if the first attempt
>> fails.
>>
>> Both of these changes are needed for the correctness of the PCI core
>> support for Live Update:
>>
>> https://lore.kernel.org/linux-pci/20260522202410.3104264-1-dmatlack@google.com/
>
> We are late in the release cycle and since there no in-tree flb users let's
> postpone this after rc1.
Yes, I agree.
[...]
--
Regards,
Pratyush Yadav
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 1/2] liveupdate: Reference count outgoing FLB data
2026-06-02 17:25 ` David Matlack
@ 2026-06-08 14:19 ` Pratyush Yadav
2026-06-08 23:37 ` David Matlack
0 siblings, 1 reply; 14+ messages in thread
From: Pratyush Yadav @ 2026-06-08 14:19 UTC (permalink / raw)
To: David Matlack
Cc: Pratyush Yadav, kexec, linux-kernel, Andrew Morton, Mike Rapoport,
Pasha Tatashin
On Tue, Jun 02 2026, David Matlack wrote:
> On 2026-06-02 07:15 PM, Pratyush Yadav wrote:
>> Hi David,
>>
>> On Thu, May 28 2026, David Matlack wrote:
>>
>> > Increment the outgoing FLB refcount in liveupdate_flb_get_outgoing() so
>> > that the FLB structure cannot be freed while the caller is actively
>> > using it. Add an additional liveupdate_flb_put_outgoing() function so
>> > the caller can explicitly indicate when it is done using the outgoing
>> > FLB.
>> >
>> > During a Live Update, the kernel may need to fetch the outgoing FLB
>> > outside of the scope of a file handler's preserve() and unpreserve()
>> > callbacks. In that situation there is no way for the caller to protect
>> > itself against the outgoing FLB from being freed while it is using it.
>> > Incrementing the reference count in liveupdate_flb_get_outgoing()
>> > ensures it cannot be freed.
>>
>> We grab a reference to the FLB's module when the first file using the
>> FLB is preserved. So the FLB should never go away while preserved files
>> exist. Once all preserved files go away, you normally shouldn't be doing
>> anything with the FLB anyway.
>>
>> Can you please elaborate on the use case and why this is a problem?
>> Using the FLB outside of the standard LUO file callbacks sounds
>> problematic.
>
> The scenario I had in mind was to remove a PCI device from the outgoing
> FLB if the device is forcibly removed while the file is still preserved,
> for example someone writes 1 to /sys/bus/pci/devices/.../remove or a
> device is physically hot-unplugged.
>
> Specifically this call here from the patch below:
>
> +void pci_liveupdate_cleanup_device(struct pci_dev *dev)
> +{
> + /*
> + * It should be safe to READ_ONCE() outside of the rwsem during cleanup
> + * since there should no longer be any references to @dev on the system.
> + */
> + if (READ_ONCE(dev->liveupdate.outgoing)) {
> + pci_WARN(dev, 1, "Destroying outgoing-preserved device!\n");
> + pci_liveupdate_unpreserve(dev);
> + }
> +}
>
> https://lore.kernel.org/linux-pci/20260522202410.3104264-3-dmatlack@google.com/
>
> I can do this without adding reference counting to
> liveupdate_flb_get_outgoing(), but the reference counting makes it
> obvious that the outgoing FLB will not be freed while I am using it
> here, and also aligns with liveupdate_flb_get_incoming().
The lifecycle of FLB is bound to _preserved_ files. So it is only valid
as long as preserved files exist. So I think you should only get the FLB
object when you are inside a file preservation callback for a file which
the FLB is registered. Anywhere outside of that, you are not guaranteed
to get anything sane.
This refcounting scheme breaks the inherent "file-lifecycle-bound" part
of FLB, since now anyone can grab a reference and hold the FLB as long
as they like, even when no preserved files exist.
For the normal case, your the VFIO driver gets probed, it registers its
file handler, then when the device is preserved by VFIO, the VFIO file
handler's callbacks can get the FLB and do whatever. LUO guarantees the
FLB exists. Anywhere outside of that, you should _not_ touch the FLB
because of the reasons above.
Now for hot-unplug, I think that case is not supported right now. When a
preserved file exists, LUO can only remove it when the user closes the
session. Trying to clean up the file from any other context will leave
dangling references to the file and we currently do not handle those.
Trying to hold the file reference won't help much either since LUO
callbacks will try to proceed as normal, and normal no longer applies.
For example, say userspace preserved the file for your device in their
session, then you hot-unplug the device, then userspace triggers a
kexec. What is the freeze() callback supposed to do? Sure, the FLB
object still exists, but the device doesn't. Similarly, if you force
remove the module, the freeze() callback itself no longer exists, and
you likely get a panic.
We might at some point support "invalidating" preserved files. I imagine
when you hot-unplug with a preserved device, you tell LUO to invalidate
all preserved files with that device. They would still exist in their
sessions, but all operations on them fail immediately, including
freeze(), which prevents live update from proceeding until user cleans
them up.
So unless I am missing something, I think this refcounting is a band-aid
and the real problem is to properly track these "invalidated" files.
Also, I think the refcounting on the incoming path is also a mistake.
Unfortunately for incoming, there is a need for accessing the FLB
outside of the file handling callbacks, since subsystems needs to use it
to initialize itself. But I suppose we can have a accessor that
subsystems can call once on boot/init to get their object. Then they use
it to initialize their state and refer to the state directly, with all
later calls going through the usual file handler callbacks.
If you are interested in solving this problem, we can have a chat to
talk in more detail, or perhaps have a discussion at one of the
bi-weeklies?
>
>> >
>> > This change also aligns the outgoing FLB lifecycle management with the
>> > incoming FLB, since the latter uses the same get/put semantics.
>> >
>> > Fixes: cab056f2aae7 ("liveupdate: luo_flb: introduce File-Lifecycle-Bound global state")
>> > Assisted-by: Gemini:gemini-3-pro-preview
>> > Signed-off-by: David Matlack <dmatlack@google.com>
>> [...]
>>
>> --
>> Regards,
>> Pratyush Yadav
--
Regards,
Pratyush Yadav
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 1/2] liveupdate: Reference count outgoing FLB data
2026-06-08 14:19 ` Pratyush Yadav
@ 2026-06-08 23:37 ` David Matlack
2026-06-09 2:17 ` Pasha Tatashin
0 siblings, 1 reply; 14+ messages in thread
From: David Matlack @ 2026-06-08 23:37 UTC (permalink / raw)
To: Pratyush Yadav
Cc: kexec, linux-kernel, Andrew Morton, Mike Rapoport, Pasha Tatashin
On 2026-06-08 04:19 PM, Pratyush Yadav wrote:
> On Tue, Jun 02 2026, David Matlack wrote:
>
> > On 2026-06-02 07:15 PM, Pratyush Yadav wrote:
> >> Hi David,
> >>
> >> On Thu, May 28 2026, David Matlack wrote:
> >>
> >> > Increment the outgoing FLB refcount in liveupdate_flb_get_outgoing() so
> >> > that the FLB structure cannot be freed while the caller is actively
> >> > using it. Add an additional liveupdate_flb_put_outgoing() function so
> >> > the caller can explicitly indicate when it is done using the outgoing
> >> > FLB.
> >> >
> >> > During a Live Update, the kernel may need to fetch the outgoing FLB
> >> > outside of the scope of a file handler's preserve() and unpreserve()
> >> > callbacks. In that situation there is no way for the caller to protect
> >> > itself against the outgoing FLB from being freed while it is using it.
> >> > Incrementing the reference count in liveupdate_flb_get_outgoing()
> >> > ensures it cannot be freed.
> >>
> >> We grab a reference to the FLB's module when the first file using the
> >> FLB is preserved. So the FLB should never go away while preserved files
> >> exist. Once all preserved files go away, you normally shouldn't be doing
> >> anything with the FLB anyway.
> >>
> >> Can you please elaborate on the use case and why this is a problem?
> >> Using the FLB outside of the standard LUO file callbacks sounds
> >> problematic.
> >
> > The scenario I had in mind was to remove a PCI device from the outgoing
> > FLB if the device is forcibly removed while the file is still preserved,
> > for example someone writes 1 to /sys/bus/pci/devices/.../remove or a
> > device is physically hot-unplugged.
> >
> > Specifically this call here from the patch below:
> >
> > +void pci_liveupdate_cleanup_device(struct pci_dev *dev)
> > +{
> > + /*
> > + * It should be safe to READ_ONCE() outside of the rwsem during cleanup
> > + * since there should no longer be any references to @dev on the system.
> > + */
> > + if (READ_ONCE(dev->liveupdate.outgoing)) {
> > + pci_WARN(dev, 1, "Destroying outgoing-preserved device!\n");
> > + pci_liveupdate_unpreserve(dev);
> > + }
> > +}
> >
> > https://lore.kernel.org/linux-pci/20260522202410.3104264-3-dmatlack@google.com/
> >
> > I can do this without adding reference counting to
> > liveupdate_flb_get_outgoing(), but the reference counting makes it
> > obvious that the outgoing FLB will not be freed while I am using it
> > here, and also aligns with liveupdate_flb_get_incoming().
>
> The lifecycle of FLB is bound to _preserved_ files. So it is only valid
> as long as preserved files exist. So I think you should only get the FLB
> object when you are inside a file preservation callback for a file which
> the FLB is registered. Anywhere outside of that, you are not guaranteed
> to get anything sane.
LUO should enforce this then, IMO.
> This refcounting scheme breaks the inherent "file-lifecycle-bound" part
> of FLB, since now anyone can grab a reference and hold the FLB as long
> as they like, even when no preserved files exist.
>
> For the normal case, your the VFIO driver gets probed, it registers its
> file handler, then when the device is preserved by VFIO, the VFIO file
> handler's callbacks can get the FLB and do whatever. LUO guarantees the
> FLB exists. Anywhere outside of that, you should _not_ touch the FLB
> because of the reasons above.
>
> Now for hot-unplug, I think that case is not supported right now. When a
> preserved file exists, LUO can only remove it when the user closes the
> session. Trying to clean up the file from any other context will leave
> dangling references to the file and we currently do not handle those.
> Trying to hold the file reference won't help much either since LUO
> callbacks will try to proceed as normal, and normal no longer applies.
>
> For example, say userspace preserved the file for your device in their
> session, then you hot-unplug the device, then userspace triggers a
> kexec. What is the freeze() callback supposed to do? Sure, the FLB
> object still exists, but the device doesn't. Similarly, if you force
> remove the module, the freeze() callback itself no longer exists, and
> you likely get a panic.
>
> We might at some point support "invalidating" preserved files. I imagine
> when you hot-unplug with a preserved device, you tell LUO to invalidate
> all preserved files with that device. They would still exist in their
> sessions, but all operations on them fail immediately, including
> freeze(), which prevents live update from proceeding until user cleans
> them up.
>
> So unless I am missing something, I think this refcounting is a band-aid
> and the real problem is to properly track these "invalidated" files.
>
> Also, I think the refcounting on the incoming path is also a mistake.
> Unfortunately for incoming, there is a need for accessing the FLB
> outside of the file handling callbacks, since subsystems needs to use it
> to initialize itself. But I suppose we can have a accessor that
> subsystems can call once on boot/init to get their object. Then they use
> it to initialize their state and refer to the state directly, with all
> later calls going through the usual file handler callbacks.
>
> If you are interested in solving this problem, we can have a chat to
> talk in more detail, or perhaps have a discussion at one of the
> bi-weeklies?
Thanks for the detailed reply but I think it's hard to discuss all these
as theoretical situations since we can get bogged down in the parts that
aren't clear yet and potential future use-cases.
Can you review the use of the outgoing and incoming FLB in the PCI core
series and let me know what you think I am doing wrong?
https://lore.kernel.org/linux-pci/20260522202410.3104264-1-dmatlack@google.com/
>
> >
> >> >
> >> > This change also aligns the outgoing FLB lifecycle management with the
> >> > incoming FLB, since the latter uses the same get/put semantics.
> >> >
> >> > Fixes: cab056f2aae7 ("liveupdate: luo_flb: introduce File-Lifecycle-Bound global state")
> >> > Assisted-by: Gemini:gemini-3-pro-preview
> >> > Signed-off-by: David Matlack <dmatlack@google.com>
> >> [...]
> >>
> >> --
> >> Regards,
> >> Pratyush Yadav
>
> --
> Regards,
> Pratyush Yadav
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 1/2] liveupdate: Reference count outgoing FLB data
2026-06-08 23:37 ` David Matlack
@ 2026-06-09 2:17 ` Pasha Tatashin
2026-06-09 13:33 ` Pratyush Yadav
0 siblings, 1 reply; 14+ messages in thread
From: Pasha Tatashin @ 2026-06-09 2:17 UTC (permalink / raw)
To: David Matlack
Cc: Pratyush Yadav, kexec, linux-kernel, Andrew Morton, Mike Rapoport,
Pasha Tatashin
On 06-08 23:37, David Matlack wrote:
> On 2026-06-08 04:19 PM, Pratyush Yadav wrote:
> > On Tue, Jun 02 2026, David Matlack wrote:
> >
> > > On 2026-06-02 07:15 PM, Pratyush Yadav wrote:
> > >> Hi David,
> > >>
> > >> On Thu, May 28 2026, David Matlack wrote:
> > >>
> > >> > Increment the outgoing FLB refcount in liveupdate_flb_get_outgoing() so
> > >> > that the FLB structure cannot be freed while the caller is actively
> > >> > using it. Add an additional liveupdate_flb_put_outgoing() function so
> > >> > the caller can explicitly indicate when it is done using the outgoing
> > >> > FLB.
> > >> >
> > >> > During a Live Update, the kernel may need to fetch the outgoing FLB
> > >> > outside of the scope of a file handler's preserve() and unpreserve()
> > >> > callbacks. In that situation there is no way for the caller to protect
> > >> > itself against the outgoing FLB from being freed while it is using it.
> > >> > Incrementing the reference count in liveupdate_flb_get_outgoing()
> > >> > ensures it cannot be freed.
> > >>
> > >> We grab a reference to the FLB's module when the first file using the
> > >> FLB is preserved. So the FLB should never go away while preserved files
> > >> exist. Once all preserved files go away, you normally shouldn't be doing
> > >> anything with the FLB anyway.
> > >>
> > >> Can you please elaborate on the use case and why this is a problem?
> > >> Using the FLB outside of the standard LUO file callbacks sounds
> > >> problematic.
> > >
> > > The scenario I had in mind was to remove a PCI device from the outgoing
> > > FLB if the device is forcibly removed while the file is still preserved,
> > > for example someone writes 1 to /sys/bus/pci/devices/.../remove or a
> > > device is physically hot-unplugged.
> > >
> > > Specifically this call here from the patch below:
> > >
> > > +void pci_liveupdate_cleanup_device(struct pci_dev *dev)
> > > +{
> > > + /*
> > > + * It should be safe to READ_ONCE() outside of the rwsem during cleanup
> > > + * since there should no longer be any references to @dev on the system.
> > > + */
> > > + if (READ_ONCE(dev->liveupdate.outgoing)) {
> > > + pci_WARN(dev, 1, "Destroying outgoing-preserved device!\n");
> > > + pci_liveupdate_unpreserve(dev);
> > > + }
> > > +}
> > >
> > > https://lore.kernel.org/linux-pci/20260522202410.3104264-3-dmatlack@google.com/
> > >
> > > I can do this without adding reference counting to
> > > liveupdate_flb_get_outgoing(), but the reference counting makes it
> > > obvious that the outgoing FLB will not be freed while I am using it
> > > here, and also aligns with liveupdate_flb_get_incoming().
> >
> > The lifecycle of FLB is bound to _preserved_ files. So it is only valid
An FLB is a shared global resource whose lifecycle is bound to the
duration of first creation and final destruction across multiple files.
It is not associated with any single preserved file, but rather with the
fact that files of a particular type(s) are currently being preserved.
Because the kernel is highly asynchronous, we must handle concurrent
access and race conditions during teardown and FLB lookups. Reference
counting provides a standard, robust, and methodical way to handle this
safely.
> > as long as preserved files exist. So I think you should only get the FLB
> > object when you are inside a file preservation callback for a file which
Limiting access to only within a file preservation callback is not
feasible. FLBs are global resources and may be queried early during boot
or late during teardown, completely outside of a specific file's
preservation callback.
> > the FLB is registered. Anywhere outside of that, you are not guaranteed
> > to get anything sane.
>
> LUO should enforce this then, IMO.
>
> > This refcounting scheme breaks the inherent "file-lifecycle-bound" part
> > of FLB, since now anyone can grab a reference and hold the FLB as long
> > as they like, even when no preserved files exist.
This is the standard design pattern for shared resources in the kernel.
When the last file of that type is unpreserved, the FLB's reference
count reaches zero, and it is cleanly destroyed.
If there is a bug where someone holds onto a reference indefinitely,
that is simply a leak/bug to be fixed. It is not a flaw in the
reference-counting pattern itself, which is crucial for preventing
use-after-free and race conditions under concurrent access.
> > For the normal case, your the VFIO driver gets probed, it registers its
> > file handler, then when the device is preserved by VFIO, the VFIO file
> > handler's callbacks can get the FLB and do whatever. LUO guarantees the
> > FLB exists. Anywhere outside of that, you should _not_ touch the FLB
> > because of the reasons above.
> >
> > Now for hot-unplug, I think that case is not supported right now. When a
> > preserved file exists, LUO can only remove it when the user closes the
> > session. Trying to clean up the file from any other context will leave
> > dangling references to the file and we currently do not handle those.
> > Trying to hold the file reference won't help much either since LUO
> > callbacks will try to proceed as normal, and normal no longer applies.
> >
> > For example, say userspace preserved the file for your device in their
> > session, then you hot-unplug the device, then userspace triggers a
> > kexec. What is the freeze() callback supposed to do? Sure, the FLB
> > object still exists, but the device doesn't. Similarly, if you force
> > remove the module, the freeze() callback itself no longer exists, and
> > you likely get a panic.
> >
> > We might at some point support "invalidating" preserved files. I imagine
> > when you hot-unplug with a preserved device, you tell LUO to invalidate
> > all preserved files with that device. They would still exist in their
> > sessions, but all operations on them fail immediately, including
> > freeze(), which prevents live update from proceeding until user cleans
> > them up.
> >
> > So unless I am missing something, I think this refcounting is a band-aid
> > and the real problem is to properly track these "invalidated" files.
> >
> > Also, I think the refcounting on the incoming path is also a mistake.
On incoming path FLBs may be accessed during early boot and later
including once we enter userspace. We must have a way to ensure valid,
race-free access to these shared resources. Without reference counting,
querying the existence of an FLB or retrieving it would require
introducing global locking, which is highly undesirable.
Also, having both the incoming and outgoing FLB paths behave
symmetrically using the same reference-counting API makes the overall
design uniform, clean, and less error-prone.
> > Unfortunately for incoming, there is a need for accessing the FLB
> > outside of the file handling callbacks, since subsystems needs to use it
> > to initialize itself. But I suppose we can have a accessor that
> > subsystems can call once on boot/init to get their object. Then they use
> > it to initialize their state and refer to the state directly, with all
> > later calls going through the usual file handler callbacks.
> >
> > If you are interested in solving this problem, we can have a chat to
> > talk in more detail, or perhaps have a discussion at one of the
> > bi-weeklies?
>
> Thanks for the detailed reply but I think it's hard to discuss all these
> as theoretical situations since we can get bogged down in the parts that
> aren't clear yet and potential future use-cases.
>
> Can you review the use of the outgoing and incoming FLB in the PCI core
> series and let me know what you think I am doing wrong?
>
> https://lore.kernel.org/linux-pci/20260522202410.3104264-1-dmatlack@google.com/
>
> >
> > >
> > >> >
> > >> > This change also aligns the outgoing FLB lifecycle management with the
> > >> > incoming FLB, since the latter uses the same get/put semantics.
> > >> >
> > >> > Fixes: cab056f2aae7 ("liveupdate: luo_flb: introduce File-Lifecycle-Bound global state")
> > >> > Assisted-by: Gemini:gemini-3-pro-preview
> > >> > Signed-off-by: David Matlack <dmatlack@google.com>
> > >> [...]
> > >>
> > >> --
> > >> Regards,
> > >> Pratyush Yadav
> >
> > --
> > Regards,
> > Pratyush Yadav
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 1/2] liveupdate: Reference count outgoing FLB data
2026-06-09 2:17 ` Pasha Tatashin
@ 2026-06-09 13:33 ` Pratyush Yadav
0 siblings, 0 replies; 14+ messages in thread
From: Pratyush Yadav @ 2026-06-09 13:33 UTC (permalink / raw)
To: Pasha Tatashin
Cc: David Matlack, Pratyush Yadav, kexec, linux-kernel, Andrew Morton,
Mike Rapoport
On Tue, Jun 09 2026, Pasha Tatashin wrote:
> On 06-08 23:37, David Matlack wrote:
>> On 2026-06-08 04:19 PM, Pratyush Yadav wrote:
>> > On Tue, Jun 02 2026, David Matlack wrote:
>> >
>> > > On 2026-06-02 07:15 PM, Pratyush Yadav wrote:
>> > >> Hi David,
>> > >>
>> > >> On Thu, May 28 2026, David Matlack wrote:
>> > >>
>> > >> > Increment the outgoing FLB refcount in liveupdate_flb_get_outgoing() so
>> > >> > that the FLB structure cannot be freed while the caller is actively
>> > >> > using it. Add an additional liveupdate_flb_put_outgoing() function so
>> > >> > the caller can explicitly indicate when it is done using the outgoing
>> > >> > FLB.
>> > >> >
>> > >> > During a Live Update, the kernel may need to fetch the outgoing FLB
>> > >> > outside of the scope of a file handler's preserve() and unpreserve()
>> > >> > callbacks. In that situation there is no way for the caller to protect
>> > >> > itself against the outgoing FLB from being freed while it is using it.
>> > >> > Incrementing the reference count in liveupdate_flb_get_outgoing()
>> > >> > ensures it cannot be freed.
>> > >>
>> > >> We grab a reference to the FLB's module when the first file using the
>> > >> FLB is preserved. So the FLB should never go away while preserved files
>> > >> exist. Once all preserved files go away, you normally shouldn't be doing
>> > >> anything with the FLB anyway.
>> > >>
>> > >> Can you please elaborate on the use case and why this is a problem?
>> > >> Using the FLB outside of the standard LUO file callbacks sounds
>> > >> problematic.
>> > >
>> > > The scenario I had in mind was to remove a PCI device from the outgoing
>> > > FLB if the device is forcibly removed while the file is still preserved,
>> > > for example someone writes 1 to /sys/bus/pci/devices/.../remove or a
>> > > device is physically hot-unplugged.
>> > >
>> > > Specifically this call here from the patch below:
>> > >
>> > > +void pci_liveupdate_cleanup_device(struct pci_dev *dev)
>> > > +{
>> > > + /*
>> > > + * It should be safe to READ_ONCE() outside of the rwsem during cleanup
>> > > + * since there should no longer be any references to @dev on the system.
>> > > + */
>> > > + if (READ_ONCE(dev->liveupdate.outgoing)) {
>> > > + pci_WARN(dev, 1, "Destroying outgoing-preserved device!\n");
>> > > + pci_liveupdate_unpreserve(dev);
>> > > + }
>> > > +}
>> > >
>> > > https://lore.kernel.org/linux-pci/20260522202410.3104264-3-dmatlack@google.com/
>> > >
>> > > I can do this without adding reference counting to
>> > > liveupdate_flb_get_outgoing(), but the reference counting makes it
>> > > obvious that the outgoing FLB will not be freed while I am using it
>> > > here, and also aligns with liveupdate_flb_get_incoming().
>> >
>> > The lifecycle of FLB is bound to _preserved_ files. So it is only valid
>
> An FLB is a shared global resource whose lifecycle is bound to the
> duration of first creation and final destruction across multiple files.
> It is not associated with any single preserved file, but rather with the
> fact that files of a particular type(s) are currently being preserved.
>
> Because the kernel is highly asynchronous, we must handle concurrent
> access and race conditions during teardown and FLB lookups. Reference
> counting provides a standard, robust, and methodical way to handle this
> safely.
But that's my point. As you say, destruction is supposed to happen when
the last file using the FLB goes away. That stops being true when any
random caller can hold a reference. So now you can get a FLB with no
files. It fundamentally breaks the model we started with FLBs.
The FLB model also deals with asynchronicity, but it does so by managing
the lifetime of the FLB directly in LUO instead of leaving it to
subsystems. This reduces cognitive load on the users of the FLB API.
They do not need to care about managing the lifecycle themselves.
If users of FLB are directly managing references, then why does LUO even
need to manage anything? Then subsystems can just create their own
global data structure, grab a reference and release it as they please.
That is a perfectly valid way of managing this global object, but we
decided to _not_ do this because letting LUO manage the lifecycle (and
thus the refcount) results in code that is easier to understand and
reason about.
>
>> > as long as preserved files exist. So I think you should only get the FLB
>> > object when you are inside a file preservation callback for a file which
>
> Limiting access to only within a file preservation callback is not
> feasible. FLBs are global resources and may be queried early during boot
> or late during teardown, completely outside of a specific file's
> preservation callback.
To clarify, I don't mean only the preservation callback. I mean during
any of the file handler callbacks.
And if the subsystem needs to do something with the FLB after all its
files are frozen, we can provide a freeze() callback to
liveupdate_flb_ops.
Why do you need to access the FLB anywhere else outside of these
callbacks? What's the use case?
>
>> > the FLB is registered. Anywhere outside of that, you are not guaranteed
>> > to get anything sane.
>>
>> LUO should enforce this then, IMO.
>>
>> > This refcounting scheme breaks the inherent "file-lifecycle-bound" part
>> > of FLB, since now anyone can grab a reference and hold the FLB as long
>> > as they like, even when no preserved files exist.
>
> This is the standard design pattern for shared resources in the kernel.
> When the last file of that type is unpreserved, the FLB's reference
> count reaches zero, and it is cleanly destroyed.
>
> If there is a bug where someone holds onto a reference indefinitely,
> that is simply a leak/bug to be fixed. It is not a flaw in the
> reference-counting pattern itself, which is crucial for preventing
> use-after-free and race conditions under concurrent access.
Going to my previous point, the whole reason for FLB to exist is that we
don't want to burden subsystems with managing these references. If we do
that, then I think it would be way easier to just get rid of FLBs
entirely. It would be almost as easy for each subsystem to allocate a
global struct with its state and a refcount.
I don't get why we want to mix these approaches. FLB makes my life as a
user of the API easier because I don't need to worry about the
lifetimes. LUO takes care of that for me.
>
>> > For the normal case, your the VFIO driver gets probed, it registers its
>> > file handler, then when the device is preserved by VFIO, the VFIO file
>> > handler's callbacks can get the FLB and do whatever. LUO guarantees the
>> > FLB exists. Anywhere outside of that, you should _not_ touch the FLB
>> > because of the reasons above.
>> >
>> > Now for hot-unplug, I think that case is not supported right now. When a
>> > preserved file exists, LUO can only remove it when the user closes the
>> > session. Trying to clean up the file from any other context will leave
>> > dangling references to the file and we currently do not handle those.
>> > Trying to hold the file reference won't help much either since LUO
>> > callbacks will try to proceed as normal, and normal no longer applies.
>> >
>> > For example, say userspace preserved the file for your device in their
>> > session, then you hot-unplug the device, then userspace triggers a
>> > kexec. What is the freeze() callback supposed to do? Sure, the FLB
>> > object still exists, but the device doesn't. Similarly, if you force
>> > remove the module, the freeze() callback itself no longer exists, and
>> > you likely get a panic.
>> >
>> > We might at some point support "invalidating" preserved files. I imagine
>> > when you hot-unplug with a preserved device, you tell LUO to invalidate
>> > all preserved files with that device. They would still exist in their
>> > sessions, but all operations on them fail immediately, including
>> > freeze(), which prevents live update from proceeding until user cleans
>> > them up.
>> >
>> > So unless I am missing something, I think this refcounting is a band-aid
>> > and the real problem is to properly track these "invalidated" files.
>> >
>> > Also, I think the refcounting on the incoming path is also a mistake.
>
> On incoming path FLBs may be accessed during early boot and later
> including once we enter userspace. We must have a way to ensure valid,
> race-free access to these shared resources. Without reference counting,
> querying the existence of an FLB or retrieving it would require
> introducing global locking, which is highly undesirable.
Ideally, on the incoming path LUO should provide a subsystem init hook
that lets them get their FLB data and use it during init. Unfortunately
it isn't easy because of boot ordering. You might need the FLB before
LUO comes up.
But I think we should still tighten up access to the FLB on the incoming
side as well. You can use the FLB early until it gets registered (or
something similar) with LUO. Then you hand over the ownership to LUO and
can only use it from the standard LUO callbacks. At this point the
access model of incoming and outgoing become the same.
Overall, my point is that we should pick one usage model. Either LUO
manages the lifetime of the object or the subsystem. Mixing both this
way sounds wrong.
>
> Also, having both the incoming and outgoing FLB paths behave
> symmetrically using the same reference-counting API makes the overall
> design uniform, clean, and less error-prone.
>
>> > Unfortunately for incoming, there is a need for accessing the FLB
>> > outside of the file handling callbacks, since subsystems needs to use it
>> > to initialize itself. But I suppose we can have a accessor that
>> > subsystems can call once on boot/init to get their object. Then they use
>> > it to initialize their state and refer to the state directly, with all
>> > later calls going through the usual file handler callbacks.
>> >
>> > If you are interested in solving this problem, we can have a chat to
>> > talk in more detail, or perhaps have a discussion at one of the
>> > bi-weeklies?
>>
>> Thanks for the detailed reply but I think it's hard to discuss all these
>> as theoretical situations since we can get bogged down in the parts that
>> aren't clear yet and potential future use-cases.
>>
>> Can you review the use of the outgoing and incoming FLB in the PCI core
>> series and let me know what you think I am doing wrong?
>>
>> https://lore.kernel.org/linux-pci/20260522202410.3104264-1-dmatlack@google.com/
Yes, good idea. Will do.
--
Regards,
Pratyush Yadav
^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2026-06-09 13:33 UTC | newest]
Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-28 17:41 [PATCH 0/2] liveupdate: Small FLB fixes David Matlack
2026-05-28 17:41 ` [PATCH 1/2] liveupdate: Reference count outgoing FLB data David Matlack
2026-06-02 17:15 ` Pratyush Yadav
2026-06-02 17:25 ` David Matlack
2026-06-08 14:19 ` Pratyush Yadav
2026-06-08 23:37 ` David Matlack
2026-06-09 2:17 ` Pasha Tatashin
2026-06-09 13:33 ` Pratyush Yadav
2026-06-03 3:36 ` Pasha Tatashin
2026-05-28 17:41 ` [PATCH 2/2] liveupdate: Remember FLB retrieve() status David Matlack
2026-06-02 17:18 ` Pratyush Yadav
2026-06-03 3:36 ` Pasha Tatashin
2026-06-04 5:28 ` [PATCH 0/2] liveupdate: Small FLB fixes Mike Rapoport
2026-06-05 13:09 ` Pratyush Yadav
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.