From: Pratyush Yadav <pratyush@kernel.org>
To: Jason Gunthorpe <jgg@nvidia.com>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>,
pratyush@kernel.org, jasonmiu@google.com, graf@amazon.com,
changyuanl@google.com, rppt@kernel.org, dmatlack@google.com,
rientjes@google.com, corbet@lwn.net, rdunlap@infradead.org,
ilpo.jarvinen@linux.intel.com, kanie@linux.alibaba.com,
ojeda@kernel.org, aliceryhl@google.com, masahiroy@kernel.org,
akpm@linux-foundation.org, tj@kernel.org,
yoann.congal@smile.fr, mmaurer@google.com,
roman.gushchin@linux.dev, chenridong@huawei.com,
axboe@kernel.dk, mark.rutland@arm.com, jannh@google.com,
vincent.guittot@linaro.org, hannes@cmpxchg.org,
dan.j.williams@intel.com, david@redhat.com,
joel.granados@kernel.org, rostedt@goodmis.org,
anna.schumaker@oracle.com, song@kernel.org,
zhangguopeng@kylinos.cn, linux@weissschuh.net,
linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org,
linux-mm@kvack.org, gregkh@linuxfoundation.org,
tglx@linutronix.de, mingo@redhat.com, bp@alien8.de,
dave.hansen@linux.intel.com, x86@kernel.org, hpa@zytor.com,
rafael@kernel.org, dakr@kernel.org,
bartosz.golaszewski@linaro.org, cw00.choi@samsung.com,
myungjoo.ham@samsung.com, yesanishhere@gmail.com,
Jonathan.Cameron@huawei.com, quic_zijuhu@quicinc.com,
aleksander.lobakin@intel.com, ira.weiny@intel.com,
andriy.shevchenko@linux.intel.com, leon@kernel.org,
lukas@wunner.de, bhelgaas@google.com, wagi@kernel.org,
djeffery@redhat.com, stuart.w.hayes@gmail.com,
lennart@poettering.net, brauner@kernel.org,
linux-api@vger.kernel.org, linux-fsdevel@vger.kernel.org,
saeedm@nvidia.com, ajayachandra@nvidia.com, parav@nvidia.com,
leonro@nvidia.com, witu@nvidia.com
Subject: Re: [PATCH v3 29/30] luo: allow preserving memfd
Date: Wed, 27 Aug 2025 17:03:55 +0200 [thread overview]
Message-ID: <mafs0bjo0yffo.fsf@kernel.org> (raw)
In-Reply-To: <20250826162019.GD2130239@nvidia.com>
Hi Jason,
Thanks for the review.
On Tue, Aug 26 2025, Jason Gunthorpe wrote:
> On Thu, Aug 07, 2025 at 01:44:35AM +0000, Pasha Tatashin wrote:
>
>> + /*
>> + * Most of the space should be taken by preserved folios. So take its
>> + * size, plus a page for other properties.
>> + */
>> + fdt = memfd_luo_create_fdt(PAGE_ALIGN(preserved_size) + PAGE_SIZE);
>> + if (!fdt) {
>> + err = -ENOMEM;
>> + goto err_unpin;
>> + }
>
> This doesn't seem to have any versioning scheme, it really should..
It does. See the "compatible" property.
static const char memfd_luo_compatible[] = "memfd-v1";
static struct liveupdate_file_handler memfd_luo_handler = {
.ops = &memfd_luo_file_ops,
.compatible = memfd_luo_compatible,
};
This goes into the LUO FDT:
static int luo_files_to_fdt(struct xarray *files_xa_out)
[...]
xa_for_each(files_xa_out, token, h) {
[...]
ret = fdt_property_string(luo_file_fdt_out, "compatible",
h->fh->compatible);
So this function only gets called for the version 1.
>
>> + err = fdt_property_placeholder(fdt, "folios", preserved_size,
>> + (void **)&preserved_folios);
>> + if (err) {
>> + pr_err("Failed to reserve folios property in FDT: %s\n",
>> + fdt_strerror(err));
>> + err = -ENOMEM;
>> + goto err_free_fdt;
>> + }
>
> Yuk.
>
> This really wants some luo helper
>
> 'luo alloc array'
> 'luo restore array'
> 'luo free array'
>
> Which would get a linearized list of pages in the vmap to hold the
> array and then allocate some structure to record the page list and
> return back the u64 of the phys_addr of the top of the structure to
> store in whatever.
>
> Getting fdt to allocate the array inside the fds is just not going to
> work for anything of size.
Yep, I agree. This version already runs into size limits of around 1 GiB
due to the FDT being limited to MAX_PAGE_ORDER, since that is the
largest contiguous piece of memory folio_alloc() can give us. On top,
FDT is only limited to 32 bits. While very large, it isn't unreasonable
to expect metadata exceeding that for some use cases (4 GiB is only 0.4%
of 1 TiB and there are systems a lot larger than that around).
I think we need something a luo_xarray data structure that users like
memfd (and later hugetlb and guest_memfd and maybe others) can build to
make serialization easier. It will cover both contiguous arrays and
arrays with some holes in them.
I did it this way mainly to keep things simple and get things out. But
Pasha already mentioned he is running into this limit for some tests, so
I think I will experiment around with a serialized xarray design.
>
>> + for (; i < nr_pfolios; i++) {
>> + const struct memfd_luo_preserved_folio *pfolio = &pfolios[i];
>> + phys_addr_t phys;
>> + u64 index;
>> + int flags;
>> +
>> + if (!pfolio->foliodesc)
>> + continue;
>> +
>> + phys = PFN_PHYS(PRESERVED_FOLIO_PFN(pfolio->foliodesc));
>> + folio = kho_restore_folio(phys);
>> + if (!folio) {
>> + pr_err("Unable to restore folio at physical address: %llx\n",
>> + phys);
>> + goto put_file;
>> + }
>> + index = pfolio->index;
>> + flags = PRESERVED_FOLIO_FLAGS(pfolio->foliodesc);
>> +
>> + /* Set up the folio for insertion. */
>> + /*
>> + * TODO: Should find a way to unify this and
>> + * shmem_alloc_and_add_folio().
>> + */
>> + __folio_set_locked(folio);
>> + __folio_set_swapbacked(folio);
>>
>> + ret = mem_cgroup_charge(folio, NULL, mapping_gfp_mask(mapping));
>> + if (ret) {
>> + pr_err("shmem: failed to charge folio index %d: %d\n",
>> + i, ret);
>> + goto unlock_folio;
>> + }
>
> [..]
>
>> + folio_add_lru(folio);
>> + folio_unlock(folio);
>> + folio_put(folio);
>> + }
>
> Probably some consolidation will be needed to make this less
> duplicated..
Maybe. I do have that as a TODO item, but I took a quick look today and
I am not sure if it will make things simple enough. There are a few
places that add a folio to the shmem page cache, and all of them have
subtle differences and consolidating them all might be tricky. Let me
give it a shot...
>
> But overall I think just using the memfd_luo_preserved_folio as the
> serialization is entirely file, I don't think this needs anything more
> complicated.
>
> What it does need is an alternative to the FDT with versioning.
As I explained above, the versioning is already there. Beyond that, why
do you think a raw C struct is better than FDT? It is just another way
of expressing the same information. FDT is a bit more cumbersome to
write and read, but comes at the benefit of more introspect-ability.
>
> Which seems to me to be entirely fine as:
>
> struct memfd_luo_v0 {
> __aligned_u64 size;
> __aligned_u64 pos;
> __aligned_u64 folios;
> };
>
> struct memfd_luo_v0 memfd_luo_v0 = {.size = size, pos = file->f_pos, folios = folios};
> luo_store_object(&memfd_luo_v0, sizeof(memfd_luo_v0), <.. identifier for this fd..>, /*version=*/0);
>
> Which also shows the actual data needing to be serialized comes from
> more than one struct and has to be marshaled in code, somehow, to a
> single struct.
>
> Then I imagine a fairly simple forwards/backwards story. If something
> new is needed that is non-optional, lets say you compress the folios
> list to optimize holes:
>
> struct memfd_luo_v1 {
> __aligned_u64 size;
> __aligned_u64 pos;
> __aligned_u64 folios_list_with_holes;
> };
>
> Obviously a v0 kernel cannot parse this, but in this case a v1 aware
> kernel could optionally duplicate and write out the v0 format as well:
>
> luo_store_object(&memfd_luo_v0, sizeof(memfd_luo_v0), <.. identifier for this fd..>, /*version=*/0);
> luo_store_object(&memfd_luo_v1, sizeof(memfd_luo_v1), <.. identifier for this fd..>, /*version=*/1);
I think what you describe here is essentially how LUO works currently,
just that the mechanisms are a bit different.
For example, instead of the subsystem calling luo_store_object(), the
LUO core calls back into the subsystem at the appropriate time to let it
populate the object. See memfd_luo_prepare() and the data argument. The
version is decided by the compatible string with which the handler was
registered.
Since LUO knows when to start serializing what, I think this flow of
calling into the subsystem and letting it fill in an object that LUO
tracks and hands over makes a lot of sense.
>
> Then the rule is fairly simple, when the sucessor kernel goes to
> deserialize it asks luo for the versions it supports:
>
> if (luo_restore_object(&memfd_luo_v1, sizeof(memfd_luo_v1), <.. identifier for this fd..>, /*version=*/1))
> restore_v1(&memfd_luo_v1)
> else if (luo_restore_object(&memfd_luo_v0, sizeof(memfd_luo_v0), <.. identifier for this fd..>, /*version=*/0))
> restore_v0(&memfd_luo_v0)
> else
> luo_failure("Do not understand this");
Similarly, on restore side, the new kernel can register handlers of all
the versions it can deal with, and LUO core takes care of calling into
the right callback. See memfd_luo_retrieve() for example. If we now have
a v2, the new kernel can simply define a new handler for v2 and add a
new memfd_luo_retrieve_v2().
>
> luo core just manages this list of versioned data per serialized
> object. There is only one version per object.
This also holds true.
--
Regards,
Pratyush Yadav
next prev parent reply other threads:[~2025-08-27 15:04 UTC|newest]
Thread overview: 109+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-08-07 1:44 [PATCH v3 00/30] Live Update Orchestrator Pasha Tatashin
2025-08-07 1:44 ` [PATCH v3 01/30] kho: init new_physxa->phys_bits to fix lockdep Pasha Tatashin
2025-08-08 11:42 ` Pratyush Yadav
2025-08-08 11:52 ` Pratyush Yadav
2025-08-08 14:00 ` Pasha Tatashin
2025-08-08 19:06 ` Andrew Morton
2025-08-08 19:51 ` Pasha Tatashin
2025-08-08 20:19 ` Pasha Tatashin
2025-08-14 13:11 ` Jason Gunthorpe
2025-08-14 14:57 ` Pasha Tatashin
2025-08-07 1:44 ` [PATCH v3 02/30] kho: mm: Don't allow deferred struct page with KHO Pasha Tatashin
2025-08-08 11:47 ` Pratyush Yadav
2025-08-08 14:01 ` Pasha Tatashin
2025-08-07 1:44 ` [PATCH v3 03/30] kho: warn if KHO is disabled due to an error Pasha Tatashin
2025-08-08 11:48 ` Pratyush Yadav
2025-08-07 1:44 ` [PATCH v3 04/30] kho: allow to drive kho from within kernel Pasha Tatashin
2025-08-07 1:44 ` [PATCH v3 05/30] kho: make debugfs interface optional Pasha Tatashin
2025-08-07 1:44 ` [PATCH v3 06/30] kho: drop notifiers Pasha Tatashin
2025-08-07 1:44 ` [PATCH v3 07/30] kho: add interfaces to unpreserve folios and physical memory ranges Pasha Tatashin
2025-08-14 13:22 ` Jason Gunthorpe
2025-08-14 15:05 ` Pasha Tatashin
2025-08-14 17:01 ` Jason Gunthorpe
2025-08-15 9:12 ` Mike Rapoport
2025-08-18 13:55 ` Jason Gunthorpe
2025-08-07 1:44 ` [PATCH v3 08/30] kho: don't unpreserve memory during abort Pasha Tatashin
2025-08-14 13:30 ` Jason Gunthorpe
2025-08-07 1:44 ` [PATCH v3 09/30] liveupdate: kho: move to kernel/liveupdate Pasha Tatashin
2025-08-30 8:35 ` Mike Rapoport
2025-08-07 1:44 ` [PATCH v3 10/30] liveupdate: luo_core: luo_ioctl: Live Update Orchestrator Pasha Tatashin
2025-08-14 13:31 ` Jason Gunthorpe
2025-08-07 1:44 ` [PATCH v3 11/30] liveupdate: luo_core: integrate with KHO Pasha Tatashin
2025-08-07 1:44 ` [PATCH v3 12/30] liveupdate: luo_subsystems: add subsystem registration Pasha Tatashin
2025-08-07 1:44 ` [PATCH v3 13/30] liveupdate: luo_subsystems: implement subsystem callbacks Pasha Tatashin
2025-08-07 1:44 ` [PATCH v3 14/30] liveupdate: luo_files: add infrastructure for FDs Pasha Tatashin
2025-08-07 1:44 ` [PATCH v3 15/30] liveupdate: luo_files: implement file systems callbacks Pasha Tatashin
2025-08-07 1:44 ` [PATCH v3 16/30] liveupdate: luo_ioctl: add userpsace interface Pasha Tatashin
2025-08-14 13:49 ` Jason Gunthorpe
2025-08-07 1:44 ` [PATCH v3 17/30] liveupdate: luo_files: luo_ioctl: Unregister all FDs on device close Pasha Tatashin
2025-08-27 15:34 ` Pratyush Yadav
2025-08-07 1:44 ` [PATCH v3 18/30] liveupdate: luo_files: luo_ioctl: Add ioctls for per-file state management Pasha Tatashin
2025-08-14 14:02 ` Jason Gunthorpe
2025-08-07 1:44 ` [PATCH v3 19/30] liveupdate: luo_sysfs: add sysfs state monitoring Pasha Tatashin
2025-08-26 16:03 ` Jason Gunthorpe
2025-08-26 18:58 ` Pasha Tatashin
2025-08-07 1:44 ` [PATCH v3 20/30] reboot: call liveupdate_reboot() before kexec Pasha Tatashin
2025-08-07 1:44 ` [PATCH v3 21/30] kho: move kho debugfs directory to liveupdate Pasha Tatashin
2025-08-07 1:44 ` [PATCH v3 22/30] liveupdate: add selftests for subsystems un/registration Pasha Tatashin
2025-08-07 1:44 ` [PATCH v3 23/30] selftests/liveupdate: add subsystem/state tests Pasha Tatashin
2025-08-07 1:44 ` [PATCH v3 24/30] docs: add luo documentation Pasha Tatashin
2025-08-07 1:44 ` [PATCH v3 25/30] MAINTAINERS: add liveupdate entry Pasha Tatashin
2025-08-07 1:44 ` [PATCH v3 26/30] mm: shmem: use SHMEM_F_* flags instead of VM_* flags Pasha Tatashin
2025-08-11 23:11 ` Vipin Sharma
2025-08-13 12:42 ` Pratyush Yadav
2025-08-07 1:44 ` [PATCH v3 27/30] mm: shmem: allow freezing inode mapping Pasha Tatashin
2025-08-07 1:44 ` [PATCH v3 28/30] mm: shmem: export some functions to internal.h Pasha Tatashin
2025-08-07 1:44 ` [PATCH v3 29/30] luo: allow preserving memfd Pasha Tatashin
2025-08-08 20:22 ` Pasha Tatashin
2025-08-13 12:44 ` Pratyush Yadav
2025-08-13 6:34 ` Vipin Sharma
2025-08-13 7:09 ` Greg KH
2025-08-13 12:02 ` Pratyush Yadav
2025-08-13 12:14 ` Greg KH
2025-08-13 12:41 ` Jason Gunthorpe
2025-08-13 13:00 ` Greg KH
2025-08-13 13:37 ` Pratyush Yadav
2025-08-13 13:41 ` Pasha Tatashin
2025-08-13 13:53 ` Greg KH
2025-08-13 13:53 ` Greg KH
2025-08-13 20:03 ` Jason Gunthorpe
2025-08-13 13:31 ` Pratyush Yadav
2025-08-13 12:29 ` Pratyush Yadav
2025-08-13 13:49 ` Pasha Tatashin
2025-08-13 13:55 ` Pratyush Yadav
2025-08-26 16:20 ` Jason Gunthorpe
2025-08-27 15:03 ` Pratyush Yadav [this message]
2025-08-28 12:43 ` Jason Gunthorpe
2025-08-28 23:00 ` Chris Li
2025-09-01 17:10 ` Pratyush Yadav
2025-09-02 13:48 ` Jason Gunthorpe
2025-09-03 14:10 ` Pratyush Yadav
2025-08-28 7:14 ` Mike Rapoport
2025-08-29 18:47 ` Chris Li
2025-08-29 19:18 ` Chris Li
2025-09-02 13:41 ` Jason Gunthorpe
2025-09-03 12:01 ` Chris Li
2025-09-01 16:23 ` Mike Rapoport
2025-09-01 16:54 ` Pasha Tatashin
2025-09-01 17:21 ` Pratyush Yadav
2025-09-01 19:02 ` Pasha Tatashin
2025-09-02 11:38 ` Jason Gunthorpe
2025-09-02 11:58 ` Mike Rapoport
2025-09-01 17:01 ` Pratyush Yadav
2025-09-02 11:44 ` Mike Rapoport
2025-09-03 14:17 ` Pratyush Yadav
2025-08-07 1:44 ` [PATCH v3 30/30] docs: add documentation for memfd preservation via LUO Pasha Tatashin
2025-08-08 12:07 ` [PATCH v3 00/30] Live Update Orchestrator David Hildenbrand
2025-08-08 12:24 ` Pratyush Yadav
2025-08-08 13:53 ` Pasha Tatashin
2025-08-08 13:52 ` Pasha Tatashin
2025-08-26 13:16 ` Pratyush Yadav
2025-08-26 13:54 ` Pasha Tatashin
2025-08-26 14:24 ` Jason Gunthorpe
2025-08-26 15:02 ` Pasha Tatashin
2025-08-26 15:13 ` Jason Gunthorpe
2025-08-26 16:10 ` Pasha Tatashin
2025-08-26 16:22 ` Jason Gunthorpe
2025-08-26 17:03 ` Pasha Tatashin
2025-08-26 17:08 ` Jason Gunthorpe
2025-08-27 14:01 ` Pratyush Yadav
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=mafs0bjo0yffo.fsf@kernel.org \
--to=pratyush@kernel.org \
--cc=Jonathan.Cameron@huawei.com \
--cc=ajayachandra@nvidia.com \
--cc=akpm@linux-foundation.org \
--cc=aleksander.lobakin@intel.com \
--cc=aliceryhl@google.com \
--cc=andriy.shevchenko@linux.intel.com \
--cc=anna.schumaker@oracle.com \
--cc=axboe@kernel.dk \
--cc=bartosz.golaszewski@linaro.org \
--cc=bhelgaas@google.com \
--cc=bp@alien8.de \
--cc=brauner@kernel.org \
--cc=changyuanl@google.com \
--cc=chenridong@huawei.com \
--cc=corbet@lwn.net \
--cc=cw00.choi@samsung.com \
--cc=dakr@kernel.org \
--cc=dan.j.williams@intel.com \
--cc=dave.hansen@linux.intel.com \
--cc=david@redhat.com \
--cc=djeffery@redhat.com \
--cc=dmatlack@google.com \
--cc=graf@amazon.com \
--cc=gregkh@linuxfoundation.org \
--cc=hannes@cmpxchg.org \
--cc=hpa@zytor.com \
--cc=ilpo.jarvinen@linux.intel.com \
--cc=ira.weiny@intel.com \
--cc=jannh@google.com \
--cc=jasonmiu@google.com \
--cc=jgg@nvidia.com \
--cc=joel.granados@kernel.org \
--cc=kanie@linux.alibaba.com \
--cc=lennart@poettering.net \
--cc=leon@kernel.org \
--cc=leonro@nvidia.com \
--cc=linux-api@vger.kernel.org \
--cc=linux-doc@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linux@weissschuh.net \
--cc=lukas@wunner.de \
--cc=mark.rutland@arm.com \
--cc=masahiroy@kernel.org \
--cc=mingo@redhat.com \
--cc=mmaurer@google.com \
--cc=myungjoo.ham@samsung.com \
--cc=ojeda@kernel.org \
--cc=parav@nvidia.com \
--cc=pasha.tatashin@soleen.com \
--cc=quic_zijuhu@quicinc.com \
--cc=rafael@kernel.org \
--cc=rdunlap@infradead.org \
--cc=rientjes@google.com \
--cc=roman.gushchin@linux.dev \
--cc=rostedt@goodmis.org \
--cc=rppt@kernel.org \
--cc=saeedm@nvidia.com \
--cc=song@kernel.org \
--cc=stuart.w.hayes@gmail.com \
--cc=tglx@linutronix.de \
--cc=tj@kernel.org \
--cc=vincent.guittot@linaro.org \
--cc=wagi@kernel.org \
--cc=witu@nvidia.com \
--cc=x86@kernel.org \
--cc=yesanishhere@gmail.com \
--cc=yoann.congal@smile.fr \
--cc=zhangguopeng@kylinos.cn \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).