linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Pratyush Yadav <pratyush@kernel.org>
To: Jason Gunthorpe <jgg@nvidia.com>
Cc: Pratyush Yadav <pratyush@kernel.org>,
	 Pasha Tatashin <pasha.tatashin@soleen.com>,
	 jasonmiu@google.com,  graf@amazon.com, changyuanl@google.com,
	 rppt@kernel.org,  dmatlack@google.com, rientjes@google.com,
	 corbet@lwn.net,  rdunlap@infradead.org,
	ilpo.jarvinen@linux.intel.com,  kanie@linux.alibaba.com,
	ojeda@kernel.org,  aliceryhl@google.com,  masahiroy@kernel.org,
	akpm@linux-foundation.org,  tj@kernel.org,
	 yoann.congal@smile.fr, mmaurer@google.com,
	 roman.gushchin@linux.dev,  chenridong@huawei.com,
	axboe@kernel.dk,  mark.rutland@arm.com,  jannh@google.com,
	vincent.guittot@linaro.org,  hannes@cmpxchg.org,
	dan.j.williams@intel.com,  david@redhat.com,
	 joel.granados@kernel.org, rostedt@goodmis.org,
	 anna.schumaker@oracle.com,  song@kernel.org,
	zhangguopeng@kylinos.cn,  linux@weissschuh.net,
	linux-kernel@vger.kernel.org,  linux-doc@vger.kernel.org,
	linux-mm@kvack.org,  gregkh@linuxfoundation.org,
	 tglx@linutronix.de, mingo@redhat.com,  bp@alien8.de,
	 dave.hansen@linux.intel.com, x86@kernel.org,  hpa@zytor.com,
	 rafael@kernel.org,  dakr@kernel.org,
	bartosz.golaszewski@linaro.org,  cw00.choi@samsung.com,
	myungjoo.ham@samsung.com,  yesanishhere@gmail.com,
	Jonathan.Cameron@huawei.com,  quic_zijuhu@quicinc.com,
	aleksander.lobakin@intel.com,  ira.weiny@intel.com,
	andriy.shevchenko@linux.intel.com,  leon@kernel.org,
	 lukas@wunner.de, bhelgaas@google.com,  wagi@kernel.org,
	 djeffery@redhat.com, stuart.w.hayes@gmail.com,
	 lennart@poettering.net,  brauner@kernel.org,
	linux-api@vger.kernel.org,  linux-fsdevel@vger.kernel.org,
	saeedm@nvidia.com,  ajayachandra@nvidia.com,  parav@nvidia.com,
	leonro@nvidia.com,  witu@nvidia.com
Subject: Re: [PATCH v3 29/30] luo: allow preserving memfd
Date: Thu, 04 Sep 2025 14:57:35 +0200	[thread overview]
Message-ID: <mafs0a53av0hs.fsf@kernel.org> (raw)
In-Reply-To: <20250903150157.GH470103@nvidia.com>

Hi Jason,

On Wed, Sep 03 2025, Jason Gunthorpe wrote:

> On Wed, Sep 03, 2025 at 04:10:37PM +0200, Pratyush Yadav wrote:
>
>> > So, it could be useful, but I wouldn't use it for memfd, the vmalloc
>> > approach is better and we shouldn't optimize for sparsness which
>> > should never happen.
>> 
>> I disagree. I think we are re-inventing the same data format with minor
>> variations. I think we should define extensible fundamental data formats
>> first, and then use those as the building blocks for the rest of our
>> serialization logic.
>
> page, vmalloc, slab seem to me to be the fundamental units of memory
> management in linux, so they should get KHO support.
>
> If you want to preserve a known-sized array you use vmalloc and then
> write out the per-list items. If it is a dictionary/sparse array then
> you write an index with each item too. This is all trivial and doesn't
> really need more abstraction in of itself, IMHO.

We will use up double the space for tracking metadata, but maybe that is
fine until we start seeing bigger memfds in real workloads.

>
>> cases can then build on top of it. For example, the preservation bitmaps
>> can get rid of their linked list logic and just use KHO array to hold
>> and retrieve its bitmaps. It will make the serialization simpler.
>
> I don't think the bitmaps should, the serialization here is very
> special because it is not actually preserved, it just exists for the
> time while the new kernel runs in scratch and is insta freed once the
> allocators start up.

I don't think it matters if they are preserved or not. The serialization
and deserialization is independent of that. You can very well create a
KHO array that you don't KHO-preserve. On next boot, you can still use
it, you just have to be careful of doing it while scratch-only. Same as
we do now.

>
>> I also don't get why you think sparseness "should never happen". For
>> memfd for example, you say in one of your other emails that "And again
>> in real systems we expect memfd to be fully populated too." Which
>> systems and use cases do you have in mind? Why do you think people won't
>> want a sparse memfd?
>
> memfd should principally be used to back VM memory, and I expect VM
> memory to be fully populated. Why would it be sparse?

For the _hypervisor_ live update case, sure. Though even there, I have a
feeling we will start seeing userspace components on the hypervisor use
memfd for stashing some of their state. Pasha has already mentioned they
have a use case for a memfd that is not VM memory.

But hypervisor live upadte isn't the only use case for LUO. We are
looking at enabling state preservation for "normal" userspace
applications. Think big storage nodes with memory in order of TiB. Those
can use a memfd to back their caches so on a kernel upgrade the caches
don't have to be re-fetched. Sparseness is to be expected for such use
cases.

>
>> All in all, I think KHO array is going to prove useful and will make
>> serialization for subsystems easier. I think sparseness will also prove
>> useful but it is not a hill I want to die on. I am fine with starting
>> with a non-sparse array if people really insist. But I do think we
>> should go with KHO array as a base instead of re-inventing the linked
>> list of pages again and again.
>
> The two main advantages I see to the kho array design vs vmalloc is
> that it should be a bit faster as it doesn't establish a vmap, and it
> handles unknown size lists much better.
>
> Are these important considerations? IDK.
>
> As I said to Chris, I think we should see more examples of what we
> actually need before assuming any certain datastructure is the best
> choice.
>
> So I'd stick to simpler open coded things and go back and improve them
> than start out building the wrong shared data structure.
>
> How about have at least three luo clients that show meaningful benefit
> before proposing something beyond the fundamental page, vmalloc, slab
> things?

I think the fundamentals themselves get some benefit. But anyway, since
I have done most of the work on this feature anyway, I will do the rest
and send the patches out. Then you can have a look and if you're still
not convinced, I am fine shelving it for now to revisit later when a
stronger case can be made.

>
>> What do you mean by "data per version"? I think there should be only one
>> version of the serialized object. Multiple versions of the same thing
>> will get ugly real quick.
>
> If you want to support backwards/forwards compatability then you
> probably should support multiple versions as well. Otherwise it
> could become quite hard to make downgrades..

Hmm, forward can work regardless since a newer kernel should speak older
formats too, but for backwards it makes sense to have an older version.

But perhaps it might be a better idea to come up with a mechanism for
the kernel to discover which formats the "next" kernel speaks so it can
for one decide whether it can do the live update at all, and for another
which formats it should use. Maybe we give a way for luod to choose
formats, and give it the responsibility for doing these checks?

>
> Ideally I'd want to remove the upstream code for obsolete versions
> fairly quickly so I'd imagine kernels will want to generate both
> versions during the transition period and then eventually newer
> kernels will only accept the new version.
>
> I've argued before that the extended matrix of any kernel version to
> any other kernel version should lie with the distro/CSP making the
> kernel fork. They know what their upgrade sequence will be so they can
> manage any missing versions to make it work.
>
> Upstream should do like v6.1 to v6.2 only or something similarly well
> constrained. I think this is a reasonable trade off to get subsystem
> maintainers to even accept this stuff at all.
[...]

-- 
Regards,
Pratyush Yadav

  reply	other threads:[~2025-09-04 12:57 UTC|newest]

Thread overview: 118+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-08-07  1:44 [PATCH v3 00/30] Live Update Orchestrator Pasha Tatashin
2025-08-07  1:44 ` [PATCH v3 01/30] kho: init new_physxa->phys_bits to fix lockdep Pasha Tatashin
2025-08-08 11:42   ` Pratyush Yadav
2025-08-08 11:52     ` Pratyush Yadav
2025-08-08 14:00       ` Pasha Tatashin
2025-08-08 19:06         ` Andrew Morton
2025-08-08 19:51           ` Pasha Tatashin
2025-08-08 20:19             ` Pasha Tatashin
2025-08-14 13:11   ` Jason Gunthorpe
2025-08-14 14:57     ` Pasha Tatashin
2025-08-07  1:44 ` [PATCH v3 02/30] kho: mm: Don't allow deferred struct page with KHO Pasha Tatashin
2025-08-08 11:47   ` Pratyush Yadav
2025-08-08 14:01     ` Pasha Tatashin
2025-08-07  1:44 ` [PATCH v3 03/30] kho: warn if KHO is disabled due to an error Pasha Tatashin
2025-08-08 11:48   ` Pratyush Yadav
2025-08-07  1:44 ` [PATCH v3 04/30] kho: allow to drive kho from within kernel Pasha Tatashin
2025-08-07  1:44 ` [PATCH v3 05/30] kho: make debugfs interface optional Pasha Tatashin
2025-08-07  1:44 ` [PATCH v3 06/30] kho: drop notifiers Pasha Tatashin
2025-08-07  1:44 ` [PATCH v3 07/30] kho: add interfaces to unpreserve folios and physical memory ranges Pasha Tatashin
2025-08-14 13:22   ` Jason Gunthorpe
2025-08-14 15:05     ` Pasha Tatashin
2025-08-14 17:01       ` Jason Gunthorpe
2025-08-15  9:12     ` Mike Rapoport
2025-08-18 13:55       ` Jason Gunthorpe
2025-08-07  1:44 ` [PATCH v3 08/30] kho: don't unpreserve memory during abort Pasha Tatashin
2025-08-14 13:30   ` Jason Gunthorpe
2025-08-07  1:44 ` [PATCH v3 09/30] liveupdate: kho: move to kernel/liveupdate Pasha Tatashin
2025-08-30  8:35   ` Mike Rapoport
2025-08-07  1:44 ` [PATCH v3 10/30] liveupdate: luo_core: luo_ioctl: Live Update Orchestrator Pasha Tatashin
2025-08-14 13:31   ` Jason Gunthorpe
2025-08-07  1:44 ` [PATCH v3 11/30] liveupdate: luo_core: integrate with KHO Pasha Tatashin
2025-08-07  1:44 ` [PATCH v3 12/30] liveupdate: luo_subsystems: add subsystem registration Pasha Tatashin
2025-08-07  1:44 ` [PATCH v3 13/30] liveupdate: luo_subsystems: implement subsystem callbacks Pasha Tatashin
2025-08-07  1:44 ` [PATCH v3 14/30] liveupdate: luo_files: add infrastructure for FDs Pasha Tatashin
2025-08-07  1:44 ` [PATCH v3 15/30] liveupdate: luo_files: implement file systems callbacks Pasha Tatashin
2025-08-07  1:44 ` [PATCH v3 16/30] liveupdate: luo_ioctl: add userpsace interface Pasha Tatashin
2025-08-14 13:49   ` Jason Gunthorpe
2025-08-07  1:44 ` [PATCH v3 17/30] liveupdate: luo_files: luo_ioctl: Unregister all FDs on device close Pasha Tatashin
2025-08-27 15:34   ` Pratyush Yadav
2025-08-07  1:44 ` [PATCH v3 18/30] liveupdate: luo_files: luo_ioctl: Add ioctls for per-file state management Pasha Tatashin
2025-08-14 14:02   ` Jason Gunthorpe
2025-08-07  1:44 ` [PATCH v3 19/30] liveupdate: luo_sysfs: add sysfs state monitoring Pasha Tatashin
2025-08-26 16:03   ` Jason Gunthorpe
2025-08-26 18:58     ` Pasha Tatashin
2025-08-07  1:44 ` [PATCH v3 20/30] reboot: call liveupdate_reboot() before kexec Pasha Tatashin
2025-08-07  1:44 ` [PATCH v3 21/30] kho: move kho debugfs directory to liveupdate Pasha Tatashin
2025-08-07  1:44 ` [PATCH v3 22/30] liveupdate: add selftests for subsystems un/registration Pasha Tatashin
2025-08-07  1:44 ` [PATCH v3 23/30] selftests/liveupdate: add subsystem/state tests Pasha Tatashin
2025-08-07  1:44 ` [PATCH v3 24/30] docs: add luo documentation Pasha Tatashin
2025-08-07  1:44 ` [PATCH v3 25/30] MAINTAINERS: add liveupdate entry Pasha Tatashin
2025-08-07  1:44 ` [PATCH v3 26/30] mm: shmem: use SHMEM_F_* flags instead of VM_* flags Pasha Tatashin
2025-08-11 23:11   ` Vipin Sharma
2025-08-13 12:42     ` Pratyush Yadav
2025-08-07  1:44 ` [PATCH v3 27/30] mm: shmem: allow freezing inode mapping Pasha Tatashin
2025-08-07  1:44 ` [PATCH v3 28/30] mm: shmem: export some functions to internal.h Pasha Tatashin
2025-08-07  1:44 ` [PATCH v3 29/30] luo: allow preserving memfd Pasha Tatashin
2025-08-08 20:22   ` Pasha Tatashin
2025-08-13 12:44     ` Pratyush Yadav
2025-08-13  6:34   ` Vipin Sharma
2025-08-13  7:09     ` Greg KH
2025-08-13 12:02       ` Pratyush Yadav
2025-08-13 12:14         ` Greg KH
2025-08-13 12:41           ` Jason Gunthorpe
2025-08-13 13:00             ` Greg KH
2025-08-13 13:37               ` Pratyush Yadav
2025-08-13 13:41                 ` Pasha Tatashin
2025-08-13 13:53                   ` Greg KH
2025-08-13 13:53                 ` Greg KH
2025-08-13 20:03               ` Jason Gunthorpe
2025-08-13 13:31             ` Pratyush Yadav
2025-08-13 12:29     ` Pratyush Yadav
2025-08-13 13:49       ` Pasha Tatashin
2025-08-13 13:55         ` Pratyush Yadav
2025-08-26 16:20   ` Jason Gunthorpe
2025-08-27 15:03     ` Pratyush Yadav
2025-08-28 12:43       ` Jason Gunthorpe
2025-08-28 23:00         ` Chris Li
2025-09-01 17:10         ` Pratyush Yadav
2025-09-02 13:48           ` Jason Gunthorpe
2025-09-03 14:10             ` Pratyush Yadav
2025-09-03 15:01               ` Jason Gunthorpe
2025-09-04 12:57                 ` Pratyush Yadav [this message]
2025-09-04 14:42                   ` Jason Gunthorpe
2025-08-28  7:14     ` Mike Rapoport
2025-08-29 18:47       ` Chris Li
2025-08-29 19:18     ` Chris Li
2025-09-02 13:41       ` Jason Gunthorpe
2025-09-03 12:01         ` Chris Li
2025-09-04 17:34           ` Jason Gunthorpe
2025-09-01 16:23     ` Mike Rapoport
2025-09-01 16:54       ` Pasha Tatashin
2025-09-01 17:21         ` Pratyush Yadav
2025-09-01 19:02           ` Pasha Tatashin
2025-09-02 11:38             ` Jason Gunthorpe
2025-09-03 15:59               ` Pasha Tatashin
2025-09-03 16:40                 ` Jason Gunthorpe
2025-09-03 19:29                 ` Mike Rapoport
2025-09-02 11:58         ` Mike Rapoport
2025-09-01 17:01       ` Pratyush Yadav
2025-09-02 11:44         ` Mike Rapoport
2025-09-03 14:17           ` Pratyush Yadav
2025-09-03 19:39             ` Mike Rapoport
2025-09-04 12:39               ` Pratyush Yadav
2025-08-07  1:44 ` [PATCH v3 30/30] docs: add documentation for memfd preservation via LUO Pasha Tatashin
2025-08-08 12:07 ` [PATCH v3 00/30] Live Update Orchestrator David Hildenbrand
2025-08-08 12:24   ` Pratyush Yadav
2025-08-08 13:53     ` Pasha Tatashin
2025-08-08 13:52   ` Pasha Tatashin
2025-08-26 13:16 ` Pratyush Yadav
2025-08-26 13:54   ` Pasha Tatashin
2025-08-26 14:24     ` Jason Gunthorpe
2025-08-26 15:02       ` Pasha Tatashin
2025-08-26 15:13         ` Jason Gunthorpe
2025-08-26 16:10           ` Pasha Tatashin
2025-08-26 16:22             ` Jason Gunthorpe
2025-08-26 17:03               ` Pasha Tatashin
2025-08-26 17:08                 ` Jason Gunthorpe
2025-08-27 14:01                 ` Pratyush Yadav

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=mafs0a53av0hs.fsf@kernel.org \
    --to=pratyush@kernel.org \
    --cc=Jonathan.Cameron@huawei.com \
    --cc=ajayachandra@nvidia.com \
    --cc=akpm@linux-foundation.org \
    --cc=aleksander.lobakin@intel.com \
    --cc=aliceryhl@google.com \
    --cc=andriy.shevchenko@linux.intel.com \
    --cc=anna.schumaker@oracle.com \
    --cc=axboe@kernel.dk \
    --cc=bartosz.golaszewski@linaro.org \
    --cc=bhelgaas@google.com \
    --cc=bp@alien8.de \
    --cc=brauner@kernel.org \
    --cc=changyuanl@google.com \
    --cc=chenridong@huawei.com \
    --cc=corbet@lwn.net \
    --cc=cw00.choi@samsung.com \
    --cc=dakr@kernel.org \
    --cc=dan.j.williams@intel.com \
    --cc=dave.hansen@linux.intel.com \
    --cc=david@redhat.com \
    --cc=djeffery@redhat.com \
    --cc=dmatlack@google.com \
    --cc=graf@amazon.com \
    --cc=gregkh@linuxfoundation.org \
    --cc=hannes@cmpxchg.org \
    --cc=hpa@zytor.com \
    --cc=ilpo.jarvinen@linux.intel.com \
    --cc=ira.weiny@intel.com \
    --cc=jannh@google.com \
    --cc=jasonmiu@google.com \
    --cc=jgg@nvidia.com \
    --cc=joel.granados@kernel.org \
    --cc=kanie@linux.alibaba.com \
    --cc=lennart@poettering.net \
    --cc=leon@kernel.org \
    --cc=leonro@nvidia.com \
    --cc=linux-api@vger.kernel.org \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux@weissschuh.net \
    --cc=lukas@wunner.de \
    --cc=mark.rutland@arm.com \
    --cc=masahiroy@kernel.org \
    --cc=mingo@redhat.com \
    --cc=mmaurer@google.com \
    --cc=myungjoo.ham@samsung.com \
    --cc=ojeda@kernel.org \
    --cc=parav@nvidia.com \
    --cc=pasha.tatashin@soleen.com \
    --cc=quic_zijuhu@quicinc.com \
    --cc=rafael@kernel.org \
    --cc=rdunlap@infradead.org \
    --cc=rientjes@google.com \
    --cc=roman.gushchin@linux.dev \
    --cc=rostedt@goodmis.org \
    --cc=rppt@kernel.org \
    --cc=saeedm@nvidia.com \
    --cc=song@kernel.org \
    --cc=stuart.w.hayes@gmail.com \
    --cc=tglx@linutronix.de \
    --cc=tj@kernel.org \
    --cc=vincent.guittot@linaro.org \
    --cc=wagi@kernel.org \
    --cc=witu@nvidia.com \
    --cc=x86@kernel.org \
    --cc=yesanishhere@gmail.com \
    --cc=yoann.congal@smile.fr \
    --cc=zhangguopeng@kylinos.cn \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).