Re: [PATCH v3 29/30] luo: allow preserving memfd

linux-api.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Chris Li <chrisl@kernel.org>
To: Jason Gunthorpe <jgg@nvidia.com>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>,
	pratyush@kernel.org, jasonmiu@google.com,  graf@amazon.com,
	changyuanl@google.com, rppt@kernel.org, dmatlack@google.com,
	 rientjes@google.com, corbet@lwn.net, rdunlap@infradead.org,
	 ilpo.jarvinen@linux.intel.com, kanie@linux.alibaba.com,
	ojeda@kernel.org,  aliceryhl@google.com, masahiroy@kernel.org,
	akpm@linux-foundation.org,  tj@kernel.org, yoann.congal@smile.fr,
	mmaurer@google.com,  roman.gushchin@linux.dev,
	chenridong@huawei.com, axboe@kernel.dk,  mark.rutland@arm.com,
	jannh@google.com, vincent.guittot@linaro.org,
	 hannes@cmpxchg.org, dan.j.williams@intel.com, david@redhat.com,
	 joel.granados@kernel.org, rostedt@goodmis.org,
	anna.schumaker@oracle.com,  song@kernel.org,
	zhangguopeng@kylinos.cn, linux@weissschuh.net,
	 linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org,
	linux-mm@kvack.org,  gregkh@linuxfoundation.org,
	tglx@linutronix.de, mingo@redhat.com,  bp@alien8.de,
	dave.hansen@linux.intel.com, x86@kernel.org, hpa@zytor.com,
	 rafael@kernel.org, dakr@kernel.org,
	bartosz.golaszewski@linaro.org,  cw00.choi@samsung.com,
	myungjoo.ham@samsung.com, yesanishhere@gmail.com,
	 Jonathan.Cameron@huawei.com, quic_zijuhu@quicinc.com,
	 aleksander.lobakin@intel.com, ira.weiny@intel.com,
	 andriy.shevchenko@linux.intel.com, leon@kernel.org,
	lukas@wunner.de,  bhelgaas@google.com, wagi@kernel.org,
	djeffery@redhat.com,  stuart.w.hayes@gmail.com,
	ptyadav@amazon.de, lennart@poettering.net,  brauner@kernel.org,
	linux-api@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	 saeedm@nvidia.com, ajayachandra@nvidia.com, parav@nvidia.com,
	 leonro@nvidia.com, witu@nvidia.com
Subject: Re: [PATCH v3 29/30] luo: allow preserving memfd
Date: Wed, 3 Sep 2025 05:01:15 -0700	[thread overview]
Message-ID: <CACePvbWGR+XPfTub41=Ekj3aSMjzyO+FyJmzMy5HEQKq0-wqag@mail.gmail.com> (raw)
In-Reply-To: <20250902134156.GM186519@nvidia.com>

On Tue, Sep 2, 2025 at 6:42 AM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> On Fri, Aug 29, 2025 at 12:18:43PM -0700, Chris Li wrote:
>
> > Another idea is that having a middle layer manages the life cycle of
> > the reserved memory for you. Kind of like a slab allocator for the
> > preserved memory.
>
> If you want a slab allocator then I think you should make slab
> preservable.. Don't need more allocators :\

Sure, we can reuse the slab allocator to add the KHO function to it. I
consider that as the implementation detail side, I haven't even
started yet. I just want to point out that we might want to have a
high level library to take care of the life cycle of the preserved
memory. Less boilerplate code for the caller.

> > Question: Do we have a matching FDT node to match the memfd C
> > structure hierarchy? Otherwise all the C struct will lump into one FDT
> > node. Maybe one FDT node for all C struct is fine. Then there is a
> > risk of overflowing the 4K buffer limit on the FDT node.
>
> I thought you were getting rid of FDT? My suggestion was to be taken
> as a FDT replacement..

Thanks for the clarification. Yes, I do want to get rid of FDT, very much so.

If we are not using FDT, adding an object might change the underlying
C structure layout causing a chain reaction of C struct change back to
the root. That is where I assume you might be still using FDT. I see
your later comments address that with a list of objects. I will
discuss it there.

> You need some kind of hierarchy of identifiers, things like memfd
> should chain off some higher level luo object for a file descriptor.

Ack.

>
> PCI should be the same, but not fd based.

Ack.

> It may be that luo maintains some flat dictionary of
>   string -> [object type, version, u64 ptr]*

I see, got it. That answers my question of how to add a new object
without changing the C structure layout. You are using a list of the
same C structure. When adding more objects to it, just add more items
to the list. This part of the boiler plate detail is not mentioned in
your original suggestion.  I understand your proposal better now.

> And if you want to serialize that the optimal path would be to have a
> vmalloc of all the strings and a vmalloc of the [] data, sort of like
> the kho array idea.

The KHO array idea is already implemented in the existing KHO code or
that is something new you want to propose?

Then we will have to know the combined size of the string up front,
similar to the FDT story. Ideally the list can incrementally add items
to it. May be stored as a list as raw pointer without vmalloc
first,then have a final pass vmalloc and serialize the string and
data.

With the additional detail above, I would like to point out something
I have observed earlier: even though the core idea of the native C
struct is simple and intuitive, the end of end implementation is not.
When we compare C struct implementation, we need to include all those
additional boilerplate details as a whole, otherwise it is not a apple
to apple comparison.

> > At this stage, do you see that exploring such a machine idea can be
> > beneficial or harmful to the project? If such an idea is considered
> > harmful, we should stop discussing such an idea at all. Go back to
> > building more batches of hand crafted screws, which are waiting by the
> > next critical component.
>
> I haven't heard a compelling idea that will obviously make things
> better.. Adding more layers and complexity is not better.

Yes, I completely understand how you reason it, and I agree with your
assessment.

I like to add to that you have been heavily discounting the
boilerplate stuff in the C struct solution. Here is where our view
point might different:
If the "more layer" has its counterpart in the C struct solution as
well, then it is not "more", it is the necessary evil. We need to
compare apples to apples.

> Your BTF proposal doesn't seem to benifit memfd at all, it was focused
> on extracting data directly from an existing struct which I feel very
> strongly we should never do.

From data flow point of view, the data is get from a C struct and
eventually store into a C struct. That is no way around that. That is
the necessary evil if you automate this process. Hey, there is also no
rule saying that you can't use a bounce buffer of some kind of manual
control in between.

It is just a way to automate stuff to reduce the boilerplate. We can
put different label on that and escalate that label or concept is bad.
Your C struct has the exact same thing pulling data from the C struct
and storing into C struct. It is just the label we are arguing. This
label is good and that label is bad. Underlying it has the similar
common necessary evil.

> The above dictionary, I also don't see how BTF helps. It is such a
> special encoding. Yes you could make some elaborate serialization
> infrastructure, like FDT, but we have all been saying FDT is too hard
> to use and too much code. I'm not sure I'm convinced there is really a

Are you ready to be connived? If you keep this as a religion you can
never be convinced.

The reason FDT is too hard to use have other reason. FDT is design to
be constructed by offline tools. In kernel mostly just read only. We
are using FDT outside of its original design parameter. It does not
mean that some thing (the machine) specially design for this purpose
can't be build and easier to use.

> better middle ground :\

With due respect, it sounds like you have the risk of judging
something you haven't fully understood. I feel that a baby, my baby,
has been thrown out with the bathwater.

As a test of water for the above statement, can you describe my idea
equal or better than I do so it passes the test of I say: "yes, this
is exactly what I am trying to build".

That is the communication barrier I am talking about. I estimate at
this rate it will take us about 15 email exchanges to get to the core
stuff. It might be much quicker to lock you and me in a room, Only
release us when you and I can describe each other's viewpoint at a
mutual satisfactory level. I understand your time is precious, and I
don't want to waste your time. I fully respect and comply with your
decision. If you want me to stop now, I can stop. No question asked.

That gets back to my original question, do we already have a ruling
that even the discussion of "the machine" idea is forbidden.

> IMHO if there is some way to improve this it still yet to be found,

In my mind, I have found it. I have to get over the communication
barrier to plead my case to you. You can issue a preliminary ruling to
dismiss my case. I just wish you fully understood the case facts
before you make such a ruling.

> and I think we don't well understand what we need to serialize just
> yet.

That may be true, we don't have 100% understanding of what needs to be
serialized.  On the other hand, it is not 0% either. Based on what we
understand, we can already use "the machine" to help us do what we
know much more effectively. Of course, there is a trade off for
developing "the machine". It takes extra time and the complexity to
maintain such a machine. I fully understand that.

> Smaller ideas like preserve the vmalloc will make big improvement
> already.

Yes, I totally agree. It is a local optimization we can do, it might
not be the global optimized though. "the machine" might not use
vmalloc at all, all this small incremental change will be throw away
once we have "the machine".

I put this situation in the airplane story, yes, we build diamond
plated filers to produce the hand craft screws faster. The missing
opportunity is that, if we have "the machine" earlier, we can pump out
machined screws much faster at scale, minus the time to build the
machine, it might still be an overall win. We don't need to use
diamond plated filter if we have the machine.

> Lets not race ahead until we understand the actual problem properly.

Is that the final ruling? It feels like so. Just clarifying what I am receiving.

I feel a much stronger sense of urgency than you though.  The stakes
are high, currently you already have four departments can use this
common serialization library right now:
1) PCI
2) VFIO
3) IOMMU
4) Memfd.

We are getting into the more complex data structures. If we merge this
into the mainline, it is much harder to pull them out later.
Basically, this is a done deal. That is why I am putting my reputation
and my job on the line to pitch "the machine" idea. It is a very risky
move, I fully understand that.

Chris

next prev parent reply	other threads:[~2025-09-03 12:01 UTC|newest]

Thread overview: 116+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-08-07  1:44 [PATCH v3 00/30] Live Update Orchestrator Pasha Tatashin
2025-08-07  1:44 ` [PATCH v3 01/30] kho: init new_physxa->phys_bits to fix lockdep Pasha Tatashin
2025-08-08 11:42   ` Pratyush Yadav
2025-08-08 11:52     ` Pratyush Yadav
2025-08-08 14:00       ` Pasha Tatashin
2025-08-08 19:06         ` Andrew Morton
2025-08-08 19:51           ` Pasha Tatashin
2025-08-08 20:19             ` Pasha Tatashin
2025-08-14 13:11   ` Jason Gunthorpe
2025-08-14 14:57     ` Pasha Tatashin
2025-08-07  1:44 ` [PATCH v3 02/30] kho: mm: Don't allow deferred struct page with KHO Pasha Tatashin
2025-08-08 11:47   ` Pratyush Yadav
2025-08-08 14:01     ` Pasha Tatashin
2025-08-07  1:44 ` [PATCH v3 03/30] kho: warn if KHO is disabled due to an error Pasha Tatashin
2025-08-08 11:48   ` Pratyush Yadav
2025-08-07  1:44 ` [PATCH v3 04/30] kho: allow to drive kho from within kernel Pasha Tatashin
2025-08-07  1:44 ` [PATCH v3 05/30] kho: make debugfs interface optional Pasha Tatashin
2025-08-07  1:44 ` [PATCH v3 06/30] kho: drop notifiers Pasha Tatashin
2025-08-07  1:44 ` [PATCH v3 07/30] kho: add interfaces to unpreserve folios and physical memory ranges Pasha Tatashin
2025-08-14 13:22   ` Jason Gunthorpe
2025-08-14 15:05     ` Pasha Tatashin
2025-08-14 17:01       ` Jason Gunthorpe
2025-08-15  9:12     ` Mike Rapoport
2025-08-18 13:55       ` Jason Gunthorpe
2025-08-07  1:44 ` [PATCH v3 08/30] kho: don't unpreserve memory during abort Pasha Tatashin
2025-08-14 13:30   ` Jason Gunthorpe
2025-08-07  1:44 ` [PATCH v3 09/30] liveupdate: kho: move to kernel/liveupdate Pasha Tatashin
2025-08-30  8:35   ` Mike Rapoport
2025-08-07  1:44 ` [PATCH v3 10/30] liveupdate: luo_core: luo_ioctl: Live Update Orchestrator Pasha Tatashin
2025-08-14 13:31   ` Jason Gunthorpe
2025-08-07  1:44 ` [PATCH v3 11/30] liveupdate: luo_core: integrate with KHO Pasha Tatashin
2025-08-07  1:44 ` [PATCH v3 12/30] liveupdate: luo_subsystems: add subsystem registration Pasha Tatashin
2025-08-07  1:44 ` [PATCH v3 13/30] liveupdate: luo_subsystems: implement subsystem callbacks Pasha Tatashin
2025-08-07  1:44 ` [PATCH v3 14/30] liveupdate: luo_files: add infrastructure for FDs Pasha Tatashin
2025-08-07  1:44 ` [PATCH v3 15/30] liveupdate: luo_files: implement file systems callbacks Pasha Tatashin
2025-08-07  1:44 ` [PATCH v3 16/30] liveupdate: luo_ioctl: add userpsace interface Pasha Tatashin
2025-08-14 13:49   ` Jason Gunthorpe
2025-08-07  1:44 ` [PATCH v3 17/30] liveupdate: luo_files: luo_ioctl: Unregister all FDs on device close Pasha Tatashin
2025-08-27 15:34   ` Pratyush Yadav
2025-08-07  1:44 ` [PATCH v3 18/30] liveupdate: luo_files: luo_ioctl: Add ioctls for per-file state management Pasha Tatashin
2025-08-14 14:02   ` Jason Gunthorpe
2025-08-07  1:44 ` [PATCH v3 19/30] liveupdate: luo_sysfs: add sysfs state monitoring Pasha Tatashin
2025-08-26 16:03   ` Jason Gunthorpe
2025-08-26 18:58     ` Pasha Tatashin
2025-08-07  1:44 ` [PATCH v3 20/30] reboot: call liveupdate_reboot() before kexec Pasha Tatashin
2025-08-07  1:44 ` [PATCH v3 21/30] kho: move kho debugfs directory to liveupdate Pasha Tatashin
2025-08-07  1:44 ` [PATCH v3 22/30] liveupdate: add selftests for subsystems un/registration Pasha Tatashin
2025-08-07  1:44 ` [PATCH v3 23/30] selftests/liveupdate: add subsystem/state tests Pasha Tatashin
2025-08-07  1:44 ` [PATCH v3 24/30] docs: add luo documentation Pasha Tatashin
2025-08-07  1:44 ` [PATCH v3 25/30] MAINTAINERS: add liveupdate entry Pasha Tatashin
2025-08-07  1:44 ` [PATCH v3 26/30] mm: shmem: use SHMEM_F_* flags instead of VM_* flags Pasha Tatashin
2025-08-11 23:11   ` Vipin Sharma
2025-08-13 12:42     ` Pratyush Yadav
2025-08-07  1:44 ` [PATCH v3 27/30] mm: shmem: allow freezing inode mapping Pasha Tatashin
2025-08-07  1:44 ` [PATCH v3 28/30] mm: shmem: export some functions to internal.h Pasha Tatashin
2025-08-07  1:44 ` [PATCH v3 29/30] luo: allow preserving memfd Pasha Tatashin
2025-08-08 20:22   ` Pasha Tatashin
2025-08-13 12:44     ` Pratyush Yadav
2025-08-13  6:34   ` Vipin Sharma
2025-08-13  7:09     ` Greg KH
2025-08-13 12:02       ` Pratyush Yadav
2025-08-13 12:14         ` Greg KH
2025-08-13 12:41           ` Jason Gunthorpe
2025-08-13 13:00             ` Greg KH
2025-08-13 13:37               ` Pratyush Yadav
2025-08-13 13:41                 ` Pasha Tatashin
2025-08-13 13:53                   ` Greg KH
2025-08-13 13:53                 ` Greg KH
2025-08-13 20:03               ` Jason Gunthorpe
2025-08-13 13:31             ` Pratyush Yadav
2025-08-13 12:29     ` Pratyush Yadav
2025-08-13 13:49       ` Pasha Tatashin
2025-08-13 13:55         ` Pratyush Yadav
2025-08-26 16:20   ` Jason Gunthorpe
2025-08-27 15:03     ` Pratyush Yadav
2025-08-28 12:43       ` Jason Gunthorpe
2025-08-28 23:00         ` Chris Li
2025-09-01 17:10         ` Pratyush Yadav
2025-09-02 13:48           ` Jason Gunthorpe
2025-09-03 14:10             ` Pratyush Yadav
2025-09-03 15:01               ` Jason Gunthorpe
2025-09-04 12:57                 ` Pratyush Yadav
2025-08-28  7:14     ` Mike Rapoport
2025-08-29 18:47       ` Chris Li
2025-08-29 19:18     ` Chris Li
2025-09-02 13:41       ` Jason Gunthorpe
2025-09-03 12:01         ` Chris Li [this message]
2025-09-01 16:23     ` Mike Rapoport
2025-09-01 16:54       ` Pasha Tatashin
2025-09-01 17:21         ` Pratyush Yadav
2025-09-01 19:02           ` Pasha Tatashin
2025-09-02 11:38             ` Jason Gunthorpe
2025-09-03 15:59               ` Pasha Tatashin
2025-09-03 16:40                 ` Jason Gunthorpe
2025-09-03 19:29                 ` Mike Rapoport
2025-09-02 11:58         ` Mike Rapoport
2025-09-01 17:01       ` Pratyush Yadav
2025-09-02 11:44         ` Mike Rapoport
2025-09-03 14:17           ` Pratyush Yadav
2025-09-03 19:39             ` Mike Rapoport
2025-09-04 12:39               ` Pratyush Yadav
2025-08-07  1:44 ` [PATCH v3 30/30] docs: add documentation for memfd preservation via LUO Pasha Tatashin
2025-08-08 12:07 ` [PATCH v3 00/30] Live Update Orchestrator David Hildenbrand
2025-08-08 12:24   ` Pratyush Yadav
2025-08-08 13:53     ` Pasha Tatashin
2025-08-08 13:52   ` Pasha Tatashin
2025-08-26 13:16 ` Pratyush Yadav
2025-08-26 13:54   ` Pasha Tatashin
2025-08-26 14:24     ` Jason Gunthorpe
2025-08-26 15:02       ` Pasha Tatashin
2025-08-26 15:13         ` Jason Gunthorpe
2025-08-26 16:10           ` Pasha Tatashin
2025-08-26 16:22             ` Jason Gunthorpe
2025-08-26 17:03               ` Pasha Tatashin
2025-08-26 17:08                 ` Jason Gunthorpe
2025-08-27 14:01                 ` Pratyush Yadav

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CACePvbWGR+XPfTub41=Ekj3aSMjzyO+FyJmzMy5HEQKq0-wqag@mail.gmail.com' \
    --to=chrisl@kernel.org \
    --cc=Jonathan.Cameron@huawei.com \
    --cc=ajayachandra@nvidia.com \
    --cc=akpm@linux-foundation.org \
    --cc=aleksander.lobakin@intel.com \
    --cc=aliceryhl@google.com \
    --cc=andriy.shevchenko@linux.intel.com \
    --cc=anna.schumaker@oracle.com \
    --cc=axboe@kernel.dk \
    --cc=bartosz.golaszewski@linaro.org \
    --cc=bhelgaas@google.com \
    --cc=bp@alien8.de \
    --cc=brauner@kernel.org \
    --cc=changyuanl@google.com \
    --cc=chenridong@huawei.com \
    --cc=corbet@lwn.net \
    --cc=cw00.choi@samsung.com \
    --cc=dakr@kernel.org \
    --cc=dan.j.williams@intel.com \
    --cc=dave.hansen@linux.intel.com \
    --cc=david@redhat.com \
    --cc=djeffery@redhat.com \
    --cc=dmatlack@google.com \
    --cc=graf@amazon.com \
    --cc=gregkh@linuxfoundation.org \
    --cc=hannes@cmpxchg.org \
    --cc=hpa@zytor.com \
    --cc=ilpo.jarvinen@linux.intel.com \
    --cc=ira.weiny@intel.com \
    --cc=jannh@google.com \
    --cc=jasonmiu@google.com \
    --cc=jgg@nvidia.com \
    --cc=joel.granados@kernel.org \
    --cc=kanie@linux.alibaba.com \
    --cc=lennart@poettering.net \
    --cc=leon@kernel.org \
    --cc=leonro@nvidia.com \
    --cc=linux-api@vger.kernel.org \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux@weissschuh.net \
    --cc=lukas@wunner.de \
    --cc=mark.rutland@arm.com \
    --cc=masahiroy@kernel.org \
    --cc=mingo@redhat.com \
    --cc=mmaurer@google.com \
    --cc=myungjoo.ham@samsung.com \
    --cc=ojeda@kernel.org \
    --cc=parav@nvidia.com \
    --cc=pasha.tatashin@soleen.com \
    --cc=pratyush@kernel.org \
    --cc=ptyadav@amazon.de \
    --cc=quic_zijuhu@quicinc.com \
    --cc=rafael@kernel.org \
    --cc=rdunlap@infradead.org \
    --cc=rientjes@google.com \
    --cc=roman.gushchin@linux.dev \
    --cc=rostedt@goodmis.org \
    --cc=rppt@kernel.org \
    --cc=saeedm@nvidia.com \
    --cc=song@kernel.org \
    --cc=stuart.w.hayes@gmail.com \
    --cc=tglx@linutronix.de \
    --cc=tj@kernel.org \
    --cc=vincent.guittot@linaro.org \
    --cc=wagi@kernel.org \
    --cc=witu@nvidia.com \
    --cc=x86@kernel.org \
    --cc=yesanishhere@gmail.com \
    --cc=yoann.congal@smile.fr \
    --cc=zhangguopeng@kylinos.cn \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).