Kexec Archive on lore.kernel.org
 help / color / mirror / Atom feed
From: Changyuan Lyu <changyuanl@google.com>
To: jgg@nvidia.com
Cc: akpm@linux-foundation.org, anthony.yznaga@oracle.com,
	arnd@arndb.de,  ashish.kalra@amd.com, benh@kernel.crashing.org,
	bp@alien8.de,  catalin.marinas@arm.com, changyuanl@google.com,
	corbet@lwn.net,  dave.hansen@linux.intel.com,
	devicetree@vger.kernel.org, dwmw2@infradead.org,
	 ebiederm@xmission.com, graf@amazon.com, hpa@zytor.com,
	jgowans@amazon.com,  kexec@lists.infradead.org, krzk@kernel.org,
	 linux-arm-kernel@lists.infradead.org, linux-doc@vger.kernel.org,
	 linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	luto@kernel.org,  mark.rutland@arm.com, mingo@redhat.com,
	pasha.tatashin@soleen.com,  pbonzini@redhat.com,
	peterz@infradead.org, ptyadav@amazon.de,  robh+dt@kernel.org,
	robh@kernel.org, rostedt@goodmis.org, rppt@kernel.org,
	 saravanak@google.com, skinsburskii@linux.microsoft.com,
	tglx@linutronix.de,  thomas.lendacky@amd.com, will@kernel.org,
	x86@kernel.org
Subject: Re: [PATCH v5 07/16] kexec: add Kexec HandOver (KHO) generation helpers
Date: Mon, 24 Mar 2025 17:21:45 -0700	[thread overview]
Message-ID: <20250325002145.982402-1-changyuanl@google.com> (raw)
In-Reply-To: <Z+GIRecXeYXiPrYv@nvidia.com>

Hi Jason,

On Mon, Mar 24, 2025 at 13:28:53 -0300, Jason Gunthorpe <jgg@nvidia.com> wrote:
> [...]
> > > I feel like this patch is premature, it should come later in the
> > > project along with a stronger justification for this approach.
> > >
> > > IHMO keep things simple for this series, just the very basics.
> >
> > The main purpose of using hashtables is to enable KHO users to save
> > data to KHO at any time, not just at the time of activate/finalize KHO
> > through sysfs/debugfs. For example, FDBox can save the data into KHO
> > tree once a new fd is saved to KHO. Also, using hashtables allows KHO
> > users to add data to KHO concurrently, while with notifiers, KHO users'
> > callbacks are executed serially.
>
> This is why I like the recursive FDT scheme. Each serialization
> operation can open its own FDT write to it and the close it
> sequenatially within its operation without any worries about
> concurrency.
>
> The top level just aggregates the FDT blobs (which are in preserved
> memory)
>
> To me all this complexity here with the hash table and the copying
> makes no sense compared to that. It is all around slower.
>
> > Regarding the suggestion of recursive FDT, I feel like it is already
> > doable with this patchset, or even with Mike's V4 patch.
>
> Of course it is doable, here we are really talk about what is the
> right, recommended way to use this system. recurisive FDT is a better
> methodology than hash tables
>
> > just allocates a buffer, serialize all its states to the buffer using
> > libfdt (or even using other binary formats), save the address of the
> > buffer to KHO's tree, and finally register the buffer's underlying
> > pages/folios with kho_preserve_folio().
>
> Yes, exactly! I think this is how we should operate this system as a
> paradig, not a giant FDT, hash table and so on...
>
> [...]
> > To completely remove fdt_max, I am considering the idea in [1]. At the
> > time of kexec_file_load(), we pass the address of an anchor page to
> > the new kernel, and the anchor page will later be fulfilled with the
> > physical addresses of the pages containing the FDT blob. Multiple
> > anchor pages can be linked together. The FDT blob pages can be physically
> > noncontiguous.
>
> Yes, this is basically what I suggested too. I think this is much
> prefered and doesn't require the wakky uapi.
>
> Except I suggested you just really need a single u64 to point to a
> preserved page holding the top level FDT.
>
> With recursive FDT I think we can say that no FDT fragement should
> exceed PAGE_SIZE, and things become much simpler, IMHO.

Thanks for the suggestions! I am a little bit concerned about assuming
every FDT fragment is smaller than PAGE_SIZE. In case a child FDT is
larger than PAGE_SIZE, I would like to turn the single u64 in the parent
FDT into a u64 list to record all the underlying pages of the child FDT.

To be concrete and make sure I understand your suggestions correctly,
I drafted the following design,

Suppose we have 2 KHO users, memblock and gpu@0x2000000000, the KHO
FDT (top level FDT) would look like the following,

    /dts-v1/;
    / {
            compatible = "kho-v1";
            memblock {
                    kho,recursive-fdt = <0x00 0x40001000>;
            };
            gpu@0x100000000 {
                    kho,recursive-fdt = <0x00 0x40002000>;
            };
    };

kho,recursive-fdt in "memblock" points to a page containing another
FDT,

    / {
            compatible = "memblock-v1";
            n1 {
                    compatible = "reserve-mem-v1";
                    size = <0x04 0x00>;
                    start = <0xc06b 0x4000000>;
            };
            n2 {
                    compatible = "reserve-mem-v1";
                    size = <0x04 0x00>;
                    start = <0xc067 0x4000000>;
            };
    };

Similarly, "kho,recursive-fdt" in "gpu@0x2000000000" points to a page
containing another FDT,

    / {
            compatible = "gpu-v1"
            key1 = "v1";
            key2 = "v2";

            node1 {
                    kho,recursive-fdt = <0x00 0x40003000 0x00 0x40005000>;
            }
            node2 {
                    key3 = "v3";
                    key4 = "v4";
            }
    }

and kho,recursive-fdt in "node1" contains 2 non-contagious pages backing
the following large FDT fragment,

    / {
            compatible = "gpu-subnode1-v1";

            key5 = "v5";
            key6 = "v6";
            key7 = "v7";
            key8 = "v8";
            ... // many many keys and small values
    }

In this way we assume that most FDT fragment is smaller than 1 page so
"kho,recursive-fdt" is usually just 1 u64, but we can also handle
larger fragments if that really happens.

I also allow KHO users to add sub nodes in-place, instead of forcing
to create a new FDT fragment for every sub node, if the KHO user is
confident that those subnodes are small enough to fit in the parent
node's page. In this way we do not need to waste a full page for a small
sub node. An example is the "memblock" node above.

Finally, the KHO top level FDT may also be larger than 1 page, this can
be handled using the anchor-page method discussed in the previous mails.

What do you think?

Best,
Changyuan


  reply	other threads:[~2025-03-27 13:41 UTC|newest]

Thread overview: 103+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-03-20  1:55 [PATCH v5 00/16] kexec: introduce Kexec HandOver (KHO) Changyuan Lyu
2025-03-20  1:55 ` [PATCH v5 01/16] kexec: define functions to map and unmap segments Changyuan Lyu
2025-03-20  1:55 ` [PATCH v5 02/16] mm/mm_init: rename init_reserved_page to init_deferred_page Changyuan Lyu
2025-03-20  7:10   ` Krzysztof Kozlowski
2025-03-20 17:15     ` Changyuan Lyu
2025-03-20  1:55 ` [PATCH v5 03/16] memblock: add MEMBLOCK_RSRV_KERN flag Changyuan Lyu
2025-03-20  1:55 ` [PATCH v5 04/16] memblock: Add support for scratch memory Changyuan Lyu
2025-03-20  1:55 ` [PATCH v5 05/16] memblock: introduce memmap_init_kho_scratch() Changyuan Lyu
2025-03-20  1:55 ` [PATCH v5 06/16] hashtable: add macro HASHTABLE_INIT Changyuan Lyu
2025-03-20  1:55 ` [PATCH v5 07/16] kexec: add Kexec HandOver (KHO) generation helpers Changyuan Lyu
2025-03-21 13:34   ` Jason Gunthorpe
2025-03-23 19:02     ` Changyuan Lyu
2025-03-24 16:28       ` Jason Gunthorpe
2025-03-25  0:21         ` Changyuan Lyu [this message]
2025-03-25  2:20           ` Jason Gunthorpe
2025-03-24 18:40   ` Frank van der Linden
2025-03-25 19:19     ` Mike Rapoport
2025-03-25 21:56       ` Frank van der Linden
2025-03-26 11:59         ` Mike Rapoport
2025-03-26 16:25           ` Frank van der Linden
2025-03-20  1:55 ` [PATCH v5 08/16] kexec: add KHO parsing support Changyuan Lyu
2025-03-20  1:55 ` [PATCH v5 09/16] kexec: enable KHO support for memory preservation Changyuan Lyu
2025-03-21 13:46   ` Jason Gunthorpe
2025-03-22 19:12     ` Mike Rapoport
2025-03-23 18:55       ` Jason Gunthorpe
2025-03-24 18:18         ` Mike Rapoport
2025-03-24 20:07           ` Jason Gunthorpe
2025-03-26 12:07             ` Mike Rapoport
2025-03-23 19:07     ` Changyuan Lyu
2025-03-25  2:04       ` Jason Gunthorpe
2025-03-27 10:03   ` Pratyush Yadav
2025-03-27 13:31     ` Jason Gunthorpe
2025-03-27 17:28       ` Pratyush Yadav
2025-03-28 12:53         ` Jason Gunthorpe
2025-04-02 16:44         ` Changyuan Lyu
2025-04-02 16:47           ` Pratyush Yadav
2025-04-02 18:37             ` Pasha Tatashin
2025-04-02 18:49               ` Pratyush Yadav
2025-04-02 19:16   ` Pratyush Yadav
2025-04-03 11:42     ` Jason Gunthorpe
2025-04-03 13:58       ` Mike Rapoport
2025-04-03 14:24         ` Jason Gunthorpe
2025-04-04  9:54           ` Mike Rapoport
2025-04-04 12:47             ` Jason Gunthorpe
2025-04-04 13:53               ` Mike Rapoport
2025-04-04 14:30                 ` Jason Gunthorpe
2025-04-04 16:24                   ` Pratyush Yadav
2025-04-04 17:31                     ` Jason Gunthorpe
2025-04-06 16:13                     ` Mike Rapoport
2025-04-06 16:11                   ` Mike Rapoport
2025-04-07 14:16                     ` Jason Gunthorpe
2025-04-07 16:31                       ` Mike Rapoport
2025-04-07 17:03                         ` Jason Gunthorpe
2025-04-09  9:06                           ` Mike Rapoport
2025-04-09 12:56                             ` Jason Gunthorpe
2025-04-09 13:58                               ` Mike Rapoport
2025-04-09 15:37                                 ` Jason Gunthorpe
2025-04-09 16:19                                   ` Mike Rapoport
2025-04-09 16:28                                     ` Jason Gunthorpe
2025-04-10 16:51                                       ` Matthew Wilcox
2025-04-10 17:31                                         ` Jason Gunthorpe
2025-04-09 16:28                       ` Mike Rapoport
2025-04-09 18:32                         ` Jason Gunthorpe
2025-04-04 16:15                 ` Pratyush Yadav
2025-04-06 16:34                   ` Mike Rapoport
2025-04-07 14:23                     ` Jason Gunthorpe
2025-04-03 13:57     ` Mike Rapoport
2025-04-11  4:02     ` Changyuan Lyu
2025-04-03 15:50   ` Pratyush Yadav
2025-04-03 16:10     ` Jason Gunthorpe
2025-04-03 17:37       ` Pratyush Yadav
2025-04-04 12:54         ` Jason Gunthorpe
2025-04-04 15:39           ` Pratyush Yadav
2025-04-09  8:35       ` Mike Rapoport
2025-03-20  1:55 ` [PATCH v5 10/16] kexec: add KHO support to kexec file loads Changyuan Lyu
2025-03-21 13:48   ` Jason Gunthorpe
2025-03-20  1:55 ` [PATCH v5 11/16] kexec: add config option for KHO Changyuan Lyu
2025-03-20  7:10   ` Krzysztof Kozlowski
2025-03-20 17:18     ` Changyuan Lyu
2025-03-24  4:18   ` Dave Young
2025-03-24 19:26     ` Pasha Tatashin
2025-03-25  1:24       ` Dave Young
2025-03-25  3:07         ` Dave Young
2025-03-25  6:57     ` Baoquan He
2025-03-25  8:36       ` Dave Young
2025-03-26  9:17         ` Dave Young
2025-03-26 11:28           ` Mike Rapoport
2025-03-26 12:09             ` Dave Young
2025-03-25 14:04       ` Pasha Tatashin
2025-03-20  1:55 ` [PATCH v5 12/16] arm64: add KHO support Changyuan Lyu
2025-03-20  7:13   ` Krzysztof Kozlowski
2025-03-20  8:30     ` Krzysztof Kozlowski
2025-03-20 23:29     ` Changyuan Lyu
2025-04-11  3:47   ` Changyuan Lyu
2025-03-20  1:55 ` [PATCH v5 13/16] x86/setup: use memblock_reserve_kern for memory used by kernel Changyuan Lyu
2025-03-20  1:55 ` [PATCH v5 14/16] x86: add KHO support Changyuan Lyu
2025-03-20  1:55 ` [PATCH v5 15/16] memblock: add KHO support for reserve_mem Changyuan Lyu
2025-03-20  1:55 ` [PATCH v5 16/16] Documentation: add documentation for KHO Changyuan Lyu
2025-03-20 14:45   ` Jonathan Corbet
2025-03-21  6:33     ` Changyuan Lyu
2025-03-21 13:46       ` Jonathan Corbet
2025-03-25 14:19 ` [PATCH v5 00/16] kexec: introduce Kexec HandOver (KHO) Pasha Tatashin
2025-03-25 15:03   ` Mike Rapoport

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20250325002145.982402-1-changyuanl@google.com \
    --to=changyuanl@google.com \
    --cc=akpm@linux-foundation.org \
    --cc=anthony.yznaga@oracle.com \
    --cc=arnd@arndb.de \
    --cc=ashish.kalra@amd.com \
    --cc=benh@kernel.crashing.org \
    --cc=bp@alien8.de \
    --cc=catalin.marinas@arm.com \
    --cc=corbet@lwn.net \
    --cc=dave.hansen@linux.intel.com \
    --cc=devicetree@vger.kernel.org \
    --cc=dwmw2@infradead.org \
    --cc=ebiederm@xmission.com \
    --cc=graf@amazon.com \
    --cc=hpa@zytor.com \
    --cc=jgg@nvidia.com \
    --cc=jgowans@amazon.com \
    --cc=kexec@lists.infradead.org \
    --cc=krzk@kernel.org \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=luto@kernel.org \
    --cc=mark.rutland@arm.com \
    --cc=mingo@redhat.com \
    --cc=pasha.tatashin@soleen.com \
    --cc=pbonzini@redhat.com \
    --cc=peterz@infradead.org \
    --cc=ptyadav@amazon.de \
    --cc=robh+dt@kernel.org \
    --cc=robh@kernel.org \
    --cc=rostedt@goodmis.org \
    --cc=rppt@kernel.org \
    --cc=saravanak@google.com \
    --cc=skinsburskii@linux.microsoft.com \
    --cc=tglx@linutronix.de \
    --cc=thomas.lendacky@amd.com \
    --cc=will@kernel.org \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox