From: Pratyush Yadav <ptyadav@amazon.de>
To: Alexander Graf <graf@amazon.com>
Cc: <linux-kernel@vger.kernel.org>,
	<linux-trace-kernel@vger.kernel.org>, <linux-mm@kvack.org>,
	<devicetree@vger.kernel.org>,
	<linux-arm-kernel@lists.infradead.org>,
	<kexec@lists.infradead.org>, <linux-doc@vger.kernel.org>,
	<x86@kernel.org>, Eric Biederman <ebiederm@xmission.com>,
	"H . Peter Anvin" <hpa@zytor.com>,
	Andy Lutomirski <luto@kernel.org>,
	Peter Zijlstra <peterz@infradead.org>,
	Steven Rostedt <rostedt@goodmis.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	"Mark Rutland" <mark.rutland@arm.com>,
	Tom Lendacky <thomas.lendacky@amd.com>,
	Ashish Kalra <ashish.kalra@amd.com>,
	James Gowans <jgowans@amazon.com>,
	Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>,
	<arnd@arndb.de>, <pbonzini@redhat.com>,
	<madvenka@linux.microsoft.com>,
	Anthony Yznaga <anthony.yznaga@oracle.com>,
	Usama Arif <usama.arif@bytedance.com>,
	"David Woodhouse" <dwmw@amazon.co.uk>,
	Benjamin Herrenschmidt <benh@kernel.crashing.org>,
	Rob Herring <robh+dt@kernel.org>,
	"Krzysztof Kozlowski" <krzk@kernel.org>
Subject: Re: [PATCH v3 00/17] kexec: Allow preservation of ftrace buffers
Date: Fri, 16 Feb 2024 16:29:43 +0100	[thread overview]
Message-ID: <mafs0zfw08m1k.fsf@amazon.de> (raw)
In-Reply-To: <20240117144704.602-1-graf@amazon.com> (Alexander Graf's message of "Wed, 17 Jan 2024 14:46:47 +0000")
Hi Alex,
On Wed, Jan 17 2024, Alexander Graf wrote:
> Kexec today considers itself purely a boot loader: When we enter the new
> kernel, any state the previous kernel left behind is irrelevant and the
> new kernel reinitializes the system.
>
> However, there are use cases where this mode of operation is not what we
> actually want. In virtualization hosts for example, we want to use kexec
> to update the host kernel while virtual machine memory stays untouched.
> When we add device assignment to the mix, we also need to ensure that
> IOMMU and VFIO states are untouched. If we add PCIe peer to peer DMA, we
> need to do the same for the PCI subsystem. If we want to kexec while an
> SEV-SNP enabled virtual machine is running, we need to preserve the VM
> context pages and physical memory. See James' and my Linux Plumbers
> Conference 2023 presentation for details:
I am working on handing userspace pages across kexec. This can be useful
for applications with large in-memory state that can be time consuming
to rebuild. If they can hand over their state over kexec, it allows for
kernel upgrades with lower downtime. As a part of this problem, I have
been looking at plugging all of this into CRIU [0] so I don't have to
modify the applications to use this feature. I can just use CRIU to do
the checkpoint and restore quickly over kexec.
I hacked together some patches for this (which are not yet polished
enough to publish) and ended up implementing something like KHO in a
much more crude way. I have since refactored my patches to use KHO and I
find it quite useful. So thanks for working on this :-)
It was easy enough to get KHO working with my patches though I had to
look into your ftrace patches to get the whole picture. The
documentation can be improved to show how it can be used from the
driver/subsystem perspective. For example, I had to read your ftrace
patches to figure out I should use kho_get_fdt(), or that I should
register a notifier via kho_register_notifier(). I would be happy to
contribute some documentation improvements.
Have you done any analysis on the performance or memory overhead? If
yes, it would be nice to look at some data. I have some concerns with
performance and memory overhead, especially for more fragmented memory
but I don't yet have numbers to present you.
[0] https://github.com/checkpoint-restore/criu
>
>   https://lpc.events/event/17/contributions/1485/
>
> To start us on the journey to support all the use cases above, this
> patch implements basic infrastructure to allow hand over of kernel state
> across kexec (Kexec HandOver, aka KHO). As example target, we use ftrace:
> With this patch set applied, you can read ftrace records from the
> pre-kexec environment in your post-kexec one. This creates a very powerful
> debugging and performance analysis tool for kexec. It's also slightly
> easier to reason about than full blown VFIO state preservation.
>
> == Alternatives ==
>
> There are alternative approaches to (parts of) the problems above:
>
>   * Memory Pools [1] - preallocated persistent memory region + allocator
>   * PRMEM [2] - resizable persistent memory regions with fixed metadata
>                 pointer on the kernel command line + allocator
>   * Pkernfs [3] - preallocated file system for in-kernel data with fixed
>                   address location on the kernel command line
>   * PKRAM [4] - handover of user space pages using a fixed metadata page
>                 specified via command line
FYI, you forgot to paste the links in v3 but I can find them from v2.
From all these options, PKRAM seems somewhat useful for my use case but
with CRIU it would need to copy all the application pages to the PKRAM
FS and would need at least as much free memory as application memory.
Instead, I have built a simple system that gives an API to userspace to
hand over its pages and to request them back. It then keeps track of the
PID and PA -> VA mappings (essentially a page table). This lets me keep
the pages in-place and avoid needing lots of free memory or expensive
copying. KHO plays a crucial role there in handing those pages and page
tables across to the next kernel.
The FDT format works fairly well for my use case. Since page tables are
a stable data structure, I don't need to worry about their format
changing between kernel versions and can directly pass them through.
This might not be true for many other data structures so subsystems
using those either need to serialize them to FDT or invent their own
serialization formats.
I also wonder how the "mem" array will work for more fragmented
allocations. It might grow very large with lots of scattered elements. I
wonder how both KHO's parsing and memblock will behave in this case. I
have not yet tried stressing it so I can't say for myself.
>
> All of the approaches above fundamentally have the same problem: They
> require the administrator to explicitly carve out a physical memory
> location because they have no mechanism outside of the kernel command
> line to pass data (including memory reservations) between kexec'ing
> kernels.
>
> KHO provides that base foundation. We will determine later whether we
> still need any of the approaches above for fast bulk memory handover of for
> example IOMMU page tables. But IMHO they would all be users of KHO, with
> KHO providing the foundational primitive to pass metadata and bulk memory
> reservations as well as provide easy versioning for data.
>
> == Overview ==
>
> We introduce a metadata file that the kernels pass between each other. How
> they pass it is architecture specific. The file's format is a Flattened
> Device Tree (fdt) which has a generator and parser already included in
> Linux. When the root user enables KHO through /sys/kernel/kho/active, the
> kernel invokes callbacks to every driver that supports KHO to serialize
> its state. When the actual kexec happens, the fdt is part of the image
> set that we boot into. In addition, we keep a "scratch region" available
> for kexec: A physically contiguous memory region that is guaranteed to
> not have any memory that KHO would preserve.  The new kernel bootstraps
> itself using the scratch region and sets all handed over memory as in use.
> When drivers initialize that support KHO, they introspect the fdt and
> recover their state from it. This includes memory reservations, where the
> driver can either discard or claim reservations.
>
> == Limitations ==
>
> I currently only implemented file based kexec. The kernel interfaces
> in the patch set are already in place to support user space kexec as well,
> but I have not implemented it yet inside kexec tools.
>
> == How to Use ==
>
> To use the code, please boot the kernel with the "kho_scratch=" command
> line parameter set: "kho_scratch=512M". KHO requires a scratch region.
>
> Make sure to fill ftrace with contents that you want to observe after
> kexec.  Then, before you invoke file based "kexec -l", activate KHO:
>
>   # echo 1 > /sys/kernel/kho/active
>   # kexec -l Image --initrd=initrd -s
>   # kexec -e
>
> The new kernel will boot up and contain the previous kernel's trace
> buffers in /sys/kernel/debug/tracing/trace.
>
[...]
Overall, I think KHO is quite useful and I would be happy to see it
evolve and eventually make it into the kernel. It would certainly make
my life a lot easier.
Since I have used it in my patches, I have done some basic testing for
it. Nothing fancy, just handed a few pages across. It works as
advertised. For that,
Tested-by: Pratyush Yadav <ptyadav@amazon.de>
-- 
Regards,
Pratyush Yadav
Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879
     prev parent reply	other threads:[~2024-02-16 15:30 UTC|newest]
Thread overview: 34+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-01-17 14:46 [PATCH v3 00/17] kexec: Allow preservation of ftrace buffers Alexander Graf
2024-01-17 14:46 ` [PATCH v3 01/17] mm,memblock: Add support for scratch memory Alexander Graf
2024-01-17 14:46 ` [PATCH v3 02/17] memblock: Declare scratch memory as CMA Alexander Graf
2024-02-23 15:53   ` Pratyush Yadav
2024-01-17 14:46 ` [PATCH v3 03/17] kexec: Add Kexec HandOver (KHO) generation helpers Alexander Graf
2024-01-17 14:46 ` [PATCH v3 04/17] kexec: Add KHO parsing support Alexander Graf
2024-02-16 15:57   ` Pratyush Yadav
2024-01-17 14:46 ` [PATCH v3 05/17] kexec: Add KHO support to kexec file loads Alexander Graf
2024-02-16 15:37   ` Pratyush Yadav
2024-01-17 14:46 ` [PATCH v3 06/17] kexec: Add config option for KHO Alexander Graf
2024-01-17 14:46 ` [PATCH v3 07/17] kexec: Add documentation " Alexander Graf
2024-01-17 14:46 ` [PATCH v3 08/17] arm64: Add KHO support Alexander Graf
2024-01-31 14:49   ` Rob Herring
2024-01-17 14:46 ` [PATCH v3 09/17] x86: " Alexander Graf
2024-02-20 10:30   ` Mike Rapoport
2024-01-17 14:46 ` [PATCH v3 10/17] tracing: Initialize fields before registering Alexander Graf
2024-01-17 14:46 ` [PATCH v3 11/17] tracing: Introduce kho serialization Alexander Graf
2024-02-16 15:36   ` Pratyush Yadav
2024-01-17 14:46 ` [PATCH v3 12/17] tracing: Add kho serialization of trace buffers Alexander Graf
2024-01-17 14:47 ` [PATCH v3 13/17] tracing: Recover trace buffers from kexec handover Alexander Graf
2024-01-18  6:46   ` kernel test robot
2024-01-18 15:16   ` kernel test robot
2024-01-17 14:47 ` [PATCH v3 14/17] tracing: Add kho serialization of trace events Alexander Graf
2024-01-18  5:23   ` kernel test robot
2024-01-17 14:47 ` [PATCH v3 15/17] tracing: Recover trace events from kexec handover Alexander Graf
2024-01-17 14:47 ` [PATCH v3 16/17] tracing: Add config option for " Alexander Graf
2024-01-17 14:47 ` [PATCH v3 17/17] Documentation: KHO: Add ftrace bindings Alexander Graf
2024-01-29 16:34 ` [PATCH v3 00/17] kexec: Allow preservation of ftrace buffers Philipp Rudo
2024-02-02 12:58   ` Alexander Graf
2024-02-09 16:59     ` Philipp Rudo
2024-02-06  8:17 ` Oleksij Rempel
2024-02-06 13:43   ` Alexander Graf
2024-02-06 14:40     ` Oleksij Rempel
2024-02-16 15:29 ` Pratyush Yadav [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox
  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):
  git send-email \
    --in-reply-to=mafs0zfw08m1k.fsf@amazon.de \
    --to=ptyadav@amazon.de \
    --cc=akpm@linux-foundation.org \
    --cc=anthony.yznaga@oracle.com \
    --cc=arnd@arndb.de \
    --cc=ashish.kalra@amd.com \
    --cc=benh@kernel.crashing.org \
    --cc=devicetree@vger.kernel.org \
    --cc=dwmw@amazon.co.uk \
    --cc=ebiederm@xmission.com \
    --cc=graf@amazon.com \
    --cc=hpa@zytor.com \
    --cc=jgowans@amazon.com \
    --cc=kexec@lists.infradead.org \
    --cc=krzk@kernel.org \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-trace-kernel@vger.kernel.org \
    --cc=luto@kernel.org \
    --cc=madvenka@linux.microsoft.com \
    --cc=mark.rutland@arm.com \
    --cc=pbonzini@redhat.com \
    --cc=peterz@infradead.org \
    --cc=robh+dt@kernel.org \
    --cc=rostedt@goodmis.org \
    --cc=skinsburskii@linux.microsoft.com \
    --cc=thomas.lendacky@amd.com \
    --cc=usama.arif@bytedance.com \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY
  https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
  Be sure your reply has a Subject: header at the top and a blank line
  before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).