Re: [RFC PATCH v1 00/10] mm/prmem: Implement the Persistent-Across-Kexec memory feature (prmem)

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

From: Alexander Graf <graf@amazon.de>
To: <madvenka@linux.microsoft.com>, <gregkh@linuxfoundation.org>,
	<pbonzini@redhat.com>, <rppt@kernel.org>, <jgowans@amazon.com>,
	<arnd@arndb.de>, <keescook@chromium.org>,
	<stanislav.kinsburskii@gmail.com>, <anthony.yznaga@oracle.com>,
	<linux-mm@kvack.org>, <linux-kernel@vger.kernel.org>,
	<jamorris@linux.microsoft.com>,
	"rostedt@goodmis.org" <rostedt@goodmis.org>,
	kvm <kvm@vger.kernel.org>
Subject: Re: [RFC PATCH v1 00/10] mm/prmem: Implement the Persistent-Across-Kexec memory feature (prmem)
Date: Tue, 17 Oct 2023 10:31:27 +0200	[thread overview]
Message-ID: <8f9d81a8-1071-43ca-98cd-e9c1eab8e014@amazon.de> (raw)
In-Reply-To: <20231016233215.13090-1-madvenka@linux.microsoft.com>

Hey Madhavan!

This patch set looks super exciting - thanks a lot for putting it 
together. We've been poking at a very similar direction for a while as 
well and will discuss the fundamental problem of how to persist kernel 
metadata across kexec at LPC:

   https://lpc.events/event/17/contributions/1485/

It would be great to have you in the room as well then.

Some more comments inline.

On 17.10.23 01:32, madvenka@linux.microsoft.com wrote:
> From: "Madhavan T. Venkataraman" <madvenka@linux.microsoft.com>
>
> Introduction
> ============
>
> This feature can be used to persist kernel and user data across kexec reboots
> in RAM for various uses. E.g., persisting:
>
>          - cached data. E.g., database caches.
>          - state. E.g., KVM guest states.
>          - historical information since the last cold boot. E.g., events, logs
>            and journals.
>          - measurements for integrity checks on the next boot.
>          - driver data.
>          - IOMMU mappings.
>          - MMIO config information.
>
> This is useful on systems where there is no non-volatile storage or
> non-volatile storage is too small or too slow.

This is useful in more situations. We for example need it to do a kexec 
while a virtual machine is in suspended state, but has IOMMU mappings 
intact (Live Update). For that, we need to ensure DMA can still reach 
the VM memory and that everything gets reassembled identically and 
without interruptions on the receiving end.

> The following sections describe the implementation.
>
> I have enhanced the ram disk block device driver to provide persistent ram
> disks on which any filesystem can be created. This is for persisting user data.
> I have also implemented DAX support for the persistent ram disks.

This is probably the least interesting of the enablements, right? You 
can already today reserve RAM on boot as DAX block device and use it for 
that purpose.

> I am also working on making ZRAM persistent.
>
> I have also briefly discussed the following use cases:
>
>          - Persisting IOMMU mappings
>          - Remembering DMA pages
>          - Reserving pages that encounter memory errors
>          - Remembering IMA measurements for integrity checks
>          - Remembering MMIO config info
>          - Implementing prmemfs (special filesystem tailored for persistence)
>
> Allocate metadata
> =================
>
> Define a metadata structure to store all persistent memory related information.
> The metadata fits into one page. On a cold boot, allocate and initialize the
> metadata page.
>
> Allocate data
> =============
>
> On a cold boot, allocate some memory for storing persistent data. Call it
> persistent memory. Specify the size in a command line parameter:
>
>          prmem=size[KMG][,max_size[KMG]]
>
>          size            Initial amount of memory allocated to prmem during boot
>          max_size        Maximum amount of memory that can be allocated to prmem
>
> When the initial memory is exhaused via allocations, expand prmem dynamically
> up to max_size. Expansion is done by allocating from the buddy allocator.
> Record all allocations in the metadata.

I don't understand why we need a separate allocator. Why can't we just 
use normal Linux allocations and serialize their location for handover? 
We would obviously still need to find a large contiguous piece of memory 
for the target kernel to bootstrap itself into until it can read which 
pages it can and can not use, but we can do that allocation in the 
source environment using CMA, no?

What I'm trying to say is: I think we're better off separating the 
handover mechanism from the allocation mechanism. If we can implement 
handover without a new allocator, we can use it for simple things with a 
slight runtime penalty. To accelerate the handover then, we can later 
add a compacting allocator that can use the handover mechanism we 
already built to persist itself.

I have a WIP branch where I'm toying with such a handover mechanism that 
uses device tree to serialize/deserialize state. By standardizing the 
property naming, we can in the receiving kernel mark all persistent 
allocations as reserved and then slowly either free them again or mark 
them as in-use one by one:

https://github.com/agraf/linux/commit/fd5736a21d549a9a86c178c91acb29ed7f364f42

I used ftrace as example payload to persist: With the handover mechanism 
in place, we serialize/deserialize ftrace ring buffer metadata and are 
thus able to read traces of the previous system after kexec. This way, 
you can for example profile the kexec exit path.

It's not even in RFC state yet, there are a few things where I would 
need a couple days to think hard about data structures, layouts and 
other problems :). But I believe from the patch you get the idea.

One such user of kho could be a new allocator like prmem and each 
subsystem's serialization code could choose to rely on the prmem 
subsystem to persist data instead of doing it themselves. That way you 
get a very non-intrusive enablement path for kexec handover, easily 
amendable data structures that can change compatibly over time as well 
as the ability to recreate ephemeral data structure based on persistent 
information - which will be necessary to persist VFIO containers.

Alex

Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879

next prev parent reply	other threads:[~2023-10-17  8:31 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <1b1bc25eb87355b91fcde1de7c2f93f38abb2bf9>
2023-10-16 23:32 ` [RFC PATCH v1 00/10] mm/prmem: Implement the Persistent-Across-Kexec memory feature (prmem) madvenka
2023-10-16 23:32   ` [RFC PATCH v1 01/10] mm/prmem: Allocate memory during boot for storing persistent data madvenka
2023-10-16 23:32   ` [RFC PATCH v1 02/10] mm/prmem: Reserve metadata and persistent regions in early boot after kexec madvenka
2023-10-16 23:32   ` [RFC PATCH v1 03/10] mm/prmem: Manage persistent memory with the gen pool allocator madvenka
2023-10-16 23:32   ` [RFC PATCH v1 04/10] mm/prmem: Implement a page allocator for persistent memory madvenka
2023-10-16 23:32   ` [RFC PATCH v1 05/10] mm/prmem: Implement a buffer " madvenka
2023-10-16 23:32   ` [RFC PATCH v1 06/10] mm/prmem: Implement persistent XArray (and Radix Tree) madvenka
2023-10-16 23:32   ` [RFC PATCH v1 07/10] mm/prmem: Implement named Persistent Instances madvenka
2023-10-16 23:32   ` [RFC PATCH v1 08/10] mm/prmem: Implement Persistent Ramdisk instances madvenka
2023-10-16 23:32   ` [RFC PATCH v1 09/10] mm/prmem: Implement DAX support for Persistent Ramdisks madvenka
2023-10-16 23:32   ` [RFC PATCH v1 10/10] mm/prmem: Implement dynamic expansion of prmem madvenka
2023-10-17  8:31   ` Alexander Graf [this message]
2023-10-17 18:08     ` [RFC PATCH v1 00/10] mm/prmem: Implement the Persistent-Across-Kexec memory feature (prmem) Madhavan T. Venkataraman

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=8f9d81a8-1071-43ca-98cd-e9c1eab8e014@amazon.de \
    --to=graf@amazon.de \
    --cc=anthony.yznaga@oracle.com \
    --cc=arnd@arndb.de \
    --cc=gregkh@linuxfoundation.org \
    --cc=jamorris@linux.microsoft.com \
    --cc=jgowans@amazon.com \
    --cc=keescook@chromium.org \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=madvenka@linux.microsoft.com \
    --cc=pbonzini@redhat.com \
    --cc=rostedt@goodmis.org \
    --cc=rppt@kernel.org \
    --cc=stanislav.kinsburskii@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox