linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* Re: [RFC PATCH] Introduce persistent memory pool
       [not found] ` <2023082506-enchanted-tripping-d1d5@gregkh>
@ 2023-08-23  1:36   ` Stanislav Kinsburskii
       [not found]   ` <c26ad989dcc6737dd295e980c78ef53740098810.camel@amazon.com>
       [not found]   ` <64e8f6dd.050a0220.edb3c.c045SMTPIN_ADDED_BROKEN@mx.google.com>
  2 siblings, 0 replies; 14+ messages in thread
From: Stanislav Kinsburskii @ 2023-08-23  1:36 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: Stanislav Kinsburskii, Derek Kiernan, Dragan Cvetic,
	Arnd Bergmann, Wei Liu, K. Y. Srinivasan, Madhavan Venkataraman,
	Anthony Yznaga, Mike Rapoport (IBM), James Gowans,
	Anirudh Rayabharam, Jinank Jain, Andrew Morton, linux-mm,
	linux-kernel

On Fri, Aug 25, 2023 at 10:05:08AM +0200, Greg Kroah-Hartman wrote:
> On Tue, Aug 22, 2023 at 11:34:34AM -0700, Stanislav Kinsburskii wrote:
> > This patch addresses the need for a memory allocator dedicated to
> > persistent memory within the kernel. This allocator will preserve
> > kernel-specific states like DMA passthrough device states, IOMMU state, and
> > more across kexec.
> 
> And CMA doesn't do this for you today?
> 

Unfortunatelly no, as it doesn't preserve the bitmap (and thus the
actual pages state) across kexec.
But perhaps extending cma or building something else on top of it is the
right way forward here.
I'll look into it, thank you for the pointer.

> >acrros The proposed solution offers a foundational implementation for potential
> > custom solutions that might follow. Though the implementation is
> > intentionally kept concise and straightforward to foster discussion and
> > feedback, it's fully functional in its current state.
> > 
> > The persistent memory pool consists of a simple page allocator that
> > operates on reserved physical memory regions. It employs bitmaps to
> > allocate and free pages, with the following characteristics:
> > 
> >   1. Memory pool regions are specified using the command line option:
> > 
> >        pmpool=<base>,<size>
> > 
> >      Where <base> represents the physical memory base address and <size>
> >      indicates the memory size.
> > 
> >   2. While the pages allocation emulates the buddy allocator interface, it
> >      isn't confined to the MAX_ORDER.
> > 
> >   3. The memory pool initializes during a cold boot and is retained across
> >      kexec.
> > 
> > Potential applications include:
> > 
> >   1. Allowing various in-kernel entities to allocate persistent pages from
> >      a singular memory pool, eliminating the need for multiple region
> >      reservations.
> > 
> >   2. For in-kernel components that require the allocation address to be
> >      available on kernel kexec, this address can be exposed to user space and
> >      then passed via the command line.
> > 
> >   3. Separate subsystems or drivers can reserve their region, sharing a
> >      portion of it for their persistent memory pool. This might be a file
> >      system, a key-value store, or other applications.
> > 
> > Potential Enhancements for the Proposed Memory Pool:
> > 
> >   1. Multiple Memory Regions Support: enhance the pool to accommodate and
> >      manage multiple memory regions simultaneously.
> > 
> >   2. In-Kernel Memory Allocations for Storage Memory Regions:
> >    * Allow in-kernel memory allocations to serve as storage memory regions.
> >    * Implement explicit reservation of these designated regions during kexec.
> 
> As you have no in-kernel users of this, it's not something we can even
> consider at the moment for obvious reasons (neither would you want us
> to.)
> 

This is fair.
I didn't have any anticipation for this code to get merged.
My hope is rather gathering any feedback I can get.

> Can you make this part of a patch series that actually adds a user,
> probably more than one, so that we can see if any of this even makes
> sense?
> 

Will do.

> 
> >  drivers/misc/Kconfig   |    7 +
> >  drivers/misc/Makefile  |    1 
> >  drivers/misc/pmpool.c  |  270 ++++++++++++++++++++++++++++++++++++++++++++++++
> >  include/linux/pmpool.h |   20 ++++
> >  4 files changed, 298 insertions(+)
> >  create mode 100644 drivers/misc/pmpool.c
> >  create mode 100644 include/linux/pmpool.h
> 
> misc is not for memory pools, as this is not a driver.  please put this
> in the properly location instead of trying to hide it from the mm
> maintainers and subsystem :)
> 

This is indeed a misplacement. I had the completely oposite intention as
I do know that this is a common area for multiple contributors. That's
actually why I carefully crafted the recepients list to make sure all
pkram-involved people are there.
I added the mm list to the CC.

> > diff --git a/drivers/misc/Kconfig b/drivers/misc/Kconfig
> > index cadd4a820c03..c8ef5b37ee98 100644
> > --- a/drivers/misc/Kconfig
> > +++ b/drivers/misc/Kconfig
> > @@ -562,6 +562,13 @@ config TPS6594_PFSM
> >  	  This driver can also be built as a module.  If so, the module
> >  	  will be called tps6594-pfsm.
> >  
> > +config PMPOOL
> > +	bool "Persistent memory pool support"
> > +	help
> > +	  This option adds support for a persistent memory pool feature, which
> > +	  provides pages allocation and freeing from a set of persistent memory ranges,
> > +	  deposited to the memory pool.
> 
> Why would this even be a user selectable option?  Either the kernel
> needs this or it doesn't.
> 

I don't see a default kernel need per se. Kernel drivers or file systems
may or may not use such a pool to store their state there. Therefore I
made it optional.

> 
> > +
> >  source "drivers/misc/c2port/Kconfig"
> >  source "drivers/misc/eeprom/Kconfig"
> >  source "drivers/misc/cb710/Kconfig"
> > diff --git a/drivers/misc/Makefile b/drivers/misc/Makefile
> > index f2a4d1ff65d4..31dd6553057d 100644
> > --- a/drivers/misc/Makefile
> > +++ b/drivers/misc/Makefile
> > @@ -67,3 +67,4 @@ obj-$(CONFIG_TMR_MANAGER)      += xilinx_tmr_manager.o
> >  obj-$(CONFIG_TMR_INJECT)	+= xilinx_tmr_inject.o
> >  obj-$(CONFIG_TPS6594_ESM)	+= tps6594-esm.o
> >  obj-$(CONFIG_TPS6594_PFSM)	+= tps6594-pfsm.o
> > +obj-$(CONFIG_PMPOOL)		+= pmpool.o
> > diff --git a/drivers/misc/pmpool.c b/drivers/misc/pmpool.c
> > new file mode 100644
> > index 000000000000..e2c923b31b36
> > --- /dev/null
> > +++ b/drivers/misc/pmpool.c
> > @@ -0,0 +1,270 @@
> > +#include <linux/io.h>
> 
> You forgot basic copyright/license stuff, did you use checkpatch.pl on
> your file?
> 

Yeah, I used checkpatch.pl. And I deliberately omitted the license
stuff as I wanted the shorten the diff with intention to have it as an
RFC only.
I'll fix it next time.

> > +#include <linux/bitmap.h>
> > +#include <linux/memblock.h>
> > +#include <linux/spinlock.h>
> > +#include <linux/types.h>
> > +
> > +#include <linux/pmpool.h>
> > +
> > +#define VERSION			1
> 
> In kernel code does not need versions.
> 

Could you elaborate on this? Should kernel version be used as a backward
compatitbility marker instead?

> > +#define MAX_REGIONS		14
> 
> Why 14?  Why not 24?  Or something else?
> 

That's the number of bitmaps for 8 MB memory ranges (default MAX_ORDER
allocations) fitting into the 4k page header after the pmempool header.
But yeah, it can be definitely something else.

> > +
> > +#define for_each_region(pmpool, r)					\
> > +	for (r = pmpool->hdr->regions;					\
> > +	     r - pmpool->hdr->regions < pmpool->hdr->nr_regions;	\
> > +	     r++)
> > +
> > +#define for_each_used_region(pmpool, r)					\
> > +	for_each_region((pmpool), (r))					\
> > +		if (!(r)->size_in_pfns) { continue; } else
> > +
> > +struct pmpool_region {
> > +	uint32_t		base_pfn;		/* 32 bits * 4k = up it 15TB of physical address space */
> 
> Please use proper kernel types when writing kernel code (i.e. u32, u8,
> and so on.)  uint*_t is for userspace code only.
> 

Will do.

> > --- /dev/null
> > +++ b/include/linux/pmpool.h
> > @@ -0,0 +1,20 @@
> > +#ifndef _PMPOOL_H
> > +#define _PMPOOL_H
> > +
> > +#include <linux/types.h>
> > +#include <linux/spinlock.h>
> > +
> > +void *alloc_pm_pages(unsigned int order);
> > +void free_pm_pages(void *addr, unsigned int order);
> > +
> > +struct pmpool {
> > +	spinlock_t		lock;
> > +	struct pmpool_header	*hdr;
> > +};
> > +
> > +int pmpool_init(struct pmpool *pmpool, phys_addr_t base, phys_addr_t size);
> > +
> > +void *alloc_pmpool_pages(struct pmpool *pmpool, unsigned int order);
> > +void free_pmpool_pages(struct pmpool *pmpool, void *addr, unsigned int order);
> 
> Please use "noun_verb_*" for new apis so that we have a chance at
> understanding where the calls are living.
> 

Will do.

> good luck!
> 

Thank you, Greg.

Stanislav

> greg k-h


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH] Introduce persistent memory pool
       [not found]   ` <c26ad989dcc6737dd295e980c78ef53740098810.camel@amazon.com>
@ 2023-08-23  2:45     ` Stanislav Kinsburskii
  2023-08-28 20:50       ` Alexander Graf
  0 siblings, 1 reply; 14+ messages in thread
From: Stanislav Kinsburskii @ 2023-08-23  2:45 UTC (permalink / raw)
  To: Gowans, James
  Cc: gregkh@linuxfoundation.org, rppt@kernel.org,
	madvenka@linux.microsoft.com, anthony.yznaga@oracle.com,
	steven.sistare@oracle.com, stanislav.kinsburskii@gmail.com,
	linux-kernel@vger.kernel.org, seanjc@google.com,
	pbonzini@redhat.com, kys@microsoft.com, arnd@arndb.de,
	Graf (AWS), Alexander, wei.liu@kernel.org,
	anrayabh@linux.microsoft.com, dragan.cvetic@amd.com,
	jinankjain@linux.microsoft.com, derek.kiernan@amd.com,
	linux-mm@kvack.org, Andrew Morton

+akpm, +linux-mm

On Fri, Aug 25, 2023 at 01:32:40PM +0000, Gowans, James wrote:
> On Fri, 2023-08-25 at 10:05 +0200, Greg Kroah-Hartman wrote:
> 
> Thanks for adding me to this thread Greg!
> 
> > On Tue, Aug 22, 2023 at 11:34:34AM -0700, Stanislav Kinsburskii wrote:
> > > This patch addresses the need for a memory allocator dedicated to
> > > persistent memory within the kernel. This allocator will preserve
> > > kernel-specific states like DMA passthrough device states, IOMMU state, and
> > > more across kexec.
> > > The proposed solution offers a foundational implementation for potential
> > > custom solutions that might follow. Though the implementation is
> > > intentionally kept concise and straightforward to foster discussion and
> > > feedback, it's fully functional in its current state.
> 
> Hi Stanislav, it looks like we're working on similar things. I'm looking
> to develop a mechanism to support hypervisor live update for when KVM is
> running VMs with PCI device passthrough. VMs with device passthrough
> also necessitates passing and re-hydrating IOMMU state so that DMA can
> continue during live update.
> 
> Planning on having an LPC session on this topic:
> https://lpc.events/event/17/abstracts/1629/ (currently it's only a
> submitted abstract so not sure if visible, hopefully it will be soon).
> 
> We are looking at implementing persistence across kexec via an in-memory
> filesystem on top of reserved memory. This would have files for anything
> that needs to be persisted. That includes files for IOMMU pgtables, for
> guest memory or userspace-accessible memory. 
> 
> It may be nice to solve all kexec persistence requirements with one
> solution, but we can consider IOMMU separately. There are at least three
> ways that this can be done:
> a) carving out reserved memory for pgtables. This is done by your
> proposal here, as well as my suggestion of a filesystem.
> b) pre/post kexec hooks for drivers to serialise state and pass it
> across in a structured format from old to new kernel.
> c) Reconstructing IOMMU state in the new kernel by starting at the
> hardware registers and walking the page tables. No state passing needed.
> 
> Have you considered option (b) and (c) here? One of the implications of
> (b) and (c) are that they would need to hook into the buddy allocator
> really early to be able to carve out the reconstructed page tables
> before the allocator is used. Similar to how pkram [0] hooks in early to
> carve out pages used for its filesystem.
> 

Hi James,

We are indeed working on similar things, so thanks for chiming in.
I've seen pkram proposal as well as your comments there.

I think (b) will need some persistent-over-kexec memory to pass the
state across kexec as well as some key-value store persisted as well.
And the proposed persistent memory pool is aimed exactly for this
purpose.
Or do you imply some other way to pass driver's data accross kexec?

I dind't consider (c) yet, thanks for for the pointer.

I have a question in this scope: how is PCI devices registers state is persisted
across kexec with the files system you are working on? I.e. how does
driver know, that the device shouldn't not be reinitialized?

> > 
> > > 
> > > Potential applications include:
> > > 
> > >   1. Allowing various in-kernel entities to allocate persistent pages from
> > >      a singular memory pool, eliminating the need for multiple region
> > >      reservations.
> > > 
> > >   2. For in-kernel components that require the allocation address to be
> > >      available on kernel kexec, this address can be exposed to user space and
> > >      then passed via the command line.
> 
> Do you have specific examples of other state that needs to be passed
> across? Trying to see whether tailoring specifically to the IOMMU case
> is okay. Conceptually IOMMU state can be reconstructed starting with
> hardware registers, not needing reserved memory. Other use-cases may not
> have this option.
> 

Well, basically it's IOMMU state and PCI devices to skip/avoid
initializing.
I bet there can be other misc (and unrelated things) like persistent
filesystems, block devices, etc. But I don't have a solid set of use
cases to present.

> > 
> > As you have no in-kernel users of this, it's not something we can even
> > consider at the moment for obvious reasons (neither would you want us
> > to.)
> > 
> > Can you make this part of a patch series that actually adds a user,
> > probably more than one, so that we can see if any of this even makes
> > sense?
> 
> I'm very keen to see this as well. The way that the IOMMU drivers are
> enlightened to hook into your memory pool will likely be similar to how
> they would hook into my proposal of an in-memory filesystem.
> Do you have code available showing the IOMMU integration? 
> 

No, I don't have such a code yet.
But I was thinking that using such a allocator in the mempool allows
to hide this implementation under the hood of an existent generic
mechanism, which is then can be used to create persistent objects (file
system, for example) on top of it.

> > 
> > >  drivers/misc/Kconfig   |    7 +
> > >  drivers/misc/Makefile  |    1
> > >  drivers/misc/pmpool.c  |  270 ++++++++++++++++++++++++++++++++++++++++++++++++
> > >  include/linux/pmpool.h |   20 ++++
> > >  4 files changed, 298 insertions(+)
> > >  create mode 100644 drivers/misc/pmpool.c
> > >  create mode 100644 include/linux/pmpool.h
> > 
> > misc is not for memory pools, as this is not a driver.  please put this
> > in the properly location instead of trying to hide it from the mm
> > maintainers and subsystem :)
> 
> One of the reasons I thought a proper filesystem would be a better way
> of exposing this functionality.
> 

Yes, I see the point of having a file system for the goals you are
targeting. It looks like the right way forward.

What I'm not sure about is that persistent-over-kexec memory management and
the actual preservation mechanism should be an embedded part of this file
system.

What I'm trying to propose is a part of a generic mechanism to provide
state persistence across kexec, which can then be used by your file
system or something else.

What do you think about this approach?

Thanks,
Stanislav

> JG
> 
> 
> [0]
> https://lore.kernel.org/lkml/1617140178-8773-1-git-send-email-anthony.yznaga@oracle.com/T/


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH] Introduce persistent memory pool
  2023-08-26  7:45     ` Greg Kroah-Hartman
@ 2023-08-23  6:15       ` Stanislav Kinsburskii
       [not found]       ` <64ea25cd.650a0220.642cc.50e6SMTPIN_ADDED_BROKEN@mx.google.com>
  1 sibling, 0 replies; 14+ messages in thread
From: Stanislav Kinsburskii @ 2023-08-23  6:15 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: Stanislav Kinsburskii, Derek Kiernan, Dragan Cvetic,
	Arnd Bergmann, Wei Liu, K. Y. Srinivasan, Madhavan Venkataraman,
	Anthony Yznaga, Mike Rapoport (IBM), James Gowans,
	Anirudh Rayabharam, Jinank Jain, Andrew Morton, linux-mm,
	linux-kernel

On Sat, Aug 26, 2023 at 09:45:39AM +0200, Greg Kroah-Hartman wrote:
> On Tue, Aug 22, 2023 at 06:36:10PM -0700, Stanislav Kinsburskii wrote:
> > > > +#include <linux/bitmap.h>
> > > > +#include <linux/memblock.h>
> > > > +#include <linux/spinlock.h>
> > > > +#include <linux/types.h>
> > > > +
> > > > +#include <linux/pmpool.h>
> > > > +
> > > > +#define VERSION			1
> > > 
> > > In kernel code does not need versions.
> > > 
> > 
> > Could you elaborate on this? Should kernel version be used as a backward
> > compatitbility marker instead?
> 
> kernel versions should never be checked for in-kernel code, so I really
> don't understand the question here sorry.
> 
> For code that is in the kernel tree, having "versions" on them (as many
> drivers used to, and now only a few do), makes no sense, especially with
> the stable/lts trees getting fixes for them over time as well.
> 

This version is rather an ABI version. The idea is to make sure, that
any future ABI change is explicit and reflected in the version, so it
can be easily noticed in case of kexec to a kernel with an older
version.
But I guess there are other ways to make sure, that the ABI contract is
the preserved.

> In short, there should not be a need for a "version" anywhere.
> 

I see. Thank you.

Stanislav

> thanks,
> 
> greg k-h


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH] Introduce persistent memory pool
  2023-08-26 17:02         ` Greg Kroah-Hartman
@ 2023-08-23  6:21           ` Stanislav Kinsburskii
       [not found]           ` <64ea3699.170a0220.13ee0.5c3aSMTPIN_ADDED_BROKEN@mx.google.com>
  1 sibling, 0 replies; 14+ messages in thread
From: Stanislav Kinsburskii @ 2023-08-23  6:21 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: Stanislav Kinsburskii, Derek Kiernan, Dragan Cvetic,
	Arnd Bergmann, Wei Liu, K. Y. Srinivasan, Madhavan Venkataraman,
	Anthony Yznaga, Mike Rapoport (IBM), James Gowans,
	Anirudh Rayabharam, Jinank Jain, Andrew Morton, linux-mm,
	linux-kernel

On Sat, Aug 26, 2023 at 07:02:12PM +0200, Greg Kroah-Hartman wrote:
> On Tue, Aug 22, 2023 at 11:15:08PM -0700, Stanislav Kinsburskii wrote:
> > On Sat, Aug 26, 2023 at 09:45:39AM +0200, Greg Kroah-Hartman wrote:
> > > On Tue, Aug 22, 2023 at 06:36:10PM -0700, Stanislav Kinsburskii wrote:
> > > > > > +#include <linux/bitmap.h>
> > > > > > +#include <linux/memblock.h>
> > > > > > +#include <linux/spinlock.h>
> > > > > > +#include <linux/types.h>
> > > > > > +
> > > > > > +#include <linux/pmpool.h>
> > > > > > +
> > > > > > +#define VERSION			1
> > > > > 
> > > > > In kernel code does not need versions.
> > > > > 
> > > > 
> > > > Could you elaborate on this? Should kernel version be used as a backward
> > > > compatitbility marker instead?
> > > 
> > > kernel versions should never be checked for in-kernel code, so I really
> > > don't understand the question here sorry.
> > > 
> > > For code that is in the kernel tree, having "versions" on them (as many
> > > drivers used to, and now only a few do), makes no sense, especially with
> > > the stable/lts trees getting fixes for them over time as well.
> > > 
> > 
> > This version is rather an ABI version. The idea is to make sure, that
> > any future ABI change is explicit and reflected in the version, so it
> > can be easily noticed in case of kexec to a kernel with an older
> > version.
> > But I guess there are other ways to make sure, that the ABI contract is
> > the preserved.
> 
> Which ABI are you referring to here.  The user/kernel one?  Or the
> kernel/hypervisor one?  Or something else?
> 

Yeah, I guess the "ABI" word in misleading here, especially the first
letter. I mean something else: the old kernel/new kernel.
This persistent memory pool (its metadata) is supposed to be passed
across kexec with the data. That is probably the main difference in
comparison to pmem or cma.
Since the header can change its format between kernels, there should be
a way to identify it.

> There is no "numbering" of user/kernel apis, sorry.  APIs just don't
> need that, you can handle things properly automatically without version
> numbers (as again, that just does not work.)
> 

This is fair.
Version looked like the cheapest way to implement the compatibility, but
there are other ways, which I'll explore as you suggested.

Thank you,

Stanislav

> thanks,
> 
> greg k-h


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH] Introduce persistent memory pool
       [not found]   ` <64e8f6dd.050a0220.edb3c.c045SMTPIN_ADDED_BROKEN@mx.google.com>
@ 2023-08-26  7:45     ` Greg Kroah-Hartman
  2023-08-23  6:15       ` Stanislav Kinsburskii
       [not found]       ` <64ea25cd.650a0220.642cc.50e6SMTPIN_ADDED_BROKEN@mx.google.com>
  0 siblings, 2 replies; 14+ messages in thread
From: Greg Kroah-Hartman @ 2023-08-26  7:45 UTC (permalink / raw)
  To: Stanislav Kinsburskii
  Cc: Stanislav Kinsburskii, Derek Kiernan, Dragan Cvetic,
	Arnd Bergmann, Wei Liu, K. Y. Srinivasan, Madhavan Venkataraman,
	Anthony Yznaga, Mike Rapoport (IBM), James Gowans,
	Anirudh Rayabharam, Jinank Jain, Andrew Morton, linux-mm,
	linux-kernel

On Tue, Aug 22, 2023 at 06:36:10PM -0700, Stanislav Kinsburskii wrote:
> > > +#include <linux/bitmap.h>
> > > +#include <linux/memblock.h>
> > > +#include <linux/spinlock.h>
> > > +#include <linux/types.h>
> > > +
> > > +#include <linux/pmpool.h>
> > > +
> > > +#define VERSION			1
> > 
> > In kernel code does not need versions.
> > 
> 
> Could you elaborate on this? Should kernel version be used as a backward
> compatitbility marker instead?

kernel versions should never be checked for in-kernel code, so I really
don't understand the question here sorry.

For code that is in the kernel tree, having "versions" on them (as many
drivers used to, and now only a few do), makes no sense, especially with
the stable/lts trees getting fixes for them over time as well.

In short, there should not be a need for a "version" anywhere.

thanks,

greg k-h


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH] Introduce persistent memory pool
       [not found]       ` <64ea25cd.650a0220.642cc.50e6SMTPIN_ADDED_BROKEN@mx.google.com>
@ 2023-08-26 17:02         ` Greg Kroah-Hartman
  2023-08-23  6:21           ` Stanislav Kinsburskii
       [not found]           ` <64ea3699.170a0220.13ee0.5c3aSMTPIN_ADDED_BROKEN@mx.google.com>
  0 siblings, 2 replies; 14+ messages in thread
From: Greg Kroah-Hartman @ 2023-08-26 17:02 UTC (permalink / raw)
  To: Stanislav Kinsburskii
  Cc: Stanislav Kinsburskii, Derek Kiernan, Dragan Cvetic,
	Arnd Bergmann, Wei Liu, K. Y. Srinivasan, Madhavan Venkataraman,
	Anthony Yznaga, Mike Rapoport (IBM), James Gowans,
	Anirudh Rayabharam, Jinank Jain, Andrew Morton, linux-mm,
	linux-kernel

On Tue, Aug 22, 2023 at 11:15:08PM -0700, Stanislav Kinsburskii wrote:
> On Sat, Aug 26, 2023 at 09:45:39AM +0200, Greg Kroah-Hartman wrote:
> > On Tue, Aug 22, 2023 at 06:36:10PM -0700, Stanislav Kinsburskii wrote:
> > > > > +#include <linux/bitmap.h>
> > > > > +#include <linux/memblock.h>
> > > > > +#include <linux/spinlock.h>
> > > > > +#include <linux/types.h>
> > > > > +
> > > > > +#include <linux/pmpool.h>
> > > > > +
> > > > > +#define VERSION			1
> > > > 
> > > > In kernel code does not need versions.
> > > > 
> > > 
> > > Could you elaborate on this? Should kernel version be used as a backward
> > > compatitbility marker instead?
> > 
> > kernel versions should never be checked for in-kernel code, so I really
> > don't understand the question here sorry.
> > 
> > For code that is in the kernel tree, having "versions" on them (as many
> > drivers used to, and now only a few do), makes no sense, especially with
> > the stable/lts trees getting fixes for them over time as well.
> > 
> 
> This version is rather an ABI version. The idea is to make sure, that
> any future ABI change is explicit and reflected in the version, so it
> can be easily noticed in case of kexec to a kernel with an older
> version.
> But I guess there are other ways to make sure, that the ABI contract is
> the preserved.

Which ABI are you referring to here.  The user/kernel one?  Or the
kernel/hypervisor one?  Or something else?

There is no "numbering" of user/kernel apis, sorry.  APIs just don't
need that, you can handle things properly automatically without version
numbers (as again, that just does not work.)

thanks,

greg k-h


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH] Introduce persistent memory pool
       [not found]           ` <64ea3699.170a0220.13ee0.5c3aSMTPIN_ADDED_BROKEN@mx.google.com>
@ 2023-08-26 20:04             ` Greg Kroah-Hartman
  2023-08-31 14:18               ` Paolo Bonzini
  0 siblings, 1 reply; 14+ messages in thread
From: Greg Kroah-Hartman @ 2023-08-26 20:04 UTC (permalink / raw)
  To: Stanislav Kinsburskii
  Cc: Stanislav Kinsburskii, Derek Kiernan, Dragan Cvetic,
	Arnd Bergmann, Wei Liu, K. Y. Srinivasan, Madhavan Venkataraman,
	Anthony Yznaga, Mike Rapoport (IBM), James Gowans,
	Anirudh Rayabharam, Jinank Jain, Andrew Morton, linux-mm,
	linux-kernel

On Tue, Aug 22, 2023 at 11:21:59PM -0700, Stanislav Kinsburskii wrote:
> On Sat, Aug 26, 2023 at 07:02:12PM +0200, Greg Kroah-Hartman wrote:
> > On Tue, Aug 22, 2023 at 11:15:08PM -0700, Stanislav Kinsburskii wrote:
> > > On Sat, Aug 26, 2023 at 09:45:39AM +0200, Greg Kroah-Hartman wrote:
> > > > On Tue, Aug 22, 2023 at 06:36:10PM -0700, Stanislav Kinsburskii wrote:
> > > > > > > +#include <linux/bitmap.h>
> > > > > > > +#include <linux/memblock.h>
> > > > > > > +#include <linux/spinlock.h>
> > > > > > > +#include <linux/types.h>
> > > > > > > +
> > > > > > > +#include <linux/pmpool.h>
> > > > > > > +
> > > > > > > +#define VERSION			1
> > > > > > 
> > > > > > In kernel code does not need versions.
> > > > > > 
> > > > > 
> > > > > Could you elaborate on this? Should kernel version be used as a backward
> > > > > compatitbility marker instead?
> > > > 
> > > > kernel versions should never be checked for in-kernel code, so I really
> > > > don't understand the question here sorry.
> > > > 
> > > > For code that is in the kernel tree, having "versions" on them (as many
> > > > drivers used to, and now only a few do), makes no sense, especially with
> > > > the stable/lts trees getting fixes for them over time as well.
> > > > 
> > > 
> > > This version is rather an ABI version. The idea is to make sure, that
> > > any future ABI change is explicit and reflected in the version, so it
> > > can be easily noticed in case of kexec to a kernel with an older
> > > version.
> > > But I guess there are other ways to make sure, that the ABI contract is
> > > the preserved.
> > 
> > Which ABI are you referring to here.  The user/kernel one?  Or the
> > kernel/hypervisor one?  Or something else?
> > 
> 
> Yeah, I guess the "ABI" word in misleading here, especially the first
> letter. I mean something else: the old kernel/new kernel.
> This persistent memory pool (its metadata) is supposed to be passed
> across kexec with the data. That is probably the main difference in
> comparison to pmem or cma.
> Since the header can change its format between kernels, there should be
> a way to identify it.

Ah.  Hah, that's crazy, and it's never going to work, you need to just
test the version of the kernel that the image was created for (you have
that in the kernel already) and verify that it is the same before
loading the new one.

That way you never have to worry about any "version number", it's just
the kernel specific version number instead.

thanks,

greg k-h


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH] Introduce persistent memory pool
  2023-08-23  2:45     ` Stanislav Kinsburskii
@ 2023-08-28 20:50       ` Alexander Graf
  2023-08-29 22:07         ` Stanislav Kinsburskii
  0 siblings, 1 reply; 14+ messages in thread
From: Alexander Graf @ 2023-08-28 20:50 UTC (permalink / raw)
  To: Stanislav Kinsburskii, Gowans, James
  Cc: gregkh@linuxfoundation.org, rppt@kernel.org,
	madvenka@linux.microsoft.com, anthony.yznaga@oracle.com,
	steven.sistare@oracle.com, stanislav.kinsburskii@gmail.com,
	linux-kernel@vger.kernel.org, seanjc@google.com,
	pbonzini@redhat.com, kys@microsoft.com, arnd@arndb.de,
	wei.liu@kernel.org, anrayabh@linux.microsoft.com,
	dragan.cvetic@amd.com, jinankjain@linux.microsoft.com,
	derek.kiernan@amd.com, linux-mm@kvack.org, Andrew Morton, kexec,
	iommu, kvm

+kexec, iommu, kvm

On 23.08.23 04:45, Stanislav Kinsburskii wrote:
>
> +akpm, +linux-mm
>
> On Fri, Aug 25, 2023 at 01:32:40PM +0000, Gowans, James wrote:
>> On Fri, 2023-08-25 at 10:05 +0200, Greg Kroah-Hartman wrote:
>>
>> Thanks for adding me to this thread Greg!
>>
>>> On Tue, Aug 22, 2023 at 11:34:34AM -0700, Stanislav Kinsburskii wrote:
>>>> This patch addresses the need for a memory allocator dedicated to
>>>> persistent memory within the kernel. This allocator will preserve
>>>> kernel-specific states like DMA passthrough device states, IOMMU state, and
>>>> more across kexec.
>>>> The proposed solution offers a foundational implementation for potential
>>>> custom solutions that might follow. Though the implementation is
>>>> intentionally kept concise and straightforward to foster discussion and
>>>> feedback, it's fully functional in its current state.
>> Hi Stanislav, it looks like we're working on similar things. I'm looking
>> to develop a mechanism to support hypervisor live update for when KVM is
>> running VMs with PCI device passthrough. VMs with device passthrough
>> also necessitates passing and re-hydrating IOMMU state so that DMA can
>> continue during live update.
>>
>> Planning on having an LPC session on this topic:
>> https://lpc.events/event/17/abstracts/1629/ (currently it's only a
>> submitted abstract so not sure if visible, hopefully it will be soon).
>>
>> We are looking at implementing persistence across kexec via an in-memory
>> filesystem on top of reserved memory. This would have files for anything
>> that needs to be persisted. That includes files for IOMMU pgtables, for
>> guest memory or userspace-accessible memory.
>>
>> It may be nice to solve all kexec persistence requirements with one
>> solution, but we can consider IOMMU separately. There are at least three
>> ways that this can be done:
>> a) carving out reserved memory for pgtables. This is done by your
>> proposal here, as well as my suggestion of a filesystem.
>> b) pre/post kexec hooks for drivers to serialise state and pass it
>> across in a structured format from old to new kernel.
>> c) Reconstructing IOMMU state in the new kernel by starting at the
>> hardware registers and walking the page tables. No state passing needed.
>>
>> Have you considered option (b) and (c) here? One of the implications of
>> (b) and (c) are that they would need to hook into the buddy allocator
>> really early to be able to carve out the reconstructed page tables
>> before the allocator is used. Similar to how pkram [0] hooks in early to
>> carve out pages used for its filesystem.
>>
> Hi James,
>
> We are indeed working on similar things, so thanks for chiming in.
> I've seen pkram proposal as well as your comments there.
>
> I think (b) will need some persistent-over-kexec memory to pass the
> state across kexec as well as some key-value store persisted as well.
> And the proposed persistent memory pool is aimed exactly for this
> purpose.
> Or do you imply some other way to pass driver's data accross kexec?


If I had to build this, I'd probably do it just like device tree passing 
on ARM. It's a single, physically contiguous blob of data whose entry 
point you pass to the target kernel. IIRC ACPI passing works similarly. 
This would just be one more opaque data structure that then needs very 
strict versioning and forward/backward compat guarantees.


> I dind't consider (c) yet, thanks for for the pointer.
>
> I have a question in this scope: how is PCI devices registers state is persisted
> across kexec with the files system you are working on? I.e. how does
> driver know, that the device shouldn't not be reinitialized?


The easiest way to do it initially would be kernel command line options 
that hack up the drivers. But I suppose depending on the option we go 
with, you can also use the respective "natural" path:

(a) A special metadata file that explains the state to the driver
(b) An entry in the structured file format that explains the state to 
the target driver
(c) Compatible target drivers try to enumerate state from the target 
device's register file


>
>>>> Potential applications include:
>>>>
>>>>    1. Allowing various in-kernel entities to allocate persistent pages from
>>>>       a singular memory pool, eliminating the need for multiple region
>>>>       reservations.
>>>>
>>>>    2. For in-kernel components that require the allocation address to be
>>>>       available on kernel kexec, this address can be exposed to user space and
>>>>       then passed via the command line.
>> Do you have specific examples of other state that needs to be passed
>> across? Trying to see whether tailoring specifically to the IOMMU case
>> is okay. Conceptually IOMMU state can be reconstructed starting with
>> hardware registers, not needing reserved memory. Other use-cases may not
>> have this option.
>>
> Well, basically it's IOMMU state and PCI devices to skip/avoid
> initializing.
> I bet there can be other misc (and unrelated things) like persistent
> filesystems, block devices, etc. But I don't have a solid set of use
> cases to present.


Would be great if you could think through the problem space until LPC so 
we can have a solid conversation there :)


>
>>> As you have no in-kernel users of this, it's not something we can even
>>> consider at the moment for obvious reasons (neither would you want us
>>> to.)
>>>
>>> Can you make this part of a patch series that actually adds a user,
>>> probably more than one, so that we can see if any of this even makes
>>> sense?
>> I'm very keen to see this as well. The way that the IOMMU drivers are
>> enlightened to hook into your memory pool will likely be similar to how
>> they would hook into my proposal of an in-memory filesystem.
>> Do you have code available showing the IOMMU integration?
>>
> No, I don't have such a code yet.
> But I was thinking that using such a allocator in the mempool allows
> to hide this implementation under the hood of an existent generic
> mechanism, which is then can be used to create persistent objects (file
> system, for example) on top of it.


Unfortunately it's practically impossible to have a solid conversation 
on generic mechanisms without actual users to see how they fit in with 
the real world. That's Greg's answer to your patch set and I tend to 
agree. What if (b) or (c) turn out much more viable? Then we've wasted a 
lot of effort in shaping up the allocator for no good reason.


>
>>>>   drivers/misc/Kconfig   |    7 +
>>>>   drivers/misc/Makefile  |    1
>>>>   drivers/misc/pmpool.c  |  270 ++++++++++++++++++++++++++++++++++++++++++++++++
>>>>   include/linux/pmpool.h |   20 ++++
>>>>   4 files changed, 298 insertions(+)
>>>>   create mode 100644 drivers/misc/pmpool.c
>>>>   create mode 100644 include/linux/pmpool.h
>>> misc is not for memory pools, as this is not a driver.  please put this
>>> in the properly location instead of trying to hide it from the mm
>>> maintainers and subsystem :)
>> One of the reasons I thought a proper filesystem would be a better way
>> of exposing this functionality.
>>
> Yes, I see the point of having a file system for the goals you are
> targeting. It looks like the right way forward.
>
> What I'm not sure about is that persistent-over-kexec memory management and
> the actual preservation mechanism should be an embedded part of this file
> system.
>
> What I'm trying to propose is a part of a generic mechanism to provide
> state persistence across kexec, which can then be used by your file
> system or something else.
>
> What do you think about this approach?


IMHO we need to at least prototype each of the paths outlined above to 
be able to create a sense for what works the best. I can see problems 
with all of them - and I'm personally not convinced there will be a 
one-size-fits-all solution yet.

That said, I am happy to see you pursue similar paths to what we have in 
mind. It means there is a real gap in functionality in Linux kexec that 
we need to overcome sooner or later.


Alex





Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH] Introduce persistent memory pool
  2023-08-28 20:50       ` Alexander Graf
@ 2023-08-29 22:07         ` Stanislav Kinsburskii
  2023-08-30  7:20           ` Alexander Graf
  0 siblings, 1 reply; 14+ messages in thread
From: Stanislav Kinsburskii @ 2023-08-29 22:07 UTC (permalink / raw)
  To: Alexander Graf
  Cc: Gowans, James, gregkh@linuxfoundation.org, rppt@kernel.org,
	madvenka@linux.microsoft.com, anthony.yznaga@oracle.com,
	steven.sistare@oracle.com, stanislav.kinsburskii@gmail.com,
	linux-kernel@vger.kernel.org, seanjc@google.com,
	pbonzini@redhat.com, kys@microsoft.com, arnd@arndb.de,
	wei.liu@kernel.org, anrayabh@linux.microsoft.com,
	dragan.cvetic@amd.com, jinankjain@linux.microsoft.com,
	derek.kiernan@amd.com, linux-mm@kvack.org, Andrew Morton, kexec,
	iommu, kvm

On Mon, Aug 28, 2023 at 10:50:19PM +0200, Alexander Graf wrote:
> +kexec, iommu, kvm
> 
> On 23.08.23 04:45, Stanislav Kinsburskii wrote:
> > 
> > +akpm, +linux-mm
> > 
> > On Fri, Aug 25, 2023 at 01:32:40PM +0000, Gowans, James wrote:
> > > On Fri, 2023-08-25 at 10:05 +0200, Greg Kroah-Hartman wrote:
> > > 
> > > Thanks for adding me to this thread Greg!
> > > 
> > > > On Tue, Aug 22, 2023 at 11:34:34AM -0700, Stanislav Kinsburskii wrote:
> > > > > This patch addresses the need for a memory allocator dedicated to
> > > > > persistent memory within the kernel. This allocator will preserve
> > > > > kernel-specific states like DMA passthrough device states, IOMMU state, and
> > > > > more across kexec.
> > > > > The proposed solution offers a foundational implementation for potential
> > > > > custom solutions that might follow. Though the implementation is
> > > > > intentionally kept concise and straightforward to foster discussion and
> > > > > feedback, it's fully functional in its current state.
> > > Hi Stanislav, it looks like we're working on similar things. I'm looking
> > > to develop a mechanism to support hypervisor live update for when KVM is
> > > running VMs with PCI device passthrough. VMs with device passthrough
> > > also necessitates passing and re-hydrating IOMMU state so that DMA can
> > > continue during live update.
> > > 
> > > Planning on having an LPC session on this topic:
> > > https://lpc.events/event/17/abstracts/1629/ (currently it's only a
> > > submitted abstract so not sure if visible, hopefully it will be soon).
> > > 
> > > We are looking at implementing persistence across kexec via an in-memory
> > > filesystem on top of reserved memory. This would have files for anything
> > > that needs to be persisted. That includes files for IOMMU pgtables, for
> > > guest memory or userspace-accessible memory.
> > > 
> > > It may be nice to solve all kexec persistence requirements with one
> > > solution, but we can consider IOMMU separately. There are at least three
> > > ways that this can be done:
> > > a) carving out reserved memory for pgtables. This is done by your
> > > proposal here, as well as my suggestion of a filesystem.
> > > b) pre/post kexec hooks for drivers to serialise state and pass it
> > > across in a structured format from old to new kernel.
> > > c) Reconstructing IOMMU state in the new kernel by starting at the
> > > hardware registers and walking the page tables. No state passing needed.
> > > 
> > > Have you considered option (b) and (c) here? One of the implications of
> > > (b) and (c) are that they would need to hook into the buddy allocator
> > > really early to be able to carve out the reconstructed page tables
> > > before the allocator is used. Similar to how pkram [0] hooks in early to
> > > carve out pages used for its filesystem.
> > > 
> > Hi James,
> > 
> > We are indeed working on similar things, so thanks for chiming in.
> > I've seen pkram proposal as well as your comments there.
> > 
> > I think (b) will need some persistent-over-kexec memory to pass the
> > state across kexec as well as some key-value store persisted as well.
> > And the proposed persistent memory pool is aimed exactly for this
> > purpose.
> > Or do you imply some other way to pass driver's data accross kexec?
> 
> 
> If I had to build this, I'd probably do it just like device tree passing on
> ARM. It's a single, physically contiguous blob of data whose entry point you
> pass to the target kernel. IIRC ACPI passing works similarly. This would
> just be one more opaque data structure that then needs very strict
> versioning and forward/backward compat guarantees.
> 

Device tree or ACPI are options indeed. However AFAIU in case of DT user
space has to involved into the picture to modify and complie it, while
ACPI isn't flexible or easily extendable.
Also, AFAIU both these standards were designed with passing
hardware-specific data in mind from bootstrap software to an OS kernel
and thus were never really intended to be used for creating a persistent
state accross kexec.
To me, an attempt to use either of them to pass kernel-specific data looks
like an abuse (or misuse) excused by the simplicity of implementation.

> 
> > I dind't consider (c) yet, thanks for for the pointer.
> > 
> > I have a question in this scope: how is PCI devices registers state is persisted
> > across kexec with the files system you are working on? I.e. how does
> > driver know, that the device shouldn't not be reinitialized?
> 
> 
> The easiest way to do it initially would be kernel command line options that
> hack up the drivers. But I suppose depending on the option we go with, you
> can also use the respective "natural" path:
> 
> (a) A special metadata file that explains the state to the driver
> (b) An entry in the structured file format that explains the state to the
> target driver
> (c) Compatible target drivers try to enumerate state from the target
> device's register file
> 

Command line option is the simplest way to go indeed, but from my POV
it's good only for pointing to a particualr object, which is persisted
somehow else. But it we have a persistence mechanism, then I think we
can make another step forward and don't use command line at all (which
is a bit cumbersome and errorprone due to it's human-readable and
serialized nature).

I'm leaning towards some kind of "natural" path you mentioned... I guess
I'm a bit confused with the word "file" here, as it sounds line it
implies a file system driver, and I'm not sure that's what we want for
driver specific data.

> 
> > 
> > > > > Potential applications include:
> > > > > 
> > > > >    1. Allowing various in-kernel entities to allocate persistent pages from
> > > > >       a singular memory pool, eliminating the need for multiple region
> > > > >       reservations.
> > > > > 
> > > > >    2. For in-kernel components that require the allocation address to be
> > > > >       available on kernel kexec, this address can be exposed to user space and
> > > > >       then passed via the command line.
> > > Do you have specific examples of other state that needs to be passed
> > > across? Trying to see whether tailoring specifically to the IOMMU case
> > > is okay. Conceptually IOMMU state can be reconstructed starting with
> > > hardware registers, not needing reserved memory. Other use-cases may not
> > > have this option.
> > > 
> > Well, basically it's IOMMU state and PCI devices to skip/avoid
> > initializing.
> > I bet there can be other misc (and unrelated things) like persistent
> > filesystems, block devices, etc. But I don't have a solid set of use
> > cases to present.
> 
> 
> Would be great if you could think through the problem space until LPC so we
> can have a solid conversation there :)
> 

Yeah, I have a few ideas I'll try to implement and share before LPC.
Unfortunatelly I'm not planning to attend it this year, so this
conversation will be without me.
But I'll do my best to provide as much content to discuss as I can.

> 
> > 
> > > > As you have no in-kernel users of this, it's not something we can even
> > > > consider at the moment for obvious reasons (neither would you want us
> > > > to.)
> > > > 
> > > > Can you make this part of a patch series that actually adds a user,
> > > > probably more than one, so that we can see if any of this even makes
> > > > sense?
> > > I'm very keen to see this as well. The way that the IOMMU drivers are
> > > enlightened to hook into your memory pool will likely be similar to how
> > > they would hook into my proposal of an in-memory filesystem.
> > > Do you have code available showing the IOMMU integration?
> > > 
> > No, I don't have such a code yet.
> > But I was thinking that using such a allocator in the mempool allows
> > to hide this implementation under the hood of an existent generic
> > mechanism, which is then can be used to create persistent objects (file
> > system, for example) on top of it.
> 
> 
> Unfortunately it's practically impossible to have a solid conversation on
> generic mechanisms without actual users to see how they fit in with the real
> world. That's Greg's answer to your patch set and I tend to agree. What if
> (b) or (c) turn out much more viable? Then we've wasted a lot of effort in
> shaping up the allocator for no good reason.
> 

This is fair.
I sent such a small piece, because I wanted to get some opinions about
the approach in general without too much investement into.
Thanks to you and Greg, I now have the feadeback I was looking for, so
I'm planning to come up with a series including in-kernel users, like
Greg suggested.

Thanks,
Stanislav



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH] Introduce persistent memory pool
  2023-08-29 22:07         ` Stanislav Kinsburskii
@ 2023-08-30  7:20           ` Alexander Graf
  2023-08-30 23:39             ` Arnd Bergmann
  0 siblings, 1 reply; 14+ messages in thread
From: Alexander Graf @ 2023-08-30  7:20 UTC (permalink / raw)
  To: Stanislav Kinsburskii
  Cc: Gowans, James, gregkh@linuxfoundation.org, rppt@kernel.org,
	madvenka@linux.microsoft.com, anthony.yznaga@oracle.com,
	steven.sistare@oracle.com, stanislav.kinsburskii@gmail.com,
	linux-kernel@vger.kernel.org, seanjc@google.com,
	pbonzini@redhat.com, kys@microsoft.com, arnd@arndb.de,
	wei.liu@kernel.org, anrayabh@linux.microsoft.com,
	dragan.cvetic@amd.com, jinankjain@linux.microsoft.com,
	derek.kiernan@amd.com, linux-mm@kvack.org, Andrew Morton, kexec,
	iommu, kvm

Hi Stanislav,

On 30.08.23 00:07, Stanislav Kinsburskii wrote:
>
> On Mon, Aug 28, 2023 at 10:50:19PM +0200, Alexander Graf wrote:
>> +kexec, iommu, kvm
>>
>> On 23.08.23 04:45, Stanislav Kinsburskii wrote:
>>> +akpm, +linux-mm
>>>
>>> On Fri, Aug 25, 2023 at 01:32:40PM +0000, Gowans, James wrote:
>>>> On Fri, 2023-08-25 at 10:05 +0200, Greg Kroah-Hartman wrote:
>>>>
>>>> Thanks for adding me to this thread Greg!
>>>>
>>>>> On Tue, Aug 22, 2023 at 11:34:34AM -0700, Stanislav Kinsburskii wrote:
>>>>>> This patch addresses the need for a memory allocator dedicated to
>>>>>> persistent memory within the kernel. This allocator will preserve
>>>>>> kernel-specific states like DMA passthrough device states, IOMMU state, and
>>>>>> more across kexec.
>>>>>> The proposed solution offers a foundational implementation for potential
>>>>>> custom solutions that might follow. Though the implementation is
>>>>>> intentionally kept concise and straightforward to foster discussion and
>>>>>> feedback, it's fully functional in its current state.
>>>> Hi Stanislav, it looks like we're working on similar things. I'm looking
>>>> to develop a mechanism to support hypervisor live update for when KVM is
>>>> running VMs with PCI device passthrough. VMs with device passthrough
>>>> also necessitates passing and re-hydrating IOMMU state so that DMA can
>>>> continue during live update.
>>>>
>>>> Planning on having an LPC session on this topic:
>>>> https://lpc.events/event/17/abstracts/1629/ (currently it's only a
>>>> submitted abstract so not sure if visible, hopefully it will be soon).
>>>>
>>>> We are looking at implementing persistence across kexec via an in-memory
>>>> filesystem on top of reserved memory. This would have files for anything
>>>> that needs to be persisted. That includes files for IOMMU pgtables, for
>>>> guest memory or userspace-accessible memory.
>>>>
>>>> It may be nice to solve all kexec persistence requirements with one
>>>> solution, but we can consider IOMMU separately. There are at least three
>>>> ways that this can be done:
>>>> a) carving out reserved memory for pgtables. This is done by your
>>>> proposal here, as well as my suggestion of a filesystem.
>>>> b) pre/post kexec hooks for drivers to serialise state and pass it
>>>> across in a structured format from old to new kernel.
>>>> c) Reconstructing IOMMU state in the new kernel by starting at the
>>>> hardware registers and walking the page tables. No state passing needed.
>>>>
>>>> Have you considered option (b) and (c) here? One of the implications of
>>>> (b) and (c) are that they would need to hook into the buddy allocator
>>>> really early to be able to carve out the reconstructed page tables
>>>> before the allocator is used. Similar to how pkram [0] hooks in early to
>>>> carve out pages used for its filesystem.
>>>>
>>> Hi James,
>>>
>>> We are indeed working on similar things, so thanks for chiming in.
>>> I've seen pkram proposal as well as your comments there.
>>>
>>> I think (b) will need some persistent-over-kexec memory to pass the
>>> state across kexec as well as some key-value store persisted as well.
>>> And the proposed persistent memory pool is aimed exactly for this
>>> purpose.
>>> Or do you imply some other way to pass driver's data accross kexec?
>>
>> If I had to build this, I'd probably do it just like device tree passing on
>> ARM. It's a single, physically contiguous blob of data whose entry point you
>> pass to the target kernel. IIRC ACPI passing works similarly. This would
>> just be one more opaque data structure that then needs very strict
>> versioning and forward/backward compat guarantees.
>>
> Device tree or ACPI are options indeed. However AFAIU in case of DT user
> space has to involved into the picture to modify and complie it, while
> ACPI isn't flexible or easily extendable.
> Also, AFAIU both these standards were designed with passing
> hardware-specific data in mind from bootstrap software to an OS kernel
> and thus were never really intended to be used for creating a persistent
> state accross kexec.
> To me, an attempt to use either of them to pass kernel-specific data looks
> like an abuse (or misuse) excused by the simplicity of implementation.


What I was describing above is that the Linux boot protocol already has 
natural ways to pass a DT (arm) or set of ACPI tables (x86) to the 
target kernel. Whatever we do here should either piggy back on top of 
those natural mechanisms (e.g. /chosen node in DT) or be on the same 
level (e.g. pass DT in one register, pass metadata structure in another 
register).

When it comes to the actual content of the metadata, I'm personally also 
leaning towards DT. We already have libfdt inside the kernel. It gives 
is a very simple, well understood structured file format that you can 
extend, version, etc etc. And the kernel has mechanisms to modify fdt 
contents.


>
>>> I dind't consider (c) yet, thanks for for the pointer.
>>>
>>> I have a question in this scope: how is PCI devices registers state is persisted
>>> across kexec with the files system you are working on? I.e. how does
>>> driver know, that the device shouldn't not be reinitialized?
>>
>> The easiest way to do it initially would be kernel command line options that
>> hack up the drivers. But I suppose depending on the option we go with, you
>> can also use the respective "natural" path:
>>
>> (a) A special metadata file that explains the state to the driver
>> (b) An entry in the structured file format that explains the state to the
>> target driver
>> (c) Compatible target drivers try to enumerate state from the target
>> device's register file
>>
> Command line option is the simplest way to go indeed, but from my POV
> it's good only for pointing to a particualr object, which is persisted
> somehow else. But it we have a persistence mechanism, then I think we
> can make another step forward and don't use command line at all (which
> is a bit cumbersome and errorprone due to it's human-readable and
> serialized nature).
>
> I'm leaning towards some kind of "natural" path you mentioned... I guess
> I'm a bit confused with the word "file" here, as it sounds line it
> implies a file system driver, and I'm not sure that's what we want for
> driver specific data.


Oh, I was just referring to the set of registers a device exposes :).


>
>>>>>> Potential applications include:
>>>>>>
>>>>>>     1. Allowing various in-kernel entities to allocate persistent pages from
>>>>>>        a singular memory pool, eliminating the need for multiple region
>>>>>>        reservations.
>>>>>>
>>>>>>     2. For in-kernel components that require the allocation address to be
>>>>>>        available on kernel kexec, this address can be exposed to user space and
>>>>>>        then passed via the command line.
>>>> Do you have specific examples of other state that needs to be passed
>>>> across? Trying to see whether tailoring specifically to the IOMMU case
>>>> is okay. Conceptually IOMMU state can be reconstructed starting with
>>>> hardware registers, not needing reserved memory. Other use-cases may not
>>>> have this option.
>>>>
>>> Well, basically it's IOMMU state and PCI devices to skip/avoid
>>> initializing.
>>> I bet there can be other misc (and unrelated things) like persistent
>>> filesystems, block devices, etc. But I don't have a solid set of use
>>> cases to present.
>>
>> Would be great if you could think through the problem space until LPC so we
>> can have a solid conversation there :)
>>
> Yeah, I have a few ideas I'll try to implement and share before LPC.
> Unfortunatelly I'm not planning to attend it this year, so this
> conversation will be without me.
> But I'll do my best to provide as much content to discuss as I can.


Thanks, looking forward to it! :)


Alex




Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH] Introduce persistent memory pool
  2023-08-30  7:20           ` Alexander Graf
@ 2023-08-30 23:39             ` Arnd Bergmann
  2023-08-31  2:24               ` Stanislav Kinsburskii
  0 siblings, 1 reply; 14+ messages in thread
From: Arnd Bergmann @ 2023-08-30 23:39 UTC (permalink / raw)
  To: Alexander Graf, Stanislav Kinsburskii
  Cc: Gowans, James, Greg Kroah-Hartman, Mike Rapoport,
	madvenka@linux.microsoft.com, anthony.yznaga@oracle.com,
	steven.sistare@oracle.com, Stanislav Kinsburskii,
	linux-kernel@vger.kernel.org, Sean Christopherson, Paolo Bonzini,
	K. Y. Srinivasan, Wei Liu, anrayabh@linux.microsoft.com,
	dragan.cvetic@amd.com, jinankjain@linux.microsoft.com,
	derek.kiernan@amd.com, linux-mm@kvack.org, Andrew Morton, kexec,
	iommu, kvm

On Wed, Aug 30, 2023, at 03:20, Alexander Graf wrote:
> On 30.08.23 00:07, Stanislav Kinsburskii wrote:
>> On Mon, Aug 28, 2023 at 10:50:19PM +0200, Alexander Graf wrote:

>> Device tree or ACPI are options indeed. However AFAIU in case of DT user
>> space has to involved into the picture to modify and complie it, while
>> ACPI isn't flexible or easily extendable.
>> Also, AFAIU both these standards were designed with passing
>> hardware-specific data in mind from bootstrap software to an OS kernel
>> and thus were never really intended to be used for creating a persistent
>> state accross kexec.
>> To me, an attempt to use either of them to pass kernel-specific data looks
>> like an abuse (or misuse) excused by the simplicity of implementation.
>
>
> What I was describing above is that the Linux boot protocol already has 
> natural ways to pass a DT (arm) or set of ACPI tables (x86) to the 
> target kernel. Whatever we do here should either piggy back on top of 
> those natural mechanisms (e.g. /chosen node in DT) or be on the same 
> level (e.g. pass DT in one register, pass metadata structure in another 
> register).
>
> When it comes to the actual content of the metadata, I'm personally also 
> leaning towards DT. We already have libfdt inside the kernel. It gives 
> is a very simple, well understood structured file format that you can 
> extend, version, etc etc. And the kernel has mechanisms to modify fdt 
> contents.

Agreed. This also makes a lot of sense since the fdt format was
originally introduced for this exact purpose, to be a key-value
store to pass data from the running kernel to the next one after
kexec when the original source of the data (originally open
firmware) is gone. It only turned into the generic way to
describe embedded systems later on, but both the fdt binary
format and the kexec infrastructure for manipulating and
passing the blob should be easy to reuse for additional purposes
as long as the contents are put into appropriate namespaces that
don't clash with existing usage.

       Arnd


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH] Introduce persistent memory pool
  2023-08-30 23:39             ` Arnd Bergmann
@ 2023-08-31  2:24               ` Stanislav Kinsburskii
  0 siblings, 0 replies; 14+ messages in thread
From: Stanislav Kinsburskii @ 2023-08-31  2:24 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Alexander Graf, Gowans, James, Greg Kroah-Hartman, Mike Rapoport,
	madvenka@linux.microsoft.com, anthony.yznaga@oracle.com,
	steven.sistare@oracle.com, Stanislav Kinsburskii,
	linux-kernel@vger.kernel.org, Sean Christopherson, Paolo Bonzini,
	K. Y. Srinivasan, Wei Liu, anrayabh@linux.microsoft.com,
	dragan.cvetic@amd.com, jinankjain@linux.microsoft.com,
	derek.kiernan@amd.com, linux-mm@kvack.org, Andrew Morton, kexec,
	iommu, kvm

On Wed, Aug 30, 2023 at 07:39:18PM -0400, Arnd Bergmann wrote:
> On Wed, Aug 30, 2023, at 03:20, Alexander Graf wrote:
> > On 30.08.23 00:07, Stanislav Kinsburskii wrote:
> >> On Mon, Aug 28, 2023 at 10:50:19PM +0200, Alexander Graf wrote:
> 
> >> Device tree or ACPI are options indeed. However AFAIU in case of DT user
> >> space has to involved into the picture to modify and complie it, while
> >> ACPI isn't flexible or easily extendable.
> >> Also, AFAIU both these standards were designed with passing
> >> hardware-specific data in mind from bootstrap software to an OS kernel
> >> and thus were never really intended to be used for creating a persistent
> >> state accross kexec.
> >> To me, an attempt to use either of them to pass kernel-specific data looks
> >> like an abuse (or misuse) excused by the simplicity of implementation.
> >
> >
> > What I was describing above is that the Linux boot protocol already has 
> > natural ways to pass a DT (arm) or set of ACPI tables (x86) to the 
> > target kernel. Whatever we do here should either piggy back on top of 
> > those natural mechanisms (e.g. /chosen node in DT) or be on the same 
> > level (e.g. pass DT in one register, pass metadata structure in another 
> > register).
> >
> > When it comes to the actual content of the metadata, I'm personally also 
> > leaning towards DT. We already have libfdt inside the kernel. It gives 
> > is a very simple, well understood structured file format that you can 
> > extend, version, etc etc. And the kernel has mechanisms to modify fdt 
> > contents.
> 
> Agreed. This also makes a lot of sense since the fdt format was
> originally introduced for this exact purpose, to be a key-value
> store to pass data from the running kernel to the next one after
> kexec when the original source of the data (originally open
> firmware) is gone. It only turned into the generic way to
> describe embedded systems later on, but both the fdt binary
> format and the kexec infrastructure for manipulating and
> passing the blob should be easy to reuse for additional purposes
> as long as the contents are put into appropriate namespaces that
> don't clash with existing usage.
> 

I see. Using DT instead of introducing another format to pass metadata
is quite tempting.
The only problem I see here is that it enforces a user of x86 host to
use device tree.
I guess such a device tree can be fully constructred in kernel and then
added to kexec parameters by kernel itself thus keeping the such
matadata management independent from user setup.
But then how to combine this in-kernel device tree with the one,
provided by user?
Also, AFAIK user can override device tree via kexec paramaters, so I
guess placing kernel-specific data into a DT, controlled by user, isn't
very reliable.
Is there a way to pass multiple device trees over kexec?
Or should there be an implicit extention to patch user provided DT
with kernel metadata before kexec?

Thanks,
Stanislav

>        Arnd


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH] Introduce persistent memory pool
  2023-08-31 14:18               ` Paolo Bonzini
@ 2023-08-31  2:37                 ` Stanislav Kinsburskii
  0 siblings, 0 replies; 14+ messages in thread
From: Stanislav Kinsburskii @ 2023-08-31  2:37 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Greg Kroah-Hartman, Stanislav Kinsburskii, Derek Kiernan,
	Dragan Cvetic, Arnd Bergmann, Wei Liu, K. Y. Srinivasan,
	Madhavan Venkataraman, Anthony Yznaga, Mike Rapoport (IBM),
	James Gowans, Anirudh Rayabharam, Jinank Jain, Andrew Morton,
	linux-mm, linux-kernel

On Thu, Aug 31, 2023 at 04:18:48PM +0200, Paolo Bonzini wrote:
> On 8/26/23 22:04, Greg Kroah-Hartman wrote:
> > > Yeah, I guess the "ABI" word in misleading here, especially the first
> > > letter. I mean something else: the old kernel/new kernel.
> > > This persistent memory pool (its metadata) is supposed to be passed
> > > across kexec with the data. That is probably the main difference in
> > > comparison to pmem or cma.
> > > Since the header can change its format between kernels, there should be
> > > a way to identify it.
> > 
> > Ah.  Hah, that's crazy, and it's never going to work, you need to just
> > test the version of the kernel that the image was created for (you have
> > that in the kernel already) and verify that it is the same before
> > loading the new one.
> > 
> > That way you never have to worry about any "version number", it's just
> > the kernel specific version number instead.
> 
> Checking the version of the kernel is not enough because you want to support
> kexec to a newer kernel.
> 
> I agree though that a version number is not needed.  In the end this is just
> like a filesystem and you'd better keep it backwards compatible. If you
> think you might need an extra field in the header, you have to leave some
> padding and add a "flags" field that right now is always zero.  Or something
> like that.
> 

Thanks for laying it out. I was also thinking about some pads to reserve
in metadata upfront to allow future extensions.
But I guess if we decide to store metadata in device tree, then this
padding won't be required.

Thanks,
Stanislav

> Paolo


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH] Introduce persistent memory pool
  2023-08-26 20:04             ` Greg Kroah-Hartman
@ 2023-08-31 14:18               ` Paolo Bonzini
  2023-08-31  2:37                 ` Stanislav Kinsburskii
  0 siblings, 1 reply; 14+ messages in thread
From: Paolo Bonzini @ 2023-08-31 14:18 UTC (permalink / raw)
  To: Greg Kroah-Hartman, Stanislav Kinsburskii
  Cc: Stanislav Kinsburskii, Derek Kiernan, Dragan Cvetic,
	Arnd Bergmann, Wei Liu, K. Y. Srinivasan, Madhavan Venkataraman,
	Anthony Yznaga, Mike Rapoport (IBM), James Gowans,
	Anirudh Rayabharam, Jinank Jain, Andrew Morton, linux-mm,
	linux-kernel

On 8/26/23 22:04, Greg Kroah-Hartman wrote:
>> Yeah, I guess the "ABI" word in misleading here, especially the first
>> letter. I mean something else: the old kernel/new kernel.
>> This persistent memory pool (its metadata) is supposed to be passed
>> across kexec with the data. That is probably the main difference in
>> comparison to pmem or cma.
>> Since the header can change its format between kernels, there should be
>> a way to identify it.
> 
> Ah.  Hah, that's crazy, and it's never going to work, you need to just
> test the version of the kernel that the image was created for (you have
> that in the kernel already) and verify that it is the same before
> loading the new one.
> 
> That way you never have to worry about any "version number", it's just
> the kernel specific version number instead.

Checking the version of the kernel is not enough because you want to 
support kexec to a newer kernel.

I agree though that a version number is not needed.  In the end this is 
just like a filesystem and you'd better keep it backwards compatible. 
If you think you might need an extra field in the header, you have to 
leave some padding and add a "flags" field that right now is always 
zero.  Or something like that.

Paolo



^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2023-08-31 17:38 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <64e7cbf7.050a0220.114c7.b70dSMTPIN_ADDED_BROKEN@mx.google.com>
     [not found] ` <2023082506-enchanted-tripping-d1d5@gregkh>
2023-08-23  1:36   ` [RFC PATCH] Introduce persistent memory pool Stanislav Kinsburskii
     [not found]   ` <c26ad989dcc6737dd295e980c78ef53740098810.camel@amazon.com>
2023-08-23  2:45     ` Stanislav Kinsburskii
2023-08-28 20:50       ` Alexander Graf
2023-08-29 22:07         ` Stanislav Kinsburskii
2023-08-30  7:20           ` Alexander Graf
2023-08-30 23:39             ` Arnd Bergmann
2023-08-31  2:24               ` Stanislav Kinsburskii
     [not found]   ` <64e8f6dd.050a0220.edb3c.c045SMTPIN_ADDED_BROKEN@mx.google.com>
2023-08-26  7:45     ` Greg Kroah-Hartman
2023-08-23  6:15       ` Stanislav Kinsburskii
     [not found]       ` <64ea25cd.650a0220.642cc.50e6SMTPIN_ADDED_BROKEN@mx.google.com>
2023-08-26 17:02         ` Greg Kroah-Hartman
2023-08-23  6:21           ` Stanislav Kinsburskii
     [not found]           ` <64ea3699.170a0220.13ee0.5c3aSMTPIN_ADDED_BROKEN@mx.google.com>
2023-08-26 20:04             ` Greg Kroah-Hartman
2023-08-31 14:18               ` Paolo Bonzini
2023-08-31  2:37                 ` Stanislav Kinsburskii

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).