[RFC] Virtualizing tagged disaggregated memory capacity (app specific, multi host shared)

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [RFC] Virtualizing tagged disaggregated memory capacity (app specific, multi host shared)
@ 2024-08-15 16:22 Jonathan Cameron via
  2024-08-16  7:05 ` Hannes Reinecke
                   ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Jonathan Cameron via @ 2024-08-15 16:22 UTC (permalink / raw)
  To: David Hildenbrand, linux-mm, linux-cxl, Davidlohr Bueso,
	Ira Weiny, John Groves, virtualization
  Cc: Oscar Salvador, qemu-devel, Dave Jiang, Dan Williams, linuxarm,
	wangkefeng.wang, John Groves, Fan   Ni, Navneet   Singh,
	“Michael S. Tsirkin”, Igor Mammedov,
	Philippe Mathieu-Daudé

Introduction
============

If we think application specific memory (including inter-host shared memory) is
a thing, it will also be a thing people want to use with virtual machines,
potentially nested. So how do we present it at the Host to VM boundary?

This RFC is perhaps premature given we haven't yet merged upstream support for
the bare metal case. However I'd like to get the discussion going given we've
touched briefly on this in a number of CXL sync calls and it is clear no one is
entirely sure what direction make sense.  We may briefly touch on this in the
LPC CXL uconf, but time will be very limited.

Aim here isn't to promote a particular path, but just to describe the problem
and some potential solutions. It may be obvious which one I think is easiest,
but it may be a case of I have that hammer so will hit things with it.
It's also the case that we may not converge on a single solution and end up
with several supported.  That's not a problem as long as there isn't
significant extra maintenance burden etc.  There are subtle differences
between likely deployments that may make certain solutions more attractive
than others.

Whilst I'm very familiar with the bare metal CXL part of this, I'm less
familiar with the Virtual Machine and MM elements.  Hence I'm hoping to
get inputs from David Hildenbrand, particularly around virtio-mem as an
option and many others to help fill in some of the gaps in information.
I'd also like inputs from those like John Groves who are looking at inter-host
sharing.  I've also cc'd the QEMU list given all these solutions are likely
to involve some additional emulation work and QEMU is my preferred choice for
a reference implementation.

I've almost certainly forgotten someone, so please do +CC others.

Background
==========

Skip if you already know all about CXL or similar memory pooling technologies.
I've skipped over many of the details, because they hopefully don't matter for
the core of the questions posed. I'm happy to provide more details though if
this isn't detailed enough.

Memory pool devices
-------------------

CXL and similar technologies bring the option of having an 'appliance' that
provides disaggregated memory to a number of servers with moderately low latency
overhead compared to local memory.  Typically these are multi-head devices
directly connected to Root Ports of a number of different hosts. This design
avoids the latency cost of a switched fabric.

Experimental deployments suggest ratios of around 1 memory pool to 16 CPU
sockets.
[Pond: CXL-Based Memory Pooling Systems for Cloud Platforms, Li et Al.
 ASPLOS '23]

In some deployments, each socket has on it's own connection allowing lowish
latency (perhaps 1.5x typical inter socket), highish bandwidth memory
expansion.  Interleave can further boost the host to appliance bandwidth
at the cost of reducing number of hosts sharing a single appliance.
 __________________   __________________   __________________
| Host A           | | Host B           | | Host C           |
|                  | |                  | |                  |
|   ____    ____   | |   ____    ____   | |   ____    ____   |
|__|_RP_|__|_RP_|__| |__|_RP_|__|_RP_|__| |__|_RP_|__|_RP_|__|
     ||      ||           ||      ||           ||      ||
     ||      ||           ||      ||           ||      ||
 ____||______||___________||______||___________||______||_____
|                                                             |
|                                                             |
|             Memory Appliance.                               |
|                                                             |
|_____________________________________________________________|

CXL memory pooling options
--------------------------

CXL 2.0 provided basic memory pool facilities via hot-plug of entire devices.
This is both expensive to do and inflexible, so not well suited to memory
appliance applications.

CXL 3.0 and onwards introduced Dynamic Capacity in what is known as a Dynamic
Capacity Device (DCD).

We'll need a few key terms:

Host Physical Address (HPA). Region of the host system address map where reads 
and writes will be routed to a particular CXL host bridge. This is considered
a fixed mapping (may be changeable in BIOS) and presented to the host OS via
an ACPI table. These windows are called CXL Fixed Memory Windows (CFMWs)
Yes I'm being lazy here and HPA may not quite be the same as the view a CPU sees
but that's a detail we don't care about here.

Device Physical Address (DPA).  Confusingly this isn't necessarily the
addressing used on a device to access a particular physical address in the DRAM
chips, but rather a presentation of the device memory to a particular host.
There may be another level of translation underneath (this detail will matter
later)

Host Managed Device Memory Decoders (HDM Decoders). Programmable Address
Routers that control the routing of a CXL transaction.

Extents - Contiguous regions of DPA space (offset + size)

Key elements of DCD usage
-------------------------

Device to host address routing is not often changed. Typically it is set up at
boot, either in host firmware, or once the operating system has started. That
is, we'll probably program all the HDM Decoders once per boot. They may be left
in a state where the host can reprogram them, or locked down.

Regions of the DPA space that these decoders are routing the accesses to may
not be backed by anything. A slight simplification is that these unbacked
addresses read zero, and writes are dropped.

Later on, some magic entity - let's call it an orchestrator, will tell the
memory pool to provide memory to a given host. The host gets notified by
the device of an 'offer' of specific memory extents and can accept it, after
which it may start to make use of the provided memory extent.

Those address ranges may be shared across multiple hosts (in which case they
are not for general use), or may be dedicated memory intended for use as
normal RAM.

Whilst the granularity of DCD extents is allowed by the specification to be very
fine (64 Bytes), in reality my expectation is no one will build general purpose
memory pool devices with fine granularity.

Memory hot-plug options (bare metal)
------------------------------------

By default, these extents will surface as either:
1) Normal memory hot-plugged into a NUMA node.
2) DAX - requiring applications to map that memory directly or use
   a filesystem etc.

There are various ways to apply policy to this. One is to base the policy
decision on a 'tag' that is associated with a set of DPA extents. That 'tag'
is metadata that originates at the orchestrator. It's big enough to hold a
UUID, so can convey whatever meaning is agreed by the orchestrator and the
software running on each host.

Memory pools tend to want to guarantee, when the circumstances change
(workload finishes etc), they can have the resources they allocated back.
CXL brings polite ways of asking for the memory back and big hammers for
when the host ignores things (which may well crash a naughty host).
Reliable hot unplug of normal memory continues to be a challenge for memory
that is 'normal' because not all its use / lifetime is tied to a particular
application.

Application specific memory
---------------------------

The DAX path enables association of the memory with a single application
by allowing that application to simply mmap appropriate /dev/daxX.Y
That device optionally has an associated tag.

When the application closes or otherwise releases that memory we can
guarantee to be able to recover the capacity.  Memory provided to an
application this way will be referred to here as Application Specific Memory.
This model also works for HBM or other 'better' memory that is reserved for
specific use cases.

So the flow is something like:
1. Cloud orchestrator decides it's going to run in memory database A 
   on host W.
2. Memory appliance Z is told to 'offer' 1TB or memory to host W with
   UUID / tag wwwwxxxxzzzz
3. Host W accepts that memory (why would it say no?) and creates a
   DAX device for which the tag is discoverable.
4. Orchestrator tells host W to launch the workload and that it
   should use the memory provided with tag wwwwxxxxzzzz.
5. Host launches DB and tells it to use DAX device with tag wwwwxxxxzzzz 
   which the DB then mmap()s and loads it's database data into.
... sometime later....
6. Orchestrator tells host W to close that DB ad release the memory
   allocated from the pool.
7. Host gives the memory back to the memory appliance which can then use
   it to provide another host with the necessary memory.

This approach requires applications or at least memory allocation libraries to
be modified.  The guarantees of getting the memory they asked for + that they
will definitely be able to safely give the memory back when done, may make such
software modifications worthwhile.

There are disadvantages and bloat issues if the 'wrong' amount of memory is
allocated to the application. So these techniques only work when the
orchestrator has the necessary information about the workload.

Note that one specific example of application specific memory is virtual
machines.  Just in this case the virtual machine is the application.
Later on it may be useful to consider the example of the specific
application in a VM being a nested virtual machine.

Shared Memory - closely related!
--------------------------------

CXL enables a number of different types of memory sharing across multiple
hosts:
- Read only shared memory (suitable for apache arrow for example)
- Hardware Coherent shared memory.
- Software managed coherency.

These surface using the same machinery as non shared DCD extents. Note however
that the presentation, in terms of extents, to different host is not the same
(can be different extents, in an unrelated order) but along with tags, shared
extents have sufficient data to 'construct' a virtual address to HPA mapping
that makes them look the same to aware  application or file systems.  Current
proposed approach to this is to surface the extents via DAX and apply a
filesystem approach to managing the data.
https://lpc.events/event/18/contributions/1827/

These two types of memory pooling activity (shared memory, application specific
memory) both require capacity associated with a tag to be presented to specific
users in a fashion that is 'separate' from normal memory hot-plug.

The virtualization question
===========================

Having made the assumption that the models above are going to be used in
practice, and that Linux will support them, the natural next step is to
assume that applications designed against them are going to be used in virtual
machines as well as on bare metal hosts.

The open question this RFC is aiming to start discussion around is how best to
present them to the VM.  I want to get that discussion going early because
some of the options I can see will require specification additions and / or
significant PoC / development work to prove them out.  Before we go there,
let us briefly consider other uses of pooled memory in VMs and how theuy
aren't really relevant here.

Other virtualization uses of memory pool capacity
-------------------------------------------------

1. Part of static capacity of VM provided from a memory pool.
   Can be presented as a NUMA setup, with HMAT etc providing performance data
   relative to other memory the VM is using. Recovery of pooled capacity 
   requires shutting down or migrating the VM.
2. Coarse grained memory increases for 'normal' memory.
   Can use memory hot-plug. Recovery of capacity likely to only be possible on
   VM shutdown.

Both these use cases are well covered by existing solutions so we can ignore
them for the rest of this document.

Application specific or shared dynamic capacity - VM options.
-------------------------------------------------------------

1. Memory hot-plug - but with specific purpose memory flag set in EFI
   memory map.  Current default policy is to bring those up as normal memory.
   That policy can be adjusted via kernel option or Kconfig so they turn up
   as DAX.  We 'could' augment the metadata with such hot-plugged memory
   with the UID / tag from an underlying bare metal DAX device.

2. Virtio-mem - It may be possible to fit this use case within an extended
   virtio-mem.

3. Emulate a CXL type 3 device.

4. Other options?

Memory hotplug
--------------

This is the heavy weight solution but should 'work' if we close a specification
gap.  Granularity limitations are unlikely to be a big problem given anticipated
CXL devices.

Today, the EFI memory map has an attribute EFI_MEMORY_SP, for "Specific Purpose
Memory" intended to notify the operating system that it can use the memory as
normal, but it is there for a specific use case and so might be wanted back at
any point. This memory attribute can be provided in the memory map at boot
time and if associated with EfiReservedMemoryType can be used to indicate a
range of HPA Space where memory that is hot-plugged later should be treated as
'special'.

There isn't an obvious path to associate a particular range of hot plugged
memory with a UID / tag.  I'd expect we'd need to add something to the ACPI
specification to enable this.

Virtio-mem
----------

The design goals of virtio-mem [1] mean that it is not 'directly' applicable
to this case, but could perhaps be adapted with the addition of meta data
and DAX + guaranteed removal of explicit extents.

[1] [virtio-mem: Paravirtualized Memory Hot(Un)Plug, David Hildenbrand and
Martin Schulz, Vee'21]

Emulating a CXL Type 3 Device
-----------------------------

Concerns raised about just emulating a CXL topology:
* A CXL Type 3 device is pretty complex.
* All we need is a tag + make it DAX, so surely this is too much?

Possible advantages
* Kernel is exactly the same as that running on the host. No new drivers or
  changes to existing drivers needed as what we are presenting is a possible
  device topology - which may be much simpler that the host.

Complexity:
***********

We don't emulate everything that can exist in physical topologies.
- One emulated device per host CXL Fixed Memory Window
  (I think we can't quite get away with just one in total due to BW/Latency
   discovery)
- Direct connect each emulated device to an emulate RP + Host Bridge. 
- Single CXL Fixed memory Window.  Never present interleave (that's a host
  only problem).
- Can probably always present a single extent per DAX region (if we don't
  mind burning some GPA space to avoid fragmentation).

In most real deployments, that's 1 CFMW, 1 pass through expander bridge, 1 RP
and 1 EP. We would probably lock down the decoders before presentation to the
kernel. Locking down routing is already supported by Linux as a BIOS may do
this. That lock down simplifies the emulation.

We already have most of what is needed emulated and upstream in QEMU with
the exception of a necessary optimization to avoid interleave decoding
(not relevant here, that is all for testing topology handling). PoC level
code exists for that bit.  The other aspect not yet enabled, is hotplugging
additional memory backends into a single CXL Type 3 emulated device. I don't
anticipate that being a problem, but PoC needed to be sure.

One possible corner is that the Dynamic Capacity Flows in a physical machine
require flushing caches due to changes of the physical address map.  Care may
be needed to silently drop such flushes if they are issued from the guest as it
will not actually be changing the physical address map when capacity is added
or released.

Today, Linux associates a single NUMA node with a CXL Fixed Memory
window. Whilst this is a limitation of the Linux handling, to avoid
major changes to that infrastructure it may make sense to present multiple
CXL Fixed Memory windows, so that the Guest can have separate NUMA nodes
for memory pools with different characteristics.

So I agreed complexity of this solution is a valid point, but mostly for
emulation complexity. As emulated devices go it's not that complex (and
we have most of it in place already and upstream in QEMU with Fan's
DCD emulation support going in recently).

Error handling:
***************
What we mostly care about here is memory corruption.  Protocol errors
may be relevant if we an contain the resulting resets, but that is mostly
a host problem.

Synchronous memory errors should surface the same as normal.
Asynchronous errors can either use FW first error injection into the VMM
or inject emulated device errors (some support already in QEMU, additional
support under review).

Conclusion for Type 3 emulation
*******************************
Seems doable.
Complexity is control paths in the VMM.
No kernel changes needed (I think!)

What I'm looking for from this discussion
=========================================

- Blockers!  What problems do people anticipate with each approach?
- General agreement on what we 'might' support in the kernel / QEMU / other VMMs.
- Are there other use cases with similar requirements that we should incorporate?

Appendix : Known corner cases
=============================

These are here mostly for completeness and to track things we need
to solve, rather that because they should greatly influence the
path taken.

CXL Type 3 Performance discovery
--------------------------------

The discussion above suggests that we would represent interleaved CXL devices
as a single device.  Given NUMA characteristics of CXL attached memory are
calculated based partly on the PCIe Link register values that currently 
indicate we have up to a 16x 64GT/s link, to present several higher performance
devices that are interleaved as a single device may require representation of 
a device faster than hardware specifications allow.  If this turns out to be 
a practical problem, solutions such as a PCIe DVSEC capability could be used
to provided accurate information.  If we can ensure the emulated link is not
acting as a bottleneck, the rest of the performance information from the
topology can be mapped to a combination of emulated host HMAT entries and
emulated CDAT data provided by the emulated type 3 device.

Migration
---------

VM migration will either have to remove all extents, or appropriately
prepopulate them prior to migration.  There are possible ways this
may be done with the same memory pool contents via 'temporal' sharing,
but in general this may bring additional complexity.

Kexec etc etc will be similar to how we handle it on the host - probably
just give all the capacity back.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC] Virtualizing tagged disaggregated memory capacity (app specific, multi host shared)
  2024-08-15 16:22 [RFC] Virtualizing tagged disaggregated memory capacity (app specific, multi host shared) Jonathan Cameron via
@ 2024-08-16  7:05 ` Hannes Reinecke
  2024-08-16  9:41   ` Jonathan Cameron via
  2024-08-19  2:12 ` John Groves
  2024-09-19  9:09 ` David Hildenbrand
  2 siblings, 1 reply; 14+ messages in thread
From: Hannes Reinecke @ 2024-08-16  7:05 UTC (permalink / raw)
  To: Jonathan Cameron, David Hildenbrand, linux-mm, linux-cxl,
	Davidlohr Bueso, Ira Weiny, John Groves, virtualization
  Cc: Oscar Salvador, qemu-devel, Dave Jiang, Dan Williams, linuxarm,
	wangkefeng.wang, John Groves, Fan Ni, Navneet Singh,
	“Michael S. Tsirkin”, Igor Mammedov,
	Philippe Mathieu-Daudé

On 8/15/24 18:22, Jonathan Cameron wrote:
> Introduction
> ============
> 
> If we think application specific memory (including inter-host shared memory) is
> a thing, it will also be a thing people want to use with virtual machines,
> potentially nested. So how do we present it at the Host to VM boundary?
> 
> This RFC is perhaps premature given we haven't yet merged upstream support for
> the bare metal case. However I'd like to get the discussion going given we've
> touched briefly on this in a number of CXL sync calls and it is clear no one is
> entirely sure what direction make sense.  We may briefly touch on this in the
> LPC CXL uconf, but time will be very limited.
> 
Thanks for the detailed write-up.

Can't we have an ad-hoc meeting at OSS/LPC to gather interested/relevant 
people to explore ideas around this?

In particular I'd be interested on how to _get_ the application specific 
memory to the application in question. It's easy if you have your own 
application and design it to work on DAX devices. Obviously this 
approach won't work for unmodified applications; however, they really
might want to use this, too.

And, of course, the other mentioned problems are worth discussing, and I 
do agree that the uconf will probably not providing sufficient time for 
this.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC] Virtualizing tagged disaggregated memory capacity (app specific, multi host shared)
  2024-08-16  7:05 ` Hannes Reinecke
@ 2024-08-16  9:41   ` Jonathan Cameron via
  0 siblings, 0 replies; 14+ messages in thread
From: Jonathan Cameron via @ 2024-08-16  9:41 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: David Hildenbrand, linux-mm, linux-cxl, Davidlohr Bueso,
	Ira Weiny, John Groves, virtualization, Oscar Salvador,
	qemu-devel, Dave Jiang, Dan Williams, linuxarm, wangkefeng.wang,
	John Groves, Fan Ni, Navneet Singh,
	 “Michael S. Tsirkin”, Igor Mammedov,
	Philippe Mathieu-Daudé

On Fri, 16 Aug 2024 09:05:46 +0200
Hannes Reinecke <hare@suse.de> wrote:

> On 8/15/24 18:22, Jonathan Cameron wrote:
> > Introduction
> > ============
> > 
> > If we think application specific memory (including inter-host shared memory) is
> > a thing, it will also be a thing people want to use with virtual machines,
> > potentially nested. So how do we present it at the Host to VM boundary?
> > 
> > This RFC is perhaps premature given we haven't yet merged upstream support for
> > the bare metal case. However I'd like to get the discussion going given we've
> > touched briefly on this in a number of CXL sync calls and it is clear no one is
> > entirely sure what direction make sense.  We may briefly touch on this in the
> > LPC CXL uconf, but time will be very limited.
> >   
> Thanks for the detailed write-up.
> 
> Can't we have an ad-hoc meeting at OSS/LPC to gather interested/relevant 
> people to explore ideas around this?

Absolutely.  If people want to email me directly (or mention in the thread)
I'll gather up a list of people to try and find a suitable time / place
(and then post that here).


> 
> In particular I'd be interested on how to _get_ the application specific 
> memory to the application in question. It's easy if you have your own 
> application and design it to work on DAX devices. Obviously this 
> approach won't work for unmodified applications; however, they really
> might want to use this, too.

That's a good parallel question (as not virtualization specific).
I'd be tempted to enable this path first for aware applications, but longer
term the ability to use this via common allocator libraries (LD_PRELOAD etc)
might make sense (or some other path?) 

> 
> And, of course, the other mentioned problems are worth discussing, and I 
> do agree that the uconf will probably not providing sufficient time for 
> this.
> 
> Cheers,
> 
> Hannes



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC] Virtualizing tagged disaggregated memory capacity (app specific, multi host shared)
  2024-08-15 16:22 [RFC] Virtualizing tagged disaggregated memory capacity (app specific, multi host shared) Jonathan Cameron via
  2024-08-16  7:05 ` Hannes Reinecke
@ 2024-08-19  2:12 ` John Groves
  2024-08-19 15:40   ` Jonathan Cameron via
  2024-09-19  9:09 ` David Hildenbrand
  2 siblings, 1 reply; 14+ messages in thread
From: John Groves @ 2024-08-19  2:12 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: David Hildenbrand, linux-mm, linux-cxl, Davidlohr Bueso,
	Ira Weiny, virtualization, Oscar Salvador, qemu-devel, Dave Jiang,
	Dan Williams, linuxarm, wangkefeng.wang, John Groves, Fan Ni,
	Navneet Singh, “Michael S. Tsirkin”, Igor Mammedov,
	Philippe Mathieu-Daudé

On 24/08/15 05:22PM, Jonathan Cameron wrote:
> Introduction
> ============
> 
> If we think application specific memory (including inter-host shared memory) is
> a thing, it will also be a thing people want to use with virtual machines,
> potentially nested. So how do we present it at the Host to VM boundary?
> 
> This RFC is perhaps premature given we haven't yet merged upstream support for
> the bare metal case. However I'd like to get the discussion going given we've
> touched briefly on this in a number of CXL sync calls and it is clear no one is

Excellent write-up, thanks Jonathan.

Hannes' idea of an in-person discussion at LPC is a great idea - count me in.

As the proprietor of famfs [1] I have many thoughts.

First, I like the concept of application-specific memory (ASM), but I wonder
if there might be a better term for it. ASM suggests that there is one
application, but I'd suggest that a more concise statement of the concept
is that the Linux kernel never accesses or mutates the memory - even though
multiple apps might share it (e.g. via famfs). It's a subtle point, but
an important one for RAS etc. ASM might better be called non-kernel-managed
memory - though that name does not have as good a ring to it. Will mull this
over further...

Now a few level-setting comments on CXL and Dynamic Capacity Devices (DCDs),
some of which will be obvious to many of you:

* A DCD is just a memory device with an allocator and host-level
  access-control built in.
* Usable memory from a DCD is not available until the fabric manger (likely
  on behalf of an orchestrator) performs an Initiate Dynamic Capacity Add
  command to the DCD.
* A DCD allocation has a tag (uuid) which is the invariant way of identifying
  the memory from that allocation.
* The tag becomes known to the host from the DCD extents provided via
  a CXL event following succesful allocation.
* The memory associated with a tagged allocation will surface as a dax device
  on each host that has access to it. But of course dax device naming &
  numbering won't be consistent across separate hosts - so we need to use
  the uuid's to find specific memory.

A few less foundational observations:

* It does not make sense to "online" shared or sharable memory as system-ram,
  because system-ram gets zeroed, which blows up use cases for sharable memory.
  So the default for sharable memory must be devdax mode.
* Tags are mandatory for sharable allocations, and allowed but optional for
  non-sharable allocations. The implication is that non-sharable allocations
  may get onlined automatically as system-ram, so we don't need a namespace
  for those. (I argued for mandatory tags on all allocations - hey you don't
  have to use them - but encountered objections and dropped it.)
* CXL access control only goes to host root ports; CXL has no concept of
  giving access to a VM. So some component on a host (perhaps logically
  an orchestrator component) needs to plumb memory to VMs as appropriate.

So tags are a namespace to find specific memory "allocations" (which in the
CXL consortium, we usually refer to as "tagged capacity").

In an orchestrated environment, the orchestrator would allocate resources
(including tagged memory capacity), make that capacity visible on the right
host(s), and then provide the tag when starting the app if needed.

if (e.g.) the memory cotains a famfs file system, famfs needs the uuid of the
root memory allocation to find the right memory device. Once mounted, it's a
file sytem so apps can be directed to the mount path. Apps that consume the
dax devices directly also need the uuid because /dev/dax0.0 is not invariant
across a cluster...

I have been assuming that when the CXL stack discovers a new DCD allocation,
it will configure the devdax device and provide some way to find it by tag.
/sys/cxl/<tag>/dev or whatever. That works as far as it goes, but I'm coming
around to thinking that the uuid-to-dax map should not be overtly CXL-specific.

General thoughts regarding VMs and qemu

Physical connections to CXL memory are handled by physical servers. I don't
think there is a scenario in which a VM should interact directly with the
pcie function(s) of CXL devices. They will be configured as dax devices
(findable by their tags!) by the host OS, and should be provided to VMs
(when appropriate) as DAX devices. And software in a VM needs to be able to
find the right DAX device the same way it would running on bare metal - by
the tag.

Qemu can already get memory from files (-object memory-backend-file,...), and
I believe this works whether it's an actual file or a devdax device. So far,
so good.

Qemu can back a virtual pmem device by one of these, but currently (AFAIK)
not a virtual devdax device. I think virtual devdax is needed as a first-class
abstraction. If we can add the tag as a property of the memory-backend-file,
we're almost there - we just need away to lookup a daxdev by tag.

Summary thoughts:

* A mechanism for resolving tags to "tagged capacity" devdax devices is
  essential (and I don't think there are specific proposals about this
  mechanism so far).
* Said mechanism should not be explicitly CXL-specific.
* Finding a tagged capacity devdax device in a VM should work the same as it
  does running on bare metal.
* The file-backed (and devdax-backed) devdax abstraction is needed in qemu.
* Beyond that, I'm not yet sure what the lookup mechanism should be. Extra
  points for being easy to implement in both physical and virtual systems.

Thanks for teeing this up!
John

[1] https://github.com/cxl-micron-reskit/famfs/blob/master/README.md

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC] Virtualizing tagged disaggregated memory capacity (app specific, multi host shared)
  2024-08-19  2:12 ` John Groves
@ 2024-08-19 15:40   ` Jonathan Cameron via
  2024-09-17 19:37     ` Jonathan Cameron via
  2024-09-17 19:56     ` Jonathan Cameron via
  0 siblings, 2 replies; 14+ messages in thread
From: Jonathan Cameron via @ 2024-08-19 15:40 UTC (permalink / raw)
  To: John Groves
  Cc: David Hildenbrand, linux-mm, linux-cxl, Davidlohr Bueso,
	Ira Weiny, virtualization, Oscar Salvador, qemu-devel, Dave Jiang,
	Dan Williams, linuxarm, wangkefeng.wang, John Groves, Fan Ni,
	Navneet Singh, “Michael S. Tsirkin”, Igor Mammedov,
	Philippe Mathieu-Daudé

On Sun, 18 Aug 2024 21:12:34 -0500
John Groves <John@groves.net> wrote:

> On 24/08/15 05:22PM, Jonathan Cameron wrote:
> > Introduction
> > ============
> > 
> > If we think application specific memory (including inter-host shared memory) is
> > a thing, it will also be a thing people want to use with virtual machines,
> > potentially nested. So how do we present it at the Host to VM boundary?
> > 
> > This RFC is perhaps premature given we haven't yet merged upstream support for
> > the bare metal case. However I'd like to get the discussion going given we've
> > touched briefly on this in a number of CXL sync calls and it is clear no one is  
> 
> Excellent write-up, thanks Jonathan.
> 
> Hannes' idea of an in-person discussion at LPC is a great idea - count me in.

Had a feeling you might say that ;)

> 
> As the proprietor of famfs [1] I have many thoughts.
> 
> First, I like the concept of application-specific memory (ASM), but I wonder
> if there might be a better term for it. ASM suggests that there is one
> application, but I'd suggest that a more concise statement of the concept
> is that the Linux kernel never accesses or mutates the memory - even though
> multiple apps might share it (e.g. via famfs). It's a subtle point, but
> an important one for RAS etc. ASM might better be called non-kernel-managed
> memory - though that name does not have as good a ring to it. Will mull this
> over further...

Naming is always the hard bit :)  I agree that one doesn't work for
shared capacity. You can tell I didn't start there :)

> 
> Now a few level-setting comments on CXL and Dynamic Capacity Devices (DCDs),
> some of which will be obvious to many of you:
> 
> * A DCD is just a memory device with an allocator and host-level
>   access-control built in.
> * Usable memory from a DCD is not available until the fabric manger (likely
>   on behalf of an orchestrator) performs an Initiate Dynamic Capacity Add
>   command to the DCD.
> * A DCD allocation has a tag (uuid) which is the invariant way of identifying
>   the memory from that allocation.
> * The tag becomes known to the host from the DCD extents provided via
>   a CXL event following succesful allocation.
> * The memory associated with a tagged allocation will surface as a dax device
>   on each host that has access to it. But of course dax device naming &
>   numbering won't be consistent across separate hosts - so we need to use
>   the uuid's to find specific memory.
> 
> A few less foundational observations:
> 
> * It does not make sense to "online" shared or sharable memory as system-ram,
>   because system-ram gets zeroed, which blows up use cases for sharable memory.
>   So the default for sharable memory must be devdax mode.
(CXL specific diversion)

Absolutely agree this this. There is a 'corner' that irritates me in the spec though
which is that there is no distinction between shareable and shared capacity.
If we are in a constrained setup with limited HPA or DPA space, we may not want
to have separate DCD regions for these.  Thus it is plausible that an orchestrator
might tell a memory appliance to present memory for general use and yet it
surfaces as shareable.  So there may need to be an opt in path at least for
going ahead and using this memory as normal RAM.

> * Tags are mandatory for sharable allocations, and allowed but optional for
>   non-sharable allocations. The implication is that non-sharable allocations
>   may get onlined automatically as system-ram, so we don't need a namespace
>   for those. (I argued for mandatory tags on all allocations - hey you don't
>   have to use them - but encountered objections and dropped it.)
> * CXL access control only goes to host root ports; CXL has no concept of
>   giving access to a VM. So some component on a host (perhaps logically
>   an orchestrator component) needs to plumb memory to VMs as appropriate.

Yes.  It's some mashup of an orchestrator and VMM / libvirt, local library
of your choice. We can just group into into the ill defined concept of
a distributed orchestrator.

> 
> So tags are a namespace to find specific memory "allocations" (which in the
> CXL consortium, we usually refer to as "tagged capacity").
> 
> In an orchestrated environment, the orchestrator would allocate resources
> (including tagged memory capacity), make that capacity visible on the right
> host(s), and then provide the tag when starting the app if needed.
> 
> if (e.g.) the memory cotains a famfs file system, famfs needs the uuid of the
> root memory allocation to find the right memory device. Once mounted, it's a
> file sytem so apps can be directed to the mount path. Apps that consume the
> dax devices directly also need the uuid because /dev/dax0.0 is not invariant
> across a cluster...
> 
> I have been assuming that when the CXL stack discovers a new DCD allocation,
> it will configure the devdax device and provide some way to find it by tag.
> /sys/cxl/<tag>/dev or whatever. That works as far as it goes, but I'm coming
> around to thinking that the uuid-to-dax map should not be overtly CXL-specific.

Agreed. Whether that's a nice kernel side thing, or a utility pulling data
from various kernel subsystem interfaces doesn't really matter. I'd prefer
the kernel presents this but maybe that won't work for some reason.

> 
> General thoughts regarding VMs and qemu
> 
> Physical connections to CXL memory are handled by physical servers. I don't
> think there is a scenario in which a VM should interact directly with the
> pcie function(s) of CXL devices. They will be configured as dax devices
> (findable by their tags!) by the host OS, and should be provided to VMs
> (when appropriate) as DAX devices. And software in a VM needs to be able to
> find the right DAX device the same way it would running on bare metal - by
> the tag.

Limiting to typical type 3 memory pool devices. Agreed. The other CXL device
types are a can or worms for another day.

> 
> Qemu can already get memory from files (-object memory-backend-file,...), and
> I believe this works whether it's an actual file or a devdax device. So far,
> so good.
> 
> Qemu can back a virtual pmem device by one of these, but currently (AFAIK)
> not a virtual devdax device. I think virtual devdax is needed as a first-class
> abstraction. If we can add the tag as a property of the memory-backend-file,
> we're almost there - we just need away to lookup a daxdev by tag.

I'm not sure that is simple. We'd need to define a new interface capable of:
1) Hotplug - potentially of many separate regions (think nested VMs).
   That more or less rules out using separate devices on a discoverable hotpluggable
   bus. We'd run out of bus numbers too quickly if putting them on PCI.
   ACPI style hotplug is worse because we have to provision slots at the outset.
2) Runtime provision of metadata - performance data very least (bandwidth /
   latency etc). In theory could wire up ACPI _HMA but no one has ever bothered.
3) Probably do want async error signaling.  We 'could' do that with
   FW first error injection - I'm not sure it's a good idea but it's definitely
   an option.

A locked down CXL device is a bit more than that, but not very much more.
It's easy to fake registers for things that are always in one state so
that the software stack is happy.

virtio-mem has some of the parts and could perhaps be augmented
to support this use case with the advantage of no implicit tie to CXL.


> 
> Summary thoughts:
> 
> * A mechanism for resolving tags to "tagged capacity" devdax devices is
>   essential (and I don't think there are specific proposals about this
>   mechanism so far).

Agreed.

> * Said mechanism should not be explicitly CXL-specific.

Somewhat agreed, but I don't want to invent a new spec just to avoid explicit
ties to CXL. I'm not against using CXL to present HBM / ACPI Specific Purpose
memory for example to a VM. It will trivially work if that is what a user
wants to do and also illustrates that this stuff doesn't necessarily just
apply to capacity on a memory pool - it might just be 'weird' memory on the host.

> * Finding a tagged capacity devdax device in a VM should work the same as it
>   does running on bare metal.

Absolutely - that's a requirement.

> * The file-backed (and devdax-backed) devdax abstraction is needed in qemu.

Maybe. I'm not convinced the abstraction is needed at that particular level.

> * Beyond that, I'm not yet sure what the lookup mechanism should be. Extra
>   points for being easy to implement in both physical and virtual systems.

For physical systems we aren't going to get agreement :(  For the systems
I have visibility of there will be some diversity in hardware, but the
presentation to userspace and up consistency should be doable.

Jonathan

> 
> Thanks for teeing this up!
> John
> 
> 
> [1] https://github.com/cxl-micron-reskit/famfs/blob/master/README.md
> 



^ permalink raw reply	[flat|nested] 14+ messages in thread

* RE: [RFC] Virtualizing tagged disaggregated memory capacity (app specific, multi host shared)
  2024-08-19 15:40   ` Jonathan Cameron via
@ 2024-09-17 19:37     ` Jonathan Cameron via
  2024-10-22 14:11       ` Gregory Price
  2024-09-17 19:56     ` Jonathan Cameron via
  1 sibling, 1 reply; 14+ messages in thread
From: Jonathan Cameron via @ 2024-09-17 19:37 UTC (permalink / raw)
  To: John Groves
  Cc: David Hildenbrand, linux-mm, linux-cxl, Davidlohr Bueso,
	Ira Weiny, virtualization, Oscar Salvador, qemu-devel, Dave Jiang,
	Dan Williams, Linuxarm, Wangkefeng (OS Kernel Lab), John Groves,
	Fan Ni, Navneet Singh, “Michael S. Tsirkin”,
	Igor Mammedov, Philippe Mathieu-Daudé

[-- Attachment #1: Type: text/plain, Size: 10783 bytes --]

Plan is currently to meet at lpc registration desk 2pm tomorrow Wednesday and we will find a room.

J

________________________________

Jonathan Cameron
Mobile: +44-7870588074
Mail: jonathan.cameron@huawei.com

From:Jonathan Cameron <Jonathan.Cameron@Huawei.com<mailto:Jonathan.Cameron@Huawei.com>>
To:John Groves <John@groves.net<mailto:John@groves.net>>
Cc:David Hildenbrand <david@redhat.com<mailto:david@redhat.com>>;linux-mm <linux-mm@kvack.org<mailto:linux-mm@kvack.org>>;linux-cxl <linux-cxl@vger.kernel.org<mailto:linux-cxl@vger.kernel.org>>;Davidlohr Bueso <dave@stgolabs.net<mailto:dave@stgolabs.net>>;Ira Weiny <ira.weiny@intel.com<mailto:ira.weiny@intel.com>>;virtualization <virtualization@lists.linux.dev<mailto:virtualization@lists.linux.dev>>;Oscar Salvador <osalvador@suse.de<mailto:osalvador@suse.de>>;qemu-devel <qemu-devel@nongnu.org<mailto:qemu-devel@nongnu.org>>;Dave Jiang <dave.jiang@intel.com<mailto:dave.jiang@intel.com>>;Dan Williams <dan.j.williams@intel.com<mailto:dan.j.williams@intel.com>>;Linuxarm <linuxarm@huawei.com<mailto:linuxarm@huawei.com>>;Wangkefeng (OS Kernel Lab) <wangkefeng.wang@huawei.com<mailto:wangkefeng.wang@huawei.com>>;John Groves <jgroves@micron.com<mailto:jgroves@micron.com>>;Fan Ni <fan.ni@samsung.com<mailto:fan.ni@samsung.com>>;Navneet Singh <navneet.singh@intel.com<mailto:navneet.singh@intel.com>>;“Michael S. Tsirkin” <mst@redhat.com<mailto:mst@redhat.com>>;Igor Mammedov <imammedo@redhat.com<mailto:imammedo@redhat.com>>;Philippe Mathieu-Daudé <philmd@linaro.org<mailto:philmd@linaro.org>>
Date:2024-08-19 17:41:01
Subject:Re: [RFC] Virtualizing tagged disaggregated memory capacity (app specific, multi host shared)

On Sun, 18 Aug 2024 21:12:34 -0500
John Groves <John@groves.net> wrote:

> On 24/08/15 05:22PM, Jonathan Cameron wrote:
> > Introduction
> > ============
> >
> > If we think application specific memory (including inter-host shared memory) is
> > a thing, it will also be a thing people want to use with virtual machines,
> > potentially nested. So how do we present it at the Host to VM boundary?
> >
> > This RFC is perhaps premature given we haven't yet merged upstream support for
> > the bare metal case. However I'd like to get the discussion going given we've
> > touched briefly on this in a number of CXL sync calls and it is clear no one is
>
> Excellent write-up, thanks Jonathan.
>
> Hannes' idea of an in-person discussion at LPC is a great idea - count me in.

Had a feeling you might say that ;)

>
> As the proprietor of famfs [1] I have many thoughts.
>
> First, I like the concept of application-specific memory (ASM), but I wonder
> if there might be a better term for it. ASM suggests that there is one
> application, but I'd suggest that a more concise statement of the concept
> is that the Linux kernel never accesses or mutates the memory - even though
> multiple apps might share it (e.g. via famfs). It's a subtle point, but
> an important one for RAS etc. ASM might better be called non-kernel-managed
> memory - though that name does not have as good a ring to it. Will mull this
> over further...

Naming is always the hard bit :)  I agree that one doesn't work for
shared capacity. You can tell I didn't start there :)

>
> Now a few level-setting comments on CXL and Dynamic Capacity Devices (DCDs),
> some of which will be obvious to many of you:
>
> * A DCD is just a memory device with an allocator and host-level
>   access-control built in.
> * Usable memory from a DCD is not available until the fabric manger (likely
>   on behalf of an orchestrator) performs an Initiate Dynamic Capacity Add
>   command to the DCD.
> * A DCD allocation has a tag (uuid) which is the invariant way of identifying
>   the memory from that allocation.
> * The tag becomes known to the host from the DCD extents provided via
>   a CXL event following succesful allocation.
> * The memory associated with a tagged allocation will surface as a dax device
>   on each host that has access to it. But of course dax device naming &
>   numbering won't be consistent across separate hosts - so we need to use
>   the uuid's to find specific memory.
>
> A few less foundational observations:
>
> * It does not make sense to "online" shared or sharable memory as system-ram,
>   because system-ram gets zeroed, which blows up use cases for sharable memory.
>   So the default for sharable memory must be devdax mode.
(CXL specific diversion)

Absolutely agree this this. There is a 'corner' that irritates me in the spec though
which is that there is no distinction between shareable and shared capacity.
If we are in a constrained setup with limited HPA or DPA space, we may not want
to have separate DCD regions for these.  Thus it is plausible that an orchestrator
might tell a memory appliance to present memory for general use and yet it
surfaces as shareable.  So there may need to be an opt in path at least for
going ahead and using this memory as normal RAM.

> * Tags are mandatory for sharable allocations, and allowed but optional for
>   non-sharable allocations. The implication is that non-sharable allocations
>   may get onlined automatically as system-ram, so we don't need a namespace
>   for those. (I argued for mandatory tags on all allocations - hey you don't
>   have to use them - but encountered objections and dropped it.)
> * CXL access control only goes to host root ports; CXL has no concept of
>   giving access to a VM. So some component on a host (perhaps logically
>   an orchestrator component) needs to plumb memory to VMs as appropriate.

Yes.  It's some mashup of an orchestrator and VMM / libvirt, local library
of your choice. We can just group into into the ill defined concept of
a distributed orchestrator.

>
> So tags are a namespace to find specific memory "allocations" (which in the
> CXL consortium, we usually refer to as "tagged capacity").
>
> In an orchestrated environment, the orchestrator would allocate resources
> (including tagged memory capacity), make that capacity visible on the right
> host(s), and then provide the tag when starting the app if needed.
>
> if (e.g.) the memory cotains a famfs file system, famfs needs the uuid of the
> root memory allocation to find the right memory device. Once mounted, it's a
> file sytem so apps can be directed to the mount path. Apps that consume the
> dax devices directly also need the uuid because /dev/dax0.0 is not invariant
> across a cluster...
>
> I have been assuming that when the CXL stack discovers a new DCD allocation,
> it will configure the devdax device and provide some way to find it by tag.
> /sys/cxl/<tag>/dev or whatever. That works as far as it goes, but I'm coming
> around to thinking that the uuid-to-dax map should not be overtly CXL-specific.

Agreed. Whether that's a nice kernel side thing, or a utility pulling data
from various kernel subsystem interfaces doesn't really matter. I'd prefer
the kernel presents this but maybe that won't work for some reason.

>
> General thoughts regarding VMs and qemu
>
> Physical connections to CXL memory are handled by physical servers. I don't
> think there is a scenario in which a VM should interact directly with the
> pcie function(s) of CXL devices. They will be configured as dax devices
> (findable by their tags!) by the host OS, and should be provided to VMs
> (when appropriate) as DAX devices. And software in a VM needs to be able to
> find the right DAX device the same way it would running on bare metal - by
> the tag.

Limiting to typical type 3 memory pool devices. Agreed. The other CXL device
types are a can or worms for another day.

>
> Qemu can already get memory from files (-object memory-backend-file,...), and
> I believe this works whether it's an actual file or a devdax device. So far,
> so good.
>
> Qemu can back a virtual pmem device by one of these, but currently (AFAIK)
> not a virtual devdax device. I think virtual devdax is needed as a first-class
> abstraction. If we can add the tag as a property of the memory-backend-file,
> we're almost there - we just need away to lookup a daxdev by tag.

I'm not sure that is simple. We'd need to define a new interface capable of:
1) Hotplug - potentially of many separate regions (think nested VMs).
   That more or less rules out using separate devices on a discoverable hotpluggable
   bus. We'd run out of bus numbers too quickly if putting them on PCI.
   ACPI style hotplug is worse because we have to provision slots at the outset.
2) Runtime provision of metadata - performance data very least (bandwidth /
   latency etc). In theory could wire up ACPI _HMA but no one has ever bothered.
3) Probably do want async error signaling.  We 'could' do that with
   FW first error injection - I'm not sure it's a good idea but it's definitely
   an option.

A locked down CXL device is a bit more than that, but not very much more.
It's easy to fake registers for things that are always in one state so
that the software stack is happy.

virtio-mem has some of the parts and could perhaps be augmented
to support this use case with the advantage of no implicit tie to CXL.

>
> Summary thoughts:
>
> * A mechanism for resolving tags to "tagged capacity" devdax devices is
>   essential (and I don't think there are specific proposals about this
>   mechanism so far).

Agreed.

> * Said mechanism should not be explicitly CXL-specific.

Somewhat agreed, but I don't want to invent a new spec just to avoid explicit
ties to CXL. I'm not against using CXL to present HBM / ACPI Specific Purpose
memory for example to a VM. It will trivially work if that is what a user
wants to do and also illustrates that this stuff doesn't necessarily just
apply to capacity on a memory pool - it might just be 'weird' memory on the host.

> * Finding a tagged capacity devdax device in a VM should work the same as it
>   does running on bare metal.

Absolutely - that's a requirement.

> * The file-backed (and devdax-backed) devdax abstraction is needed in qemu.

Maybe. I'm not convinced the abstraction is needed at that particular level.

> * Beyond that, I'm not yet sure what the lookup mechanism should be. Extra
>   points for being easy to implement in both physical and virtual systems.

For physical systems we aren't going to get agreement :(  For the systems
I have visibility of there will be some diversity in hardware, but the
presentation to userspace and up consistency should be doable.

Jonathan

>
> Thanks for teeing this up!
> John
>
>
> [1] https://github.com/cxl-micron-reskit/famfs/blob/master/README.md
>

[-- Attachment #2: Type: text/html, Size: 13836 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC] Virtualizing tagged disaggregated memory capacity (app specific, multi host shared)
  2024-09-17 19:37     ` Jonathan Cameron via
@ 2024-10-22 14:11       ` Gregory Price
  0 siblings, 0 replies; 14+ messages in thread
From: Gregory Price @ 2024-10-22 14:11 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: John Groves, David Hildenbrand, linux-mm, linux-cxl,
	Davidlohr Bueso, Ira Weiny, virtualization, Oscar Salvador,
	qemu-devel, Dave Jiang, Dan Williams, Linuxarm,
	Wangkefeng (OS Kernel Lab), John Groves, Fan Ni, Navneet Singh,
	“Michael S. Tsirkin”, Igor Mammedov,
	Philippe Mathieu-Daudé

On Tue, Sep 17, 2024 at 07:37:21PM +0000, Jonathan Cameron wrote:
> > * Said mechanism should not be explicitly CXL-specific.
> 
> Somewhat agreed, but I don't want to invent a new spec just to avoid explicit
> ties to CXL. I'm not against using CXL to present HBM / ACPI Specific Purpose
> memory for example to a VM. It will trivially work if that is what a user
> wants to do and also illustrates that this stuff doesn't necessarily just
> apply to capacity on a memory pool - it might just be 'weird' memory on the host.
> 

I suspect if you took all the DCD components of the current CXL device
and repackaged it into a device called "DefinitelyNotACXLDCDDevice", that
the CXL device inherited, this whole discussion goes away.

Patches welcome? :]

> > * Finding a tagged capacity devdax device in a VM should work the same as it
> >   does running on bare metal.
> 
> Absolutely - that's a requirement.
> 
> > * The file-backed (and devdax-backed) devdax abstraction is needed in qemu.
> 
> Maybe. I'm not convinced the abstraction is needed at that particular level.
> 
> > * Beyond that, I'm not yet sure what the lookup mechanism should be. Extra
> >   points for being easy to implement in both physical and virtual systems.
> 
> For physical systems we aren't going to get agreement :(  For the systems
> I have visibility of there will be some diversity in hardware, but the
> presentation to userspace and up consistency should be doable.
> 
> Jonathan
> 
> >
> > Thanks for teeing this up!
> > John
> >
> >
> > [1] https://github.com/cxl-micron-reskit/famfs/blob/master/README.md
> >
> 
> 


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC] Virtualizing tagged disaggregated memory capacity (app specific, multi host shared)
  2024-08-19 15:40   ` Jonathan Cameron via
  2024-09-17 19:37     ` Jonathan Cameron via
@ 2024-09-17 19:56     ` Jonathan Cameron via
  2024-09-18 12:12       ` Jonathan Cameron
  1 sibling, 1 reply; 14+ messages in thread
From: Jonathan Cameron via @ 2024-09-17 19:56 UTC (permalink / raw)
  To: John Groves, linuxarm
  Cc: David Hildenbrand, linux-mm, linux-cxl, Davidlohr Bueso,
	Ira Weiny, virtualization, Oscar Salvador, qemu-devel, Dave Jiang,
	Dan Williams, Wangkefeng (OS Kernel Lab), John Groves, Fan Ni,
	Navneet Singh,  “Michael S. Tsirkin”, Igor Mammedov,
	Philippe Mathieu-Daudé

On Tue, 17 Sep 2024 19:37:21 +0000
Jonathan Cameron <jonathan.cameron@huawei.com> wrote:

> Plan is currently to meet at lpc registration desk 2pm tomorrow Wednesday and we will find a room.
>

And now the internet maybe knows my phone number (serves me right for using
my company mobile app that auto added a signature)
I might have been lucky and it didn't hit the archives because
the formatting was too broken..

Anyhow, see some of you tomorrow.  I didn't manage to borrow a jabra mic
so remote will be tricky but feel free to reach out and we might be
able to sort something.

Intent is this will be in informal BoF so we'll figure out the scope
at the start of the meeting.

Sorry for the noise!

Jonathan
 
> J
> On Sun, 18 Aug 2024 21:12:34 -0500
> John Groves <John@groves.net> wrote:
> 
> > On 24/08/15 05:22PM, Jonathan Cameron wrote:  
> > > Introduction
> > > ============
> > >
> > > If we think application specific memory (including inter-host shared memory) is
> > > a thing, it will also be a thing people want to use with virtual machines,
> > > potentially nested. So how do we present it at the Host to VM boundary?
> > >
> > > This RFC is perhaps premature given we haven't yet merged upstream support for
> > > the bare metal case. However I'd like to get the discussion going given we've
> > > touched briefly on this in a number of CXL sync calls and it is clear no one is  
> >
> > Excellent write-up, thanks Jonathan.
> >
> > Hannes' idea of an in-person discussion at LPC is a great idea - count me in.  
> 
> Had a feeling you might say that ;)
> 
> >
> > As the proprietor of famfs [1] I have many thoughts.
> >
> > First, I like the concept of application-specific memory (ASM), but I wonder
> > if there might be a better term for it. ASM suggests that there is one
> > application, but I'd suggest that a more concise statement of the concept
> > is that the Linux kernel never accesses or mutates the memory - even though
> > multiple apps might share it (e.g. via famfs). It's a subtle point, but
> > an important one for RAS etc. ASM might better be called non-kernel-managed
> > memory - though that name does not have as good a ring to it. Will mull this
> > over further...  
> 
> Naming is always the hard bit :)  I agree that one doesn't work for
> shared capacity. You can tell I didn't start there :)
> 
> >
> > Now a few level-setting comments on CXL and Dynamic Capacity Devices (DCDs),
> > some of which will be obvious to many of you:
> >
> > * A DCD is just a memory device with an allocator and host-level
> >   access-control built in.
> > * Usable memory from a DCD is not available until the fabric manger (likely
> >   on behalf of an orchestrator) performs an Initiate Dynamic Capacity Add
> >   command to the DCD.
> > * A DCD allocation has a tag (uuid) which is the invariant way of identifying
> >   the memory from that allocation.
> > * The tag becomes known to the host from the DCD extents provided via
> >   a CXL event following succesful allocation.
> > * The memory associated with a tagged allocation will surface as a dax device
> >   on each host that has access to it. But of course dax device naming &
> >   numbering won't be consistent across separate hosts - so we need to use
> >   the uuid's to find specific memory.
> >
> > A few less foundational observations:
> >
> > * It does not make sense to "online" shared or sharable memory as system-ram,
> >   because system-ram gets zeroed, which blows up use cases for sharable memory.
> >   So the default for sharable memory must be devdax mode.  
> (CXL specific diversion)
> 
> Absolutely agree this this. There is a 'corner' that irritates me in the spec though
> which is that there is no distinction between shareable and shared capacity.
> If we are in a constrained setup with limited HPA or DPA space, we may not want
> to have separate DCD regions for these.  Thus it is plausible that an orchestrator
> might tell a memory appliance to present memory for general use and yet it
> surfaces as shareable.  So there may need to be an opt in path at least for
> going ahead and using this memory as normal RAM.
> 
> > * Tags are mandatory for sharable allocations, and allowed but optional for
> >   non-sharable allocations. The implication is that non-sharable allocations
> >   may get onlined automatically as system-ram, so we don't need a namespace
> >   for those. (I argued for mandatory tags on all allocations - hey you don't
> >   have to use them - but encountered objections and dropped it.)
> > * CXL access control only goes to host root ports; CXL has no concept of
> >   giving access to a VM. So some component on a host (perhaps logically
> >   an orchestrator component) needs to plumb memory to VMs as appropriate.  
> 
> Yes.  It's some mashup of an orchestrator and VMM / libvirt, local library
> of your choice. We can just group into into the ill defined concept of
> a distributed orchestrator.
> 
> >
> > So tags are a namespace to find specific memory "allocations" (which in the
> > CXL consortium, we usually refer to as "tagged capacity").
> >
> > In an orchestrated environment, the orchestrator would allocate resources
> > (including tagged memory capacity), make that capacity visible on the right
> > host(s), and then provide the tag when starting the app if needed.
> >
> > if (e.g.) the memory cotains a famfs file system, famfs needs the uuid of the
> > root memory allocation to find the right memory device. Once mounted, it's a
> > file sytem so apps can be directed to the mount path. Apps that consume the
> > dax devices directly also need the uuid because /dev/dax0.0 is not invariant
> > across a cluster...
> >
> > I have been assuming that when the CXL stack discovers a new DCD allocation,
> > it will configure the devdax device and provide some way to find it by tag.
> > /sys/cxl/<tag>/dev or whatever. That works as far as it goes, but I'm coming
> > around to thinking that the uuid-to-dax map should not be overtly CXL-specific.  
> 
> Agreed. Whether that's a nice kernel side thing, or a utility pulling data
> from various kernel subsystem interfaces doesn't really matter. I'd prefer
> the kernel presents this but maybe that won't work for some reason.
> 
> >
> > General thoughts regarding VMs and qemu
> >
> > Physical connections to CXL memory are handled by physical servers. I don't
> > think there is a scenario in which a VM should interact directly with the
> > pcie function(s) of CXL devices. They will be configured as dax devices
> > (findable by their tags!) by the host OS, and should be provided to VMs
> > (when appropriate) as DAX devices. And software in a VM needs to be able to
> > find the right DAX device the same way it would running on bare metal - by
> > the tag.  
> 
> Limiting to typical type 3 memory pool devices. Agreed. The other CXL device
> types are a can or worms for another day.
> 
> >
> > Qemu can already get memory from files (-object memory-backend-file,...), and
> > I believe this works whether it's an actual file or a devdax device. So far,
> > so good.
> >
> > Qemu can back a virtual pmem device by one of these, but currently (AFAIK)
> > not a virtual devdax device. I think virtual devdax is needed as a first-class
> > abstraction. If we can add the tag as a property of the memory-backend-file,
> > we're almost there - we just need away to lookup a daxdev by tag.  
> 
> I'm not sure that is simple. We'd need to define a new interface capable of:
> 1) Hotplug - potentially of many separate regions (think nested VMs).
>    That more or less rules out using separate devices on a discoverable hotpluggable
>    bus. We'd run out of bus numbers too quickly if putting them on PCI.
>    ACPI style hotplug is worse because we have to provision slots at the outset.
> 2) Runtime provision of metadata - performance data very least (bandwidth /
>    latency etc). In theory could wire up ACPI _HMA but no one has ever bothered.
> 3) Probably do want async error signaling.  We 'could' do that with
>    FW first error injection - I'm not sure it's a good idea but it's definitely
>    an option.
> 
> A locked down CXL device is a bit more than that, but not very much more.
> It's easy to fake registers for things that are always in one state so
> that the software stack is happy.
> 
> virtio-mem has some of the parts and could perhaps be augmented
> to support this use case with the advantage of no implicit tie to CXL.
> 
> 
> >
> > Summary thoughts:
> >
> > * A mechanism for resolving tags to "tagged capacity" devdax devices is
> >   essential (and I don't think there are specific proposals about this
> >   mechanism so far).  
> 
> Agreed.
> 
> > * Said mechanism should not be explicitly CXL-specific.  
> 
> Somewhat agreed, but I don't want to invent a new spec just to avoid explicit
> ties to CXL. I'm not against using CXL to present HBM / ACPI Specific Purpose
> memory for example to a VM. It will trivially work if that is what a user
> wants to do and also illustrates that this stuff doesn't necessarily just
> apply to capacity on a memory pool - it might just be 'weird' memory on the host.
> 
> > * Finding a tagged capacity devdax device in a VM should work the same as it
> >   does running on bare metal.  
> 
> Absolutely - that's a requirement.
> 
> > * The file-backed (and devdax-backed) devdax abstraction is needed in qemu.  
> 
> Maybe. I'm not convinced the abstraction is needed at that particular level.
> 
> > * Beyond that, I'm not yet sure what the lookup mechanism should be. Extra
> >   points for being easy to implement in both physical and virtual systems.  
> 
> For physical systems we aren't going to get agreement :(  For the systems
> I have visibility of there will be some diversity in hardware, but the
> presentation to userspace and up consistency should be doable.
> 
> Jonathan
> 
> >
> > Thanks for teeing this up!
> > John
> >
> >
> > [1] https://github.com/cxl-micron-reskit/famfs/blob/master/README.md
> >  
> 
> 
> 



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC] Virtualizing tagged disaggregated memory capacity (app specific, multi host shared)
  2024-09-17 19:56     ` Jonathan Cameron via
@ 2024-09-18 12:12       ` Jonathan Cameron
  0 siblings, 0 replies; 14+ messages in thread
From: Jonathan Cameron @ 2024-09-18 12:12 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: John Groves, linuxarm, David Hildenbrand, linux-mm, linux-cxl,
	Davidlohr Bueso, Ira Weiny, virtualization, Oscar Salvador,
	qemu-devel, Dave Jiang, Dan Williams, Wangkefeng (OS Kernel Lab),
	John Groves, Fan Ni, Navneet Singh,
	 “Michael S. Tsirkin”, Igor Mammedov,
	Philippe Mathieu-Daudé

On Tue, 17 Sep 2024 20:56:53 +0100
Jonathan Cameron <Jonathan.Cameron@Huawei.com> wrote:

> On Tue, 17 Sep 2024 19:37:21 +0000
> Jonathan Cameron <jonathan.cameron@huawei.com> wrote:
> 
> > Plan is currently to meet at lpc registration desk 2pm tomorrow Wednesday and we will find a room.
> >  
> 
> And now the internet maybe knows my phone number (serves me right for using
> my company mobile app that auto added a signature)
> I might have been lucky and it didn't hit the archives because
> the formatting was too broken..
> 
> Anyhow, see some of you tomorrow.  I didn't manage to borrow a jabra mic
> so remote will be tricky but feel free to reach out and we might be
> able to sort something.
> 
> Intent is this will be in informal BoF so we'll figure out the scope
> at the start of the meeting.
> 
> Sorry for the noise!

Hack room 1.14 now if anyone is looking for us.


> 
> Jonathan
>  
> > J
> > On Sun, 18 Aug 2024 21:12:34 -0500
> > John Groves <John@groves.net> wrote:
> >   
> > > On 24/08/15 05:22PM, Jonathan Cameron wrote:    
> > > > Introduction
> > > > ============
> > > >
> > > > If we think application specific memory (including inter-host shared memory) is
> > > > a thing, it will also be a thing people want to use with virtual machines,
> > > > potentially nested. So how do we present it at the Host to VM boundary?
> > > >
> > > > This RFC is perhaps premature given we haven't yet merged upstream support for
> > > > the bare metal case. However I'd like to get the discussion going given we've
> > > > touched briefly on this in a number of CXL sync calls and it is clear no one is    
> > >
> > > Excellent write-up, thanks Jonathan.
> > >
> > > Hannes' idea of an in-person discussion at LPC is a great idea - count me in.    
> > 
> > Had a feeling you might say that ;)
> >   
> > >
> > > As the proprietor of famfs [1] I have many thoughts.
> > >
> > > First, I like the concept of application-specific memory (ASM), but I wonder
> > > if there might be a better term for it. ASM suggests that there is one
> > > application, but I'd suggest that a more concise statement of the concept
> > > is that the Linux kernel never accesses or mutates the memory - even though
> > > multiple apps might share it (e.g. via famfs). It's a subtle point, but
> > > an important one for RAS etc. ASM might better be called non-kernel-managed
> > > memory - though that name does not have as good a ring to it. Will mull this
> > > over further...    
> > 
> > Naming is always the hard bit :)  I agree that one doesn't work for
> > shared capacity. You can tell I didn't start there :)
> >   
> > >
> > > Now a few level-setting comments on CXL and Dynamic Capacity Devices (DCDs),
> > > some of which will be obvious to many of you:
> > >
> > > * A DCD is just a memory device with an allocator and host-level
> > >   access-control built in.
> > > * Usable memory from a DCD is not available until the fabric manger (likely
> > >   on behalf of an orchestrator) performs an Initiate Dynamic Capacity Add
> > >   command to the DCD.
> > > * A DCD allocation has a tag (uuid) which is the invariant way of identifying
> > >   the memory from that allocation.
> > > * The tag becomes known to the host from the DCD extents provided via
> > >   a CXL event following succesful allocation.
> > > * The memory associated with a tagged allocation will surface as a dax device
> > >   on each host that has access to it. But of course dax device naming &
> > >   numbering won't be consistent across separate hosts - so we need to use
> > >   the uuid's to find specific memory.
> > >
> > > A few less foundational observations:
> > >
> > > * It does not make sense to "online" shared or sharable memory as system-ram,
> > >   because system-ram gets zeroed, which blows up use cases for sharable memory.
> > >   So the default for sharable memory must be devdax mode.    
> > (CXL specific diversion)
> > 
> > Absolutely agree this this. There is a 'corner' that irritates me in the spec though
> > which is that there is no distinction between shareable and shared capacity.
> > If we are in a constrained setup with limited HPA or DPA space, we may not want
> > to have separate DCD regions for these.  Thus it is plausible that an orchestrator
> > might tell a memory appliance to present memory for general use and yet it
> > surfaces as shareable.  So there may need to be an opt in path at least for
> > going ahead and using this memory as normal RAM.
> >   
> > > * Tags are mandatory for sharable allocations, and allowed but optional for
> > >   non-sharable allocations. The implication is that non-sharable allocations
> > >   may get onlined automatically as system-ram, so we don't need a namespace
> > >   for those. (I argued for mandatory tags on all allocations - hey you don't
> > >   have to use them - but encountered objections and dropped it.)
> > > * CXL access control only goes to host root ports; CXL has no concept of
> > >   giving access to a VM. So some component on a host (perhaps logically
> > >   an orchestrator component) needs to plumb memory to VMs as appropriate.    
> > 
> > Yes.  It's some mashup of an orchestrator and VMM / libvirt, local library
> > of your choice. We can just group into into the ill defined concept of
> > a distributed orchestrator.
> >   
> > >
> > > So tags are a namespace to find specific memory "allocations" (which in the
> > > CXL consortium, we usually refer to as "tagged capacity").
> > >
> > > In an orchestrated environment, the orchestrator would allocate resources
> > > (including tagged memory capacity), make that capacity visible on the right
> > > host(s), and then provide the tag when starting the app if needed.
> > >
> > > if (e.g.) the memory cotains a famfs file system, famfs needs the uuid of the
> > > root memory allocation to find the right memory device. Once mounted, it's a
> > > file sytem so apps can be directed to the mount path. Apps that consume the
> > > dax devices directly also need the uuid because /dev/dax0.0 is not invariant
> > > across a cluster...
> > >
> > > I have been assuming that when the CXL stack discovers a new DCD allocation,
> > > it will configure the devdax device and provide some way to find it by tag.
> > > /sys/cxl/<tag>/dev or whatever. That works as far as it goes, but I'm coming
> > > around to thinking that the uuid-to-dax map should not be overtly CXL-specific.    
> > 
> > Agreed. Whether that's a nice kernel side thing, or a utility pulling data
> > from various kernel subsystem interfaces doesn't really matter. I'd prefer
> > the kernel presents this but maybe that won't work for some reason.
> >   
> > >
> > > General thoughts regarding VMs and qemu
> > >
> > > Physical connections to CXL memory are handled by physical servers. I don't
> > > think there is a scenario in which a VM should interact directly with the
> > > pcie function(s) of CXL devices. They will be configured as dax devices
> > > (findable by their tags!) by the host OS, and should be provided to VMs
> > > (when appropriate) as DAX devices. And software in a VM needs to be able to
> > > find the right DAX device the same way it would running on bare metal - by
> > > the tag.    
> > 
> > Limiting to typical type 3 memory pool devices. Agreed. The other CXL device
> > types are a can or worms for another day.
> >   
> > >
> > > Qemu can already get memory from files (-object memory-backend-file,...), and
> > > I believe this works whether it's an actual file or a devdax device. So far,
> > > so good.
> > >
> > > Qemu can back a virtual pmem device by one of these, but currently (AFAIK)
> > > not a virtual devdax device. I think virtual devdax is needed as a first-class
> > > abstraction. If we can add the tag as a property of the memory-backend-file,
> > > we're almost there - we just need away to lookup a daxdev by tag.    
> > 
> > I'm not sure that is simple. We'd need to define a new interface capable of:
> > 1) Hotplug - potentially of many separate regions (think nested VMs).
> >    That more or less rules out using separate devices on a discoverable hotpluggable
> >    bus. We'd run out of bus numbers too quickly if putting them on PCI.
> >    ACPI style hotplug is worse because we have to provision slots at the outset.
> > 2) Runtime provision of metadata - performance data very least (bandwidth /
> >    latency etc). In theory could wire up ACPI _HMA but no one has ever bothered.
> > 3) Probably do want async error signaling.  We 'could' do that with
> >    FW first error injection - I'm not sure it's a good idea but it's definitely
> >    an option.
> > 
> > A locked down CXL device is a bit more than that, but not very much more.
> > It's easy to fake registers for things that are always in one state so
> > that the software stack is happy.
> > 
> > virtio-mem has some of the parts and could perhaps be augmented
> > to support this use case with the advantage of no implicit tie to CXL.
> > 
> >   
> > >
> > > Summary thoughts:
> > >
> > > * A mechanism for resolving tags to "tagged capacity" devdax devices is
> > >   essential (and I don't think there are specific proposals about this
> > >   mechanism so far).    
> > 
> > Agreed.
> >   
> > > * Said mechanism should not be explicitly CXL-specific.    
> > 
> > Somewhat agreed, but I don't want to invent a new spec just to avoid explicit
> > ties to CXL. I'm not against using CXL to present HBM / ACPI Specific Purpose
> > memory for example to a VM. It will trivially work if that is what a user
> > wants to do and also illustrates that this stuff doesn't necessarily just
> > apply to capacity on a memory pool - it might just be 'weird' memory on the host.
> >   
> > > * Finding a tagged capacity devdax device in a VM should work the same as it
> > >   does running on bare metal.    
> > 
> > Absolutely - that's a requirement.
> >   
> > > * The file-backed (and devdax-backed) devdax abstraction is needed in qemu.    
> > 
> > Maybe. I'm not convinced the abstraction is needed at that particular level.
> >   
> > > * Beyond that, I'm not yet sure what the lookup mechanism should be. Extra
> > >   points for being easy to implement in both physical and virtual systems.    
> > 
> > For physical systems we aren't going to get agreement :(  For the systems
> > I have visibility of there will be some diversity in hardware, but the
> > presentation to userspace and up consistency should be doable.
> > 
> > Jonathan
> >   
> > >
> > > Thanks for teeing this up!
> > > John
> > >
> > >
> > > [1] https://github.com/cxl-micron-reskit/famfs/blob/master/README.md
> > >    
> > 
> > 
> >   
> 



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC] Virtualizing tagged disaggregated memory capacity (app specific, multi host shared)
  2024-08-15 16:22 [RFC] Virtualizing tagged disaggregated memory capacity (app specific, multi host shared) Jonathan Cameron via
  2024-08-16  7:05 ` Hannes Reinecke
  2024-08-19  2:12 ` John Groves
@ 2024-09-19  9:09 ` David Hildenbrand
  2024-09-20  9:06   ` Gregory Price
  2 siblings, 1 reply; 14+ messages in thread
From: David Hildenbrand @ 2024-09-19  9:09 UTC (permalink / raw)
  To: Jonathan Cameron, linux-mm, linux-cxl, Davidlohr Bueso, Ira Weiny,
	John Groves, virtualization
  Cc: Oscar Salvador, qemu-devel, Dave Jiang, Dan Williams, linuxarm,
	wangkefeng.wang, John Groves, Fan Ni, Navneet Singh,
	“Michael S. Tsirkin”, Igor Mammedov,
	Philippe Mathieu-Daudé

Sorry for the late reply ...

> Later on, some magic entity - let's call it an orchestrator, will tell the

In the virtio-mem world, that's usually something (admin/tool/whatever) 
in the hypervisor. What does it look like with CXL on bare metal?

> memory pool to provide memory to a given host. The host gets notified by
> the device of an 'offer' of specific memory extents and can accept it, after
> which it may start to make use of the provided memory extent.
> 
> Those address ranges may be shared across multiple hosts (in which case they
> are not for general use), or may be dedicated memory intended for use as
> normal RAM.
> 
> Whilst the granularity of DCD extents is allowed by the specification to be very
> fine (64 Bytes), in reality my expectation is no one will build general purpose
> memory pool devices with fine granularity.
 > > Memory hot-plug options (bare metal)
> ------------------------------------
> 
> By default, these extents will surface as either:
> 1) Normal memory hot-plugged into a NUMA node.
> 2) DAX - requiring applications to map that memory directly or use
>     a filesystem etc.
> 
> There are various ways to apply policy to this. One is to base the policy
> decision on a 'tag' that is associated with a set of DPA extents. That 'tag'
> is metadata that originates at the orchestrator. It's big enough to hold a
> UUID, so can convey whatever meaning is agreed by the orchestrator and the
> software running on each host.
> 
> Memory pools tend to want to guarantee, when the circumstances change
> (workload finishes etc), they can have the resources they allocated back.

Of course they want that guarantee. *insert usual unicorn example*

We can usually try hard, but "guarantee" is really a strong requirement 
that I am afraid we won't be able to give in many scenarios.

I'm sure CXL people were aware this is one of the basic issues of memory 
hotunplug (at least I kept telling them). If not, they didn't do their 
research properly or tried to ignore it.

> CXL brings polite ways of asking for the memory back and big hammers for
> when the host ignores things (which may well crash a naughty host).
> Reliable hot unplug of normal memory continues to be a challenge for memory
> that is 'normal' because not all its use / lifetime is tied to a particular
> application.

Yes. And crashing is worth than anything else. Rather shutdown/reboot 
the offending machine in a somewhat nice way instead of crashing it.

> 
> Application specific memory
> ---------------------------
> 
> The DAX path enables association of the memory with a single application
> by allowing that application to simply mmap appropriate /dev/daxX.Y
> That device optionally has an associated tag.
> 
> When the application closes or otherwise releases that memory we can
> guarantee to be able to recover the capacity.  Memory provided to an
> application this way will be referred to here as Application Specific Memory.
> This model also works for HBM or other 'better' memory that is reserved for
> specific use cases.
> 
> So the flow is something like:
> 1. Cloud orchestrator decides it's going to run in memory database A
>     on host W.
> 2. Memory appliance Z is told to 'offer' 1TB or memory to host W with
>     UUID / tag wwwwxxxxzzzz
> 3. Host W accepts that memory (why would it say no?) and creates a
>     DAX device for which the tag is discoverable.

Maybe there could be limitations (maximum addressable PFN?) where we 
would have to reject it? Not sure.

> 4. Orchestrator tells host W to launch the workload and that it
>     should use the memory provided with tag wwwwxxxxzzzz.
> 5. Host launches DB and tells it to use DAX device with tag wwwwxxxxzzzz
>     which the DB then mmap()s and loads it's database data into.
> ... sometime later....
> 6. Orchestrator tells host W to close that DB ad release the memory
>     allocated from the pool.
> 7. Host gives the memory back to the memory appliance which can then use
>     it to provide another host with the necessary memory.
> 
> This approach requires applications or at least memory allocation libraries to
> be modified.  The guarantees of getting the memory they asked for + that they
> will definitely be able to safely give the memory back when done, may make such
> software modifications worthwhile.
> 
> There are disadvantages and bloat issues if the 'wrong' amount of memory is
> allocated to the application. So these techniques only work when the
> orchestrator has the necessary information about the workload.

Yes.

> 
> Note that one specific example of application specific memory is virtual
> machines.  Just in this case the virtual machine is the application.
> Later on it may be useful to consider the example of the specific
> application in a VM being a nested virtual machine.
> 
> Shared Memory - closely related!
> --------------------------------
> 
> CXL enables a number of different types of memory sharing across multiple
> hosts:
> - Read only shared memory (suitable for apache arrow for example)
> - Hardware Coherent shared memory.
> - Software managed coherency.

Do we have any timeline when we will see real shared-memory devices? zVM 
supported shared segments between VMs for a couple of decades.

> 
> These surface using the same machinery as non shared DCD extents. Note however
> that the presentation, in terms of extents, to different host is not the same
> (can be different extents, in an unrelated order) but along with tags, shared
> extents have sufficient data to 'construct' a virtual address to HPA mapping
> that makes them look the same to aware  application or file systems.  Current
> proposed approach to this is to surface the extents via DAX and apply a
> filesystem approach to managing the data.
> https://lpc.events/event/18/contributions/1827/
> 
> These two types of memory pooling activity (shared memory, application specific
> memory) both require capacity associated with a tag to be presented to specific
> users in a fashion that is 'separate' from normal memory hot-plug.
> 
> The virtualization question
> ===========================
> 
> Having made the assumption that the models above are going to be used in
> practice, and that Linux will support them, the natural next step is to
> assume that applications designed against them are going to be used in virtual
> machines as well as on bare metal hosts.
> 
> The open question this RFC is aiming to start discussion around is how best to
> present them to the VM.  I want to get that discussion going early because
> some of the options I can see will require specification additions and / or
> significant PoC / development work to prove them out.  Before we go there,
> let us briefly consider other uses of pooled memory in VMs and how theuy
> aren't really relevant here.
> 
> Other virtualization uses of memory pool capacity
> -------------------------------------------------
> 
> 1. Part of static capacity of VM provided from a memory pool.
>     Can be presented as a NUMA setup, with HMAT etc providing performance data
>     relative to other memory the VM is using. Recovery of pooled capacity
>     requires shutting down or migrating the VM.
> 2. Coarse grained memory increases for 'normal' memory.
>     Can use memory hot-plug. Recovery of capacity likely to only be possible on
>     VM shutdown.

Is there are reason "movable" (ZONE_MOVABLE) is not an option, at least 
in some setups? If not, why?

> 
> Both these use cases are well covered by existing solutions so we can ignore
> them for the rest of this document.
> 
> Application specific or shared dynamic capacity - VM options.
> -------------------------------------------------------------
> 
> 1. Memory hot-plug - but with specific purpose memory flag set in EFI
>     memory map.  Current default policy is to bring those up as normal memory.
>     That policy can be adjusted via kernel option or Kconfig so they turn up
>     as DAX.  We 'could' augment the metadata with such hot-plugged memory
>     with the UID / tag from an underlying bare metal DAX device.
> 
> 2. Virtio-mem - It may be possible to fit this use case within an extended
>     virtio-mem.
> 
> 3. Emulate a CXL type 3 device.
> 
> 4. Other options?
> 
> Memory hotplug
> --------------
> 
> This is the heavy weight solution but should 'work' if we close a specification
> gap.  Granularity limitations are unlikely to be a big problem given anticipated
> CXL devices.
> 
> Today, the EFI memory map has an attribute EFI_MEMORY_SP, for "Specific Purpose
> Memory" intended to notify the operating system that it can use the memory as
> normal, but it is there for a specific use case and so might be wanted back at
> any point. This memory attribute can be provided in the memory map at boot
> time and if associated with EfiReservedMemoryType can be used to indicate a
> range of HPA Space where memory that is hot-plugged later should be treated as
> 'special'.
> 
> There isn't an obvious path to associate a particular range of hot plugged
> memory with a UID / tag.  I'd expect we'd need to add something to the ACPI
> specification to enable this.
> 
> Virtio-mem
> ----------
> 
> The design goals of virtio-mem [1] mean that it is not 'directly' applicable
> to this case, but could perhaps be adapted with the addition of meta data
> and DAX + guaranteed removal of explicit extents.

Maybe it could likely be extended, or one could built something similar 
that is better tailored to the "shared memory" use case.

> 
> [1] [virtio-mem: Paravirtualized Memory Hot(Un)Plug, David Hildenbrand and
> Martin Schulz, Vee'21]
> 
> Emulating a CXL Type 3 Device
> -----------------------------
> 
> Concerns raised about just emulating a CXL topology:
> * A CXL Type 3 device is pretty complex.
> * All we need is a tag + make it DAX, so surely this is too much?
> 
> Possible advantages
> * Kernel is exactly the same as that running on the host. No new drivers or
>    changes to existing drivers needed as what we are presenting is a possible
>    device topology - which may be much simpler that the host.
 > > Complexity:
> ***********
> 
> We don't emulate everything that can exist in physical topologies.
> - One emulated device per host CXL Fixed Memory Window
>    (I think we can't quite get away with just one in total due to BW/Latency
>     discovery)
> - Direct connect each emulated device to an emulate RP + Host Bridge.
> - Single CXL Fixed memory Window.  Never present interleave (that's a host
>    only problem).
> - Can probably always present a single extent per DAX region (if we don't
>    mind burning some GPA space to avoid fragmentation).

For "ordinary" hotplug virtio-mem provides real benefits over DIMMs. One 
thing to consider might be micro-vms where we want to emulate as little 
devices+infrastructure as possible.

So maybe looking into something paravirtualized that is more lightweight 
might make sense. Maybe not.

[...]

> Migration
> ---------
> 
> VM migration will either have to remove all extents, or appropriately
> prepopulate them prior to migration.  There are possible ways this
> may be done with the same memory pool contents via 'temporal' sharing,
> but in general this may bring additional complexity.
 > > Kexec etc etc will be similar to how we handle it on the host - 
probably
> just give all the capacity back.

kdump?

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC] Virtualizing tagged disaggregated memory capacity (app specific, multi host shared)
  2024-09-19  9:09 ` David Hildenbrand
@ 2024-09-20  9:06   ` Gregory Price
  2024-10-22  9:33     ` David Hildenbrand
  0 siblings, 1 reply; 14+ messages in thread
From: Gregory Price @ 2024-09-20  9:06 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Jonathan Cameron, linux-mm, linux-cxl, Davidlohr Bueso, Ira Weiny,
	John Groves, virtualization, Oscar Salvador, qemu-devel,
	Dave Jiang, Dan Williams, linuxarm, wangkefeng.wang, John Groves,
	Fan Ni, Navneet Singh, “Michael S. Tsirkin”,
	Igor Mammedov, Philippe Mathieu-Daudé

> > 2. Coarse grained memory increases for 'normal' memory.
> >     Can use memory hot-plug. Recovery of capacity likely to only be possible on
> >     VM shutdown.
> 
> Is there are reason "movable" (ZONE_MOVABLE) is not an option, at least in
> some setups? If not, why?
>

This seems like a bit of a muddied conversation.

"'normal' memory" has no defined meaning - so lets clear this up a bit

There is:
* System-RAM (memory managed by kernel allocators)
* Special Purpose Memory (generally presented as DAX)

System-RAM is managed as zones - the relevant ones are
* ZONE_NORMAL allows both movable and non-movable allocations
* ZONE_MOVABLE only allows non-movable allocations
  (Caveat: this generally only applies to allocation, you can
   violate this with stuff like pinning)

Hotplug can be thought of as two discrete mechanisms
* Exposing capacity to the kernel (CXL DCD Transactions)
* Exposing capacity to allocators (mm/memory-hotplug.c)

1) if the intent is to primarily utilize dynamic capacity for VMs, then
   the host does not need (read: should not need) to map the memory as
   System-RAM in the host. The VMM should be made to consume it directly
   via DAX or otherwise.

   That capacity is almost by definition "Capital G Guaranteed" to be
   reclaimable regardless of what the guest does. A VMM can force a guest
   to let go of resources - that's its job.

2) if the intent is to provide dynamic capacity to a host as System-RAM, then
   recoverability is dictated by system usage of that capacity. If onlined
   into ZONE_MOVABLE, then if the system has avoided doing things like pinning
   those pages it should *generally* be recoverable (but not guaranteed).

For the virtualization discussion:

Hotplug and recoverability is a non-issue.  The capacity should never be
exposed to system allocators and the VMM should be made to consume special
purpose memory directly. That's on the VMM/orchestration software to get right.

For the host System-RAM discussion:

Auto-onlined hotplug capacity presently defaults to ZONE_NORMAL, but we
discussed (yesterday, at Plumbers) changing this default to ZONE_MOVABLE.

The only concern is when insufficient ZONE_NORMAL exists to support
ZONE_MOVABLE capacity - but this is unlikely to be the general scenario AND
can be mitigated w/ existing mechanisms.

Manually onlined capacity defaults to ZONE_MOVABLE.

It would be nice to make this behavior consistent, since the general opinion
appears to be that this capacity should default to ZONE_MOVABLE.

~Gregory

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC] Virtualizing tagged disaggregated memory capacity (app specific, multi host shared)
  2024-09-20  9:06   ` Gregory Price
@ 2024-10-22  9:33     ` David Hildenbrand
  2024-10-22 14:24       ` Gregory Price
  0 siblings, 1 reply; 14+ messages in thread
From: David Hildenbrand @ 2024-10-22  9:33 UTC (permalink / raw)
  To: Gregory Price
  Cc: Jonathan Cameron, linux-mm, linux-cxl, Davidlohr Bueso, Ira Weiny,
	John Groves, virtualization, Oscar Salvador, qemu-devel,
	Dave Jiang, Dan Williams, linuxarm, wangkefeng.wang, John Groves,
	Fan Ni, Navneet Singh, “Michael S. Tsirkin”,
	Igor Mammedov, Philippe Mathieu-Daudé

On 20.09.24 11:06, Gregory Price wrote:
>>> 2. Coarse grained memory increases for 'normal' memory.
>>>      Can use memory hot-plug. Recovery of capacity likely to only be possible on
>>>      VM shutdown.
>>
>> Is there are reason "movable" (ZONE_MOVABLE) is not an option, at least in
>> some setups? If not, why?
>>
> 
> 
> This seems like a bit of a muddied conversation.

Cleaning up my inbox ... well at least trying :)

> 
> "'normal' memory" has no defined meaning - so lets clear this up a bit
> 
> There is:
> * System-RAM (memory managed by kernel allocators)
> * Special Purpose Memory (generally presented as DAX)
 > > System-RAM is managed as zones - the relevant ones are
> * ZONE_NORMAL allows both movable and non-movable allocations

.. except in corner cases like MIGRATE_CMA :)

> * ZONE_MOVABLE only allows non-movable allocations
>    (Caveat: this generally only applies to allocation, you can
>     violate this with stuff like pinning)

Note that long-term pinning is forbidden on MOVABLE, just like it is on 
MIGRATE_CMA. So we try that common use cases cannot violate this.

> 
> Hotplug can be thought of as two discrete mechanisms
> * Exposing capacity to the kernel (CXL DCD Transactions)
> * Exposing capacity to allocators (mm/memory-hotplug.c)
 > > 1) if the intent is to primarily utilize dynamic capacity for VMs, then
>     the host does not need (read: should not need) to map the memory as
>     System-RAM in the host. The VMM should be made to consume it directly
>     via DAX or otherwise.
> 
>     That capacity is almost by definition "Capital G Guaranteed" to be
>     reclaimable regardless of what the guest does. A VMM can force a guest
>     to let go of resources - that's its job.
> 
> 2) if the intent is to provide dynamic capacity to a host as System-RAM, then
>     recoverability is dictated by system usage of that capacity. If onlined
>     into ZONE_MOVABLE, then if the system has avoided doing things like pinning
>     those pages it should *generally* be recoverable (but not guaranteed).

There is, of course, the use case of memory overcommit -- in which case 
you would want 2). But likely that's out of the picture for this tagged 
memory.

> 
> 
> For the virtualization discussion:
> 
> Hotplug and recoverability is a non-issue.  The capacity should never be
> exposed to system allocators and the VMM should be made to consume special
> purpose memory directly. That's on the VMM/orchestration software to get right.
> 
> 
> For the host System-RAM discussion:
> 
> Auto-onlined hotplug capacity presently defaults to ZONE_NORMAL, but we
> discussed (yesterday, at Plumbers) changing this default to ZONE_MOVABLE.
> 
> The only concern is when insufficient ZONE_NORMAL exists to support
> ZONE_MOVABLE capacity - but this is unlikely to be the general scenario AND
> can be mitigated w/ existing mechanisms.

It might be worthwhile looking at 
Documentation/admin-guide/mm/memory-hotplug.rst "auto-movable" memory 
onlining polciy. It might not fit all sue cases, though (just like 
ZONE_MOVABLE doesn't)

> 
> Manually onlined capacity defaults to ZONE_MOVABLE.
> 
> It would be nice to make this behavior consistent, since the general opinion
> appears to be that this capacity should default to ZONE_MOVABLE.

It's much easier to shoot yourself into the foot with ZONE_MOVABLE, 
that's why the default can be adjusted manually using "online_movable" 
with e.g., memhp_default_state.

It's all a bit complicated, because there are various use cases and 
mechanisms for memory hotplug ... IIRC RHEL defaults with its udev rules 
to "ZONE_MOVABLE" on bare metal and "ZONE_NORMAL" in VMs. Except on 
s390, where we default to "offline" (standby memory ....).

I once worked on a systemd unit to make this configuration easier (and 
avoid udev rules), and possibly more "automatic" depending on the 
detected environment.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC] Virtualizing tagged disaggregated memory capacity (app specific, multi host shared)
  2024-10-22  9:33     ` David Hildenbrand
@ 2024-10-22 14:24       ` Gregory Price
  2024-10-22 14:35         ` David Hildenbrand
  0 siblings, 1 reply; 14+ messages in thread
From: Gregory Price @ 2024-10-22 14:24 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Jonathan Cameron, linux-mm, linux-cxl, Davidlohr Bueso, Ira Weiny,
	John Groves, virtualization, Oscar Salvador, qemu-devel,
	Dave Jiang, Dan Williams, linuxarm, wangkefeng.wang, John Groves,
	Fan Ni, Navneet Singh, “Michael S. Tsirkin”,
	Igor Mammedov, Philippe Mathieu-Daudé

On Tue, Oct 22, 2024 at 11:33:07AM +0200, David Hildenbrand wrote:
> On 20.09.24 11:06, Gregory Price wrote:
> 
> > The only concern is when insufficient ZONE_NORMAL exists to support
> > ZONE_MOVABLE capacity - but this is unlikely to be the general scenario AND
> > can be mitigated w/ existing mechanisms.
> 
> It might be worthwhile looking at
> Documentation/admin-guide/mm/memory-hotplug.rst "auto-movable" memory
> onlining polciy. It might not fit all sue cases, though (just like
> ZONE_MOVABLE doesn't)
> 

I managed to miss auto-movable in my last pass through there. Though for
our use-case, forcibly preventing ZONE_NORMAL for all CXL is the preferred
option in an effort to keep as much kernel resources out of high latency
memory.

So I think we're just going to end up using memhp_default_state, and that'll
be mostly fine.

> > 
> > Manually onlined capacity defaults to ZONE_MOVABLE.
> > 
> > It would be nice to make this behavior consistent, since the general opinion
> > appears to be that this capacity should default to ZONE_MOVABLE.
> 
> It's much easier to shoot yourself into the foot with ZONE_MOVABLE, that's
> why the default can be adjusted manually using "online_movable" with e.g.,
> memhp_default_state.
> 
> It's all a bit complicated, because there are various use cases and
> mechanisms for memory hotplug ... IIRC RHEL defaults with its udev rules to
> "ZONE_MOVABLE" on bare metal and "ZONE_NORMAL" in VMs. Except on s390, where
> we default to "offline" (standby memory ....).
> 
> I once worked on a systemd unit to make this configuration easier (and avoid
> udev rules), and possibly more "automatic" depending on the detected
> environment.
>

Appreciate the additional context, thanks!

~Gregory
 
> -- 
> Cheers,
> 
> David / dhildenb
> 


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC] Virtualizing tagged disaggregated memory capacity (app specific, multi host shared)
  2024-10-22 14:24       ` Gregory Price
@ 2024-10-22 14:35         ` David Hildenbrand
  0 siblings, 0 replies; 14+ messages in thread
From: David Hildenbrand @ 2024-10-22 14:35 UTC (permalink / raw)
  To: Gregory Price
  Cc: Jonathan Cameron, linux-mm, linux-cxl, Davidlohr Bueso, Ira Weiny,
	John Groves, virtualization, Oscar Salvador, qemu-devel,
	Dave Jiang, Dan Williams, linuxarm, wangkefeng.wang, John Groves,
	Fan Ni, Navneet Singh, “Michael S. Tsirkin”,
	Igor Mammedov, Philippe Mathieu-Daudé

On 22.10.24 16:24, Gregory Price wrote:
> On Tue, Oct 22, 2024 at 11:33:07AM +0200, David Hildenbrand wrote:
>> On 20.09.24 11:06, Gregory Price wrote:
>>
>>> The only concern is when insufficient ZONE_NORMAL exists to support
>>> ZONE_MOVABLE capacity - but this is unlikely to be the general scenario AND
>>> can be mitigated w/ existing mechanisms.
>>
>> It might be worthwhile looking at
>> Documentation/admin-guide/mm/memory-hotplug.rst "auto-movable" memory
>> onlining polciy. It might not fit all sue cases, though (just like
>> ZONE_MOVABLE doesn't)
>>
> 
> I managed to miss auto-movable in my last pass through there. Though for
> our use-case, forcibly preventing ZONE_NORMAL for all CXL is the preferred
> option in an effort to keep as much kernel resources out of high latency
> memory.

Yes, that makes sense for this memory with very different performance 
characteristics.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2024-10-22 14:36 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-08-15 16:22 [RFC] Virtualizing tagged disaggregated memory capacity (app specific, multi host shared) Jonathan Cameron via
2024-08-16  7:05 ` Hannes Reinecke
2024-08-16  9:41   ` Jonathan Cameron via
2024-08-19  2:12 ` John Groves
2024-08-19 15:40   ` Jonathan Cameron via
2024-09-17 19:37     ` Jonathan Cameron via
2024-10-22 14:11       ` Gregory Price
2024-09-17 19:56     ` Jonathan Cameron via
2024-09-18 12:12       ` Jonathan Cameron
2024-09-19  9:09 ` David Hildenbrand
2024-09-20  9:06   ` Gregory Price
2024-10-22  9:33     ` David Hildenbrand
2024-10-22 14:24       ` Gregory Price
2024-10-22 14:35         ` David Hildenbrand

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).