qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: David Hildenbrand <david@redhat.com>
To: Jonathan Cameron <Jonathan.Cameron@Huawei.com>,
	linux-mm@kvack.org, linux-cxl@vger.kernel.org,
	Davidlohr Bueso <dave@stgolabs.net>,
	Ira Weiny <ira.weiny@intel.com>, John Groves <John@Groves.net>,
	virtualization@lists.linux.dev
Cc: "Oscar Salvador" <osalvador@suse.de>,
	qemu-devel@nongnu.org, "Dave Jiang" <dave.jiang@intel.com>,
	"Dan Williams" <dan.j.williams@intel.com>,
	linuxarm@huawei.com, wangkefeng.wang@huawei.com,
	"John Groves" <jgroves@micron.com>, "Fan Ni" <fan.ni@samsung.com>,
	"Navneet Singh" <navneet.singh@intel.com>,
	"“Michael S. Tsirkin”" <mst@redhat.com>,
	"Igor Mammedov" <imammedo@redhat.com>,
	"Philippe Mathieu-Daudé" <philmd@linaro.org>
Subject: Re: [RFC] Virtualizing tagged disaggregated memory capacity (app specific, multi host shared)
Date: Thu, 19 Sep 2024 11:09:40 +0200	[thread overview]
Message-ID: <fc05d089-ce04-42d2-a0d7-ea32fd73fe90@redhat.com> (raw)
In-Reply-To: <20240815172223.00001ca7@Huawei.com>

Sorry for the late reply ...

> Later on, some magic entity - let's call it an orchestrator, will tell the

In the virtio-mem world, that's usually something (admin/tool/whatever) 
in the hypervisor. What does it look like with CXL on bare metal?

> memory pool to provide memory to a given host. The host gets notified by
> the device of an 'offer' of specific memory extents and can accept it, after
> which it may start to make use of the provided memory extent.
> 
> Those address ranges may be shared across multiple hosts (in which case they
> are not for general use), or may be dedicated memory intended for use as
> normal RAM.
> 
> Whilst the granularity of DCD extents is allowed by the specification to be very
> fine (64 Bytes), in reality my expectation is no one will build general purpose
> memory pool devices with fine granularity.
 > > Memory hot-plug options (bare metal)
> ------------------------------------
> 
> By default, these extents will surface as either:
> 1) Normal memory hot-plugged into a NUMA node.
> 2) DAX - requiring applications to map that memory directly or use
>     a filesystem etc.
> 
> There are various ways to apply policy to this. One is to base the policy
> decision on a 'tag' that is associated with a set of DPA extents. That 'tag'
> is metadata that originates at the orchestrator. It's big enough to hold a
> UUID, so can convey whatever meaning is agreed by the orchestrator and the
> software running on each host.
> 
> Memory pools tend to want to guarantee, when the circumstances change
> (workload finishes etc), they can have the resources they allocated back.

Of course they want that guarantee. *insert usual unicorn example*

We can usually try hard, but "guarantee" is really a strong requirement 
that I am afraid we won't be able to give in many scenarios.

I'm sure CXL people were aware this is one of the basic issues of memory 
hotunplug (at least I kept telling them). If not, they didn't do their 
research properly or tried to ignore it.

> CXL brings polite ways of asking for the memory back and big hammers for
> when the host ignores things (which may well crash a naughty host).
> Reliable hot unplug of normal memory continues to be a challenge for memory
> that is 'normal' because not all its use / lifetime is tied to a particular
> application.

Yes. And crashing is worth than anything else. Rather shutdown/reboot 
the offending machine in a somewhat nice way instead of crashing it.

> 
> Application specific memory
> ---------------------------
> 
> The DAX path enables association of the memory with a single application
> by allowing that application to simply mmap appropriate /dev/daxX.Y
> That device optionally has an associated tag.
> 
> When the application closes or otherwise releases that memory we can
> guarantee to be able to recover the capacity.  Memory provided to an
> application this way will be referred to here as Application Specific Memory.
> This model also works for HBM or other 'better' memory that is reserved for
> specific use cases.
> 
> So the flow is something like:
> 1. Cloud orchestrator decides it's going to run in memory database A
>     on host W.
> 2. Memory appliance Z is told to 'offer' 1TB or memory to host W with
>     UUID / tag wwwwxxxxzzzz
> 3. Host W accepts that memory (why would it say no?) and creates a
>     DAX device for which the tag is discoverable.

Maybe there could be limitations (maximum addressable PFN?) where we 
would have to reject it? Not sure.

> 4. Orchestrator tells host W to launch the workload and that it
>     should use the memory provided with tag wwwwxxxxzzzz.
> 5. Host launches DB and tells it to use DAX device with tag wwwwxxxxzzzz
>     which the DB then mmap()s and loads it's database data into.
> ... sometime later....
> 6. Orchestrator tells host W to close that DB ad release the memory
>     allocated from the pool.
> 7. Host gives the memory back to the memory appliance which can then use
>     it to provide another host with the necessary memory.
> 
> This approach requires applications or at least memory allocation libraries to
> be modified.  The guarantees of getting the memory they asked for + that they
> will definitely be able to safely give the memory back when done, may make such
> software modifications worthwhile.
> 
> There are disadvantages and bloat issues if the 'wrong' amount of memory is
> allocated to the application. So these techniques only work when the
> orchestrator has the necessary information about the workload.

Yes.

> 
> Note that one specific example of application specific memory is virtual
> machines.  Just in this case the virtual machine is the application.
> Later on it may be useful to consider the example of the specific
> application in a VM being a nested virtual machine.
> 
> Shared Memory - closely related!
> --------------------------------
> 
> CXL enables a number of different types of memory sharing across multiple
> hosts:
> - Read only shared memory (suitable for apache arrow for example)
> - Hardware Coherent shared memory.
> - Software managed coherency.

Do we have any timeline when we will see real shared-memory devices? zVM 
supported shared segments between VMs for a couple of decades.

> 
> These surface using the same machinery as non shared DCD extents. Note however
> that the presentation, in terms of extents, to different host is not the same
> (can be different extents, in an unrelated order) but along with tags, shared
> extents have sufficient data to 'construct' a virtual address to HPA mapping
> that makes them look the same to aware  application or file systems.  Current
> proposed approach to this is to surface the extents via DAX and apply a
> filesystem approach to managing the data.
> https://lpc.events/event/18/contributions/1827/
> 
> These two types of memory pooling activity (shared memory, application specific
> memory) both require capacity associated with a tag to be presented to specific
> users in a fashion that is 'separate' from normal memory hot-plug.
> 
> The virtualization question
> ===========================
> 
> Having made the assumption that the models above are going to be used in
> practice, and that Linux will support them, the natural next step is to
> assume that applications designed against them are going to be used in virtual
> machines as well as on bare metal hosts.
> 
> The open question this RFC is aiming to start discussion around is how best to
> present them to the VM.  I want to get that discussion going early because
> some of the options I can see will require specification additions and / or
> significant PoC / development work to prove them out.  Before we go there,
> let us briefly consider other uses of pooled memory in VMs and how theuy
> aren't really relevant here.
> 
> Other virtualization uses of memory pool capacity
> -------------------------------------------------
> 
> 1. Part of static capacity of VM provided from a memory pool.
>     Can be presented as a NUMA setup, with HMAT etc providing performance data
>     relative to other memory the VM is using. Recovery of pooled capacity
>     requires shutting down or migrating the VM.
> 2. Coarse grained memory increases for 'normal' memory.
>     Can use memory hot-plug. Recovery of capacity likely to only be possible on
>     VM shutdown.

Is there are reason "movable" (ZONE_MOVABLE) is not an option, at least 
in some setups? If not, why?

> 
> Both these use cases are well covered by existing solutions so we can ignore
> them for the rest of this document.
> 
> Application specific or shared dynamic capacity - VM options.
> -------------------------------------------------------------
> 
> 1. Memory hot-plug - but with specific purpose memory flag set in EFI
>     memory map.  Current default policy is to bring those up as normal memory.
>     That policy can be adjusted via kernel option or Kconfig so they turn up
>     as DAX.  We 'could' augment the metadata with such hot-plugged memory
>     with the UID / tag from an underlying bare metal DAX device.
> 
> 2. Virtio-mem - It may be possible to fit this use case within an extended
>     virtio-mem.
> 
> 3. Emulate a CXL type 3 device.
> 
> 4. Other options?
> 
> Memory hotplug
> --------------
> 
> This is the heavy weight solution but should 'work' if we close a specification
> gap.  Granularity limitations are unlikely to be a big problem given anticipated
> CXL devices.
> 
> Today, the EFI memory map has an attribute EFI_MEMORY_SP, for "Specific Purpose
> Memory" intended to notify the operating system that it can use the memory as
> normal, but it is there for a specific use case and so might be wanted back at
> any point. This memory attribute can be provided in the memory map at boot
> time and if associated with EfiReservedMemoryType can be used to indicate a
> range of HPA Space where memory that is hot-plugged later should be treated as
> 'special'.
> 
> There isn't an obvious path to associate a particular range of hot plugged
> memory with a UID / tag.  I'd expect we'd need to add something to the ACPI
> specification to enable this.
> 
> Virtio-mem
> ----------
> 
> The design goals of virtio-mem [1] mean that it is not 'directly' applicable
> to this case, but could perhaps be adapted with the addition of meta data
> and DAX + guaranteed removal of explicit extents.

Maybe it could likely be extended, or one could built something similar 
that is better tailored to the "shared memory" use case.

> 
> [1] [virtio-mem: Paravirtualized Memory Hot(Un)Plug, David Hildenbrand and
> Martin Schulz, Vee'21]
> 
> Emulating a CXL Type 3 Device
> -----------------------------
> 
> Concerns raised about just emulating a CXL topology:
> * A CXL Type 3 device is pretty complex.
> * All we need is a tag + make it DAX, so surely this is too much?
> 
> Possible advantages
> * Kernel is exactly the same as that running on the host. No new drivers or
>    changes to existing drivers needed as what we are presenting is a possible
>    device topology - which may be much simpler that the host.
 > > Complexity:
> ***********
> 
> We don't emulate everything that can exist in physical topologies.
> - One emulated device per host CXL Fixed Memory Window
>    (I think we can't quite get away with just one in total due to BW/Latency
>     discovery)
> - Direct connect each emulated device to an emulate RP + Host Bridge.
> - Single CXL Fixed memory Window.  Never present interleave (that's a host
>    only problem).
> - Can probably always present a single extent per DAX region (if we don't
>    mind burning some GPA space to avoid fragmentation).

For "ordinary" hotplug virtio-mem provides real benefits over DIMMs. One 
thing to consider might be micro-vms where we want to emulate as little 
devices+infrastructure as possible.

So maybe looking into something paravirtualized that is more lightweight 
might make sense. Maybe not.

[...]

> Migration
> ---------
> 
> VM migration will either have to remove all extents, or appropriately
> prepopulate them prior to migration.  There are possible ways this
> may be done with the same memory pool contents via 'temporal' sharing,
> but in general this may bring additional complexity.
 > > Kexec etc etc will be similar to how we handle it on the host - 
probably
> just give all the capacity back.

kdump?

-- 
Cheers,

David / dhildenb



  parent reply	other threads:[~2024-09-19  9:10 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-08-15 16:22 [RFC] Virtualizing tagged disaggregated memory capacity (app specific, multi host shared) Jonathan Cameron via
2024-08-16  7:05 ` Hannes Reinecke
2024-08-16  9:41   ` Jonathan Cameron via
2024-08-19  2:12 ` John Groves
2024-08-19 15:40   ` Jonathan Cameron via
2024-09-17 19:37     ` Jonathan Cameron via
2024-10-22 14:11       ` Gregory Price
2024-09-17 19:56     ` Jonathan Cameron via
2024-09-18 12:12       ` Jonathan Cameron
2024-09-19  9:09 ` David Hildenbrand [this message]
2024-09-20  9:06   ` Gregory Price
2024-10-22  9:33     ` David Hildenbrand
2024-10-22 14:24       ` Gregory Price
2024-10-22 14:35         ` David Hildenbrand

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=fc05d089-ce04-42d2-a0d7-ea32fd73fe90@redhat.com \
    --to=david@redhat.com \
    --cc=John@Groves.net \
    --cc=Jonathan.Cameron@Huawei.com \
    --cc=dan.j.williams@intel.com \
    --cc=dave.jiang@intel.com \
    --cc=dave@stgolabs.net \
    --cc=fan.ni@samsung.com \
    --cc=imammedo@redhat.com \
    --cc=ira.weiny@intel.com \
    --cc=jgroves@micron.com \
    --cc=linux-cxl@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linuxarm@huawei.com \
    --cc=mst@redhat.com \
    --cc=navneet.singh@intel.com \
    --cc=osalvador@suse.de \
    --cc=philmd@linaro.org \
    --cc=qemu-devel@nongnu.org \
    --cc=virtualization@lists.linux.dev \
    --cc=wangkefeng.wang@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).