From: David Hildenbrand <david@redhat.com>
To: Jonathan Cameron <Jonathan.Cameron@Huawei.com>,
linux-mm@kvack.org, linux-cxl@vger.kernel.org,
Davidlohr Bueso <dave@stgolabs.net>,
Ira Weiny <ira.weiny@intel.com>, John Groves <John@Groves.net>,
virtualization@lists.linux.dev
Cc: "Oscar Salvador" <osalvador@suse.de>,
qemu-devel@nongnu.org, "Dave Jiang" <dave.jiang@intel.com>,
"Dan Williams" <dan.j.williams@intel.com>,
linuxarm@huawei.com, wangkefeng.wang@huawei.com,
"John Groves" <jgroves@micron.com>, "Fan Ni" <fan.ni@samsung.com>,
"Navneet Singh" <navneet.singh@intel.com>,
"“Michael S. Tsirkin”" <mst@redhat.com>,
"Igor Mammedov" <imammedo@redhat.com>,
"Philippe Mathieu-Daudé" <philmd@linaro.org>
Subject: Re: [RFC] Virtualizing tagged disaggregated memory capacity (app specific, multi host shared)
Date: Thu, 19 Sep 2024 11:09:40 +0200 [thread overview]
Message-ID: <fc05d089-ce04-42d2-a0d7-ea32fd73fe90@redhat.com> (raw)
In-Reply-To: <20240815172223.00001ca7@Huawei.com>
Sorry for the late reply ...
> Later on, some magic entity - let's call it an orchestrator, will tell the
In the virtio-mem world, that's usually something (admin/tool/whatever)
in the hypervisor. What does it look like with CXL on bare metal?
> memory pool to provide memory to a given host. The host gets notified by
> the device of an 'offer' of specific memory extents and can accept it, after
> which it may start to make use of the provided memory extent.
>
> Those address ranges may be shared across multiple hosts (in which case they
> are not for general use), or may be dedicated memory intended for use as
> normal RAM.
>
> Whilst the granularity of DCD extents is allowed by the specification to be very
> fine (64 Bytes), in reality my expectation is no one will build general purpose
> memory pool devices with fine granularity.
> > Memory hot-plug options (bare metal)
> ------------------------------------
>
> By default, these extents will surface as either:
> 1) Normal memory hot-plugged into a NUMA node.
> 2) DAX - requiring applications to map that memory directly or use
> a filesystem etc.
>
> There are various ways to apply policy to this. One is to base the policy
> decision on a 'tag' that is associated with a set of DPA extents. That 'tag'
> is metadata that originates at the orchestrator. It's big enough to hold a
> UUID, so can convey whatever meaning is agreed by the orchestrator and the
> software running on each host.
>
> Memory pools tend to want to guarantee, when the circumstances change
> (workload finishes etc), they can have the resources they allocated back.
Of course they want that guarantee. *insert usual unicorn example*
We can usually try hard, but "guarantee" is really a strong requirement
that I am afraid we won't be able to give in many scenarios.
I'm sure CXL people were aware this is one of the basic issues of memory
hotunplug (at least I kept telling them). If not, they didn't do their
research properly or tried to ignore it.
> CXL brings polite ways of asking for the memory back and big hammers for
> when the host ignores things (which may well crash a naughty host).
> Reliable hot unplug of normal memory continues to be a challenge for memory
> that is 'normal' because not all its use / lifetime is tied to a particular
> application.
Yes. And crashing is worth than anything else. Rather shutdown/reboot
the offending machine in a somewhat nice way instead of crashing it.
>
> Application specific memory
> ---------------------------
>
> The DAX path enables association of the memory with a single application
> by allowing that application to simply mmap appropriate /dev/daxX.Y
> That device optionally has an associated tag.
>
> When the application closes or otherwise releases that memory we can
> guarantee to be able to recover the capacity. Memory provided to an
> application this way will be referred to here as Application Specific Memory.
> This model also works for HBM or other 'better' memory that is reserved for
> specific use cases.
>
> So the flow is something like:
> 1. Cloud orchestrator decides it's going to run in memory database A
> on host W.
> 2. Memory appliance Z is told to 'offer' 1TB or memory to host W with
> UUID / tag wwwwxxxxzzzz
> 3. Host W accepts that memory (why would it say no?) and creates a
> DAX device for which the tag is discoverable.
Maybe there could be limitations (maximum addressable PFN?) where we
would have to reject it? Not sure.
> 4. Orchestrator tells host W to launch the workload and that it
> should use the memory provided with tag wwwwxxxxzzzz.
> 5. Host launches DB and tells it to use DAX device with tag wwwwxxxxzzzz
> which the DB then mmap()s and loads it's database data into.
> ... sometime later....
> 6. Orchestrator tells host W to close that DB ad release the memory
> allocated from the pool.
> 7. Host gives the memory back to the memory appliance which can then use
> it to provide another host with the necessary memory.
>
> This approach requires applications or at least memory allocation libraries to
> be modified. The guarantees of getting the memory they asked for + that they
> will definitely be able to safely give the memory back when done, may make such
> software modifications worthwhile.
>
> There are disadvantages and bloat issues if the 'wrong' amount of memory is
> allocated to the application. So these techniques only work when the
> orchestrator has the necessary information about the workload.
Yes.
>
> Note that one specific example of application specific memory is virtual
> machines. Just in this case the virtual machine is the application.
> Later on it may be useful to consider the example of the specific
> application in a VM being a nested virtual machine.
>
> Shared Memory - closely related!
> --------------------------------
>
> CXL enables a number of different types of memory sharing across multiple
> hosts:
> - Read only shared memory (suitable for apache arrow for example)
> - Hardware Coherent shared memory.
> - Software managed coherency.
Do we have any timeline when we will see real shared-memory devices? zVM
supported shared segments between VMs for a couple of decades.
>
> These surface using the same machinery as non shared DCD extents. Note however
> that the presentation, in terms of extents, to different host is not the same
> (can be different extents, in an unrelated order) but along with tags, shared
> extents have sufficient data to 'construct' a virtual address to HPA mapping
> that makes them look the same to aware application or file systems. Current
> proposed approach to this is to surface the extents via DAX and apply a
> filesystem approach to managing the data.
> https://lpc.events/event/18/contributions/1827/
>
> These two types of memory pooling activity (shared memory, application specific
> memory) both require capacity associated with a tag to be presented to specific
> users in a fashion that is 'separate' from normal memory hot-plug.
>
> The virtualization question
> ===========================
>
> Having made the assumption that the models above are going to be used in
> practice, and that Linux will support them, the natural next step is to
> assume that applications designed against them are going to be used in virtual
> machines as well as on bare metal hosts.
>
> The open question this RFC is aiming to start discussion around is how best to
> present them to the VM. I want to get that discussion going early because
> some of the options I can see will require specification additions and / or
> significant PoC / development work to prove them out. Before we go there,
> let us briefly consider other uses of pooled memory in VMs and how theuy
> aren't really relevant here.
>
> Other virtualization uses of memory pool capacity
> -------------------------------------------------
>
> 1. Part of static capacity of VM provided from a memory pool.
> Can be presented as a NUMA setup, with HMAT etc providing performance data
> relative to other memory the VM is using. Recovery of pooled capacity
> requires shutting down or migrating the VM.
> 2. Coarse grained memory increases for 'normal' memory.
> Can use memory hot-plug. Recovery of capacity likely to only be possible on
> VM shutdown.
Is there are reason "movable" (ZONE_MOVABLE) is not an option, at least
in some setups? If not, why?
>
> Both these use cases are well covered by existing solutions so we can ignore
> them for the rest of this document.
>
> Application specific or shared dynamic capacity - VM options.
> -------------------------------------------------------------
>
> 1. Memory hot-plug - but with specific purpose memory flag set in EFI
> memory map. Current default policy is to bring those up as normal memory.
> That policy can be adjusted via kernel option or Kconfig so they turn up
> as DAX. We 'could' augment the metadata with such hot-plugged memory
> with the UID / tag from an underlying bare metal DAX device.
>
> 2. Virtio-mem - It may be possible to fit this use case within an extended
> virtio-mem.
>
> 3. Emulate a CXL type 3 device.
>
> 4. Other options?
>
> Memory hotplug
> --------------
>
> This is the heavy weight solution but should 'work' if we close a specification
> gap. Granularity limitations are unlikely to be a big problem given anticipated
> CXL devices.
>
> Today, the EFI memory map has an attribute EFI_MEMORY_SP, for "Specific Purpose
> Memory" intended to notify the operating system that it can use the memory as
> normal, but it is there for a specific use case and so might be wanted back at
> any point. This memory attribute can be provided in the memory map at boot
> time and if associated with EfiReservedMemoryType can be used to indicate a
> range of HPA Space where memory that is hot-plugged later should be treated as
> 'special'.
>
> There isn't an obvious path to associate a particular range of hot plugged
> memory with a UID / tag. I'd expect we'd need to add something to the ACPI
> specification to enable this.
>
> Virtio-mem
> ----------
>
> The design goals of virtio-mem [1] mean that it is not 'directly' applicable
> to this case, but could perhaps be adapted with the addition of meta data
> and DAX + guaranteed removal of explicit extents.
Maybe it could likely be extended, or one could built something similar
that is better tailored to the "shared memory" use case.
>
> [1] [virtio-mem: Paravirtualized Memory Hot(Un)Plug, David Hildenbrand and
> Martin Schulz, Vee'21]
>
> Emulating a CXL Type 3 Device
> -----------------------------
>
> Concerns raised about just emulating a CXL topology:
> * A CXL Type 3 device is pretty complex.
> * All we need is a tag + make it DAX, so surely this is too much?
>
> Possible advantages
> * Kernel is exactly the same as that running on the host. No new drivers or
> changes to existing drivers needed as what we are presenting is a possible
> device topology - which may be much simpler that the host.
> > Complexity:
> ***********
>
> We don't emulate everything that can exist in physical topologies.
> - One emulated device per host CXL Fixed Memory Window
> (I think we can't quite get away with just one in total due to BW/Latency
> discovery)
> - Direct connect each emulated device to an emulate RP + Host Bridge.
> - Single CXL Fixed memory Window. Never present interleave (that's a host
> only problem).
> - Can probably always present a single extent per DAX region (if we don't
> mind burning some GPA space to avoid fragmentation).
For "ordinary" hotplug virtio-mem provides real benefits over DIMMs. One
thing to consider might be micro-vms where we want to emulate as little
devices+infrastructure as possible.
So maybe looking into something paravirtualized that is more lightweight
might make sense. Maybe not.
[...]
> Migration
> ---------
>
> VM migration will either have to remove all extents, or appropriately
> prepopulate them prior to migration. There are possible ways this
> may be done with the same memory pool contents via 'temporal' sharing,
> but in general this may bring additional complexity.
> > Kexec etc etc will be similar to how we handle it on the host -
probably
> just give all the capacity back.
kdump?
--
Cheers,
David / dhildenb
next prev parent reply other threads:[~2024-09-19 9:10 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-08-15 16:22 [RFC] Virtualizing tagged disaggregated memory capacity (app specific, multi host shared) Jonathan Cameron via
2024-08-16 7:05 ` Hannes Reinecke
2024-08-16 9:41 ` Jonathan Cameron via
2024-08-19 2:12 ` John Groves
2024-08-19 15:40 ` Jonathan Cameron via
2024-09-17 19:37 ` Jonathan Cameron via
2024-10-22 14:11 ` Gregory Price
2024-09-17 19:56 ` Jonathan Cameron via
2024-09-18 12:12 ` Jonathan Cameron
2024-09-19 9:09 ` David Hildenbrand [this message]
2024-09-20 9:06 ` Gregory Price
2024-10-22 9:33 ` David Hildenbrand
2024-10-22 14:24 ` Gregory Price
2024-10-22 14:35 ` David Hildenbrand
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=fc05d089-ce04-42d2-a0d7-ea32fd73fe90@redhat.com \
--to=david@redhat.com \
--cc=John@Groves.net \
--cc=Jonathan.Cameron@Huawei.com \
--cc=dan.j.williams@intel.com \
--cc=dave.jiang@intel.com \
--cc=dave@stgolabs.net \
--cc=fan.ni@samsung.com \
--cc=imammedo@redhat.com \
--cc=ira.weiny@intel.com \
--cc=jgroves@micron.com \
--cc=linux-cxl@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linuxarm@huawei.com \
--cc=mst@redhat.com \
--cc=navneet.singh@intel.com \
--cc=osalvador@suse.de \
--cc=philmd@linaro.org \
--cc=qemu-devel@nongnu.org \
--cc=virtualization@lists.linux.dev \
--cc=wangkefeng.wang@huawei.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).