From: Jonathan Cameron <jic23@kernel.org>
To: Jonathan Cameron <Jonathan.Cameron@Huawei.com>
Cc: "John Groves" <John@groves.net>,
linuxarm@huawei.com, "David Hildenbrand" <david@redhat.com>,
linux-mm <linux-mm@kvack.org>,
linux-cxl <linux-cxl@vger.kernel.org>,
"Davidlohr Bueso" <dave@stgolabs.net>,
"Ira Weiny" <ira.weiny@intel.com>,
virtualization <virtualization@lists.linux.dev>,
"Oscar Salvador" <osalvador@suse.de>,
qemu-devel <qemu-devel@nongnu.org>,
"Dave Jiang" <dave.jiang@intel.com>,
"Dan Williams" <dan.j.williams@intel.com>,
"Wangkefeng (OS Kernel Lab)" <wangkefeng.wang@huawei.com>,
"John Groves" <jgroves@micron.com>, "Fan Ni" <fan.ni@samsung.com>,
"Navneet Singh" <navneet.singh@intel.com>,
" “Michael S. Tsirkin”" <mst@redhat.com>,
"Igor Mammedov" <imammedo@redhat.com>,
"Philippe Mathieu-Daudé" <philmd@linaro.org>
Subject: Re: [RFC] Virtualizing tagged disaggregated memory capacity (app specific, multi host shared)
Date: Wed, 18 Sep 2024 13:12:32 +0100 [thread overview]
Message-ID: <20240918131232.6fa02096@jic23-huawei> (raw)
In-Reply-To: <20240917205048.00001e34@huawei.com>
On Tue, 17 Sep 2024 20:56:53 +0100
Jonathan Cameron <Jonathan.Cameron@Huawei.com> wrote:
> On Tue, 17 Sep 2024 19:37:21 +0000
> Jonathan Cameron <jonathan.cameron@huawei.com> wrote:
>
> > Plan is currently to meet at lpc registration desk 2pm tomorrow Wednesday and we will find a room.
> >
>
> And now the internet maybe knows my phone number (serves me right for using
> my company mobile app that auto added a signature)
> I might have been lucky and it didn't hit the archives because
> the formatting was too broken..
>
> Anyhow, see some of you tomorrow. I didn't manage to borrow a jabra mic
> so remote will be tricky but feel free to reach out and we might be
> able to sort something.
>
> Intent is this will be in informal BoF so we'll figure out the scope
> at the start of the meeting.
>
> Sorry for the noise!
Hack room 1.14 now if anyone is looking for us.
>
> Jonathan
>
> > J
> > On Sun, 18 Aug 2024 21:12:34 -0500
> > John Groves <John@groves.net> wrote:
> >
> > > On 24/08/15 05:22PM, Jonathan Cameron wrote:
> > > > Introduction
> > > > ============
> > > >
> > > > If we think application specific memory (including inter-host shared memory) is
> > > > a thing, it will also be a thing people want to use with virtual machines,
> > > > potentially nested. So how do we present it at the Host to VM boundary?
> > > >
> > > > This RFC is perhaps premature given we haven't yet merged upstream support for
> > > > the bare metal case. However I'd like to get the discussion going given we've
> > > > touched briefly on this in a number of CXL sync calls and it is clear no one is
> > >
> > > Excellent write-up, thanks Jonathan.
> > >
> > > Hannes' idea of an in-person discussion at LPC is a great idea - count me in.
> >
> > Had a feeling you might say that ;)
> >
> > >
> > > As the proprietor of famfs [1] I have many thoughts.
> > >
> > > First, I like the concept of application-specific memory (ASM), but I wonder
> > > if there might be a better term for it. ASM suggests that there is one
> > > application, but I'd suggest that a more concise statement of the concept
> > > is that the Linux kernel never accesses or mutates the memory - even though
> > > multiple apps might share it (e.g. via famfs). It's a subtle point, but
> > > an important one for RAS etc. ASM might better be called non-kernel-managed
> > > memory - though that name does not have as good a ring to it. Will mull this
> > > over further...
> >
> > Naming is always the hard bit :) I agree that one doesn't work for
> > shared capacity. You can tell I didn't start there :)
> >
> > >
> > > Now a few level-setting comments on CXL and Dynamic Capacity Devices (DCDs),
> > > some of which will be obvious to many of you:
> > >
> > > * A DCD is just a memory device with an allocator and host-level
> > > access-control built in.
> > > * Usable memory from a DCD is not available until the fabric manger (likely
> > > on behalf of an orchestrator) performs an Initiate Dynamic Capacity Add
> > > command to the DCD.
> > > * A DCD allocation has a tag (uuid) which is the invariant way of identifying
> > > the memory from that allocation.
> > > * The tag becomes known to the host from the DCD extents provided via
> > > a CXL event following succesful allocation.
> > > * The memory associated with a tagged allocation will surface as a dax device
> > > on each host that has access to it. But of course dax device naming &
> > > numbering won't be consistent across separate hosts - so we need to use
> > > the uuid's to find specific memory.
> > >
> > > A few less foundational observations:
> > >
> > > * It does not make sense to "online" shared or sharable memory as system-ram,
> > > because system-ram gets zeroed, which blows up use cases for sharable memory.
> > > So the default for sharable memory must be devdax mode.
> > (CXL specific diversion)
> >
> > Absolutely agree this this. There is a 'corner' that irritates me in the spec though
> > which is that there is no distinction between shareable and shared capacity.
> > If we are in a constrained setup with limited HPA or DPA space, we may not want
> > to have separate DCD regions for these. Thus it is plausible that an orchestrator
> > might tell a memory appliance to present memory for general use and yet it
> > surfaces as shareable. So there may need to be an opt in path at least for
> > going ahead and using this memory as normal RAM.
> >
> > > * Tags are mandatory for sharable allocations, and allowed but optional for
> > > non-sharable allocations. The implication is that non-sharable allocations
> > > may get onlined automatically as system-ram, so we don't need a namespace
> > > for those. (I argued for mandatory tags on all allocations - hey you don't
> > > have to use them - but encountered objections and dropped it.)
> > > * CXL access control only goes to host root ports; CXL has no concept of
> > > giving access to a VM. So some component on a host (perhaps logically
> > > an orchestrator component) needs to plumb memory to VMs as appropriate.
> >
> > Yes. It's some mashup of an orchestrator and VMM / libvirt, local library
> > of your choice. We can just group into into the ill defined concept of
> > a distributed orchestrator.
> >
> > >
> > > So tags are a namespace to find specific memory "allocations" (which in the
> > > CXL consortium, we usually refer to as "tagged capacity").
> > >
> > > In an orchestrated environment, the orchestrator would allocate resources
> > > (including tagged memory capacity), make that capacity visible on the right
> > > host(s), and then provide the tag when starting the app if needed.
> > >
> > > if (e.g.) the memory cotains a famfs file system, famfs needs the uuid of the
> > > root memory allocation to find the right memory device. Once mounted, it's a
> > > file sytem so apps can be directed to the mount path. Apps that consume the
> > > dax devices directly also need the uuid because /dev/dax0.0 is not invariant
> > > across a cluster...
> > >
> > > I have been assuming that when the CXL stack discovers a new DCD allocation,
> > > it will configure the devdax device and provide some way to find it by tag.
> > > /sys/cxl/<tag>/dev or whatever. That works as far as it goes, but I'm coming
> > > around to thinking that the uuid-to-dax map should not be overtly CXL-specific.
> >
> > Agreed. Whether that's a nice kernel side thing, or a utility pulling data
> > from various kernel subsystem interfaces doesn't really matter. I'd prefer
> > the kernel presents this but maybe that won't work for some reason.
> >
> > >
> > > General thoughts regarding VMs and qemu
> > >
> > > Physical connections to CXL memory are handled by physical servers. I don't
> > > think there is a scenario in which a VM should interact directly with the
> > > pcie function(s) of CXL devices. They will be configured as dax devices
> > > (findable by their tags!) by the host OS, and should be provided to VMs
> > > (when appropriate) as DAX devices. And software in a VM needs to be able to
> > > find the right DAX device the same way it would running on bare metal - by
> > > the tag.
> >
> > Limiting to typical type 3 memory pool devices. Agreed. The other CXL device
> > types are a can or worms for another day.
> >
> > >
> > > Qemu can already get memory from files (-object memory-backend-file,...), and
> > > I believe this works whether it's an actual file or a devdax device. So far,
> > > so good.
> > >
> > > Qemu can back a virtual pmem device by one of these, but currently (AFAIK)
> > > not a virtual devdax device. I think virtual devdax is needed as a first-class
> > > abstraction. If we can add the tag as a property of the memory-backend-file,
> > > we're almost there - we just need away to lookup a daxdev by tag.
> >
> > I'm not sure that is simple. We'd need to define a new interface capable of:
> > 1) Hotplug - potentially of many separate regions (think nested VMs).
> > That more or less rules out using separate devices on a discoverable hotpluggable
> > bus. We'd run out of bus numbers too quickly if putting them on PCI.
> > ACPI style hotplug is worse because we have to provision slots at the outset.
> > 2) Runtime provision of metadata - performance data very least (bandwidth /
> > latency etc). In theory could wire up ACPI _HMA but no one has ever bothered.
> > 3) Probably do want async error signaling. We 'could' do that with
> > FW first error injection - I'm not sure it's a good idea but it's definitely
> > an option.
> >
> > A locked down CXL device is a bit more than that, but not very much more.
> > It's easy to fake registers for things that are always in one state so
> > that the software stack is happy.
> >
> > virtio-mem has some of the parts and could perhaps be augmented
> > to support this use case with the advantage of no implicit tie to CXL.
> >
> >
> > >
> > > Summary thoughts:
> > >
> > > * A mechanism for resolving tags to "tagged capacity" devdax devices is
> > > essential (and I don't think there are specific proposals about this
> > > mechanism so far).
> >
> > Agreed.
> >
> > > * Said mechanism should not be explicitly CXL-specific.
> >
> > Somewhat agreed, but I don't want to invent a new spec just to avoid explicit
> > ties to CXL. I'm not against using CXL to present HBM / ACPI Specific Purpose
> > memory for example to a VM. It will trivially work if that is what a user
> > wants to do and also illustrates that this stuff doesn't necessarily just
> > apply to capacity on a memory pool - it might just be 'weird' memory on the host.
> >
> > > * Finding a tagged capacity devdax device in a VM should work the same as it
> > > does running on bare metal.
> >
> > Absolutely - that's a requirement.
> >
> > > * The file-backed (and devdax-backed) devdax abstraction is needed in qemu.
> >
> > Maybe. I'm not convinced the abstraction is needed at that particular level.
> >
> > > * Beyond that, I'm not yet sure what the lookup mechanism should be. Extra
> > > points for being easy to implement in both physical and virtual systems.
> >
> > For physical systems we aren't going to get agreement :( For the systems
> > I have visibility of there will be some diversity in hardware, but the
> > presentation to userspace and up consistency should be doable.
> >
> > Jonathan
> >
> > >
> > > Thanks for teeing this up!
> > > John
> > >
> > >
> > > [1] https://github.com/cxl-micron-reskit/famfs/blob/master/README.md
> > >
> >
> >
> >
>
next prev parent reply other threads:[~2024-09-18 13:02 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-08-15 16:22 [RFC] Virtualizing tagged disaggregated memory capacity (app specific, multi host shared) Jonathan Cameron via
2024-08-16 7:05 ` Hannes Reinecke
2024-08-16 9:41 ` Jonathan Cameron via
2024-08-19 2:12 ` John Groves
2024-08-19 15:40 ` Jonathan Cameron via
2024-09-17 19:37 ` Jonathan Cameron via
2024-10-22 14:11 ` Gregory Price
2024-09-17 19:56 ` Jonathan Cameron via
2024-09-18 12:12 ` Jonathan Cameron [this message]
2024-09-19 9:09 ` David Hildenbrand
2024-09-20 9:06 ` Gregory Price
2024-10-22 9:33 ` David Hildenbrand
2024-10-22 14:24 ` Gregory Price
2024-10-22 14:35 ` David Hildenbrand
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20240918131232.6fa02096@jic23-huawei \
--to=jic23@kernel.org \
--cc=John@groves.net \
--cc=Jonathan.Cameron@Huawei.com \
--cc=dan.j.williams@intel.com \
--cc=dave.jiang@intel.com \
--cc=dave@stgolabs.net \
--cc=david@redhat.com \
--cc=fan.ni@samsung.com \
--cc=imammedo@redhat.com \
--cc=ira.weiny@intel.com \
--cc=jgroves@micron.com \
--cc=linux-cxl@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linuxarm@huawei.com \
--cc=mst@redhat.com \
--cc=navneet.singh@intel.com \
--cc=osalvador@suse.de \
--cc=philmd@linaro.org \
--cc=qemu-devel@nongnu.org \
--cc=virtualization@lists.linux.dev \
--cc=wangkefeng.wang@huawei.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).