From: Adam Manzanares <a.manzanares@samsung.com>
To: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Gregory Price <gregory.price@memverge.com>,
Gregory Price <gourry.memverge@gmail.com>,
"qemu-devel@nongnu.org" <qemu-devel@nongnu.org>,
"linux-cxl@vger.kernel.org" <linux-cxl@vger.kernel.org>,
"mst@redhat.com" <mst@redhat.com>,
"marcel.apfelbaum@gmail.com" <marcel.apfelbaum@gmail.com>,
"imammedo@redhat.com" <imammedo@redhat.com>,
"ani@anisinha.ca" <ani@anisinha.ca>,
"alison.schofield@intel.com" <alison.schofield@intel.com>,
"dave@stgolabs.net" <dave@stgolabs.net>,
"bwidawsk@kernel.org" <bwidawsk@kernel.org>,
"hchkuo@avery-design.com.tw" <hchkuo@avery-design.com.tw>,
"cbrowy@avery-design.com" <cbrowy@avery-design.com>,
"ira.weiny@intel.com" <ira.weiny@intel.com>,
"Dan Williams" <dan.j.williams@intel.com>
Subject: Re: [PATCH 0/4 v3] Multi-Region and Volatile Memory support for CXL Type-3 Devices
Date: Thu, 27 Oct 2022 18:10:23 +0000 [thread overview]
Message-ID: <20221027181014.GA317461@bgt-140510-bm01> (raw)
In-Reply-To: <20221027115854.00001f76@huawei.com>
On Thu, Oct 27, 2022 at 11:58:54AM +0100, Jonathan Cameron wrote:
> On Wed, 26 Oct 2022 16:47:18 -0400
> Gregory Price <gregory.price@memverge.com> wrote:
>
> > On Wed, Oct 26, 2022 at 08:13:24PM +0000, Adam Manzanares wrote:
> > > On Tue, Oct 25, 2022 at 08:47:33PM -0400, Gregory Price wrote:
> > > > Submitted as an extention to the multi-feature branch maintained
> > > > by Jonathan Cameron at:
> > > > https://urldefense.com/v3/__https://gitlab.com/jic23/qemu/-/tree/cxl-2022-10-24__;!!EwVzqGoTKBqv-0DWAJBm!RyiGL5B1XmQnVFwgxikKJeosPMKtoO1cTr61gIq8fwqfju8l4cbGZGwAEkKXIJB-Dbkfi_LNN2rGCbzMISz65cTxpAxI9pQ$
> > > >
> > > >
> > > > Summary of Changes:
> > > > 1) E820 CFMW Bug fix.
> > > > 2) Add CXL_CAPACITY_MULTIPLIER definition to replace magic numbers
> > > > 3) Multi-Region and Volatile Memory support for CXL Type-3 Devices
> > > > 4) CXL Type-3 SRAT Generation when NUMA node is attached to memdev
>
> +CC Dan for a question on status of Generic Ports ACPI code first ECN.
> Given that was done on the mailing list I can find any public tracking
> of whether it was accepted or not - hence whether we can get on with
> implementation. There hasn't been a release ACPI spec since before
> that was proposed so we need that confirmation of the code first proposal
> being accepted to get things moving.
>
> > > >
> > > >
> > > > Regarding the E820 fix
> > > > * This bugfix is required for memory regions to work on x86
> > > > * input from Dan Williams and others suggest that E820 entry for
> > > > the CFMW should not exist, as it is expected to be dynamically
> > > > assigned at runtime. If this entry exists, it instead blocks
> > > > region creation by nature of the memory region being marked as
> > > > reserved.
> > >
> > > For CXL 2.0 it is my understanding that volatile capacity present at boot will
> > > be advertised by the firmware. In the absence of EFI I would assume this would
> > > be provided in the e820 map.
>
> Whilst this is one option, it is certainly not the case that all CXL 2.0
> platforms will decide to do any setup of CXL memory (volatile or not) in the
> firmware. They can leave it entirely to the OS, so a cold plug flow.
> Early platforms will do the setup in BIOS to support unware OSes, once
> that's not a problem any more the only reason you'd want to do this is if
> you don't have other RAM to boot the system, or for some reason you want
> your host kernel etc in the CXL attached memory.
>
> I'd expect to see BIOS having OS managed configuration as an option in the
> intermediate period where some OSes are aware, others not.
> OS knows more about usecase / policy so can make better choices on interleaving
> etc of volatile CXL type 3 memory (let alone the fun corner of devices
> where you can dynamically change the mix of volatile and non volatile
> memory).
>
>
> >
> > The issue in this case is very explicitly that a double-mapping occurs
> > for the same region. An E820 mapping for RESERVED is set *and* the
> > region driver allocates a CXL CFMW mapping. As a result the region
> > drive straight up fails to allocate regions.
> >
> > So in either case - at least from my view - the entry added as RESERVED
> > is just wrong.
> >
> > This is separate from type-3 device SRAT entries and default mappings
> > for volatile regions. For this situation, if you explicitly assign the
> > memdev backing a type-3 device to a numa node, then an SRAT area is
> > generated and an explicit e820 entry is generated and marked usable -
> > though I think there are likely issues with this kind of
> > double-referencing.
>
> SRAT setup for CXL type 3 devices is to my mind is a job for a full BIOS,
> not QEMU level of faking a few things. That BIOS should also
> be doing the full configuration (HDM Decoders and all the rest). ARM has
> a prototype for one of the fixed virtual platforms that does this (talk
> at Plumbers Uconf), we should look to do the same if we want to test
> those flows using QEMU via appropriate changes in EDK2 to walk topology
> and configure everything. Until the devices are configured there is no
> way to configure the SLIT, HMAT entries that align with the SRAT ones
> (in theory those can be updated at runtime via _SLI, _HMA but in
> practice, I'm fairly sure Linux doesn't support doing that.)
>
>
> >
> > >
> > > Is the region driver meant to cover volatile capacity present at boot? I was
> > > under the impression that it would be used for hot added volatile memory. It
> > > would be good to cover all of these assumptions for the e820 fix.
> >
> > This region appears to cover hotplug memory behind the CFMW. The
> > problem is that this e820 RESERVED mapping blocks the CFMW region from
> > being used at all.
> >
> > Without this, you can't use a type-3 persistent region, even with
> > support, let alone a volatile region. In attempting to use a persistent
> > region as volatile via ndctl and friends, I'm seeing further issues (it
> > cannot be assigned to a numa node successfully), but that's a separate
> > issue.
> For the Numa node bit...
>
> So far, the CDAT table isn't used in Linux (read out for debug purposes
> is supported only). That all needs to be added yet to get the NUMA node
> allocations to work correctly. It shouldn't have anything to do with SRAT
> unless the BIOS has done the full full configuration (same way CXL will work
> with a legacy OS). Come to think of it, that should definitely be on the
> priority list for kernel support (don't think it was on the list on Tuesday?)
>
> >
> > >
> > > Lastly it is my understanding that the region driver does not have support for
> > > volatile memory. It would be great to add that functionality if we make this
> > > change in QEMU.
> > >
> >
> > Right now this is true, but it seems a bit of a chicken/egg scenario.
> > Nothing to test against vs no support. Nudging this along such that we
> > can at least report an (unusable) hot-add volatile memory region would
> > provide someone working with the region driver something to poke and
> > prod at.
>
> Agreed. This is same place as Ben's original CXL QEMU work on non volatile.
> Need something to develop against so we'll at least have QEMU patches
> available whilst the kernel side is in development (hopefully this cycle)
> >
> > > > Regarding SRAT Generation for Type-3 Devices
> > > > * Co-Developed by Davidlohr Bueso. Built from his base patch and
> > > > extended to work with both volatile and persistent regions.
> > > > * This can be used to demonstrate static type-3 device mapping and
> > > > testing numa-access to type-3 device memory regions.
> >
> > Regarding "volatile memory present at boot" - there is still two ways
> > for that memory to be onlined: Statically (entered as an explicit e820
> > region after reading the SRAT), or Dynamically (hot-add by the region
> > driver).
> >
> > This patch would at least allow an SRAT to be generated if you
> > explicitly add a NUMA node mapping to it. Although I concede that I'm
> > not entirely sure what is "correct" here.
>
> For hotplug, needs to be the region driver, not here (or BIOS if you
> are doing a BIOS driven flow - in which case the whole topology is
> discovered and mostly configured by the BIOS before the OS reaches it
> - including filling in SRAT, SLIT, HMAT etCc).
> >
> > What this ends up looking like is mapping a memdev to both a numa node
> > and to a type-3 device. Though that seems wrong.
> >
> > After further testing it seems like creating a CPU-less, Memory-less
> > NUMA node with the intent of mapping volatile memory regions to it is
> > not supported (yet?).
>
> Yup, and I doubt it ever will be for reasons above.
>
> BIOS does it all, or OS does it all. Mixing and matching is a mess
> (there is an exception - Generic Port entries in SRAT - those we do
> need for the OS driven flow and possibly also the BIOS flow
> - patches welcome! I'd forgotten that on my list of QEMU stuff that
> needs doing.)
Based on the discussions is it safe to say we have settled on an OS driven flow
for the current QEMU CXL emulation.
>
> https://urldefense.com/v3/__https://lore.kernel.org/all/e1a52da9aec90766da5de51b1b839fd95d63a5af.camel@intel.com/__;!!EwVzqGoTKBqv-0DWAJBm!XLqjPd1aXt3i5NXrZhakQlGdgJ5o4tcfV_5iUEp8vwBLv8T1Ft1OVHPI0p7KpYSFDYzhAGj_ulMz1LfZVmJgrOvrO-_v7udl$
>
> If anyone is implementing that, also good to do Generic Initiators
> as they are very similar.
>
> Jonathan
>
>
next prev parent reply other threads:[~2022-10-27 18:11 UTC|newest]
Thread overview: 19+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <CGME20221026004835uscas1p1d37255ba8daaba8fc7e16e5129cb94b5@uscas1p1.samsung.com>
2022-10-26 0:47 ` [PATCH 0/4 v3] Multi-Region and Volatile Memory support for CXL Type-3 Devices Gregory Price
2022-10-26 0:47 ` [PATCH 1/4] hw/i386/pc.c: CXL Fixed Memory Window should not reserve e820 in bios Gregory Price
2022-10-26 20:33 ` Michael S. Tsirkin
2022-10-26 0:47 ` [PATCH 2/4] hw/cxl: Add CXL_CAPACITY_MULTIPLIER definition Gregory Price
2022-11-14 17:55 ` Jonathan Cameron via
2022-10-26 0:47 ` [PATCH 3/4] hw/cxl: Multi-Region CXL Type-3 Devices (Volatile and Persistent) Gregory Price
2022-11-14 17:53 ` Jonathan Cameron via
2022-11-14 23:00 ` Gregory Price
2022-11-17 13:53 ` Jonathan Cameron via
2022-11-23 17:42 ` Gregory Price
2022-10-26 0:47 ` [PATCH 4/4] hw/acpi/cxl.c: Fill in SRAT for vmem/pmem if NUMA node is assigned Gregory Price
2022-10-26 20:13 ` [PATCH 0/4 v3] Multi-Region and Volatile Memory support for CXL Type-3 Devices Adam Manzanares
2022-10-26 20:47 ` Gregory Price
2022-10-27 10:58 ` Jonathan Cameron via
2022-10-27 14:29 ` Gregory Price
2022-10-27 18:10 ` Adam Manzanares [this message]
2022-10-26 20:15 ` Michael S. Tsirkin
2022-10-26 20:20 ` Michael S. Tsirkin
2022-10-26 20:48 ` Gregory Price
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20221027181014.GA317461@bgt-140510-bm01 \
--to=a.manzanares@samsung.com \
--cc=Jonathan.Cameron@huawei.com \
--cc=alison.schofield@intel.com \
--cc=ani@anisinha.ca \
--cc=bwidawsk@kernel.org \
--cc=cbrowy@avery-design.com \
--cc=dan.j.williams@intel.com \
--cc=dave@stgolabs.net \
--cc=gourry.memverge@gmail.com \
--cc=gregory.price@memverge.com \
--cc=hchkuo@avery-design.com.tw \
--cc=imammedo@redhat.com \
--cc=ira.weiny@intel.com \
--cc=linux-cxl@vger.kernel.org \
--cc=marcel.apfelbaum@gmail.com \
--cc=mst@redhat.com \
--cc=qemu-devel@nongnu.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).