From: Gregory Price <gregory.price@memverge.com>
To: Yuquan Wang <wangyuquan1236@phytium.com.cn>
Cc: lizhijian@fujitsu.com, dan.j.williams@intel.com,
linux-cxl@vger.kernel.org, y-goto@fujitsu.com,
Jonathan.Cameron@huawei.com, dave.jiang@intel.com,
fan.ni@samsung.com
Subject: Re: CXL volatile memory: How to restore the previous region/Interleave set
Date: Fri, 31 May 2024 11:50:35 -0400 [thread overview]
Message-ID: <Zlnxy5JnBjREhH+L@memverge.com> (raw)
In-Reply-To: <ZlhWXu6l6i2nYL+t@phytium.com.cn>
On Thu, May 30, 2024 at 06:35:10PM +0800, Yuquan Wang wrote:
> On Wed, May 29, 2024 at 12:40:41PM -0400, Gregory Price wrote:
> >
> > The CFMWS is the BIOS/EFI's mechanism to report the system configuration
> > to the Operating System, not the Operating System's mechanism to change
> > system configurations (such as interleave). What you're talking about
> > is re-configuring HDM Decoders to interleave devices *presented by* the
> > CFMWS to the operating system.
> >
> > Confusing, I know. But stick with me.
> >
> >
> >
> > The interleave referred to the CFMWS is the BIOS/EFI telling the system
> > that memory accesses to this (physicall address) region will be interleaved
> > across the set of devices that are backing that region. The operating system
> > is responsible for reading these settings and presenting the memory to the
> > system accordingly.
> >
> > The BIOS for example could configure all devices behind a single CFMW as
> > a "Single Device" that interleaves many physical devices, and the OS should
> > present it as such. In this scenario, there is no need to configure an
> > interleave region via cxl-cli - the BIOS already did that for you and
> > presented all these devices as a single device. All you need to do is
> > online the memory.
> >
>
> Sorry Gregory, here I have a question. According to your description, the
> bios drivers could prepare some interleave cxl region configurations on
> default cxl hardware(SoC) just like we using ndctl-tools in OS run-time
> (cxl create-region).
>
Not in the sense of using cxl-cli or ndctl, but in the sense that
BIOS/EFI is responsible for reading hardware configurations and
presenting a sane configuration/memory map to the operating system.
It is technically possible, though not necessarily implemented anywhere,
for BIOS to read the ACPI information from the devices and program
the root complex/decoders/whathaveyou to present those devices as a
single device to the operating system.
The BIOS reads in the ACPI0016 data, generates one or more CFMWS/entries
and hands off management of that CFMWS to the OS. In doing so, it's
perfectly capable of programming the CFMWS to present multiple devices
(or even specific regions in those devices) as part of a single CFMW.
This would look like reporting a single CFMWS covering multiple discrete
physical memory devices. This CFMWS would have interleaves ways set
to >=2 and a TargetList with multiple discrete devices, with a single
hardware physical address region that applies to both.
The operating system would then manage this region as single device.
Looking briefly at the CXL* Type 3 Memory Device Software Guide from
Intel (July 2021, Rev 1.0), this is described in section 2.6 and seems
reasonably straightforward to me.
You certainly COULD save this setup in the LSA if you wanted to, but to
put bluntly - there's now a better way of doing/managing all of this.
HDM decoders let the OS set this all up.
And really the LSA is meant to store information about how to stitch
persistent data back together. This is probably why the LSA is not
referenced for the volatile setups in the Software Guide.
The LSA in the persistent setups is needed to ensure the data is put
back together correctly (you could pull out the devices and swap the
slots they're in, for example). This doesn't matter for volatile
devices, so the programming can be decided on the fly.
By my read - there's somewhat of an implied "We expect your hardware
environment won't change much, so a couple BIOS/EFI flags could be
set and forgotten about when setting up hardware interleave" not
written in this document.
Side note:
I believe Intel did something similar (but different!) recently where
they were presenting DRAM+CXL as a single NUMA node as a function of
BIOS programming. I don't know whether this was done via the CFMWS
or some other tomfoolery, but it's a similar concept.
(The following I'm still a little fuzzy on, but this is my best
understanding of how we got to where we are. Iif someone sees
innaccuracies, please slap my wrist and tell me to stfu)
HDM decoders provide the OS the capability to decide how to route host
physical addresses down to the devices with the ability to program the
root complex/host bridges, switches, and the devices to configure
hardware interleave after boot.
In this scenario, BIOS/EFI would report a single CFMWS to the OS for
each discrete piece of hardware, and the Operating System would then
program the HDM decoders on the host bridge(s)/switch and the devices
to implement the interleave in hardware.
This is the `cxl create-region ... ways=X devN devM ...` command
In some ways you can think of the CFMWS way of interleave a kind of...
"Legacy Pattern", because probably just about everyone will eventually
want to use the HDM pattern because it will be capable of supporting
things like hotplug in a more maintainable manner (or at all).
For example - it's harder (if even possible) to tear down a CFMWS
implemented interleave pattern without rebooting the system than it
is to tear down an HDM implemented interleave pattern.
You might, however, want to use a combination of these two strategies.
If, for example, you have 8 expanders behind a switch attached to a
single host bridge. You might want to treat that as a single,
concrete device - as opposed to 8 separate expanders which the OS has
to manage. Doing that via the CFMWS lets BIOS/Firmware simplify the
management of the devices and forego the need for specific driver
support (at the expense of flexibility of management after boot).
In that case, you'd have the ACPI tables and firmware hardcode the
interleave and simply present the larger pool to the BIOS as a single
chunk of large capacity to the OS.
Or maybe you might want to have some of them interleaved, and others
managed by the host. Software defined memory is fuuuuun! :D
The specification doesn't really have an opinion on how you "should" do
all of this - it just provides at least 3 or 4 different ways to trim
the chia pet and lets you be confused by the mess it has made.
But as for the LSA and volatile regions, I still don't see a compelling
reason for needing it to store prior settings. That seems more of a
BIOS/EFI feature that needs to be programmed.
~Gregory
(P.S. I was not and am not in any way responsible or involved with
writing the spec, so I will now happily take my beatings should I
have gotten any of this horribly wrong. This is all just my best
understanding from having bashed my face against the spec the past 2
years or so).
next prev parent reply other threads:[~2024-05-31 15:50 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-05-24 7:32 CXL volatile memory: How to restore the previous region/Interleave set Zhijian Li (Fujitsu)
2024-05-29 1:08 ` Dan Williams
2024-05-29 10:19 ` Zhijian Li (Fujitsu)
2024-05-29 15:44 ` Gregory Price
2024-05-30 9:56 ` Zhijian Li (Fujitsu)
2024-05-29 11:33 ` Yasunori Gotou (Fujitsu)
2024-05-29 16:40 ` Gregory Price
2024-05-30 10:35 ` Yuquan Wang
2024-05-31 15:50 ` Gregory Price [this message]
2024-05-30 10:54 ` Yasunori Gotou (Fujitsu)
2024-05-31 20:56 ` Dan Williams
2024-06-03 5:01 ` Yasunori Gotou (Fujitsu)
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Zlnxy5JnBjREhH+L@memverge.com \
--to=gregory.price@memverge.com \
--cc=Jonathan.Cameron@huawei.com \
--cc=dan.j.williams@intel.com \
--cc=dave.jiang@intel.com \
--cc=fan.ni@samsung.com \
--cc=linux-cxl@vger.kernel.org \
--cc=lizhijian@fujitsu.com \
--cc=wangyuquan1236@phytium.com.cn \
--cc=y-goto@fujitsu.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox