CXL volatile memory: How to restore the previous region/Interleave set

Linux CXL
 help / color / mirror / Atom feed

* CXL volatile memory: How to restore the previous region/Interleave set
@ 2024-05-24  7:32 Zhijian Li (Fujitsu)
  2024-05-29  1:08 ` Dan Williams
  0 siblings, 1 reply; 12+ messages in thread
From: Zhijian Li (Fujitsu) @ 2024-05-24  7:32 UTC (permalink / raw)
  To: members@computeexpresslink.org, linux-cxl@vger.kernel.org
  Cc: Yasunori Gotou (Fujitsu), Jonathan Cameron,
	dan.j.williams@intel.com, dave.jiang@intel.com, Fan Ni

Hey CXL and Linux-CXL communities,

I am trying to understand how the current hardware and software can work
together to restore the previous region/Interleave Set configuration for CXL
volatile memory upon the next boot, but I don't have the answer yet.
Therefore, I have several questions and hope you can provide some suggestions
and thoughts. Thank you.

Q1, First, I would like to ask about the scope of LSA. According to CXL r3.0
section 9.13.2, it seems that LSA applies to CXL memory (including volatile
memory and persistent memory), but it does not explicitly state whether LSA
is mandatory. My understanding is:
- LSA is mandatory for persistent memory
- LSA is optional for volatile memory
Is this understanding correct?

Q2, Per CXL r3.0 "9.13.2.4 Region Labels", it mentions "Region labels describe
the geometry of a persistent memory Interleave Set". What does "a persistent
memory Interleave Set" mean here?
- a persistent Interleave Set for CXL memory device (volatile and persistent)
or
- a persistent Interleave Set for CXL persistent memory device only.

Q3, For CXL volatile memory devices without LSA installed, if users expect to
restore the Interleave set to the previous configuration after reboot, the
questions are:
Q3.1 Where should the Interleave Set information be stored?
Q3.2 Which component is responsible for restoring the Interleave Set?

One scenario I understand is that in clusters using an Orchestrator, such as
K8S, when a node (worker) restarts, K8S is able to read the Interleave Set
from the database and sets it for the corresponding node.

However, for a single-node machine, how can the kernel restore this information
immediately after startup (before /init executes) without user intervention?
Does the current Specification define this, or do other CXL-related firmware,
such as UEFI/ACPI/CFMWS/Fabric Manager, have enough information to reconstruct
the previous configuration?

PS.
It seems that LSA Region Label has not implemented yet in linux kernel
Region/Interleave set are saved in the running kernel memory/device registers, will
get lost after reboot in linux kernel.

Thanks
Zhijian

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: CXL volatile memory: How to restore the previous region/Interleave set
  2024-05-24  7:32 CXL volatile memory: How to restore the previous region/Interleave set Zhijian Li (Fujitsu)
@ 2024-05-29  1:08 ` Dan Williams
  2024-05-29 10:19   ` Zhijian Li (Fujitsu)
  2024-05-29 11:33   ` Yasunori Gotou (Fujitsu)
  0 siblings, 2 replies; 12+ messages in thread
From: Dan Williams @ 2024-05-29  1:08 UTC (permalink / raw)
  To: Zhijian Li (Fujitsu), linux-cxl@vger.kernel.org
  Cc: Yasunori Gotou (Fujitsu), Jonathan Cameron,
	dan.j.williams@intel.com, dave.jiang@intel.com, Fan Ni

Hi Zhijian,

I dropped members@computeexpresslink.org from this thread. If those
folks are interested they can follow this discussion here:

https://lore.kernel.org/r/36106fcf-1062-4961-8918-4471fd313a74@fujitsu.com

Otherwise, the way the wider Linux community learns about consortium
deliberations is through new published spec revisions.

Zhijian Li (Fujitsu) wrote:
> Hey CXL and Linux-CXL communities,
> 
> I am trying to understand how the current hardware and software can work
> together to restore the previous region/Interleave Set configuration for CXL
> volatile memory upon the next boot, but I don't have the answer yet.
> Therefore, I have several questions and hope you can provide some suggestions
> and thoughts. Thank you.
> 
> Q1, First, I would like to ask about the scope of LSA. According to CXL r3.0
> section 9.13.2, it seems that LSA applies to CXL memory (including volatile
> memory and persistent memory), but it does not explicitly state whether LSA
> is mandatory. My understanding is:
> - LSA is mandatory for persistent memory
> - LSA is optional for volatile memory
> Is this understanding correct?

I would say it differently. LSA is mandatory for persistent memory, and
irrelevant for volatile memory.

> Q2, Per CXL r3.0 "9.13.2.4 Region Labels", it mentions "Region labels describe
> the geometry of a persistent memory Interleave Set". What does "a persistent
> memory Interleave Set" mean here?
> - a persistent Interleave Set for CXL memory device (volatile and persistent)
> or
> - a persistent Interleave Set for CXL persistent memory device only.

Peristent only, because persistence means getting the exact same content
as was previously available. A volatile region configuration can be
restored by recreating the region.

> Q3, For CXL volatile memory devices without LSA installed, if users expect to
> restore the Interleave set to the previous configuration after reboot, the
> questions are:
> Q3.1 Where should the Interleave Set information be stored?
> Q3.2 Which component is responsible for restoring the Interleave Set?

The expectation is that BIOS, or the OS for hotplug devices, deploys a
default region configuration policy. That policy in the common is likely
one of either maximizing performance (maximize interleave across
host-bridges), or maximizing error isolation (create an x1-interleave
region per endpoint).

What is currently missing on the Linux OS side is a default policy for
unmapped volatile capacity after all initial device probing has
completed.

> One scenario I understand is that in clusters using an Orchestrator, such as
> K8S, when a node (worker) restarts, K8S is able to read the Interleave Set
> from the database and sets it for the corresponding node.

Yes, for sophisticated environments a configuration database could store
and replay region configurations each boot.

> However, for a single-node machine, how can the kernel restore this information
> immediately after startup (before /init executes) without user intervention?

Linux needs a default policy. Likely the best place to do this is with a
udev script that waits for PCI scanning to quiesce and then creates
volatile regions.

Unlike a PMEM region where the exact config needs to be replicated each
boot, devices can be reordered and replaced over reboot without needing
to rewrite labels on the device(s).

> Does the current Specification define this, or do other CXL-related firmware,
> such as UEFI/ACPI/CFMWS/Fabric Manager, have enough information to reconstruct
> the previous configuration?

There is already enough information for something like:

    cxl create-region --continue

That would automatically create maximally sized regions until all
available capacity is mapped. As it stands "cxl create-region" is not
yet enlightened enough to do its own capacity search, but nothing more
is needed from the specification side to enable that.

> PS.
> It seems that LSA Region Label has not implemented yet in linux kernel
> Region/Interleave set are saved in the running kernel memory/device registers, will
> get lost after reboot in linux kernel.

For PMEM regions there is support for routing label updates from the
nvdimm subsystem back to the label area of CXL-PMEM devices. The full
end-to-end support for recovering PMEM regions from stored labels was
deferred due to the lack of PMEM CXL devices in the market. So, similar
to Type-2 support, it awaits an endpoint implementation to arrive.

For volatile regions there is no expectation that the driver would ever
need to consider labels because "recreate regions by policy" is
possible.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: CXL volatile memory: How to restore the previous region/Interleave set
  2024-05-29  1:08 ` Dan Williams
@ 2024-05-29 10:19   ` Zhijian Li (Fujitsu)
  2024-05-29 15:44     ` Gregory Price
  2024-05-29 11:33   ` Yasunori Gotou (Fujitsu)
  1 sibling, 1 reply; 12+ messages in thread
From: Zhijian Li (Fujitsu) @ 2024-05-29 10:19 UTC (permalink / raw)
  To: Dan Williams, linux-cxl@vger.kernel.org
  Cc: Yasunori Gotou (Fujitsu), Jonathan Cameron, dave.jiang@intel.com,
	Fan Ni

Thanks Dan,


On 29/05/2024 09:08, Dan Williams wrote:
> Hi Zhijian,
> 
> I dropped members@computeexpresslink.org from this thread. If those
> folks are interested they can follow this discussion here:
> 
> https://lore.kernel.org/r/36106fcf-1062-4961-8918-4471fd313a74@fujitsu.com
> 
> Otherwise, the way the wider Linux community learns about consortium
> deliberations is through new published spec revisions.

Agreed. thank you.


> 
> Zhijian Li (Fujitsu) wrote:
>> Hey CXL and Linux-CXL communities,
>>
>> I am trying to understand how the current hardware and software can work
>> together to restore the previous region/Interleave Set configuration for CXL
>> volatile memory upon the next boot, but I don't have the answer yet.
>> Therefore, I have several questions and hope you can provide some suggestions
>> and thoughts. Thank you.
>>
>> Q1, First, I would like to ask about the scope of LSA. According to CXL r3.0
>> section 9.13.2, it seems that LSA applies to CXL memory (including volatile
>> memory and persistent memory), but it does not explicitly state whether LSA
>> is mandatory. My understanding is:
>> - LSA is mandatory for persistent memory
>> - LSA is optional for volatile memory
>> Is this understanding correct?
> 
> I would say it differently. LSA is mandatory for persistent memory, and
> irrelevant for volatile memory.


Another reason for my above understanding is that the current QEMU
documentation[1] also states this, and the actual code behaves accordingly
(mandatory for persistent memory and optional for volatile memory).

[1] https://github.com/qemu/qemu/blob/79d7475f39f1b0f05fcb159f5cdcbf162340dc7e/docs/system/devices/cxl.rst?plain=1#L324

Anyone have other opinion on this?

Hope the CXL consortium can help on confirming on this point to
members@computeexpresslink.org privately.
(Currently the original mail cannot reach members@computeexpresslink.org
until getting its approval)

> 
>> Q2, Per CXL r3.0 "9.13.2.4 Region Labels", it mentions "Region labels describe
>> the geometry of a persistent memory Interleave Set". What does "a persistent
>> memory Interleave Set" mean here?
>> - a persistent Interleave Set for CXL memory device (volatile and persistent)
>> or
>> - a persistent Interleave Set for CXL persistent memory device only.
> 
> Peristent only, because persistence means getting the exact same content
> as was previously available. A volatile region configuration can be
> restored by recreating the region.> 
>> Q3, For CXL volatile memory devices without LSA installed, if users expect to
>> restore the Interleave set to the previous configuration after reboot, the
>> questions are:
>> Q3.1 Where should the Interleave Set information be stored?
>> Q3.2 Which component is responsible for restoring the Interleave Set?
> 
> The expectation is that BIOS, or the OS for hotplug devices, deploys a
> default region configuration policy. That policy in the common is likely
> one of either maximizing performance (maximize interleave across
> host-bridges), or maximizing error isolation (create an x1-interleave
> region per endpoint).
>> What is currently missing on the Linux OS side is a default policy for
> unmapped volatile capacity after all initial device probing has
> completed.

I would say I can imagine the policy you mentioned. *policy* would work
on some cases, but
For a customized region, a default/predefined *policy* is not sufficient.
In my mind, a customized region could be constructed with part of all
the available memdevs(SLD/MLD/DCD), and customized interleave set.

So the the previous used configurations including memdevs and interleave set
should be stored in a persistent storage that the region can be restored
correctly.

If these configurations cannot read from the device(s), that fact would
most likely be the host OS try to reconstruct the region according to the
settings in its config file(/etc/cxl/mem.config)



> 
>> One scenario I understand is that in clusters using an Orchestrator, such as
>> K8S, when a node (worker) restarts, K8S is able to read the Interleave Set
>> from the database and sets it for the corresponding node.
> 
> Yes, for sophisticated environments a configuration database could store
> and replay region configurations each boot.
> 
>> However, for a single-node machine, how can the kernel restore this information
>> immediately after startup (before /init executes) without user intervention?
> 
> Linux needs a default policy. Likely the best place to do this is with a
> udev script that waits for PCI scanning to quiesce and then creates
> volatile regions.
> 
> Unlike a PMEM region where the exact config needs to be replicated each
> boot, devices can be reordered and replaced over reboot without needing
> to rewrite labels on the device(s).
> 
>> Does the current Specification define this, or do other CXL-related firmware,
>> such as UEFI/ACPI/CFMWS/Fabric Manager, have enough information to reconstruct
>> the previous configuration?
>  
> There is already enough information for something like:
> 
>      cxl create-region --continue
> 
> That would automatically create maximally sized regions until all
> available capacity is mapped. As it stands "cxl create-region" is not
> yet enlightened enough to do its own capacity search, but nothing more
> is needed from the specification side to enable that.
> 
>> PS.
>> It seems that LSA Region Label has not implemented yet in linux kernel
>> Region/Interleave set are saved in the running kernel memory/device registers, will
>> get lost after reboot in linux kernel.
> 
> For PMEM regions there is support for routing label updates from the
> nvdimm subsystem back to the label area of CXL-PMEM devices. The full
> end-to-end support for recovering PMEM regions from stored labels was
> deferred due to the lack of PMEM CXL devices in the market. So, similar
> to Type-2 support, it awaits an endpoint implementation to arrive.
> 

Okay, understood.


Thanks
Zhijian

> For volatile regions there is no expectation that the driver would ever
> need to consider labels because "recreate regions by policy" is
> possible.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: CXL volatile memory: How to restore the previous region/Interleave set
  2024-05-29 10:19   ` Zhijian Li (Fujitsu)
@ 2024-05-29 15:44     ` Gregory Price
  2024-05-30  9:56       ` Zhijian Li (Fujitsu)
  0 siblings, 1 reply; 12+ messages in thread
From: Gregory Price @ 2024-05-29 15:44 UTC (permalink / raw)
  To: Zhijian Li (Fujitsu)
  Cc: Dan Williams, linux-cxl@vger.kernel.org, Yasunori Gotou (Fujitsu),
	Jonathan Cameron, dave.jiang@intel.com, Fan Ni

On Wed, May 29, 2024 at 10:19:21AM +0000, Zhijian Li (Fujitsu) wrote:
> Thanks Dan,
> 
> 
> On 29/05/2024 09:08, Dan Williams wrote:
> > Hi Zhijian,
> > 
> > I dropped members@computeexpresslink.org from this thread. If those
> > folks are interested they can follow this discussion here:
> > 
> > https://lore.kernel.org/r/36106fcf-1062-4961-8918-4471fd313a74@fujitsu.com
> > 
> > Otherwise, the way the wider Linux community learns about consortium
> > deliberations is through new published spec revisions.
> 
> Agreed. thank you.
> 
> 
> > 
> > Zhijian Li (Fujitsu) wrote:
> >> Hey CXL and Linux-CXL communities,
> >>
> >> I am trying to understand how the current hardware and software can work
> >> together to restore the previous region/Interleave Set configuration for CXL
> >> volatile memory upon the next boot, but I don't have the answer yet.
> >> Therefore, I have several questions and hope you can provide some suggestions
> >> and thoughts. Thank you.
> >>
> >> Q1, First, I would like to ask about the scope of LSA. According to CXL r3.0
> >> section 9.13.2, it seems that LSA applies to CXL memory (including volatile
> >> memory and persistent memory), but it does not explicitly state whether LSA
> >> is mandatory. My understanding is:
> >> - LSA is mandatory for persistent memory
> >> - LSA is optional for volatile memory
> >> Is this understanding correct?
> > 
> > I would say it differently. LSA is mandatory for persistent memory, and
> > irrelevant for volatile memory.
> 
> 
> Another reason for my above understanding is that the current QEMU
> documentation[1] also states this, and the actual code behaves accordingly
> (mandatory for persistent memory and optional for volatile memory).
> 
> [1] https://github.com/qemu/qemu/blob/79d7475f39f1b0f05fcb159f5cdcbf162340dc7e/docs/system/devices/cxl.rst?plain=1#L324
> 
> Anyone have other opinion on this?
> 
> Hope the CXL consortium can help on confirming on this point to
> members@computeexpresslink.org privately.
> (Currently the original mail cannot reach members@computeexpresslink.org
> until getting its approval)
>

Just to be concrete:

CXL Spec 3.1: 8.2.9.9.2.3

"The Label Storage Area (LSA) *shall be* supported by a memory device
that provides persistent memory capacity and *may be* supported by a
device that provides only volatile memory capacity"

For persistent: Required
For volatile: Optional

What Dan is saying is that an voltile-only device with an LSA is
possible, but the LSA isn't particularly novel or useful since the
data in the voltile region is destroyed when the device is reset.

Recording things like interleave set configurations in LSA doesn't
really make sense, given that devices probably shouldn't know about each
other, and an interleave set kind of implies knowledge of other devices.

So as Dan said: An LSA is irrelevant for a voltile device.  Anything
you'd use it for is probably better accomplished some other way.

> > 
> > The expectation is that BIOS, or the OS for hotplug devices, deploys a
> > default region configuration policy. That policy in the common is likely
> > one of either maximizing performance (maximize interleave across
> > host-bridges), or maximizing error isolation (create an x1-interleave
> > region per endpoint).
> >> What is currently missing on the Linux OS side is a default policy for
> > unmapped volatile capacity after all initial device probing has
> > completed.
> 
> I would say I can imagine the policy you mentioned. *policy* would work
> on some cases, but
> For a customized region, a default/predefined *policy* is not sufficient.
> In my mind, a customized region could be constructed with part of all
> the available memdevs(SLD/MLD/DCD), and customized interleave set.
> 
> So the the previous used configurations including memdevs and interleave set
> should be stored in a persistent storage that the region can be restored
> correctly.
>

There's any number of reasons this is a bad idea, the most obvious of
which is that your recording information about a set of devices on each
device.  What if some of those devices go away (or are upgraded, change,
whatever) and you need to do something different? What if you move that
device to another machine?  The state machine you create with this setup
is pretty awful.  I suspect in answering these questions, you end up
just resolving to save the configuration to disk instead of the LSAs -
which (as Dan said) makes the LSA irrelevant.

Really what you want is a smarter daemon that detects the topology and
implements the best policy given the current environment.

~Gregory

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: CXL volatile memory: How to restore the previous region/Interleave set
  2024-05-29 15:44     ` Gregory Price
@ 2024-05-30  9:56       ` Zhijian Li (Fujitsu)
  0 siblings, 0 replies; 12+ messages in thread
From: Zhijian Li (Fujitsu) @ 2024-05-30  9:56 UTC (permalink / raw)
  To: Gregory Price
  Cc: Dan Williams, linux-cxl@vger.kernel.org, Yasunori Gotou (Fujitsu),
	Jonathan Cameron, dave.jiang@intel.com, Fan Ni



On 29/05/2024 23:44, Gregory Price wrote:
> On Wed, May 29, 2024 at 10:19:21AM +0000, Zhijian Li (Fujitsu) wrote:
>> Thanks Dan,
>>
>>
>> On 29/05/2024 09:08, Dan Williams wrote:
>>> Hi Zhijian,
>>>
>>> I dropped members@computeexpresslink.org from this thread. If those
>>> folks are interested they can follow this discussion here:
>>>
>>> https://lore.kernel.org/r/36106fcf-1062-4961-8918-4471fd313a74@fujitsu.com
>>>
>>> Otherwise, the way the wider Linux community learns about consortium
>>> deliberations is through new published spec revisions.
>>
>> Agreed. thank you.
>>
>>
>>>
>>> Zhijian Li (Fujitsu) wrote:
>>>> Hey CXL and Linux-CXL communities,
>>>>
>>>> I am trying to understand how the current hardware and software can work
>>>> together to restore the previous region/Interleave Set configuration for CXL
>>>> volatile memory upon the next boot, but I don't have the answer yet.
>>>> Therefore, I have several questions and hope you can provide some suggestions
>>>> and thoughts. Thank you.
>>>>
>>>> Q1, First, I would like to ask about the scope of LSA. According to CXL r3.0
>>>> section 9.13.2, it seems that LSA applies to CXL memory (including volatile
>>>> memory and persistent memory), but it does not explicitly state whether LSA
>>>> is mandatory. My understanding is:
>>>> - LSA is mandatory for persistent memory
>>>> - LSA is optional for volatile memory
>>>> Is this understanding correct?
>>>
>>> I would say it differently. LSA is mandatory for persistent memory, and
>>> irrelevant for volatile memory.
>>
>>
>> Another reason for my above understanding is that the current QEMU
>> documentation[1] also states this, and the actual code behaves accordingly
>> (mandatory for persistent memory and optional for volatile memory).
>>
>> [1] https://github.com/qemu/qemu/blob/79d7475f39f1b0f05fcb159f5cdcbf162340dc7e/docs/system/devices/cxl.rst?plain=1#L324
>>
>> Anyone have other opinion on this?
>>
>> Hope the CXL consortium can help on confirming on this point to
>> members@computeexpresslink.org privately.
>> (Currently the original mail cannot reach members@computeexpresslink.org
>> until getting its approval)
>>
> 
> Just to be concrete:
> 
> CXL Spec 3.1: 8.2.9.9.2.3
> 
> "The Label Storage Area (LSA) *shall be* supported by a memory device
> that provides persistent memory capacity and *may be* supported by a
> device that provides only volatile memory capacity"

Thanks for the reference, I missed it before.


> 
> For persistent: Required
> For volatile: Optional
> 
> What Dan is saying is that an voltile-only device with an LSA is
> possible, but the LSA isn't particularly novel or useful since the
> data in the voltile region is destroyed when the device is reset.
> 

Well, I misunderstood this in the previous reply.
Many thanks for your explanation on this point.


> Recording things like interleave set configurations in LSA doesn't
> really make sense, given that devices probably shouldn't know about each
> other, and an interleave set kind of implies knowledge of other devices.
> > So as Dan said: An LSA is irrelevant for a voltile device.  Anything
> you'd use it for is probably better accomplished some other way.> 
>>>
>>> The expectation is that BIOS, or the OS for hotplug devices, deploys a
>>> default region configuration policy. That policy in the common is likely
>>> one of either maximizing performance (maximize interleave across
>>> host-bridges), or maximizing error isolation (create an x1-interleave
>>> region per endpoint).
>>>> What is currently missing on the Linux OS side is a default policy for
>>> unmapped volatile capacity after all initial device probing has
>>> completed.
>>
>> I would say I can imagine the policy you mentioned. *policy* would work
>> on some cases, but
>> For a customized region, a default/predefined *policy* is not sufficient.
>> In my mind, a customized region could be constructed with part of all
>> the available memdevs(SLD/MLD/DCD), and customized interleave set.
>>
>> So the the previous used configurations including memdevs and interleave set
>> should be stored in a persistent storage that the region can be restored
>> correctly.
>>
> 
> There's any number of reasons this is a bad idea, the most obvious of
> which is that your recording information about a set of devices on each
> device.  What if some of those devices go away (or are upgraded, change,
> whatever) and you need to do something different? What if you move that
> device to another machine?  The state machine you create with this setup
> is pretty awful.  

I suspect in answering these questions, you end up
> just resolving to save the configuration to disk instead of the LSAs -
> which (as Dan said) makes the LSA irrelevant.
> 

In my understanding, LSA Region Label should include the needed information
to restore the previous interleave set(across memdevs). Of course, these
information on LSA are different from the one saved in the disk.


> Really what you want is a smarter daemon that detects the topology and
> implements the best policy given the current environment.


Anyway, I think I have gotten the answer from your and Dan's reply.

Thanks again,

Zhijian


> 
> ~Gregory

^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: CXL volatile memory: How to restore the previous region/Interleave set
  2024-05-29  1:08 ` Dan Williams
  2024-05-29 10:19   ` Zhijian Li (Fujitsu)
@ 2024-05-29 11:33   ` Yasunori Gotou (Fujitsu)
  2024-05-29 16:40     ` Gregory Price
  2024-05-31 20:56     ` Dan Williams
  1 sibling, 2 replies; 12+ messages in thread
From: Yasunori Gotou (Fujitsu) @ 2024-05-29 11:33 UTC (permalink / raw)
  To: 'Dan Williams', Zhijian Li (Fujitsu),
	linux-cxl@vger.kernel.org
  Cc: Jonathan Cameron, dave.jiang@intel.com, Fan Ni

Hi Dan-san,

> > Q3, For CXL volatile memory devices without LSA installed, if users
> > expect to restore the Interleave set to the previous configuration
> > after reboot, the questions are:
> > Q3.1 Where should the Interleave Set information be stored?
> > Q3.2 Which component is responsible for restoring the Interleave Set?
> 
> The expectation is that BIOS, or the OS for hotplug devices, deploys a default
> region configuration policy. That policy in the common is likely one of either
> maximizing performance (maximize interleave across host-bridges), or
> maximizing error isolation (create an x1-interleave region per endpoint).

To be honest, I feel CFMWS seems to be something incomplete spec.. 

When I first saw the " CXL* Type 3 Memory Device Software Guide", and noticed existing
CFMWS, I thought that the firmware would create it based on some configuration,
and OS would read it and create region for each window information.
Even if user would execute cxl create-region command and configure interleaved region,
I thought OS would tell it to firmware (or something), and CFMWS would reflect it on the next boot.

But, really is that the above scenario is only for persistent memory with LSA.
Even if a user configures a new region for volatile memory, and I could not find any specification to
tell the new configuration to the Firmware. 

Could you tell me why such interface is not defined in the CXL specification?
Is it just because there is no place to store region information for volatile memory?

IMHO, users want to keep previous configuration after reboot even if it is volatile memory.
Though users don't concern about contents of volatile memory, they want to keep region/interleave
configuration after reboot. Especially, if previous configuration is some years ago, I'll bet
users will forget how they configured regions against cxl volatile memory.


> 
> What is currently missing on the Linux OS side is a default policy for unmapped
> volatile capacity after all initial device probing has completed.
> 
> > One scenario I understand is that in clusters using an Orchestrator,
> > such as K8S, when a node (worker) restarts, K8S is able to read the
> > Interleave Set from the database and sets it for the corresponding node.
> 
> Yes, for sophisticated environments a configuration database could store and
> replay region configurations each boot.

Just an idea, I suppose that cxl command should have two features.
 - save current regions configuration to a file which is specified by its operand.
 - reconfigure regions depends on the specified file.
  (If hardware condition is changed, the command return error and display what is changed.)

Probably, it may be enough for the most of users.... But what do you think it?

Thanks,

> 
> > However, for a single-node machine, how can the kernel restore this
> > information immediately after startup (before /init executes) without user
> intervention?
> 
> Linux needs a default policy. Likely the best place to do this is with a udev
> script that waits for PCI scanning to quiesce and then creates volatile regions.
> 
> Unlike a PMEM region where the exact config needs to be replicated each boot,
> devices can be reordered and replaced over reboot without needing to rewrite
> labels on the device(s).
> 
> > Does the current Specification define this, or do other CXL-related
> > firmware, such as UEFI/ACPI/CFMWS/Fabric Manager, have enough
> > information to reconstruct the previous configuration?
> 
> There is already enough information for something like:
> 
>     cxl create-region --continue
> 
> That would automatically create maximally sized regions until all available
> capacity is mapped. As it stands "cxl create-region" is not yet enlightened
> enough to do its own capacity search, but nothing more is needed from the
> specification side to enable that.
> 
> > PS.
> > It seems that LSA Region Label has not implemented yet in linux kernel
> > Region/Interleave set are saved in the running kernel memory/device
> > registers, will get lost after reboot in linux kernel.
> 
> For PMEM regions there is support for routing label updates from the nvdimm
> subsystem back to the label area of CXL-PMEM devices. The full end-to-end
> support for recovering PMEM regions from stored labels was deferred due to
> the lack of PMEM CXL devices in the market. So, similar to Type-2 support, it
> awaits an endpoint implementation to arrive.
> 
> For volatile regions there is no expectation that the driver would ever need to
> consider labels because "recreate regions by policy" is possible.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: CXL volatile memory: How to restore the previous region/Interleave set
  2024-05-29 11:33   ` Yasunori Gotou (Fujitsu)
@ 2024-05-29 16:40     ` Gregory Price
  2024-05-30 10:35       ` Yuquan Wang
  2024-05-30 10:54       ` Yasunori Gotou (Fujitsu)
  2024-05-31 20:56     ` Dan Williams
  1 sibling, 2 replies; 12+ messages in thread
From: Gregory Price @ 2024-05-29 16:40 UTC (permalink / raw)
  To: Yasunori Gotou (Fujitsu)
  Cc: 'Dan Williams', Zhijian Li (Fujitsu),
	linux-cxl@vger.kernel.org, Jonathan Cameron, dave.jiang@intel.com,
	Fan Ni

On Wed, May 29, 2024 at 11:33:46AM +0000, Yasunori Gotou (Fujitsu) wrote:
> Hi Dan-san,
> 
> > > Q3, For CXL volatile memory devices without LSA installed, if users
> > > expect to restore the Interleave set to the previous configuration
> > > after reboot, the questions are:
> > > Q3.1 Where should the Interleave Set information be stored?
> > > Q3.2 Which component is responsible for restoring the Interleave Set?
> > 
> > The expectation is that BIOS, or the OS for hotplug devices, deploys a default
> > region configuration policy. That policy in the common is likely one of either
> > maximizing performance (maximize interleave across host-bridges), or
> > maximizing error isolation (create an x1-interleave region per endpoint).
> 
> To be honest, I feel CFMWS seems to be something incomplete spec.. 
> 
> When I first saw the " CXL* Type 3 Memory Device Software Guide", and noticed existing
> CFMWS, I thought that the firmware would create it based on some configuration,
> and OS would read it and create region for each window information.
> Even if user would execute cxl create-region command and configure interleaved region,
> I thought OS would tell it to firmware (or something), and CFMWS would reflect it on the next boot.

Ok this has just made me realize that I really do need to write that
article on the various forms of interleaving in a post-CXL world.

Quoting some of the specification rq:

CXL 3.1 Section 9.18.1.3: CXL Fixed Memory Window Structure

"""
The CFMWS structure describes zero or more Host Physical Address (HPA)
windows that are associated with each CXL Host Bridge.  Each window
represents a contiguous HPA range that may be interleaved across one or
more targets, some of which are CXL Host Bridges.  Associated with each
window are a set of restrictions that govern its usage.  It is the
OSPM's responsibility to utilize each window for the specified use.

The HPA ranges described by CFMWS may include addresses that are current
assigned to CXL.mem devices.  Before assigning HPAs from a fixed-memory
window, the OSPM must check the current assignments and avoid any
conflicts.

For any given HPA, it shall not be described by more than one CFMWS
entry
"""

Dan, please correct me if I'm wrong, but I'm fairly certain the
following is accurate.

The CFMWS is the BIOS/EFI's mechanism to report the system configuration
to the Operating System, not the Operating System's mechanism to change
system configurations (such as interleave).  What you're talking about
is re-configuring HDM Decoders to interleave devices *presented by* the
CFMWS to the operating system.

Confusing, I know. But stick with me.

The interleave referred to the CFMWS is the BIOS/EFI telling the system
that memory accesses to this (physicall address) region will be interleaved
across the set of devices that are backing that region. The operating system
is responsible for reading these settings and presenting the memory to the
system accordingly.

The BIOS for example could configure all devices behind a single CFMW as
a "Single Device" that interleaves many physical devices, and the OS should
present it as such.  In this scenario, there is no need to configure an
interleave region via cxl-cli - the BIOS already did that for you and
presented all these devices as a single device.  All you need to do is
online the memory.

Configuring the CFMWS *should* (but may not) manifest as a set of BIOS/EFI
options that say how to configure a set of CXL devices behind one or more
host bridges prior to OS boot. This has its limitations. For example, you'd
need to reboot the system to make changes and hotplugging a memory device
becomes impossible. The BIOS/EFI would also need to understand when the
prior configuration is no longer valid - complicated and problematic.

Additionally, for more dynamic environments (devices behind a switch,
or a DCD) this more "static" configuration may (read: does) reduce your
management flexibility.  I.e. hotplug may not be possible.

Alternatively, the BIOS may configure each device separately, and the
OS is may create a region that interleaves those devices explicitly by
programming an HDM decoder.

In this scenario, the OS could tear down the region, hotplug that device,
and recreate the region with new settings accordingly. Greater
management flexibility, but more software/management complexity.

This requires the OS to recreate the region/interleave set on each
reboot - and is probably the preferred mechanism for configuring the
system (if only because hotplug and device failure is not uncommon).

In this scenario, re-configuration looks a lot like storage mounting.
The device is either there or it isn't, and the configuration file
either works or it doesn't.  Alternatively the daemon setting this all
up is free to try to make auto-configuration decisions.

(Final note about interleave for completion sake, but not really
relevant to this discussion)

Alternatively you could just online each device as a separate region,
and simply use something like set_mempolicy/numactl to implement
interleave on a per-task basis.

> 
> But, really is that the above scenario is only for persistent memory with LSA.
> Even if a user configures a new region for volatile memory, and I could not find any specification to
> tell the new configuration to the Firmware. 
> 
> Could you tell me why such interface is not defined in the CXL specification?
> Is it just because there is no place to store region information for volatile memory?
>
> 
> IMHO, users want to keep previous configuration after reboot even if it is volatile memory.
> Though users don't concern about contents of volatile memory, they want to keep region/interleave
> configuration after reboot. Especially, if previous configuration is some years ago, I'll bet
> users will forget how they configured regions against cxl volatile memory.
>

Probably we want some daemon that reconfigures this similar to how we're
doing it with storage.  You register a preferred configuration given the
hardware environment that is valid until the hardware changes.

The OS shouldn't really be telling the firmware to configure itself if
only because what happens if you unplug a device?

~Gregory

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: CXL volatile memory: How to restore the previous region/Interleave set
  2024-05-29 16:40     ` Gregory Price
@ 2024-05-30 10:35       ` Yuquan Wang
  2024-05-31 15:50         ` Gregory Price
  2024-05-30 10:54       ` Yasunori Gotou (Fujitsu)
  1 sibling, 1 reply; 12+ messages in thread
From: Yuquan Wang @ 2024-05-30 10:35 UTC (permalink / raw)
  To: Gregory Price
  Cc: lizhijian, dan.j.williams, linux-cxl, y-goto, Jonathan.Cameron,
	dave.jiang, fan.ni

On Wed, May 29, 2024 at 12:40:41PM -0400, Gregory Price wrote:
> 
> The CFMWS is the BIOS/EFI's mechanism to report the system configuration
> to the Operating System, not the Operating System's mechanism to change
> system configurations (such as interleave).  What you're talking about
> is re-configuring HDM Decoders to interleave devices *presented by* the
> CFMWS to the operating system.
> 
> Confusing, I know. But stick with me.
> 
> 
> 
> The interleave referred to the CFMWS is the BIOS/EFI telling the system
> that memory accesses to this (physicall address) region will be interleaved
> across the set of devices that are backing that region. The operating system
> is responsible for reading these settings and presenting the memory to the
> system accordingly.
> 
> The BIOS for example could configure all devices behind a single CFMW as
> a "Single Device" that interleaves many physical devices, and the OS should
> present it as such.  In this scenario, there is no need to configure an
> interleave region via cxl-cli - the BIOS already did that for you and
> presented all these devices as a single device.  All you need to do is
> online the memory.
> 
Sorry Gregory, here I have a question. According to your description, the 
bios drivers could prepare some interleave cxl region configurations on 
default cxl hardware(SoC) just like we using ndctl-tools in OS run-time
(cxl create-region). 

> Configuring the CFMWS *should* (but may not) manifest as a set of BIOS/EFI
> options that say how to configure a set of CXL devices behind one or more
> host bridges prior to OS boot. This has its limitations. For example, you'd
> need to reboot the system to make changes and hotplugging a memory device
> becomes impossible. The BIOS/EFI would also need to understand when the
> prior configuration is no longer valid - complicated and problematic.
> 
> Additionally, for more dynamic environments (devices behind a switch,
> or a DCD) this more "static" configuration may (read: does) reduce your
> management flexibility.  I.e. hotplug may not be possible.
> 
> 
> 
> Alternatively, the BIOS may configure each device separately, and the
> OS is may create a region that interleaves those devices explicitly by
> programming an HDM decoder.
> 
> In this scenario, the OS could tear down the region, hotplug that device,
> and recreate the region with new settings accordingly. Greater
> management flexibility, but more software/management complexity.
> 
> This requires the OS to recreate the region/interleave set on each
> reboot - and is probably the preferred mechanism for configuring the
> system (if only because hotplug and device failure is not uncommon).
> 
> In this scenario, re-configuration looks a lot like storage mounting.
> The device is either there or it isn't, and the configuration file
> either works or it doesn't.  Alternatively the daemon setting this all
> up is free to try to make auto-configuration decisions.
> 
> 
> 
> 
> (Final note about interleave for completion sake, but not really
> relevant to this discussion)
> 
> Alternatively you could just online each device as a separate region,
> and simply use something like set_mempolicy/numactl to implement
> interleave on a per-task basis.
> 
> 
> > 
> > But, really is that the above scenario is only for persistent memory with LSA.
> > Even if a user configures a new region for volatile memory, and I could not find any specification to
> > tell the new configuration to the Firmware. 
> > 
> > Could you tell me why such interface is not defined in the CXL specification?
> > Is it just because there is no place to store region information for volatile memory?
> >
> > 
> > IMHO, users want to keep previous configuration after reboot even if it is volatile memory.
> > Though users don't concern about contents of volatile memory, they want to keep region/interleave
> > configuration after reboot. Especially, if previous configuration is some years ago, I'll bet
> > users will forget how they configured regions against cxl volatile memory.
> >
> 
> Probably we want some daemon that reconfigures this similar to how we're
> doing it with storage.  You register a preferred configuration given the
> hardware environment that is valid until the hardware changes.
> 
> The OS shouldn't really be telling the firmware to configure itself if
> only because what happens if you unplug a device?
> 
> ~Gregory


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: CXL volatile memory: How to restore the previous region/Interleave set
  2024-05-30 10:35       ` Yuquan Wang
@ 2024-05-31 15:50         ` Gregory Price
  0 siblings, 0 replies; 12+ messages in thread
From: Gregory Price @ 2024-05-31 15:50 UTC (permalink / raw)
  To: Yuquan Wang
  Cc: lizhijian, dan.j.williams, linux-cxl, y-goto, Jonathan.Cameron,
	dave.jiang, fan.ni

On Thu, May 30, 2024 at 06:35:10PM +0800, Yuquan Wang wrote:
> On Wed, May 29, 2024 at 12:40:41PM -0400, Gregory Price wrote:
> > 
> > The CFMWS is the BIOS/EFI's mechanism to report the system configuration
> > to the Operating System, not the Operating System's mechanism to change
> > system configurations (such as interleave).  What you're talking about
> > is re-configuring HDM Decoders to interleave devices *presented by* the
> > CFMWS to the operating system.
> > 
> > Confusing, I know. But stick with me.
> > 
> > 
> > 
> > The interleave referred to the CFMWS is the BIOS/EFI telling the system
> > that memory accesses to this (physicall address) region will be interleaved
> > across the set of devices that are backing that region. The operating system
> > is responsible for reading these settings and presenting the memory to the
> > system accordingly.
> > 
> > The BIOS for example could configure all devices behind a single CFMW as
> > a "Single Device" that interleaves many physical devices, and the OS should
> > present it as such.  In this scenario, there is no need to configure an
> > interleave region via cxl-cli - the BIOS already did that for you and
> > presented all these devices as a single device.  All you need to do is
> > online the memory.
> > 
>
> Sorry Gregory, here I have a question. According to your description, the 
> bios drivers could prepare some interleave cxl region configurations on 
> default cxl hardware(SoC) just like we using ndctl-tools in OS run-time
> (cxl create-region). 
> 

Not in the sense of using cxl-cli or ndctl, but in the sense that
BIOS/EFI is responsible for reading hardware configurations and
presenting a sane configuration/memory map to the operating system.

It is technically possible, though not necessarily implemented anywhere, 
for BIOS to read the ACPI information from the devices and program
the root complex/decoders/whathaveyou to present those devices as a
single device to the operating system.

The BIOS reads in the ACPI0016 data, generates one or more CFMWS/entries
and hands off management of that CFMWS to the OS. In doing so, it's
perfectly capable of programming the CFMWS to present multiple devices
(or even specific regions in those devices) as part of a single CFMW.

This would look like reporting a single CFMWS covering multiple discrete
physical memory devices.  This CFMWS would have interleaves ways set
to >=2 and a TargetList with multiple discrete devices, with a single
hardware physical address region that applies to both.

The operating system would then manage this region as single device.

Looking briefly at the CXL* Type 3 Memory Device Software Guide from
Intel (July 2021, Rev 1.0), this is described in section 2.6 and seems
reasonably straightforward to me.

You certainly COULD save this setup in the LSA if you wanted to, but to
put bluntly - there's now a better way of doing/managing all of this.
HDM decoders let the OS set this all up.

And really the LSA is meant to store information about how to stitch
persistent data back together. This is probably why the LSA is not
referenced for the volatile setups in the Software Guide.

The LSA in the persistent setups is needed to ensure the data is put
back together correctly (you could pull out the devices and swap the
slots they're in, for example).  This doesn't matter for volatile
devices, so the programming can be decided on the fly.

By my read - there's somewhat of an implied "We expect your hardware
environment won't change much, so a couple BIOS/EFI flags could be
set and forgotten about when setting up hardware interleave" not
written in this document.

Side note:
I believe Intel did something similar (but different!) recently where
they were presenting DRAM+CXL as a single NUMA node as a function of
BIOS programming. I don't know whether this was done via the CFMWS
or some other tomfoolery, but it's a similar concept.

(The following I'm still a little fuzzy on, but this is my best
understanding of how we got to where we are. Iif someone sees
innaccuracies, please slap my wrist and tell me to stfu)

HDM decoders provide the OS the capability to decide how to route host
physical addresses down to the devices with the ability to program the
root complex/host bridges, switches, and the devices to configure
hardware interleave after boot.

In this scenario, BIOS/EFI would report a single CFMWS to the OS for
each discrete piece of hardware, and the Operating System would then
program the HDM decoders on the host bridge(s)/switch and the devices
to implement the interleave in hardware.

This is the `cxl create-region ... ways=X devN devM ...` command

In some ways you can think of the CFMWS way of interleave a kind of...
"Legacy Pattern", because probably just about everyone will eventually
want to use the HDM pattern because it will be capable of supporting
things like hotplug in a more maintainable manner (or at all).

For example - it's harder (if even possible) to tear down a CFMWS
implemented interleave pattern without rebooting the system than it
is to tear down an HDM implemented interleave pattern.

You might, however, want to use a combination of these two strategies.

If, for example, you have 8 expanders behind a switch attached to a
single host bridge.  You might want to treat that as a single,
concrete device - as opposed to 8 separate expanders which the OS has
to manage. Doing that via the CFMWS lets BIOS/Firmware simplify the
management of the devices and forego the need for specific driver
support (at the expense of flexibility of management after boot).

In that case, you'd have the ACPI tables and firmware hardcode the
interleave and simply present the larger pool to the BIOS as a single
chunk of large capacity to the OS.

Or maybe you might want to have some of them interleaved, and others
managed by the host. Software defined memory is fuuuuun! :D

The specification doesn't really have an opinion on how you "should" do
all of this - it just provides at least 3 or 4 different ways to trim
the chia pet and lets you be confused by the mess it has made.

But as for the LSA and volatile regions, I still don't see a compelling
reason for needing it to store prior settings.  That seems more of a
BIOS/EFI feature that needs to be programmed.

~Gregory

(P.S. I was not and am not in any way responsible or involved with
writing the spec, so I will now happily take my beatings should I
have gotten any of this horribly wrong.  This is all just my best
understanding from having bashed my face against the spec the past 2
years or so).

^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: CXL volatile memory: How to restore the previous region/Interleave set
  2024-05-29 16:40     ` Gregory Price
  2024-05-30 10:35       ` Yuquan Wang
@ 2024-05-30 10:54       ` Yasunori Gotou (Fujitsu)
  1 sibling, 0 replies; 12+ messages in thread
From: Yasunori Gotou (Fujitsu) @ 2024-05-30 10:54 UTC (permalink / raw)
  To: 'Gregory Price'
  Cc: 'Dan Williams', Zhijian Li (Fujitsu),
	linux-cxl@vger.kernel.org, Jonathan Cameron, dave.jiang@intel.com,
	Fan Ni

Hello Gregory-san,

> > > > Q3, For CXL volatile memory devices without LSA installed, if
> > > > users expect to restore the Interleave set to the previous
> > > > configuration after reboot, the questions are:
> > > > Q3.1 Where should the Interleave Set information be stored?
> > > > Q3.2 Which component is responsible for restoring the Interleave Set?
> > >
> > > The expectation is that BIOS, or the OS for hotplug devices, deploys
> > > a default region configuration policy. That policy in the common is
> > > likely one of either maximizing performance (maximize interleave
> > > across host-bridges), or maximizing error isolation (create an x1-interleave
> region per endpoint).
> >
> > To be honest, I feel CFMWS seems to be something incomplete spec..
> >
> > When I first saw the " CXL* Type 3 Memory Device Software Guide", and
> > noticed existing CFMWS, I thought that the firmware would create it
> > based on some configuration, and OS would read it and create region for each
> window information.
> > Even if user would execute cxl create-region command and configure
> > interleaved region, I thought OS would tell it to firmware (or something), and
> CFMWS would reflect it on the next boot.
> 
> Ok this has just made me realize that I really do need to write that article on the
> various forms of interleaving in a post-CXL world.
> 
> Quoting some of the specification rq:
> CXL 3.1 Section 9.18.1.3: CXL Fixed Memory Window Structure
> 
> """
> The CFMWS structure describes zero or more Host Physical Address (HPA)
> windows that are associated with each CXL Host Bridge.  Each window
> represents a contiguous HPA range that may be interleaved across one or more
> targets, some of which are CXL Host Bridges.  Associated with each window
> are a set of restrictions that govern its usage.  It is the OSPM's responsibility to
> utilize each window for the specified use.
> 
> The HPA ranges described by CFMWS may include addresses that are current
> assigned to CXL.mem devices.  Before assigning HPAs from a fixed-memory
> window, the OSPM must check the current assignments and avoid any
> conflicts.
> 
> For any given HPA, it shall not be described by more than one CFMWS entry """
> 
> Dan, please correct me if I'm wrong, but I'm fairly certain the following is
> accurate.
> 
> The CFMWS is the BIOS/EFI's mechanism to report the system configuration to
> the Operating System, not the Operating System's mechanism to change
> system configurations (such as interleave).  What you're talking about is
> re-configuring HDM Decoders to interleave devices *presented by* the
> CFMWS to the operating system.
> 
> Confusing, I know. But stick with me.
> 
> The interleave referred to the CFMWS is the BIOS/EFI telling the system that
> memory accesses to this (physicall address) region will be interleaved across
> the set of devices that are backing that region. The operating system is
> responsible for reading these settings and presenting the memory to the
> system accordingly.
> 
> The BIOS for example could configure all devices behind a single CFMW as a
> "Single Device" that interleaves many physical devices, and the OS should
> present it as such.  In this scenario, there is no need to configure an interleave
> region via cxl-cli - the BIOS already did that for you and presented all these
> devices as a single device.  All you need to do is online the memory.
> 
> Configuring the CFMWS *should* (but may not) manifest as a set of BIOS/EFI
> options that say how to configure a set of CXL devices behind one or more host
> bridges prior to OS boot. This has its limitations. For example, you'd need to
> reboot the system to make changes and hotplugging a memory device becomes
> impossible. The BIOS/EFI would also need to understand when the prior
> configuration is no longer valid - complicated and problematic.
> 
> Additionally, for more dynamic environments (devices behind a switch, or a
> DCD) this more "static" configuration may (read: does) reduce your
> management flexibility.  I.e. hotplug may not be possible.
> 
> 
> Alternatively, the BIOS may configure each device separately, and the OS is may
> create a region that interleaves those devices explicitly by programming an
> HDM decoder.
> 
> In this scenario, the OS could tear down the region, hotplug that device, and
> recreate the region with new settings accordingly. Greater management
> flexibility, but more software/management complexity.
> 
> This requires the OS to recreate the region/interleave set on each reboot - and
> is probably the preferred mechanism for configuring the system (if only
> because hotplug and device failure is not uncommon).
> 
> In this scenario, re-configuration looks a lot like storage mounting.
> The device is either there or it isn't, and the configuration file either works or it
> doesn't.  Alternatively the daemon setting this all up is free to try to make
> auto-configuration decisions.
> 
> (Final note about interleave for completion sake, but not really relevant to this
> discussion)
> 
> Alternatively you could just online each device as a separate region, and simply
> use something like set_mempolicy/numactl to implement interleave on a
> per-task basis.
> 

Thank you for your extremely detailed explanation, it was more helpful than I expected.


> 
> >
> > But, really is that the above scenario is only for persistent memory with LSA.
> > Even if a user configures a new region for volatile memory, and I
> > could not find any specification to tell the new configuration to the Firmware.
> >
> > Could you tell me why such interface is not defined in the CXL specification?
> > Is it just because there is no place to store region information for volatile
> memory?
> >
> >
> > IMHO, users want to keep previous configuration after reboot even if it is
> volatile memory.
> > Though users don't concern about contents of volatile memory, they
> > want to keep region/interleave configuration after reboot. Especially,
> > if previous configuration is some years ago, I'll bet users will forget how they
> configured regions against cxl volatile memory.
> >
> 
> Probably we want some daemon that reconfigures this similar to how we're
> doing it with storage.  You register a preferred configuration given the
> hardware environment that is valid until the hardware changes.

Currently I'm considering how CXL memory/device pool is managed for our future product,
and I think that such daemon will need to have more sophisticated features
for the memory/device pool ideally

 - (Re)configure regions for volatile memory as you said
 - Select DCD (if available), or whole memory device hotplug (when cxl2.0) for memory pool
 - Select offline memory blocks/devices/DCD area
   (if a memory block offline fails, the daemon may need to set online other memory blocks 
    on the same device again to return previous status and try hot remove another device (when cxl2.0?)
  - KIck scripts for users application(or some other things) to prepare memory remove
    (In my experience, a process needed to be sent a signal to hot-remove a device.)
  - Detect hotadded device and make new regions (But it may need to wait next device if users want to configure
    interleave with the next device).
  - If necessary, it may need API to talk orchestrator or some other pool management component.
  - error notifidation...
  - etc.... (I suppose there are many things I don't notice yet.)

I think such daemon will be essential for memory/device pool. But they will require a lot of effort.

On the other hand, I feel it may be too much sophisticated for a simple server.
(For example, the server which has only direct attached cxl memory).
In such case, simple command like "cxl save region" and "cxl restore region" is easier for users I think...


> 
> The OS shouldn't really be telling the firmware to configure itself if only
> because what happens if you unplug a device?

Anyway, thank you for your detail explanation.

Thanks,
---
Yasunori Goto

> 
> ~Gregory

^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: CXL volatile memory: How to restore the previous region/Interleave set
  2024-05-29 11:33   ` Yasunori Gotou (Fujitsu)
  2024-05-29 16:40     ` Gregory Price
@ 2024-05-31 20:56     ` Dan Williams
  2024-06-03  5:01       ` Yasunori Gotou (Fujitsu)
  1 sibling, 1 reply; 12+ messages in thread
From: Dan Williams @ 2024-05-31 20:56 UTC (permalink / raw)
  To: Yasunori Gotou (Fujitsu), 'Dan Williams',
	Zhijian Li (Fujitsu), linux-cxl@vger.kernel.org
  Cc: Jonathan Cameron, dave.jiang@intel.com, Fan Ni

Yasunori Gotou (Fujitsu) wrote:
[..]
> > What is currently missing on the Linux OS side is a default policy for unmapped
> > volatile capacity after all initial device probing has completed.
> > 
> > > One scenario I understand is that in clusters using an Orchestrator,
> > > such as K8S, when a node (worker) restarts, K8S is able to read the
> > > Interleave Set from the database and sets it for the corresponding node.
> > 
> > Yes, for sophisticated environments a configuration database could store and
> > replay region configurations each boot.
> 
> Just an idea, I suppose that cxl command should have two features.
>  - save current regions configuration to a file which is specified by its operand.
>  - reconfigure regions depends on the specified file.
>   (If hardware condition is changed, the command return error and display what is changed.)
> 
> Probably, it may be enough for the most of users.... But what do you think it?

It is not clear that the tool needs a save / restore option vs just
replaying the same create-region command from one boot to the next.

You can imagine having a startup script that issues one of the
following:

	1/ cxl create-region
	2/ cxl create-region -d decoder0.4
	3/ cxl create-region -m mem0 mem1
	4/ cxl create-region -S 0x12345 0xabcde
	4/ cxl create-region -S 0x12345 0xabcde -s 1T

1/ Find the CXL window with the largest available capacity and create a
maximally sized region.

2/ Limit the search to a specific CXL window, but create the largest
available region from any spare capacity found there.

3/ Try to create the largest possible region with mem0 and mem1. NOTE
that this could produce wildly different results from boot to boot
because CXL device scanning is asynchronous and mem0 from one boot may
not match mem0 in the current boot even if the hardware configuration
has not changed.

4/ Same as 3/ but guaranteed to get the exact same devices because they
are addressed by serial number.

5/ Same as 4/ but limit the size.

So, a lot can be done without needing to save restore the exact
configuration, which for volatile the exact configuration does not
matter as much as the the capacity and performance characteristics.

NOTE, some of the above examples are not implemented yet, patches
welcome! For example the -S option to treat the arguments as
memory-device serial numbers rather memory-device id numbers is not
there today, "cxl create-region" by itself does not know how to search
for capacity, and "cxl create-region -m" without specifying the
decoder/window throws an error even though it is relatively
straightforward to figure out the set of CXL windows that map a given
memdev.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: CXL volatile memory: How to restore the previous region/Interleave set
  2024-05-31 20:56     ` Dan Williams
@ 2024-06-03  5:01       ` Yasunori Gotou (Fujitsu)
  0 siblings, 0 replies; 12+ messages in thread
From: Yasunori Gotou (Fujitsu) @ 2024-06-03  5:01 UTC (permalink / raw)
  To: 'Dan Williams', Zhijian Li (Fujitsu),
	linux-cxl@vger.kernel.org
  Cc: Jonathan Cameron, dave.jiang@intel.com, Fan Ni

> Yasunori Gotou (Fujitsu) wrote:
> [..]
> > > What is currently missing on the Linux OS side is a default policy
> > > for unmapped volatile capacity after all initial device probing has
> completed.
> > >
> > > > One scenario I understand is that in clusters using an
> > > > Orchestrator, such as K8S, when a node (worker) restarts, K8S is
> > > > able to read the Interleave Set from the database and sets it for the
> corresponding node.
> > >
> > > Yes, for sophisticated environments a configuration database could
> > > store and replay region configurations each boot.
> >
> > Just an idea, I suppose that cxl command should have two features.
> >  - save current regions configuration to a file which is specified by its
> operand.
> >  - reconfigure regions depends on the specified file.
> >   (If hardware condition is changed, the command return error and
> > display what is changed.)
> >
> > Probably, it may be enough for the most of users.... But what do you think it?
> 
> It is not clear that the tool needs a save / restore option vs just replaying the
> same create-region command from one boot to the next.
> 
> You can imagine having a startup script that issues one of the
> following:
> 
> 	1/ cxl create-region
> 	2/ cxl create-region -d decoder0.4
> 	3/ cxl create-region -m mem0 mem1
> 	4/ cxl create-region -S 0x12345 0xabcde
> 	4/ cxl create-region -S 0x12345 0xabcde -s 1T
> 
> 1/ Find the CXL window with the largest available capacity and create a
> maximally sized region.
> 
> 2/ Limit the search to a specific CXL window, but create the largest available
> region from any spare capacity found there.
> 
> 3/ Try to create the largest possible region with mem0 and mem1. NOTE that
> this could produce wildly different results from boot to boot because CXL
> device scanning is asynchronous and mem0 from one boot may not match
> mem0 in the current boot even if the hardware configuration has not changed.
> 
> 4/ Same as 3/ but guaranteed to get the exact same devices because they are
> addressed by serial number.

I feel that -S option may be interesting....

> 
> 5/ Same as 4/ but limit the size.
> 
> So, a lot can be done without needing to save restore the exact configuration,
> which for volatile the exact configuration does not matter as much as the the
> capacity and performance characteristics.

Thank you for sharing your idea.

> 
> NOTE, some of the above examples are not implemented yet, patches
> welcome! For example the -S option to treat the arguments as memory-device
> serial numbers rather memory-device id numbers is not there today, "cxl
> create-region" by itself does not know how to search for capacity, and "cxl
> create-region -m" without specifying the decoder/window throws an error even
> though it is relatively straightforward to figure out the set of CXL windows that
> map a given memdev.

Ok, I see. I'll think about it a bit more.

Thanks,
----
Yasunori Goto


^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2024-06-03  5:02 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-05-24  7:32 CXL volatile memory: How to restore the previous region/Interleave set Zhijian Li (Fujitsu)
2024-05-29  1:08 ` Dan Williams
2024-05-29 10:19   ` Zhijian Li (Fujitsu)
2024-05-29 15:44     ` Gregory Price
2024-05-30  9:56       ` Zhijian Li (Fujitsu)
2024-05-29 11:33   ` Yasunori Gotou (Fujitsu)
2024-05-29 16:40     ` Gregory Price
2024-05-30 10:35       ` Yuquan Wang
2024-05-31 15:50         ` Gregory Price
2024-05-30 10:54       ` Yasunori Gotou (Fujitsu)
2024-05-31 20:56     ` Dan Williams
2024-06-03  5:01       ` Yasunori Gotou (Fujitsu)

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox