* CXL volatile memory: How to restore the previous region/Interleave set @ 2024-05-24 7:32 Zhijian Li (Fujitsu) 2024-05-29 1:08 ` Dan Williams 0 siblings, 1 reply; 12+ messages in thread From: Zhijian Li (Fujitsu) @ 2024-05-24 7:32 UTC (permalink / raw) To: members@computeexpresslink.org, linux-cxl@vger.kernel.org Cc: Yasunori Gotou (Fujitsu), Jonathan Cameron, dan.j.williams@intel.com, dave.jiang@intel.com, Fan Ni Hey CXL and Linux-CXL communities, I am trying to understand how the current hardware and software can work together to restore the previous region/Interleave Set configuration for CXL volatile memory upon the next boot, but I don't have the answer yet. Therefore, I have several questions and hope you can provide some suggestions and thoughts. Thank you. Q1, First, I would like to ask about the scope of LSA. According to CXL r3.0 section 9.13.2, it seems that LSA applies to CXL memory (including volatile memory and persistent memory), but it does not explicitly state whether LSA is mandatory. My understanding is: - LSA is mandatory for persistent memory - LSA is optional for volatile memory Is this understanding correct? Q2, Per CXL r3.0 "9.13.2.4 Region Labels", it mentions "Region labels describe the geometry of a persistent memory Interleave Set". What does "a persistent memory Interleave Set" mean here? - a persistent Interleave Set for CXL memory device (volatile and persistent) or - a persistent Interleave Set for CXL persistent memory device only. Q3, For CXL volatile memory devices without LSA installed, if users expect to restore the Interleave set to the previous configuration after reboot, the questions are: Q3.1 Where should the Interleave Set information be stored? Q3.2 Which component is responsible for restoring the Interleave Set? One scenario I understand is that in clusters using an Orchestrator, such as K8S, when a node (worker) restarts, K8S is able to read the Interleave Set from the database and sets it for the corresponding node. However, for a single-node machine, how can the kernel restore this information immediately after startup (before /init executes) without user intervention? Does the current Specification define this, or do other CXL-related firmware, such as UEFI/ACPI/CFMWS/Fabric Manager, have enough information to reconstruct the previous configuration? PS. It seems that LSA Region Label has not implemented yet in linux kernel Region/Interleave set are saved in the running kernel memory/device registers, will get lost after reboot in linux kernel. Thanks Zhijian ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: CXL volatile memory: How to restore the previous region/Interleave set 2024-05-24 7:32 CXL volatile memory: How to restore the previous region/Interleave set Zhijian Li (Fujitsu) @ 2024-05-29 1:08 ` Dan Williams 2024-05-29 10:19 ` Zhijian Li (Fujitsu) 2024-05-29 11:33 ` Yasunori Gotou (Fujitsu) 0 siblings, 2 replies; 12+ messages in thread From: Dan Williams @ 2024-05-29 1:08 UTC (permalink / raw) To: Zhijian Li (Fujitsu), linux-cxl@vger.kernel.org Cc: Yasunori Gotou (Fujitsu), Jonathan Cameron, dan.j.williams@intel.com, dave.jiang@intel.com, Fan Ni Hi Zhijian, I dropped members@computeexpresslink.org from this thread. If those folks are interested they can follow this discussion here: https://lore.kernel.org/r/36106fcf-1062-4961-8918-4471fd313a74@fujitsu.com Otherwise, the way the wider Linux community learns about consortium deliberations is through new published spec revisions. Zhijian Li (Fujitsu) wrote: > Hey CXL and Linux-CXL communities, > > I am trying to understand how the current hardware and software can work > together to restore the previous region/Interleave Set configuration for CXL > volatile memory upon the next boot, but I don't have the answer yet. > Therefore, I have several questions and hope you can provide some suggestions > and thoughts. Thank you. > > Q1, First, I would like to ask about the scope of LSA. According to CXL r3.0 > section 9.13.2, it seems that LSA applies to CXL memory (including volatile > memory and persistent memory), but it does not explicitly state whether LSA > is mandatory. My understanding is: > - LSA is mandatory for persistent memory > - LSA is optional for volatile memory > Is this understanding correct? I would say it differently. LSA is mandatory for persistent memory, and irrelevant for volatile memory. > Q2, Per CXL r3.0 "9.13.2.4 Region Labels", it mentions "Region labels describe > the geometry of a persistent memory Interleave Set". What does "a persistent > memory Interleave Set" mean here? > - a persistent Interleave Set for CXL memory device (volatile and persistent) > or > - a persistent Interleave Set for CXL persistent memory device only. Peristent only, because persistence means getting the exact same content as was previously available. A volatile region configuration can be restored by recreating the region. > Q3, For CXL volatile memory devices without LSA installed, if users expect to > restore the Interleave set to the previous configuration after reboot, the > questions are: > Q3.1 Where should the Interleave Set information be stored? > Q3.2 Which component is responsible for restoring the Interleave Set? The expectation is that BIOS, or the OS for hotplug devices, deploys a default region configuration policy. That policy in the common is likely one of either maximizing performance (maximize interleave across host-bridges), or maximizing error isolation (create an x1-interleave region per endpoint). What is currently missing on the Linux OS side is a default policy for unmapped volatile capacity after all initial device probing has completed. > One scenario I understand is that in clusters using an Orchestrator, such as > K8S, when a node (worker) restarts, K8S is able to read the Interleave Set > from the database and sets it for the corresponding node. Yes, for sophisticated environments a configuration database could store and replay region configurations each boot. > However, for a single-node machine, how can the kernel restore this information > immediately after startup (before /init executes) without user intervention? Linux needs a default policy. Likely the best place to do this is with a udev script that waits for PCI scanning to quiesce and then creates volatile regions. Unlike a PMEM region where the exact config needs to be replicated each boot, devices can be reordered and replaced over reboot without needing to rewrite labels on the device(s). > Does the current Specification define this, or do other CXL-related firmware, > such as UEFI/ACPI/CFMWS/Fabric Manager, have enough information to reconstruct > the previous configuration? There is already enough information for something like: cxl create-region --continue That would automatically create maximally sized regions until all available capacity is mapped. As it stands "cxl create-region" is not yet enlightened enough to do its own capacity search, but nothing more is needed from the specification side to enable that. > PS. > It seems that LSA Region Label has not implemented yet in linux kernel > Region/Interleave set are saved in the running kernel memory/device registers, will > get lost after reboot in linux kernel. For PMEM regions there is support for routing label updates from the nvdimm subsystem back to the label area of CXL-PMEM devices. The full end-to-end support for recovering PMEM regions from stored labels was deferred due to the lack of PMEM CXL devices in the market. So, similar to Type-2 support, it awaits an endpoint implementation to arrive. For volatile regions there is no expectation that the driver would ever need to consider labels because "recreate regions by policy" is possible. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: CXL volatile memory: How to restore the previous region/Interleave set 2024-05-29 1:08 ` Dan Williams @ 2024-05-29 10:19 ` Zhijian Li (Fujitsu) 2024-05-29 15:44 ` Gregory Price 2024-05-29 11:33 ` Yasunori Gotou (Fujitsu) 1 sibling, 1 reply; 12+ messages in thread From: Zhijian Li (Fujitsu) @ 2024-05-29 10:19 UTC (permalink / raw) To: Dan Williams, linux-cxl@vger.kernel.org Cc: Yasunori Gotou (Fujitsu), Jonathan Cameron, dave.jiang@intel.com, Fan Ni Thanks Dan, On 29/05/2024 09:08, Dan Williams wrote: > Hi Zhijian, > > I dropped members@computeexpresslink.org from this thread. If those > folks are interested they can follow this discussion here: > > https://lore.kernel.org/r/36106fcf-1062-4961-8918-4471fd313a74@fujitsu.com > > Otherwise, the way the wider Linux community learns about consortium > deliberations is through new published spec revisions. Agreed. thank you. > > Zhijian Li (Fujitsu) wrote: >> Hey CXL and Linux-CXL communities, >> >> I am trying to understand how the current hardware and software can work >> together to restore the previous region/Interleave Set configuration for CXL >> volatile memory upon the next boot, but I don't have the answer yet. >> Therefore, I have several questions and hope you can provide some suggestions >> and thoughts. Thank you. >> >> Q1, First, I would like to ask about the scope of LSA. According to CXL r3.0 >> section 9.13.2, it seems that LSA applies to CXL memory (including volatile >> memory and persistent memory), but it does not explicitly state whether LSA >> is mandatory. My understanding is: >> - LSA is mandatory for persistent memory >> - LSA is optional for volatile memory >> Is this understanding correct? > > I would say it differently. LSA is mandatory for persistent memory, and > irrelevant for volatile memory. Another reason for my above understanding is that the current QEMU documentation[1] also states this, and the actual code behaves accordingly (mandatory for persistent memory and optional for volatile memory). [1] https://github.com/qemu/qemu/blob/79d7475f39f1b0f05fcb159f5cdcbf162340dc7e/docs/system/devices/cxl.rst?plain=1#L324 Anyone have other opinion on this? Hope the CXL consortium can help on confirming on this point to members@computeexpresslink.org privately. (Currently the original mail cannot reach members@computeexpresslink.org until getting its approval) > >> Q2, Per CXL r3.0 "9.13.2.4 Region Labels", it mentions "Region labels describe >> the geometry of a persistent memory Interleave Set". What does "a persistent >> memory Interleave Set" mean here? >> - a persistent Interleave Set for CXL memory device (volatile and persistent) >> or >> - a persistent Interleave Set for CXL persistent memory device only. > > Peristent only, because persistence means getting the exact same content > as was previously available. A volatile region configuration can be > restored by recreating the region.> >> Q3, For CXL volatile memory devices without LSA installed, if users expect to >> restore the Interleave set to the previous configuration after reboot, the >> questions are: >> Q3.1 Where should the Interleave Set information be stored? >> Q3.2 Which component is responsible for restoring the Interleave Set? > > The expectation is that BIOS, or the OS for hotplug devices, deploys a > default region configuration policy. That policy in the common is likely > one of either maximizing performance (maximize interleave across > host-bridges), or maximizing error isolation (create an x1-interleave > region per endpoint). >> What is currently missing on the Linux OS side is a default policy for > unmapped volatile capacity after all initial device probing has > completed. I would say I can imagine the policy you mentioned. *policy* would work on some cases, but For a customized region, a default/predefined *policy* is not sufficient. In my mind, a customized region could be constructed with part of all the available memdevs(SLD/MLD/DCD), and customized interleave set. So the the previous used configurations including memdevs and interleave set should be stored in a persistent storage that the region can be restored correctly. If these configurations cannot read from the device(s), that fact would most likely be the host OS try to reconstruct the region according to the settings in its config file(/etc/cxl/mem.config) > >> One scenario I understand is that in clusters using an Orchestrator, such as >> K8S, when a node (worker) restarts, K8S is able to read the Interleave Set >> from the database and sets it for the corresponding node. > > Yes, for sophisticated environments a configuration database could store > and replay region configurations each boot. > >> However, for a single-node machine, how can the kernel restore this information >> immediately after startup (before /init executes) without user intervention? > > Linux needs a default policy. Likely the best place to do this is with a > udev script that waits for PCI scanning to quiesce and then creates > volatile regions. > > Unlike a PMEM region where the exact config needs to be replicated each > boot, devices can be reordered and replaced over reboot without needing > to rewrite labels on the device(s). > >> Does the current Specification define this, or do other CXL-related firmware, >> such as UEFI/ACPI/CFMWS/Fabric Manager, have enough information to reconstruct >> the previous configuration? > > There is already enough information for something like: > > cxl create-region --continue > > That would automatically create maximally sized regions until all > available capacity is mapped. As it stands "cxl create-region" is not > yet enlightened enough to do its own capacity search, but nothing more > is needed from the specification side to enable that. > >> PS. >> It seems that LSA Region Label has not implemented yet in linux kernel >> Region/Interleave set are saved in the running kernel memory/device registers, will >> get lost after reboot in linux kernel. > > For PMEM regions there is support for routing label updates from the > nvdimm subsystem back to the label area of CXL-PMEM devices. The full > end-to-end support for recovering PMEM regions from stored labels was > deferred due to the lack of PMEM CXL devices in the market. So, similar > to Type-2 support, it awaits an endpoint implementation to arrive. > Okay, understood. Thanks Zhijian > For volatile regions there is no expectation that the driver would ever > need to consider labels because "recreate regions by policy" is > possible. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: CXL volatile memory: How to restore the previous region/Interleave set 2024-05-29 10:19 ` Zhijian Li (Fujitsu) @ 2024-05-29 15:44 ` Gregory Price 2024-05-30 9:56 ` Zhijian Li (Fujitsu) 0 siblings, 1 reply; 12+ messages in thread From: Gregory Price @ 2024-05-29 15:44 UTC (permalink / raw) To: Zhijian Li (Fujitsu) Cc: Dan Williams, linux-cxl@vger.kernel.org, Yasunori Gotou (Fujitsu), Jonathan Cameron, dave.jiang@intel.com, Fan Ni On Wed, May 29, 2024 at 10:19:21AM +0000, Zhijian Li (Fujitsu) wrote: > Thanks Dan, > > > On 29/05/2024 09:08, Dan Williams wrote: > > Hi Zhijian, > > > > I dropped members@computeexpresslink.org from this thread. If those > > folks are interested they can follow this discussion here: > > > > https://lore.kernel.org/r/36106fcf-1062-4961-8918-4471fd313a74@fujitsu.com > > > > Otherwise, the way the wider Linux community learns about consortium > > deliberations is through new published spec revisions. > > Agreed. thank you. > > > > > > Zhijian Li (Fujitsu) wrote: > >> Hey CXL and Linux-CXL communities, > >> > >> I am trying to understand how the current hardware and software can work > >> together to restore the previous region/Interleave Set configuration for CXL > >> volatile memory upon the next boot, but I don't have the answer yet. > >> Therefore, I have several questions and hope you can provide some suggestions > >> and thoughts. Thank you. > >> > >> Q1, First, I would like to ask about the scope of LSA. According to CXL r3.0 > >> section 9.13.2, it seems that LSA applies to CXL memory (including volatile > >> memory and persistent memory), but it does not explicitly state whether LSA > >> is mandatory. My understanding is: > >> - LSA is mandatory for persistent memory > >> - LSA is optional for volatile memory > >> Is this understanding correct? > > > > I would say it differently. LSA is mandatory for persistent memory, and > > irrelevant for volatile memory. > > > Another reason for my above understanding is that the current QEMU > documentation[1] also states this, and the actual code behaves accordingly > (mandatory for persistent memory and optional for volatile memory). > > [1] https://github.com/qemu/qemu/blob/79d7475f39f1b0f05fcb159f5cdcbf162340dc7e/docs/system/devices/cxl.rst?plain=1#L324 > > Anyone have other opinion on this? > > Hope the CXL consortium can help on confirming on this point to > members@computeexpresslink.org privately. > (Currently the original mail cannot reach members@computeexpresslink.org > until getting its approval) > Just to be concrete: CXL Spec 3.1: 8.2.9.9.2.3 "The Label Storage Area (LSA) *shall be* supported by a memory device that provides persistent memory capacity and *may be* supported by a device that provides only volatile memory capacity" For persistent: Required For volatile: Optional What Dan is saying is that an voltile-only device with an LSA is possible, but the LSA isn't particularly novel or useful since the data in the voltile region is destroyed when the device is reset. Recording things like interleave set configurations in LSA doesn't really make sense, given that devices probably shouldn't know about each other, and an interleave set kind of implies knowledge of other devices. So as Dan said: An LSA is irrelevant for a voltile device. Anything you'd use it for is probably better accomplished some other way. > > > > The expectation is that BIOS, or the OS for hotplug devices, deploys a > > default region configuration policy. That policy in the common is likely > > one of either maximizing performance (maximize interleave across > > host-bridges), or maximizing error isolation (create an x1-interleave > > region per endpoint). > >> What is currently missing on the Linux OS side is a default policy for > > unmapped volatile capacity after all initial device probing has > > completed. > > I would say I can imagine the policy you mentioned. *policy* would work > on some cases, but > For a customized region, a default/predefined *policy* is not sufficient. > In my mind, a customized region could be constructed with part of all > the available memdevs(SLD/MLD/DCD), and customized interleave set. > > So the the previous used configurations including memdevs and interleave set > should be stored in a persistent storage that the region can be restored > correctly. > There's any number of reasons this is a bad idea, the most obvious of which is that your recording information about a set of devices on each device. What if some of those devices go away (or are upgraded, change, whatever) and you need to do something different? What if you move that device to another machine? The state machine you create with this setup is pretty awful. I suspect in answering these questions, you end up just resolving to save the configuration to disk instead of the LSAs - which (as Dan said) makes the LSA irrelevant. Really what you want is a smarter daemon that detects the topology and implements the best policy given the current environment. ~Gregory ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: CXL volatile memory: How to restore the previous region/Interleave set 2024-05-29 15:44 ` Gregory Price @ 2024-05-30 9:56 ` Zhijian Li (Fujitsu) 0 siblings, 0 replies; 12+ messages in thread From: Zhijian Li (Fujitsu) @ 2024-05-30 9:56 UTC (permalink / raw) To: Gregory Price Cc: Dan Williams, linux-cxl@vger.kernel.org, Yasunori Gotou (Fujitsu), Jonathan Cameron, dave.jiang@intel.com, Fan Ni On 29/05/2024 23:44, Gregory Price wrote: > On Wed, May 29, 2024 at 10:19:21AM +0000, Zhijian Li (Fujitsu) wrote: >> Thanks Dan, >> >> >> On 29/05/2024 09:08, Dan Williams wrote: >>> Hi Zhijian, >>> >>> I dropped members@computeexpresslink.org from this thread. If those >>> folks are interested they can follow this discussion here: >>> >>> https://lore.kernel.org/r/36106fcf-1062-4961-8918-4471fd313a74@fujitsu.com >>> >>> Otherwise, the way the wider Linux community learns about consortium >>> deliberations is through new published spec revisions. >> >> Agreed. thank you. >> >> >>> >>> Zhijian Li (Fujitsu) wrote: >>>> Hey CXL and Linux-CXL communities, >>>> >>>> I am trying to understand how the current hardware and software can work >>>> together to restore the previous region/Interleave Set configuration for CXL >>>> volatile memory upon the next boot, but I don't have the answer yet. >>>> Therefore, I have several questions and hope you can provide some suggestions >>>> and thoughts. Thank you. >>>> >>>> Q1, First, I would like to ask about the scope of LSA. According to CXL r3.0 >>>> section 9.13.2, it seems that LSA applies to CXL memory (including volatile >>>> memory and persistent memory), but it does not explicitly state whether LSA >>>> is mandatory. My understanding is: >>>> - LSA is mandatory for persistent memory >>>> - LSA is optional for volatile memory >>>> Is this understanding correct? >>> >>> I would say it differently. LSA is mandatory for persistent memory, and >>> irrelevant for volatile memory. >> >> >> Another reason for my above understanding is that the current QEMU >> documentation[1] also states this, and the actual code behaves accordingly >> (mandatory for persistent memory and optional for volatile memory). >> >> [1] https://github.com/qemu/qemu/blob/79d7475f39f1b0f05fcb159f5cdcbf162340dc7e/docs/system/devices/cxl.rst?plain=1#L324 >> >> Anyone have other opinion on this? >> >> Hope the CXL consortium can help on confirming on this point to >> members@computeexpresslink.org privately. >> (Currently the original mail cannot reach members@computeexpresslink.org >> until getting its approval) >> > > Just to be concrete: > > CXL Spec 3.1: 8.2.9.9.2.3 > > "The Label Storage Area (LSA) *shall be* supported by a memory device > that provides persistent memory capacity and *may be* supported by a > device that provides only volatile memory capacity" Thanks for the reference, I missed it before. > > For persistent: Required > For volatile: Optional > > What Dan is saying is that an voltile-only device with an LSA is > possible, but the LSA isn't particularly novel or useful since the > data in the voltile region is destroyed when the device is reset. > Well, I misunderstood this in the previous reply. Many thanks for your explanation on this point. > Recording things like interleave set configurations in LSA doesn't > really make sense, given that devices probably shouldn't know about each > other, and an interleave set kind of implies knowledge of other devices. > > So as Dan said: An LSA is irrelevant for a voltile device. Anything > you'd use it for is probably better accomplished some other way.> >>> >>> The expectation is that BIOS, or the OS for hotplug devices, deploys a >>> default region configuration policy. That policy in the common is likely >>> one of either maximizing performance (maximize interleave across >>> host-bridges), or maximizing error isolation (create an x1-interleave >>> region per endpoint). >>>> What is currently missing on the Linux OS side is a default policy for >>> unmapped volatile capacity after all initial device probing has >>> completed. >> >> I would say I can imagine the policy you mentioned. *policy* would work >> on some cases, but >> For a customized region, a default/predefined *policy* is not sufficient. >> In my mind, a customized region could be constructed with part of all >> the available memdevs(SLD/MLD/DCD), and customized interleave set. >> >> So the the previous used configurations including memdevs and interleave set >> should be stored in a persistent storage that the region can be restored >> correctly. >> > > There's any number of reasons this is a bad idea, the most obvious of > which is that your recording information about a set of devices on each > device. What if some of those devices go away (or are upgraded, change, > whatever) and you need to do something different? What if you move that > device to another machine? The state machine you create with this setup > is pretty awful. I suspect in answering these questions, you end up > just resolving to save the configuration to disk instead of the LSAs - > which (as Dan said) makes the LSA irrelevant. > In my understanding, LSA Region Label should include the needed information to restore the previous interleave set(across memdevs). Of course, these information on LSA are different from the one saved in the disk. > Really what you want is a smarter daemon that detects the topology and > implements the best policy given the current environment. Anyway, I think I have gotten the answer from your and Dan's reply. Thanks again, Zhijian > > ~Gregory ^ permalink raw reply [flat|nested] 12+ messages in thread
* RE: CXL volatile memory: How to restore the previous region/Interleave set 2024-05-29 1:08 ` Dan Williams 2024-05-29 10:19 ` Zhijian Li (Fujitsu) @ 2024-05-29 11:33 ` Yasunori Gotou (Fujitsu) 2024-05-29 16:40 ` Gregory Price 2024-05-31 20:56 ` Dan Williams 1 sibling, 2 replies; 12+ messages in thread From: Yasunori Gotou (Fujitsu) @ 2024-05-29 11:33 UTC (permalink / raw) To: 'Dan Williams', Zhijian Li (Fujitsu), linux-cxl@vger.kernel.org Cc: Jonathan Cameron, dave.jiang@intel.com, Fan Ni Hi Dan-san, > > Q3, For CXL volatile memory devices without LSA installed, if users > > expect to restore the Interleave set to the previous configuration > > after reboot, the questions are: > > Q3.1 Where should the Interleave Set information be stored? > > Q3.2 Which component is responsible for restoring the Interleave Set? > > The expectation is that BIOS, or the OS for hotplug devices, deploys a default > region configuration policy. That policy in the common is likely one of either > maximizing performance (maximize interleave across host-bridges), or > maximizing error isolation (create an x1-interleave region per endpoint). To be honest, I feel CFMWS seems to be something incomplete spec.. When I first saw the " CXL* Type 3 Memory Device Software Guide", and noticed existing CFMWS, I thought that the firmware would create it based on some configuration, and OS would read it and create region for each window information. Even if user would execute cxl create-region command and configure interleaved region, I thought OS would tell it to firmware (or something), and CFMWS would reflect it on the next boot. But, really is that the above scenario is only for persistent memory with LSA. Even if a user configures a new region for volatile memory, and I could not find any specification to tell the new configuration to the Firmware. Could you tell me why such interface is not defined in the CXL specification? Is it just because there is no place to store region information for volatile memory? IMHO, users want to keep previous configuration after reboot even if it is volatile memory. Though users don't concern about contents of volatile memory, they want to keep region/interleave configuration after reboot. Especially, if previous configuration is some years ago, I'll bet users will forget how they configured regions against cxl volatile memory. > > What is currently missing on the Linux OS side is a default policy for unmapped > volatile capacity after all initial device probing has completed. > > > One scenario I understand is that in clusters using an Orchestrator, > > such as K8S, when a node (worker) restarts, K8S is able to read the > > Interleave Set from the database and sets it for the corresponding node. > > Yes, for sophisticated environments a configuration database could store and > replay region configurations each boot. Just an idea, I suppose that cxl command should have two features. - save current regions configuration to a file which is specified by its operand. - reconfigure regions depends on the specified file. (If hardware condition is changed, the command return error and display what is changed.) Probably, it may be enough for the most of users.... But what do you think it? Thanks, > > > However, for a single-node machine, how can the kernel restore this > > information immediately after startup (before /init executes) without user > intervention? > > Linux needs a default policy. Likely the best place to do this is with a udev > script that waits for PCI scanning to quiesce and then creates volatile regions. > > Unlike a PMEM region where the exact config needs to be replicated each boot, > devices can be reordered and replaced over reboot without needing to rewrite > labels on the device(s). > > > Does the current Specification define this, or do other CXL-related > > firmware, such as UEFI/ACPI/CFMWS/Fabric Manager, have enough > > information to reconstruct the previous configuration? > > There is already enough information for something like: > > cxl create-region --continue > > That would automatically create maximally sized regions until all available > capacity is mapped. As it stands "cxl create-region" is not yet enlightened > enough to do its own capacity search, but nothing more is needed from the > specification side to enable that. > > > PS. > > It seems that LSA Region Label has not implemented yet in linux kernel > > Region/Interleave set are saved in the running kernel memory/device > > registers, will get lost after reboot in linux kernel. > > For PMEM regions there is support for routing label updates from the nvdimm > subsystem back to the label area of CXL-PMEM devices. The full end-to-end > support for recovering PMEM regions from stored labels was deferred due to > the lack of PMEM CXL devices in the market. So, similar to Type-2 support, it > awaits an endpoint implementation to arrive. > > For volatile regions there is no expectation that the driver would ever need to > consider labels because "recreate regions by policy" is possible. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: CXL volatile memory: How to restore the previous region/Interleave set 2024-05-29 11:33 ` Yasunori Gotou (Fujitsu) @ 2024-05-29 16:40 ` Gregory Price 2024-05-30 10:35 ` Yuquan Wang 2024-05-30 10:54 ` Yasunori Gotou (Fujitsu) 2024-05-31 20:56 ` Dan Williams 1 sibling, 2 replies; 12+ messages in thread From: Gregory Price @ 2024-05-29 16:40 UTC (permalink / raw) To: Yasunori Gotou (Fujitsu) Cc: 'Dan Williams', Zhijian Li (Fujitsu), linux-cxl@vger.kernel.org, Jonathan Cameron, dave.jiang@intel.com, Fan Ni On Wed, May 29, 2024 at 11:33:46AM +0000, Yasunori Gotou (Fujitsu) wrote: > Hi Dan-san, > > > > Q3, For CXL volatile memory devices without LSA installed, if users > > > expect to restore the Interleave set to the previous configuration > > > after reboot, the questions are: > > > Q3.1 Where should the Interleave Set information be stored? > > > Q3.2 Which component is responsible for restoring the Interleave Set? > > > > The expectation is that BIOS, or the OS for hotplug devices, deploys a default > > region configuration policy. That policy in the common is likely one of either > > maximizing performance (maximize interleave across host-bridges), or > > maximizing error isolation (create an x1-interleave region per endpoint). > > To be honest, I feel CFMWS seems to be something incomplete spec.. > > When I first saw the " CXL* Type 3 Memory Device Software Guide", and noticed existing > CFMWS, I thought that the firmware would create it based on some configuration, > and OS would read it and create region for each window information. > Even if user would execute cxl create-region command and configure interleaved region, > I thought OS would tell it to firmware (or something), and CFMWS would reflect it on the next boot. Ok this has just made me realize that I really do need to write that article on the various forms of interleaving in a post-CXL world. Quoting some of the specification rq: CXL 3.1 Section 9.18.1.3: CXL Fixed Memory Window Structure """ The CFMWS structure describes zero or more Host Physical Address (HPA) windows that are associated with each CXL Host Bridge. Each window represents a contiguous HPA range that may be interleaved across one or more targets, some of which are CXL Host Bridges. Associated with each window are a set of restrictions that govern its usage. It is the OSPM's responsibility to utilize each window for the specified use. The HPA ranges described by CFMWS may include addresses that are current assigned to CXL.mem devices. Before assigning HPAs from a fixed-memory window, the OSPM must check the current assignments and avoid any conflicts. For any given HPA, it shall not be described by more than one CFMWS entry """ Dan, please correct me if I'm wrong, but I'm fairly certain the following is accurate. The CFMWS is the BIOS/EFI's mechanism to report the system configuration to the Operating System, not the Operating System's mechanism to change system configurations (such as interleave). What you're talking about is re-configuring HDM Decoders to interleave devices *presented by* the CFMWS to the operating system. Confusing, I know. But stick with me. The interleave referred to the CFMWS is the BIOS/EFI telling the system that memory accesses to this (physicall address) region will be interleaved across the set of devices that are backing that region. The operating system is responsible for reading these settings and presenting the memory to the system accordingly. The BIOS for example could configure all devices behind a single CFMW as a "Single Device" that interleaves many physical devices, and the OS should present it as such. In this scenario, there is no need to configure an interleave region via cxl-cli - the BIOS already did that for you and presented all these devices as a single device. All you need to do is online the memory. Configuring the CFMWS *should* (but may not) manifest as a set of BIOS/EFI options that say how to configure a set of CXL devices behind one or more host bridges prior to OS boot. This has its limitations. For example, you'd need to reboot the system to make changes and hotplugging a memory device becomes impossible. The BIOS/EFI would also need to understand when the prior configuration is no longer valid - complicated and problematic. Additionally, for more dynamic environments (devices behind a switch, or a DCD) this more "static" configuration may (read: does) reduce your management flexibility. I.e. hotplug may not be possible. Alternatively, the BIOS may configure each device separately, and the OS is may create a region that interleaves those devices explicitly by programming an HDM decoder. In this scenario, the OS could tear down the region, hotplug that device, and recreate the region with new settings accordingly. Greater management flexibility, but more software/management complexity. This requires the OS to recreate the region/interleave set on each reboot - and is probably the preferred mechanism for configuring the system (if only because hotplug and device failure is not uncommon). In this scenario, re-configuration looks a lot like storage mounting. The device is either there or it isn't, and the configuration file either works or it doesn't. Alternatively the daemon setting this all up is free to try to make auto-configuration decisions. (Final note about interleave for completion sake, but not really relevant to this discussion) Alternatively you could just online each device as a separate region, and simply use something like set_mempolicy/numactl to implement interleave on a per-task basis. > > But, really is that the above scenario is only for persistent memory with LSA. > Even if a user configures a new region for volatile memory, and I could not find any specification to > tell the new configuration to the Firmware. > > Could you tell me why such interface is not defined in the CXL specification? > Is it just because there is no place to store region information for volatile memory? > > > IMHO, users want to keep previous configuration after reboot even if it is volatile memory. > Though users don't concern about contents of volatile memory, they want to keep region/interleave > configuration after reboot. Especially, if previous configuration is some years ago, I'll bet > users will forget how they configured regions against cxl volatile memory. > Probably we want some daemon that reconfigures this similar to how we're doing it with storage. You register a preferred configuration given the hardware environment that is valid until the hardware changes. The OS shouldn't really be telling the firmware to configure itself if only because what happens if you unplug a device? ~Gregory ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: CXL volatile memory: How to restore the previous region/Interleave set 2024-05-29 16:40 ` Gregory Price @ 2024-05-30 10:35 ` Yuquan Wang 2024-05-31 15:50 ` Gregory Price 2024-05-30 10:54 ` Yasunori Gotou (Fujitsu) 1 sibling, 1 reply; 12+ messages in thread From: Yuquan Wang @ 2024-05-30 10:35 UTC (permalink / raw) To: Gregory Price Cc: lizhijian, dan.j.williams, linux-cxl, y-goto, Jonathan.Cameron, dave.jiang, fan.ni On Wed, May 29, 2024 at 12:40:41PM -0400, Gregory Price wrote: > > The CFMWS is the BIOS/EFI's mechanism to report the system configuration > to the Operating System, not the Operating System's mechanism to change > system configurations (such as interleave). What you're talking about > is re-configuring HDM Decoders to interleave devices *presented by* the > CFMWS to the operating system. > > Confusing, I know. But stick with me. > > > > The interleave referred to the CFMWS is the BIOS/EFI telling the system > that memory accesses to this (physicall address) region will be interleaved > across the set of devices that are backing that region. The operating system > is responsible for reading these settings and presenting the memory to the > system accordingly. > > The BIOS for example could configure all devices behind a single CFMW as > a "Single Device" that interleaves many physical devices, and the OS should > present it as such. In this scenario, there is no need to configure an > interleave region via cxl-cli - the BIOS already did that for you and > presented all these devices as a single device. All you need to do is > online the memory. > Sorry Gregory, here I have a question. According to your description, the bios drivers could prepare some interleave cxl region configurations on default cxl hardware(SoC) just like we using ndctl-tools in OS run-time (cxl create-region). > Configuring the CFMWS *should* (but may not) manifest as a set of BIOS/EFI > options that say how to configure a set of CXL devices behind one or more > host bridges prior to OS boot. This has its limitations. For example, you'd > need to reboot the system to make changes and hotplugging a memory device > becomes impossible. The BIOS/EFI would also need to understand when the > prior configuration is no longer valid - complicated and problematic. > > Additionally, for more dynamic environments (devices behind a switch, > or a DCD) this more "static" configuration may (read: does) reduce your > management flexibility. I.e. hotplug may not be possible. > > > > Alternatively, the BIOS may configure each device separately, and the > OS is may create a region that interleaves those devices explicitly by > programming an HDM decoder. > > In this scenario, the OS could tear down the region, hotplug that device, > and recreate the region with new settings accordingly. Greater > management flexibility, but more software/management complexity. > > This requires the OS to recreate the region/interleave set on each > reboot - and is probably the preferred mechanism for configuring the > system (if only because hotplug and device failure is not uncommon). > > In this scenario, re-configuration looks a lot like storage mounting. > The device is either there or it isn't, and the configuration file > either works or it doesn't. Alternatively the daemon setting this all > up is free to try to make auto-configuration decisions. > > > > > (Final note about interleave for completion sake, but not really > relevant to this discussion) > > Alternatively you could just online each device as a separate region, > and simply use something like set_mempolicy/numactl to implement > interleave on a per-task basis. > > > > > > But, really is that the above scenario is only for persistent memory with LSA. > > Even if a user configures a new region for volatile memory, and I could not find any specification to > > tell the new configuration to the Firmware. > > > > Could you tell me why such interface is not defined in the CXL specification? > > Is it just because there is no place to store region information for volatile memory? > > > > > > IMHO, users want to keep previous configuration after reboot even if it is volatile memory. > > Though users don't concern about contents of volatile memory, they want to keep region/interleave > > configuration after reboot. Especially, if previous configuration is some years ago, I'll bet > > users will forget how they configured regions against cxl volatile memory. > > > > Probably we want some daemon that reconfigures this similar to how we're > doing it with storage. You register a preferred configuration given the > hardware environment that is valid until the hardware changes. > > The OS shouldn't really be telling the firmware to configure itself if > only because what happens if you unplug a device? > > ~Gregory ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: CXL volatile memory: How to restore the previous region/Interleave set 2024-05-30 10:35 ` Yuquan Wang @ 2024-05-31 15:50 ` Gregory Price 0 siblings, 0 replies; 12+ messages in thread From: Gregory Price @ 2024-05-31 15:50 UTC (permalink / raw) To: Yuquan Wang Cc: lizhijian, dan.j.williams, linux-cxl, y-goto, Jonathan.Cameron, dave.jiang, fan.ni On Thu, May 30, 2024 at 06:35:10PM +0800, Yuquan Wang wrote: > On Wed, May 29, 2024 at 12:40:41PM -0400, Gregory Price wrote: > > > > The CFMWS is the BIOS/EFI's mechanism to report the system configuration > > to the Operating System, not the Operating System's mechanism to change > > system configurations (such as interleave). What you're talking about > > is re-configuring HDM Decoders to interleave devices *presented by* the > > CFMWS to the operating system. > > > > Confusing, I know. But stick with me. > > > > > > > > The interleave referred to the CFMWS is the BIOS/EFI telling the system > > that memory accesses to this (physicall address) region will be interleaved > > across the set of devices that are backing that region. The operating system > > is responsible for reading these settings and presenting the memory to the > > system accordingly. > > > > The BIOS for example could configure all devices behind a single CFMW as > > a "Single Device" that interleaves many physical devices, and the OS should > > present it as such. In this scenario, there is no need to configure an > > interleave region via cxl-cli - the BIOS already did that for you and > > presented all these devices as a single device. All you need to do is > > online the memory. > > > > Sorry Gregory, here I have a question. According to your description, the > bios drivers could prepare some interleave cxl region configurations on > default cxl hardware(SoC) just like we using ndctl-tools in OS run-time > (cxl create-region). > Not in the sense of using cxl-cli or ndctl, but in the sense that BIOS/EFI is responsible for reading hardware configurations and presenting a sane configuration/memory map to the operating system. It is technically possible, though not necessarily implemented anywhere, for BIOS to read the ACPI information from the devices and program the root complex/decoders/whathaveyou to present those devices as a single device to the operating system. The BIOS reads in the ACPI0016 data, generates one or more CFMWS/entries and hands off management of that CFMWS to the OS. In doing so, it's perfectly capable of programming the CFMWS to present multiple devices (or even specific regions in those devices) as part of a single CFMW. This would look like reporting a single CFMWS covering multiple discrete physical memory devices. This CFMWS would have interleaves ways set to >=2 and a TargetList with multiple discrete devices, with a single hardware physical address region that applies to both. The operating system would then manage this region as single device. Looking briefly at the CXL* Type 3 Memory Device Software Guide from Intel (July 2021, Rev 1.0), this is described in section 2.6 and seems reasonably straightforward to me. You certainly COULD save this setup in the LSA if you wanted to, but to put bluntly - there's now a better way of doing/managing all of this. HDM decoders let the OS set this all up. And really the LSA is meant to store information about how to stitch persistent data back together. This is probably why the LSA is not referenced for the volatile setups in the Software Guide. The LSA in the persistent setups is needed to ensure the data is put back together correctly (you could pull out the devices and swap the slots they're in, for example). This doesn't matter for volatile devices, so the programming can be decided on the fly. By my read - there's somewhat of an implied "We expect your hardware environment won't change much, so a couple BIOS/EFI flags could be set and forgotten about when setting up hardware interleave" not written in this document. Side note: I believe Intel did something similar (but different!) recently where they were presenting DRAM+CXL as a single NUMA node as a function of BIOS programming. I don't know whether this was done via the CFMWS or some other tomfoolery, but it's a similar concept. (The following I'm still a little fuzzy on, but this is my best understanding of how we got to where we are. Iif someone sees innaccuracies, please slap my wrist and tell me to stfu) HDM decoders provide the OS the capability to decide how to route host physical addresses down to the devices with the ability to program the root complex/host bridges, switches, and the devices to configure hardware interleave after boot. In this scenario, BIOS/EFI would report a single CFMWS to the OS for each discrete piece of hardware, and the Operating System would then program the HDM decoders on the host bridge(s)/switch and the devices to implement the interleave in hardware. This is the `cxl create-region ... ways=X devN devM ...` command In some ways you can think of the CFMWS way of interleave a kind of... "Legacy Pattern", because probably just about everyone will eventually want to use the HDM pattern because it will be capable of supporting things like hotplug in a more maintainable manner (or at all). For example - it's harder (if even possible) to tear down a CFMWS implemented interleave pattern without rebooting the system than it is to tear down an HDM implemented interleave pattern. You might, however, want to use a combination of these two strategies. If, for example, you have 8 expanders behind a switch attached to a single host bridge. You might want to treat that as a single, concrete device - as opposed to 8 separate expanders which the OS has to manage. Doing that via the CFMWS lets BIOS/Firmware simplify the management of the devices and forego the need for specific driver support (at the expense of flexibility of management after boot). In that case, you'd have the ACPI tables and firmware hardcode the interleave and simply present the larger pool to the BIOS as a single chunk of large capacity to the OS. Or maybe you might want to have some of them interleaved, and others managed by the host. Software defined memory is fuuuuun! :D The specification doesn't really have an opinion on how you "should" do all of this - it just provides at least 3 or 4 different ways to trim the chia pet and lets you be confused by the mess it has made. But as for the LSA and volatile regions, I still don't see a compelling reason for needing it to store prior settings. That seems more of a BIOS/EFI feature that needs to be programmed. ~Gregory (P.S. I was not and am not in any way responsible or involved with writing the spec, so I will now happily take my beatings should I have gotten any of this horribly wrong. This is all just my best understanding from having bashed my face against the spec the past 2 years or so). ^ permalink raw reply [flat|nested] 12+ messages in thread
* RE: CXL volatile memory: How to restore the previous region/Interleave set 2024-05-29 16:40 ` Gregory Price 2024-05-30 10:35 ` Yuquan Wang @ 2024-05-30 10:54 ` Yasunori Gotou (Fujitsu) 1 sibling, 0 replies; 12+ messages in thread From: Yasunori Gotou (Fujitsu) @ 2024-05-30 10:54 UTC (permalink / raw) To: 'Gregory Price' Cc: 'Dan Williams', Zhijian Li (Fujitsu), linux-cxl@vger.kernel.org, Jonathan Cameron, dave.jiang@intel.com, Fan Ni Hello Gregory-san, > > > > Q3, For CXL volatile memory devices without LSA installed, if > > > > users expect to restore the Interleave set to the previous > > > > configuration after reboot, the questions are: > > > > Q3.1 Where should the Interleave Set information be stored? > > > > Q3.2 Which component is responsible for restoring the Interleave Set? > > > > > > The expectation is that BIOS, or the OS for hotplug devices, deploys > > > a default region configuration policy. That policy in the common is > > > likely one of either maximizing performance (maximize interleave > > > across host-bridges), or maximizing error isolation (create an x1-interleave > region per endpoint). > > > > To be honest, I feel CFMWS seems to be something incomplete spec.. > > > > When I first saw the " CXL* Type 3 Memory Device Software Guide", and > > noticed existing CFMWS, I thought that the firmware would create it > > based on some configuration, and OS would read it and create region for each > window information. > > Even if user would execute cxl create-region command and configure > > interleaved region, I thought OS would tell it to firmware (or something), and > CFMWS would reflect it on the next boot. > > Ok this has just made me realize that I really do need to write that article on the > various forms of interleaving in a post-CXL world. > > Quoting some of the specification rq: > CXL 3.1 Section 9.18.1.3: CXL Fixed Memory Window Structure > > """ > The CFMWS structure describes zero or more Host Physical Address (HPA) > windows that are associated with each CXL Host Bridge. Each window > represents a contiguous HPA range that may be interleaved across one or more > targets, some of which are CXL Host Bridges. Associated with each window > are a set of restrictions that govern its usage. It is the OSPM's responsibility to > utilize each window for the specified use. > > The HPA ranges described by CFMWS may include addresses that are current > assigned to CXL.mem devices. Before assigning HPAs from a fixed-memory > window, the OSPM must check the current assignments and avoid any > conflicts. > > For any given HPA, it shall not be described by more than one CFMWS entry """ > > Dan, please correct me if I'm wrong, but I'm fairly certain the following is > accurate. > > The CFMWS is the BIOS/EFI's mechanism to report the system configuration to > the Operating System, not the Operating System's mechanism to change > system configurations (such as interleave). What you're talking about is > re-configuring HDM Decoders to interleave devices *presented by* the > CFMWS to the operating system. > > Confusing, I know. But stick with me. > > The interleave referred to the CFMWS is the BIOS/EFI telling the system that > memory accesses to this (physicall address) region will be interleaved across > the set of devices that are backing that region. The operating system is > responsible for reading these settings and presenting the memory to the > system accordingly. > > The BIOS for example could configure all devices behind a single CFMW as a > "Single Device" that interleaves many physical devices, and the OS should > present it as such. In this scenario, there is no need to configure an interleave > region via cxl-cli - the BIOS already did that for you and presented all these > devices as a single device. All you need to do is online the memory. > > Configuring the CFMWS *should* (but may not) manifest as a set of BIOS/EFI > options that say how to configure a set of CXL devices behind one or more host > bridges prior to OS boot. This has its limitations. For example, you'd need to > reboot the system to make changes and hotplugging a memory device becomes > impossible. The BIOS/EFI would also need to understand when the prior > configuration is no longer valid - complicated and problematic. > > Additionally, for more dynamic environments (devices behind a switch, or a > DCD) this more "static" configuration may (read: does) reduce your > management flexibility. I.e. hotplug may not be possible. > > > Alternatively, the BIOS may configure each device separately, and the OS is may > create a region that interleaves those devices explicitly by programming an > HDM decoder. > > In this scenario, the OS could tear down the region, hotplug that device, and > recreate the region with new settings accordingly. Greater management > flexibility, but more software/management complexity. > > This requires the OS to recreate the region/interleave set on each reboot - and > is probably the preferred mechanism for configuring the system (if only > because hotplug and device failure is not uncommon). > > In this scenario, re-configuration looks a lot like storage mounting. > The device is either there or it isn't, and the configuration file either works or it > doesn't. Alternatively the daemon setting this all up is free to try to make > auto-configuration decisions. > > (Final note about interleave for completion sake, but not really relevant to this > discussion) > > Alternatively you could just online each device as a separate region, and simply > use something like set_mempolicy/numactl to implement interleave on a > per-task basis. > Thank you for your extremely detailed explanation, it was more helpful than I expected. > > > > > But, really is that the above scenario is only for persistent memory with LSA. > > Even if a user configures a new region for volatile memory, and I > > could not find any specification to tell the new configuration to the Firmware. > > > > Could you tell me why such interface is not defined in the CXL specification? > > Is it just because there is no place to store region information for volatile > memory? > > > > > > IMHO, users want to keep previous configuration after reboot even if it is > volatile memory. > > Though users don't concern about contents of volatile memory, they > > want to keep region/interleave configuration after reboot. Especially, > > if previous configuration is some years ago, I'll bet users will forget how they > configured regions against cxl volatile memory. > > > > Probably we want some daemon that reconfigures this similar to how we're > doing it with storage. You register a preferred configuration given the > hardware environment that is valid until the hardware changes. Currently I'm considering how CXL memory/device pool is managed for our future product, and I think that such daemon will need to have more sophisticated features for the memory/device pool ideally - (Re)configure regions for volatile memory as you said - Select DCD (if available), or whole memory device hotplug (when cxl2.0) for memory pool - Select offline memory blocks/devices/DCD area (if a memory block offline fails, the daemon may need to set online other memory blocks on the same device again to return previous status and try hot remove another device (when cxl2.0?) - KIck scripts for users application(or some other things) to prepare memory remove (In my experience, a process needed to be sent a signal to hot-remove a device.) - Detect hotadded device and make new regions (But it may need to wait next device if users want to configure interleave with the next device). - If necessary, it may need API to talk orchestrator or some other pool management component. - error notifidation... - etc.... (I suppose there are many things I don't notice yet.) I think such daemon will be essential for memory/device pool. But they will require a lot of effort. On the other hand, I feel it may be too much sophisticated for a simple server. (For example, the server which has only direct attached cxl memory). In such case, simple command like "cxl save region" and "cxl restore region" is easier for users I think... > > The OS shouldn't really be telling the firmware to configure itself if only > because what happens if you unplug a device? Anyway, thank you for your detail explanation. Thanks, --- Yasunori Goto > > ~Gregory ^ permalink raw reply [flat|nested] 12+ messages in thread
* RE: CXL volatile memory: How to restore the previous region/Interleave set 2024-05-29 11:33 ` Yasunori Gotou (Fujitsu) 2024-05-29 16:40 ` Gregory Price @ 2024-05-31 20:56 ` Dan Williams 2024-06-03 5:01 ` Yasunori Gotou (Fujitsu) 1 sibling, 1 reply; 12+ messages in thread From: Dan Williams @ 2024-05-31 20:56 UTC (permalink / raw) To: Yasunori Gotou (Fujitsu), 'Dan Williams', Zhijian Li (Fujitsu), linux-cxl@vger.kernel.org Cc: Jonathan Cameron, dave.jiang@intel.com, Fan Ni Yasunori Gotou (Fujitsu) wrote: [..] > > What is currently missing on the Linux OS side is a default policy for unmapped > > volatile capacity after all initial device probing has completed. > > > > > One scenario I understand is that in clusters using an Orchestrator, > > > such as K8S, when a node (worker) restarts, K8S is able to read the > > > Interleave Set from the database and sets it for the corresponding node. > > > > Yes, for sophisticated environments a configuration database could store and > > replay region configurations each boot. > > Just an idea, I suppose that cxl command should have two features. > - save current regions configuration to a file which is specified by its operand. > - reconfigure regions depends on the specified file. > (If hardware condition is changed, the command return error and display what is changed.) > > Probably, it may be enough for the most of users.... But what do you think it? It is not clear that the tool needs a save / restore option vs just replaying the same create-region command from one boot to the next. You can imagine having a startup script that issues one of the following: 1/ cxl create-region 2/ cxl create-region -d decoder0.4 3/ cxl create-region -m mem0 mem1 4/ cxl create-region -S 0x12345 0xabcde 4/ cxl create-region -S 0x12345 0xabcde -s 1T 1/ Find the CXL window with the largest available capacity and create a maximally sized region. 2/ Limit the search to a specific CXL window, but create the largest available region from any spare capacity found there. 3/ Try to create the largest possible region with mem0 and mem1. NOTE that this could produce wildly different results from boot to boot because CXL device scanning is asynchronous and mem0 from one boot may not match mem0 in the current boot even if the hardware configuration has not changed. 4/ Same as 3/ but guaranteed to get the exact same devices because they are addressed by serial number. 5/ Same as 4/ but limit the size. So, a lot can be done without needing to save restore the exact configuration, which for volatile the exact configuration does not matter as much as the the capacity and performance characteristics. NOTE, some of the above examples are not implemented yet, patches welcome! For example the -S option to treat the arguments as memory-device serial numbers rather memory-device id numbers is not there today, "cxl create-region" by itself does not know how to search for capacity, and "cxl create-region -m" without specifying the decoder/window throws an error even though it is relatively straightforward to figure out the set of CXL windows that map a given memdev. ^ permalink raw reply [flat|nested] 12+ messages in thread
* RE: CXL volatile memory: How to restore the previous region/Interleave set 2024-05-31 20:56 ` Dan Williams @ 2024-06-03 5:01 ` Yasunori Gotou (Fujitsu) 0 siblings, 0 replies; 12+ messages in thread From: Yasunori Gotou (Fujitsu) @ 2024-06-03 5:01 UTC (permalink / raw) To: 'Dan Williams', Zhijian Li (Fujitsu), linux-cxl@vger.kernel.org Cc: Jonathan Cameron, dave.jiang@intel.com, Fan Ni > Yasunori Gotou (Fujitsu) wrote: > [..] > > > What is currently missing on the Linux OS side is a default policy > > > for unmapped volatile capacity after all initial device probing has > completed. > > > > > > > One scenario I understand is that in clusters using an > > > > Orchestrator, such as K8S, when a node (worker) restarts, K8S is > > > > able to read the Interleave Set from the database and sets it for the > corresponding node. > > > > > > Yes, for sophisticated environments a configuration database could > > > store and replay region configurations each boot. > > > > Just an idea, I suppose that cxl command should have two features. > > - save current regions configuration to a file which is specified by its > operand. > > - reconfigure regions depends on the specified file. > > (If hardware condition is changed, the command return error and > > display what is changed.) > > > > Probably, it may be enough for the most of users.... But what do you think it? > > It is not clear that the tool needs a save / restore option vs just replaying the > same create-region command from one boot to the next. > > You can imagine having a startup script that issues one of the > following: > > 1/ cxl create-region > 2/ cxl create-region -d decoder0.4 > 3/ cxl create-region -m mem0 mem1 > 4/ cxl create-region -S 0x12345 0xabcde > 4/ cxl create-region -S 0x12345 0xabcde -s 1T > > 1/ Find the CXL window with the largest available capacity and create a > maximally sized region. > > 2/ Limit the search to a specific CXL window, but create the largest available > region from any spare capacity found there. > > 3/ Try to create the largest possible region with mem0 and mem1. NOTE that > this could produce wildly different results from boot to boot because CXL > device scanning is asynchronous and mem0 from one boot may not match > mem0 in the current boot even if the hardware configuration has not changed. > > 4/ Same as 3/ but guaranteed to get the exact same devices because they are > addressed by serial number. I feel that -S option may be interesting.... > > 5/ Same as 4/ but limit the size. > > So, a lot can be done without needing to save restore the exact configuration, > which for volatile the exact configuration does not matter as much as the the > capacity and performance characteristics. Thank you for sharing your idea. > > NOTE, some of the above examples are not implemented yet, patches > welcome! For example the -S option to treat the arguments as memory-device > serial numbers rather memory-device id numbers is not there today, "cxl > create-region" by itself does not know how to search for capacity, and "cxl > create-region -m" without specifying the decoder/window throws an error even > though it is relatively straightforward to figure out the set of CXL windows that > map a given memdev. Ok, I see. I'll think about it a bit more. Thanks, ---- Yasunori Goto ^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2024-06-03 5:02 UTC | newest] Thread overview: 12+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2024-05-24 7:32 CXL volatile memory: How to restore the previous region/Interleave set Zhijian Li (Fujitsu) 2024-05-29 1:08 ` Dan Williams 2024-05-29 10:19 ` Zhijian Li (Fujitsu) 2024-05-29 15:44 ` Gregory Price 2024-05-30 9:56 ` Zhijian Li (Fujitsu) 2024-05-29 11:33 ` Yasunori Gotou (Fujitsu) 2024-05-29 16:40 ` Gregory Price 2024-05-30 10:35 ` Yuquan Wang 2024-05-31 15:50 ` Gregory Price 2024-05-30 10:54 ` Yasunori Gotou (Fujitsu) 2024-05-31 20:56 ` Dan Williams 2024-06-03 5:01 ` Yasunori Gotou (Fujitsu)
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox