Questions about CXL device (type 3 memory) hotplug

Linux CXL
 help / color / mirror / Atom feed

* Questions about CXL device (type 3 memory) hotplug
@ 2023-05-22  8:06 Yasunori Gotou (Fujitsu)
  2023-05-23  0:11 ` Dan Williams
  0 siblings, 1 reply; 29+ messages in thread
From: Yasunori Gotou (Fujitsu) @ 2023-05-22  8:06 UTC (permalink / raw)
  To: linux-cxl@vger.kernel.org; +Cc: 'Dan Williams'

Hello,

I have some questions about CXL device hotplug (especially type 3 memory device).

Though my team members still need to work for a remain issue of Filesystem-DAX,
or RDMA for persistent memory yet, I would like to move some of them for
CXL type 3 memory device after finish the above works.
This is preparation for it. I would like to confirm current status of CXL
type 3 memory hotplug.

Q1) Can PCIe hotplug driver detect and call CXL driver?

    CXL specification says as follows
    "9.9 Hot-Plug"
    "CXL leverages PCIe Hot-plug model and Hot-plug elements as defined in PCIe
    BaseSpecification and the applicable form-factor specifications."

    At a glance, PCIe hotplug driver seems to be able to detect any PCIe
    hotpluged device, and call its suitable probe/attach function. But I'm new
    around here, and  I'm not sure it can actually call the suitable CXL driver.

    Can PCIe hotplug driver recognize a CXL device and call its driver?

Q2) Can QEMU/KVM emulate CXL device hotplug?

   I heard that QEMU/KVM has PCIe device hotplug emulation, but I'm not sure
   it can hotplug CXL device.

Q3) After called CXL driver and detected a CXL type 3 memory device,
    what sequence is/will be executed?

    IIRC, kernel created /sys/devices/system/memory/memoryNNN directories when
    memory device is recognized. Then, online operation was executed by 
    a user or an application.
    
    However, CXL specification seems to require more configuration like
    interleave, region, and namespace by Fabric Manager (cxl command?)
    after device detection before memory online.

    So, my understanding is that the above configuration must be executed
    after device detection, and before memory online. Is it correct?

Q4) Current CXL drivers/tools support Hot-removal request from PCIe?

    CXL specification says "In a managed Hot-Remove flow, software is
    notified of a hot removal request."

    I think that CXL drivers/tools need to find which sections belongs to the
    requested device, and execute offline them at least. In addition,
    Fabric Manager may need to prepare removing the device due to configuration
    change.

    Does current CXL drivers/tools can execute them?
    Otherwise, does it need to be implemented yet?

Q5) How CXL driver treat region/namespace size against section size?
    Current x86-64 section size can be 2Gbyte, but CXL region size may be
    able to smaller than it. 

Thanks,
Yasunori Goto


^ permalink raw reply	[flat|nested] 29+ messages in thread

* RE: Questions about CXL device (type 3 memory) hotplug
  2023-05-22  8:06 Questions about CXL device (type 3 memory) hotplug Yasunori Gotou (Fujitsu)
@ 2023-05-23  0:11 ` Dan Williams
  2023-05-23  8:31   ` Yasunori Gotou (Fujitsu)
                     ` (3 more replies)
  0 siblings, 4 replies; 29+ messages in thread
From: Dan Williams @ 2023-05-23  0:11 UTC (permalink / raw)
  To: Yasunori Gotou (Fujitsu), linux-cxl@vger.kernel.org
  Cc: 'Dan Williams'

Yasunori Gotou (Fujitsu) wrote:
> Hello,
> 
> I have some questions about CXL device hotplug (especially type 3 memory device).
> 
> Though my team members still need to work for a remain issue of Filesystem-DAX,
> or RDMA for persistent memory yet, I would like to move some of them for
> CXL type 3 memory device after finish the above works.
> This is preparation for it. I would like to confirm current status of CXL
> type 3 memory hotplug.
> 
> Q1) Can PCIe hotplug driver detect and call CXL driver?
> 
>     CXL specification says as follows
>     "9.9 Hot-Plug"
>     "CXL leverages PCIe Hot-plug model and Hot-plug elements as defined in PCIe
>     BaseSpecification and the applicable form-factor specifications."
> 
>     At a glance, PCIe hotplug driver seems to be able to detect any PCIe
>     hotpluged device, and call its suitable probe/attach function. But I'm new
>     around here, and  I'm not sure it can actually call the suitable CXL driver.
> 
>     Can PCIe hotplug driver recognize a CXL device and call its driver?

Yes.

The cxl_pci driver (drivers/cxl/pci.c) is just a typical PCI driver as
far as the PCI hotplug driver is concerned. So add/remove events of a
CXL card get turned into probe()/remove() events on the driver.

> 
> Q2) Can QEMU/KVM emulate CXL device hotplug?
> 
>    I heard that QEMU/KVM has PCIe device hotplug emulation, but I'm not sure
>    it can hotplug CXL device.

It can, but as far as the driver is concerned you can achieve the same
by:

echo $devname > /sys/bus/pci/drivers/cxl_pci/unbind

...that exercises the same software flows as physical unplug.

> 
> Q3) After called CXL driver and detected a CXL type 3 memory device,
>     what sequence is/will be executed?
> 
>     IIRC, kernel created /sys/devices/system/memory/memoryNNN directories when
>     memory device is recognized. Then, online operation was executed by 
>     a user or an application.
>     
>     However, CXL specification seems to require more configuration like
>     interleave, region, and namespace by Fabric Manager (cxl command?)
>     after device detection before memory online.
> 
>     So, my understanding is that the above configuration must be executed
>     after device detection, and before memory online. Is it correct?

Correct, after the device is added and the driver attaches there is
still a step needed to configure a CXL region.

For now that step is to manually run:

cxl create-region

...later we might consider some udev rules to automatically assemble
regions from discovered capacity.

> Q4) Current CXL drivers/tools support Hot-removal request from PCIe?
> 
>     CXL specification says "In a managed Hot-Remove flow, software is
>     notified of a hot removal request."

Currently there is a requirement that:

cxl disable-memdev

...is run before the device can be removed. There is no warning from the
PCI hotplug driver. Which means that if end user does the wrong sequence
they can crash the kernel / remove memory that may still be in active
use.

>     I think that CXL drivers/tools need to find which sections belongs to the
>     requested device, and execute offline them at least. In addition,
>     Fabric Manager may need to prepare removing the device due to configuration
>     change.
> 
>     Does current CXL drivers/tools can execute them?
>     Otherwise, does it need to be implemented yet?

Currently the 'cxl disable-memdev' command is not smart about
determining when the device is in active use it just claims that it is
always in use. That is in progress to be improved.

> Q5) How CXL driver treat region/namespace size against section size?
>     Current x86-64 section size can be 2Gbyte, but CXL region size may be
>     able to smaller than it. 

The section size is still 128MB, the hotplug memory block size is what
expands to 2GB. That size limits what can be onlined via the dax_kmem
driver.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* RE: Questions about CXL device (type 3 memory) hotplug
  2023-05-23  0:11 ` Dan Williams
@ 2023-05-23  8:31   ` Yasunori Gotou (Fujitsu)
  2023-05-23 17:36     ` Dan Williams
  2023-05-23 13:34   ` Vikram Sethi
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 29+ messages in thread
From: Yasunori Gotou (Fujitsu) @ 2023-05-23  8:31 UTC (permalink / raw)
  To: 'Dan Williams', linux-cxl@vger.kernel.org


Thank you for your answer!
Its progress seems to be better than I thought.

I would like to ask more questions.

> Yasunori Gotou (Fujitsu) wrote:
> > Hello,
> >
> > I have some questions about CXL device hotplug (especially type 3 memory
> device).
> >
> > Though my team members still need to work for a remain issue of
> > Filesystem-DAX, or RDMA for persistent memory yet, I would like to
> > move some of them for CXL type 3 memory device after finish the above
> works.
> > This is preparation for it. I would like to confirm current status of
> > CXL type 3 memory hotplug.
> >
> > Q1) Can PCIe hotplug driver detect and call CXL driver?
> >
> >     CXL specification says as follows
> >     "9.9 Hot-Plug"
> >     "CXL leverages PCIe Hot-plug model and Hot-plug elements as defined
> in PCIe
> >     BaseSpecification and the applicable form-factor specifications."
> >
> >     At a glance, PCIe hotplug driver seems to be able to detect any PCIe
> >     hotpluged device, and call its suitable probe/attach function. But I'm
> new
> >     around here, and  I'm not sure it can actually call the suitable CXL
> driver.
> >
> >     Can PCIe hotplug driver recognize a CXL device and call its driver?
> 
> Yes.
> 
> The cxl_pci driver (drivers/cxl/pci.c) is just a typical PCI driver as far as the PCI
> hotplug driver is concerned. So add/remove events of a CXL card get turned
> into probe()/remove() events on the driver.

Sounds good!

> >
> > Q2) Can QEMU/KVM emulate CXL device hotplug?
> >
> >    I heard that QEMU/KVM has PCIe device hotplug emulation, but I'm not
> sure
> >    it can hotplug CXL device.
> 
> It can,

Ok, then, are there any test-set for CXL device hotplug with QEMU/KVM emulation?
It will be more helpful when actual CXL memory device will be released.


> but as far as the driver is concerned you can achieve the same
> by:
> 
> echo $devname > /sys/bus/pci/drivers/cxl_pci/unbind
> 
> ...that exercises the same software flows as physical unplug.

Ok. I see.

> >
> > Q3) After called CXL driver and detected a CXL type 3 memory device,
> >     what sequence is/will be executed?
> >
> >     IIRC, kernel created /sys/devices/system/memory/memoryNNN
> directories when
> >     memory device is recognized. Then, online operation was executed by
> >     a user or an application.
> >
> >     However, CXL specification seems to require more configuration like
> >     interleave, region, and namespace by Fabric Manager (cxl command?)
> >     after device detection before memory online.
> >
> >     So, my understanding is that the above configuration must be executed
> >     after device detection, and before memory online. Is it correct?
> 
> Correct, after the device is added and the driver attaches there is still a step
> needed to configure a CXL region.
> 
> For now that step is to manually run:
> 
> cxl create-region
> 
> ...later we might consider some udev rules to automatically assemble regions
> from discovered capacity.

Hmm, I suppose 2 types of udev rules may be necessary.
The first one is for notify new CXL device is detected, and cxl-command assemble 
a region automatically.
The second one is for notify region is configured, online is execute for each
memory block on the region by the notification, and rollback when one of the block
fails hotadd If necessary.

> 
> > Q4) Current CXL drivers/tools support Hot-removal request from PCIe?
> >
> >     CXL specification says "In a managed Hot-Remove flow, software is
> >     notified of a hot removal request."
> 
> Currently there is a requirement that:
> 
> cxl disable-memdev
> 
> ...is run before the device can be removed. There is no warning from the PCI
> hotplug driver. Which means that if end user does the wrong sequence they
> can crash the kernel / remove memory that may still be in active use.

Ok.
Though "Surprising remove" is not guaranteed by specification, I think
"managed hot-removed flow" should be realized.
I'll chase more what should we do about it.


> 
> >     I think that CXL drivers/tools need to find which sections belongs to the
> >     requested device, and execute offline them at least. In addition,
> >     Fabric Manager may need to prepare removing the device due to
> configuration
> >     change.
> >
> >     Does current CXL drivers/tools can execute them?
> >     Otherwise, does it need to be implemented yet?
> 
> Currently the 'cxl disable-memdev' command is not smart about determining
> when the device is in active use it just claims that it is always in use. That is in
> progress to be improved.

Ok. I see.

> 
> > Q5) How CXL driver treat region/namespace size against section size?
> >     Current x86-64 section size can be 2Gbyte, but CXL region size may be
> >     able to smaller than it.
> 
> The section size is still 128MB, the hotplug memory block size is what expands
> to 2GB. That size limits what can be onlined via the dax_kmem driver.

Oops. 
OK, I understand I should change my word "section" to "hotplug memory block".

One of the background of this question is "rollback".
"If memory hotadd or hotremove for a memory block fails, is rollback available?".

If a block hotadd sequence fails in the device for some reasons, its user wants to remove
the device for the moment, and may want to retry hotadd again or try other device.
To achieve it, already onlined blocks before failed block should be offlined again.

If a block hotremove sequence fails in the device, its user would like to keep the device
online to postpone replacing it or select other device for device pooling. (vice vesa).
I don't find which component handle this situation.

I noticed that current users prefer online after device detection immediately, and kernel
supports it. Though it is natural for some use-case, I feel it may be obstacle for rollback of
CXL device hotplug failure.

Thanks,



^ permalink raw reply	[flat|nested] 29+ messages in thread

* RE: Questions about CXL device (type 3 memory) hotplug
  2023-05-23  8:31   ` Yasunori Gotou (Fujitsu)
@ 2023-05-23 17:36     ` Dan Williams
  2023-05-24 11:12       ` Yasunori Gotou (Fujitsu)
  0 siblings, 1 reply; 29+ messages in thread
From: Dan Williams @ 2023-05-23 17:36 UTC (permalink / raw)
  To: Yasunori Gotou (Fujitsu), 'Dan Williams',
	linux-cxl@vger.kernel.org

Yasunori Gotou (Fujitsu) wrote:
> 
> Thank you for your answer!
> Its progress seems to be better than I thought.
> 
> I would like to ask more questions.
> 
> > Yasunori Gotou (Fujitsu) wrote:
[..]
> > Correct, after the device is added and the driver attaches there is still a step
> > needed to configure a CXL region.
> > 
> > For now that step is to manually run:
> > 
> > cxl create-region
> > 
> > ...later we might consider some udev rules to automatically assemble regions
> > from discovered capacity.
> 
> Hmm, I suppose 2 types of udev rules may be necessary.
> The first one is for notify new CXL device is detected, and cxl-command assemble 
> a region automatically.

Yes, I suspect this ends up being similar to the mdadm monitor policy
where the device arrival events trigger notification to a daemon that
can apply an assembly policy.

> The second one is for notify region is configured, online is execute for each
> memory block on the region by the notification, and rollback when one of the block
> fails hotadd If necessary.

This policy needs to coordinate with the
CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE policy and the memhp_default_state
setting. I.e. the kernel may do this automatically depending on those
settings.

> > > Q4) Current CXL drivers/tools support Hot-removal request from PCIe?
> > >
> > >     CXL specification says "In a managed Hot-Remove flow, software is
> > >     notified of a hot removal request."
> > 
> > Currently there is a requirement that:
> > 
> > cxl disable-memdev
> > 
> > ...is run before the device can be removed. There is no warning from the PCI
> > hotplug driver. Which means that if end user does the wrong sequence they
> > can crash the kernel / remove memory that may still be in active use.
> 
> Ok.
> Though "Surprising remove" is not guaranteed by specification, I think
> "managed hot-removed flow" should be realized.
> I'll chase more what should we do about it.

The nuance here is that even though the PCI hotplug driver supports an
attention button and pauses to let the OS acknowledge the removal. That
acknowledgement is not coordinated with the associated drivers instead
those drivers just receive a ->remove() notification that can not be
failed.

So, this means that the CXL device must be shutdown manually with

daxctl offline-memory
cxl disable-region
cxl disable-memdev 

...*before* the hotplug attention button is pressed. If any of those
commands fail the device is in active use by the kernel and the hotplug
attempt needs to be cancelled. My expectation is that CXL memory device
removal is not possible in the majority of cases. This is why the
Dynamic Capacity Device definition in CXL 3.0 allows for the flexibility
of partial removal.

> > >     I think that CXL drivers/tools need to find which sections belongs to the
> > >     requested device, and execute offline them at least. In addition,
> > >     Fabric Manager may need to prepare removing the device due to
> > configuration
> > >     change.
> > >
> > >     Does current CXL drivers/tools can execute them?
> > >     Otherwise, does it need to be implemented yet?
> > 
> > Currently the 'cxl disable-memdev' command is not smart about determining
> > when the device is in active use it just claims that it is always in use. That is in
> > progress to be improved.
> 
> Ok. I see.
> 
> > 
> > > Q5) How CXL driver treat region/namespace size against section size?
> > >     Current x86-64 section size can be 2Gbyte, but CXL region size may be
> > >     able to smaller than it.
> > 
> > The section size is still 128MB, the hotplug memory block size is what expands
> > to 2GB. That size limits what can be onlined via the dax_kmem driver.
> 
> Oops. 
> OK, I understand I should change my word "section" to "hotplug memory block".
> 
> One of the background of this question is "rollback".
> "If memory hotadd or hotremove for a memory block fails, is rollback available?".
> 
> If a block hotadd sequence fails in the device for some reasons, its user wants to remove
> the device for the moment, and may want to retry hotadd again or try other device.
> To achieve it, already onlined blocks before failed block should be offlined again.
> 
> If a block hotremove sequence fails in the device, its user would like to keep the device
> online to postpone replacing it or select other device for device pooling. (vice vesa).
> I don't find which component handle this situation.

It depends on how the memory is onlined and whether it gets pinned by
the kernel. As long as all of the memory is onlined to ZONE_MOVABLE then
there is a good chance to be able to get it back. However, ZONE_MOVABLE
is not a guarantee that memory can be removed later, and ZONE_MOVABLE
requires some ratio of ZONE_NORMAL memory to be present to make it
usable. See "Zone Imbalances" in
Documentation/admin-guide/mm/memory-hotplug.rst.

> I noticed that current users prefer online after device detection immediately, and kernel
> supports it. Though it is natural for some use-case, I feel it may be obstacle for rollback of
> CXL device hotplug failure.

Yes, this is a platform owner policy tradeoff decision. Maximize hotplug
capability by limiting how the memory is used, or maximize the
utilization of the memory by limiting hotplug flexibility. The kernel
defaults to maximizing the utilization of the memory, but administrator
policy can go as far as only allowing memory access through the
dedicated device-dax interface.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* RE: Questions about CXL device (type 3 memory) hotplug
  2023-05-23 17:36     ` Dan Williams
@ 2023-05-24 11:12       ` Yasunori Gotou (Fujitsu)
  2023-05-24 20:51         ` Dan Williams
  2023-05-26  8:05         ` Yasunori Gotou (Fujitsu)
  0 siblings, 2 replies; 29+ messages in thread
From: Yasunori Gotou (Fujitsu) @ 2023-05-24 11:12 UTC (permalink / raw)
  To: 'Dan Williams', linux-cxl@vger.kernel.org

> Yasunori Gotou (Fujitsu) wrote:
> >
> > Thank you for your answer!
> > Its progress seems to be better than I thought.
> >
> > I would like to ask more questions.
> >
> > > Yasunori Gotou (Fujitsu) wrote:
> [..]
> > > Correct, after the device is added and the driver attaches there is
> > > still a step needed to configure a CXL region.
> > >
> > > For now that step is to manually run:
> > >
> > > cxl create-region
> > >
> > > ...later we might consider some udev rules to automatically assemble
> > > regions from discovered capacity.
> >
> > Hmm, I suppose 2 types of udev rules may be necessary.
> > The first one is for notify new CXL device is detected, and
> > cxl-command assemble a region automatically.
> 
> Yes, I suspect this ends up being similar to the mdadm monitor policy where
> the device arrival events trigger notification to a daemon that can apply an
> assembly policy.
> 
> > The second one is for notify region is configured, online is execute
> > for each memory block on the region by the notification, and rollback
> > when one of the block fails hotadd If necessary.
> 
> This policy needs to coordinate with the
> CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE policy and the
> memhp_default_state setting. I.e. the kernel may do this automatically
> depending on those settings.
> 
> > > > Q4) Current CXL drivers/tools support Hot-removal request from PCIe?
> > > >
> > > >     CXL specification says "In a managed Hot-Remove flow, software is
> > > >     notified of a hot removal request."
> > >
> > > Currently there is a requirement that:
> > >
> > > cxl disable-memdev
> > >
> > > ...is run before the device can be removed. There is no warning from
> > > the PCI hotplug driver. Which means that if end user does the wrong
> > > sequence they can crash the kernel / remove memory that may still be in
> active use.
> >
> > Ok.
> > Though "Surprising remove" is not guaranteed by specification, I think
> > "managed hot-removed flow" should be realized.
> > I'll chase more what should we do about it.
> 
> The nuance here is that even though the PCI hotplug driver supports an
> attention button and pauses to let the OS acknowledge the removal. That
> acknowledgement is not coordinated with the associated drivers instead those
> drivers just receive a ->remove() notification that can not be failed.
> 
> So, this means that the CXL device must be shutdown manually with
> 
> daxctl offline-memory
> cxl disable-region
> cxl disable-memdev
> 
> ...*before* the hotplug attention button is pressed. If any of those commands
> fail the device is in active use by the kernel and the hotplug attempt needs to be
> cancelled. My expectation is that CXL memory device removal is not possible in
> the majority of cases.
> This is why the Dynamic Capacity Device definition in
> CXL 3.0 allows for the flexibility of partial removal.

Hmmm, I mind something here, but I cannot make sentence what is it yet
Probably, I need time to reconsider it. Please wait.

> 
> > > >     I think that CXL drivers/tools need to find which sections belongs to
> the
> > > >     requested device, and execute offline them at least. In addition,
> > > >     Fabric Manager may need to prepare removing the device due to
> > > configuration
> > > >     change.
> > > >
> > > >     Does current CXL drivers/tools can execute them?
> > > >     Otherwise, does it need to be implemented yet?
> > >
> > > Currently the 'cxl disable-memdev' command is not smart about
> > > determining when the device is in active use it just claims that it
> > > is always in use. That is in progress to be improved.
> >
> > Ok. I see.
> >
> > >
> > > > Q5) How CXL driver treat region/namespace size against section size?
> > > >     Current x86-64 section size can be 2Gbyte, but CXL region size may
> be
> > > >     able to smaller than it.
> > >
> > > The section size is still 128MB, the hotplug memory block size is
> > > what expands to 2GB. That size limits what can be onlined via the
> dax_kmem driver.
> >
> > Oops.
> > OK, I understand I should change my word "section" to "hotplug memory
> block".
> >
> > One of the background of this question is "rollback".
> > "If memory hotadd or hotremove for a memory block fails, is rollback
> available?".
> >
> > If a block hotadd sequence fails in the device for some reasons, its
> > user wants to remove the device for the moment, and may want to retry
> hotadd again or try other device.
> > To achieve it, already onlined blocks before failed block should be offlined
> again.
> >
> > If a block hotremove sequence fails in the device, its user would like
> > to keep the device online to postpone replacing it or select other device for
> device pooling. (vice vesa).
> > I don't find which component handle this situation.
> 
> It depends on how the memory is onlined and whether it gets pinned by the
> kernel. As long as all of the memory is onlined to ZONE_MOVABLE then there is
> a good chance to be able to get it back. However, ZONE_MOVABLE is not a
> guarantee that memory can be removed later, and ZONE_MOVABLE requires
> some ratio of ZONE_NORMAL memory to be present to make it usable. See
> "Zone Imbalances" in Documentation/admin-guide/mm/memory-hotplug.rst.

I know it. Probably, I'm the first person who proposed that kernel divides its memory into
movable and not movable area. (IIRC, it was BOF at Ottawa Linux Symposium 2004 or 2005).
Actually, my name is still remain in git blame in the empty lines of the document.
----
ac3332c44767b Documentation/admin-guide/mm/memory-hotplug.rst (David Hildenbrand     2021-09-07 19:54:49 -0700   4) Memory Hot(Un)Plug
ac3332c44767b Documentation/admin-guide/mm/memory-hotplug.rst (David Hildenbrand     2021-09-07 19:54:49 -0700   5) ==================
6867c9310d5da Documentation/memory-hotplug.txt                (Yasunori Goto         2007-08-10 13:00:59 -0700   6)
 :
---
I'm glad to see many people have enhanced it after leaving from working for memory-hotplug 😊.

In my understanding, one of the big reason of memory hotplug failure is long term pin user pages
like Infiniband RDMA, and I guess that or any similar features may have same problem.
Many CXL devices like smartNIC will have such feature.
Because It has ambivalent requirements.
- To achieve fast data transfer, such feature want to skip the kernel layer and pin user pages
 to transfer data directly. The most of CXL Device like Smart NIC will want to use it.
- On the other hand, kernel has responsibility of such area management. Memory hotplug is one example of it,
 and it will be important for CXL memory pool.

I think it is same with the issue FS-DAX vs. RDMA, and On Demand Paging is only one solution for it.
I expect ODP may helpful for memory hotplug too.

About ratio problem between ZONE_NORMAL and ZONE_MOVABLE,
I think user/platform will configure that DDR DRAM will be ZONE_NORMAL, and CXL memory pool will
be ZONE_MOVABLE. It is easy for them to understand.

> 
> > I noticed that current users prefer online after device detection
> > immediately, and kernel supports it. Though it is natural for some
> > use-case, I feel it may be obstacle for rollback of CXL device hotplug failure.
> 
> Yes, this is a platform owner policy tradeoff decision. Maximize hotplug
> capability by limiting how the memory is used, or maximize the utilization of the
> memory by limiting hotplug flexibility. The kernel defaults to maximizing the
> utilization of the memory, but administrator policy can go as far as only allowing
> memory access through the dedicated device-dax interface.

Thanks,
---
Yasunori Goto

^ permalink raw reply	[flat|nested] 29+ messages in thread

* RE: Questions about CXL device (type 3 memory) hotplug
  2023-05-24 11:12       ` Yasunori Gotou (Fujitsu)
@ 2023-05-24 20:51         ` Dan Williams
  2023-05-25 10:32           ` Yasunori Gotou (Fujitsu)
  2023-05-26  8:05         ` Yasunori Gotou (Fujitsu)
  1 sibling, 1 reply; 29+ messages in thread
From: Dan Williams @ 2023-05-24 20:51 UTC (permalink / raw)
  To: Yasunori Gotou (Fujitsu), 'Dan Williams',
	linux-cxl@vger.kernel.org

Yasunori Gotou (Fujitsu) wrote:
> > Yasunori Gotou (Fujitsu) wrote:
[..]
> > > If a block hotremove sequence fails in the device, its user would like
> > > to keep the device online to postpone replacing it or select other device for
> > device pooling. (vice vesa).
> > > I don't find which component handle this situation.
> > 
> > It depends on how the memory is onlined and whether it gets pinned by the
> > kernel. As long as all of the memory is onlined to ZONE_MOVABLE then there is
> > a good chance to be able to get it back. However, ZONE_MOVABLE is not a
> > guarantee that memory can be removed later, and ZONE_MOVABLE requires
> > some ratio of ZONE_NORMAL memory to be present to make it usable. See
> > "Zone Imbalances" in Documentation/admin-guide/mm/memory-hotplug.rst.
> 
> I know it. Probably, I'm the first person who proposed that kernel divides its memory into
> movable and not movable area. (IIRC, it was BOF at Ottawa Linux Symposium 2004 or 2005).
> Actually, my name is still remain in git blame in the empty lines of the document.
> ----
> ac3332c44767b Documentation/admin-guide/mm/memory-hotplug.rst (David Hildenbrand     2021-09-07 19:54:49 -0700   4) Memory Hot(Un)Plug
> ac3332c44767b Documentation/admin-guide/mm/memory-hotplug.rst (David Hildenbrand     2021-09-07 19:54:49 -0700   5) ==================
> 6867c9310d5da Documentation/memory-hotplug.txt                (Yasunori Goto         2007-08-10 13:00:59 -0700   6)
>  :
> ---
> I'm glad to see many people have enhanced it after leaving from working for memory-hotplug 😊.

Nice! Yeah, I have noticed that most times when I think I need something
new for memory hotplug and CXL I run into David Hildenbrand's work
associated with virtio-mem.

> In my understanding, one of the big reason of memory hotplug failure is long term pin user pages
> like Infiniband RDMA, and I guess that or any similar features may have same problem.
> Many CXL devices like smartNIC will have such feature.
> Because It has ambivalent requirements.
> - To achieve fast data transfer, such feature want to skip the kernel layer and pin user pages
>  to transfer data directly. The most of CXL Device like Smart NIC will want to use it.
> - On the other hand, kernel has responsibility of such area management. Memory hotplug is one example of it,
>  and it will be important for CXL memory pool.
> 
> I think it is same with the issue FS-DAX vs. RDMA, and On Demand Paging is only one solution for it.
> I expect ODP may helpful for memory hotplug too.

It's going to be interesting. Yes, as memory becomes more dynamic, long
term page pinning is going to become more and more painful. It's even
worse because it's not just RDMA that causes the problem it's also any
device assigment to a guest VM that wants to pin all host pages backing
guest memory.

> About ratio problem between ZONE_NORMAL and ZONE_MOVABLE,
> I think user/platform will configure that DDR DRAM will be ZONE_NORMAL, and CXL memory pool will
> be ZONE_MOVABLE. It is easy for them to understand.

While that it is easy to understand, I worry that is in conflict with
one of the main value propositions of CXL which is vastly expanded
memory capacity. The conflict comes if the capacity of inexpensive CXL
outpaces the ZONE_NORMAL requirements that can be satisified from
locally attached DDR.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* RE: Questions about CXL device (type 3 memory) hotplug
  2023-05-24 20:51         ` Dan Williams
@ 2023-05-25 10:32           ` Yasunori Gotou (Fujitsu)
  0 siblings, 0 replies; 29+ messages in thread
From: Yasunori Gotou (Fujitsu) @ 2023-05-25 10:32 UTC (permalink / raw)
  To: 'Dan Williams', linux-cxl@vger.kernel.org

> Yasunori Gotou (Fujitsu) wrote:
> > > Yasunori Gotou (Fujitsu) wrote:
> [..]
> > > > If a block hotremove sequence fails in the device, its user would
> > > > like to keep the device online to postpone replacing it or select
> > > > other device for
> > > device pooling. (vice vesa).
> > > > I don't find which component handle this situation.
> > >
> > > It depends on how the memory is onlined and whether it gets pinned
> > > by the kernel. As long as all of the memory is onlined to
> > > ZONE_MOVABLE then there is a good chance to be able to get it back.
> > > However, ZONE_MOVABLE is not a guarantee that memory can be removed
> > > later, and ZONE_MOVABLE requires some ratio of ZONE_NORMAL
> memory to
> > > be present to make it usable. See "Zone Imbalances" in
> Documentation/admin-guide/mm/memory-hotplug.rst.
> >
> > I know it. Probably, I'm the first person who proposed that kernel
> > divides its memory into movable and not movable area. (IIRC, it was BOF at
> Ottawa Linux Symposium 2004 or 2005).
> > Actually, my name is still remain in git blame in the empty lines of the
> document.
> > ----
> > ac3332c44767b Documentation/admin-guide/mm/memory-hotplug.rst
> (David Hildenbrand     2021-09-07 19:54:49 -0700   4) Memory Hot(Un)Plug
> > ac3332c44767b Documentation/admin-guide/mm/memory-hotplug.rst
> (David Hildenbrand     2021-09-07 19:54:49 -0700   5)
> ==================
> > 6867c9310d5da Documentation/memory-hotplug.txt
> (Yasunori Goto         2007-08-10 13:00:59 -0700   6)
> >  :
> > ---
> > I'm glad to see many people have enhanced it after leaving from working for
> memory-hotplug 😊.
> 
> Nice! Yeah, I have noticed that most times when I think I need something new
> for memory hotplug and CXL I run into David Hildenbrand's work associated
> with virtio-mem.
> 
> > In my understanding, one of the big reason of memory hotplug failure
> > is long term pin user pages like Infiniband RDMA, and I guess that or any
> similar features may have same problem.
> > Many CXL devices like smartNIC will have such feature.
> > Because It has ambivalent requirements.
> > - To achieve fast data transfer, such feature want to skip the kernel
> > layer and pin user pages  to transfer data directly. The most of CXL Device
> like Smart NIC will want to use it.
> > - On the other hand, kernel has responsibility of such area
> > management. Memory hotplug is one example of it,  and it will be important
> for CXL memory pool.
> >
> > I think it is same with the issue FS-DAX vs. RDMA, and On Demand Paging
> is only one solution for it.
> > I expect ODP may helpful for memory hotplug too.
> 
> It's going to be interesting. Yes, as memory becomes more dynamic, long term
> page pinning is going to become more and more painful.

Yeah...
I guess that many company/people don't notice the above ambivalent requirements.
So, I would like to propagate that ODP should be supported by many device when we can confirmed that
ODP is also effective for memory-hotplug.
- So far, ODP was supported by only Mellanox(NVDIA) card, but we are making ODP for SoftRoCE. 
  https://lore.kernel.org/lkml/2d4f6023-0897-2414-45c0-e16b119dd9fb@gmail.com/T/
  Then, it must be possible at driver layer.
- PCIe specification has ATS(Address Translation Service) and PRI(Page Request Interface).
  So I think it is ok from specification view.
Then, if many device vendors support it, I expect the problem will decrease....

＞It's even worse
> because it's not just RDMA that causes the problem it's also any device
> assigment to a guest VM that wants to pin all host pages backing guest
> memory.

Ahhhh, certainly. 
I'll rethink about it too.

> 
> > About ratio problem between ZONE_NORMAL and ZONE_MOVABLE, I think
> > user/platform will configure that DDR DRAM will be ZONE_NORMAL, and
> > CXL memory pool will be ZONE_MOVABLE. It is easy for them to understand.
> 
> While that it is easy to understand, I worry that is in conflict with one of the
> main value propositions of CXL which is vastly expanded memory capacity. The
> conflict comes if the capacity of inexpensive CXL outpaces the
> ZONE_NORMAL requirements that can be satisified from locally attached
> DDR.

I Agree.. It will depend on user's use-case.

Thanks,
----
Yasunori Goto



^ permalink raw reply	[flat|nested] 29+ messages in thread

* RE: Questions about CXL device (type 3 memory) hotplug
  2023-05-24 11:12       ` Yasunori Gotou (Fujitsu)
  2023-05-24 20:51         ` Dan Williams
@ 2023-05-26  8:05         ` Yasunori Gotou (Fujitsu)
  2023-05-26 14:48           ` Dan Williams
  1 sibling, 1 reply; 29+ messages in thread
From: Yasunori Gotou (Fujitsu) @ 2023-05-26  8:05 UTC (permalink / raw)
  To: 'Dan Williams', linux-cxl@vger.kernel.org


> > > > > Q4) Current CXL drivers/tools support Hot-removal request from PCIe?
> > > > >
> > > > >     CXL specification says "In a managed Hot-Remove flow, software
> is
> > > > >     notified of a hot removal request."
> > > >
> > > > Currently there is a requirement that:
> > > >
> > > > cxl disable-memdev
> > > >
> > > > ...is run before the device can be removed. There is no warning
> > > > from the PCI hotplug driver. Which means that if end user does the
> > > > wrong sequence they can crash the kernel / remove memory that may
> > > > still be in
> > active use.
> > >
> > > Ok.
> > > Though "Surprising remove" is not guaranteed by specification, I
> > > think "managed hot-removed flow" should be realized.
> > > I'll chase more what should we do about it.
> >
> > The nuance here is that even though the PCI hotplug driver supports an
> > attention button and pauses to let the OS acknowledge the removal.
> > That acknowledgement is not coordinated with the associated drivers
> > instead those drivers just receive a ->remove() notification that can not be
> failed.
> >
> > So, this means that the CXL device must be shutdown manually with
> >
> > daxctl offline-memory
> > cxl disable-region
> > cxl disable-memdev
> >
> > ...*before* the hotplug attention button is pressed. If any of those
> > commands fail the device is in active use by the kernel and the
> > hotplug attempt needs to be cancelled. My expectation is that CXL
> > memory device removal is not possible in the majority of cases.
> > This is why the Dynamic Capacity Device definition in CXL 3.0 allows
> > for the flexibility of partial removal.
> 
> Hmmm, I mind something here, but I cannot make sentence what is it yet
> Probably, I need time to reconsider it. Please wait.

One of what I mind here --was-- which documentation describes OS triggered hotremove instead of PCIe trigger.
Because many hardware/firmware developers don't know the circumstance of Linux.
They may want to implement same system not only for Linux but also for VMware or any other system,
and may want to obey only the specification or any similar documents.
But I found " CXL* Type 3 Memory Device Software Guide: 2.13.7 OS managed hot remove sequence"
https://cdrdv2-public.intel.com/643805/643805_CXL%20Memory%20Device%20SW%20Guide_Rev1p0.pdf
Then, I can talk with them by it. So, it was solved.

My remain questions are the followings.

Q6) Are there any way to hotremove from outside of servers now?
    Currently, administrator seems to need to login a server and execute offline and cxl disable commands
    to remove memory in it, right? But in future, something software like memory pool manager,
    Fabric Manager, or any other management tools which can manage many servers CXL devices
    will want to remove each server's devices from outside.
    But I'm not sure it can available or not yet now.

Q7) Are there any interface to know which block cannot be offlined?
    Even DCD is supported, such manager software seems to need to repeat memory offline
    request for many blocks and collect succeeded blocks until requested amount....
    It may need much time to complete even if its success rate is low. I think that "time out"
    for such case is not so good idea.
    If there is an interface for such application to know which block has long term pin pages
    or any other obstacles, then it is desirable the above software can avoid to wait long time.

    If not, I would like to investigate more how to make it.

Thanks,

---
Yasunori Goto

^ permalink raw reply	[flat|nested] 29+ messages in thread

* RE: Questions about CXL device (type 3 memory) hotplug
  2023-05-26  8:05         ` Yasunori Gotou (Fujitsu)
@ 2023-05-26 14:48           ` Dan Williams
  2023-05-29  8:07             ` Yasunori Gotou (Fujitsu)
  0 siblings, 1 reply; 29+ messages in thread
From: Dan Williams @ 2023-05-26 14:48 UTC (permalink / raw)
  To: Yasunori Gotou (Fujitsu), 'Dan Williams',
	linux-cxl@vger.kernel.org

Yasunori Gotou (Fujitsu) wrote:
[..]
> One of what I mind here --was-- which documentation describes OS triggered hotremove instead of PCIe trigger.
> Because many hardware/firmware developers don't know the circumstance of Linux.
> They may want to implement same system not only for Linux but also for VMware or any other system,
> and may want to obey only the specification or any similar documents.
> But I found " CXL* Type 3 Memory Device Software Guide: 2.13.7 OS managed hot remove sequence"
> https://cdrdv2-public.intel.com/643805/643805_CXL%20Memory%20Device%20SW%20Guide_Rev1p0.pdf
> Then, I can talk with them by it. So, it was solved.
> 
> My remain questions are the followings.
> 
> Q6) Are there any way to hotremove from outside of servers now?
>     Currently, administrator seems to need to login a server and execute offline and cxl disable commands
>     to remove memory in it, right? But in future, something software like memory pool manager,
>     Fabric Manager, or any other management tools which can manage many servers CXL devices
>     will want to remove each server's devices from outside.
>     But I'm not sure it can available or not yet now.

As far as I can see all of the PCI hotplug state machines just
coordinate the removal internal to themselves and the PCI bus core
without any participation from the impacted driver before the ->remove()
event. The ->remove() event is too late to cancel the hotplug. So the
change here would be either an upcall to userspace, or some permission
request callback to the impacted driver. Since this is a policy decision
whether to allow a given CXL device to be removed that leans towards a
userspace upcall mechanism.

> 
> Q7) Are there any interface to know which block cannot be offlined?
>     Even DCD is supported, such manager software seems to need to repeat memory offline
>     request for many blocks and collect succeeded blocks until requested amount....
>     It may need much time to complete even if its success rate is low. I think that "time out"
>     for such case is not so good idea.
>     If there is an interface for such application to know which block has long term pin pages
>     or any other obstacles, then it is desirable the above software can avoid to wait long time.
> 
>     If not, I would like to investigate more how to make it.

memory_block_offline() can fail and that error code is returned to the
sysfs operation that wrote 0 to
/sys/devices/system/memory/memoryX/online, but there is no facility to
determine whether a block is long term pinned. You can only try to
offline it, but whether that error is transient or permanent is unknown.

However, one of the mitigations here is that DCD allows for partial
release. So unless the fabric manager strictly needs to get all of its
memory back then it can make forward progress with a partial return
rather than timing out. If the fabric manager needs guarantees then
there is also the option to never online the memory. So, we have some
flexibility here before it is clear that a new interface is needed.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* RE: Questions about CXL device (type 3 memory) hotplug
  2023-05-26 14:48           ` Dan Williams
@ 2023-05-29  8:07             ` Yasunori Gotou (Fujitsu)
  2023-06-06 17:58               ` Dan Williams
  0 siblings, 1 reply; 29+ messages in thread
From: Yasunori Gotou (Fujitsu) @ 2023-05-29  8:07 UTC (permalink / raw)
  To: 'Dan Williams', linux-cxl@vger.kernel.org

> Yasunori Gotou (Fujitsu) wrote:
> [..]
> > One of what I mind here --was-- which documentation describes OS
> triggered hotremove instead of PCIe trigger.
> > Because many hardware/firmware developers don't know the circumstance
> of Linux.
> > They may want to implement same system not only for Linux but also for
> > VMware or any other system, and may want to obey only the specification or
> any similar documents.
> > But I found " CXL* Type 3 Memory Device Software Guide: 2.13.7 OS
> managed hot remove sequence"
> >
> https://cdrdv2-public.intel.com/643805/643805_CXL%20Memory%20Device
> %20
> > SW%20Guide_Rev1p0.pdf Then, I can talk with them by it. So, it was
> > solved.
> >
> > My remain questions are the followings.
> >
> > Q6) Are there any way to hotremove from outside of servers now?
> >     Currently, administrator seems to need to login a server and execute
> offline and cxl disable commands
> >     to remove memory in it, right? But in future, something software like
> memory pool manager,
> >     Fabric Manager, or any other management tools which can manage
> many servers CXL devices
> >     will want to remove each server's devices from outside.
> >     But I'm not sure it can available or not yet now.
> 
> As far as I can see all of the PCI hotplug state machines just coordinate the
> removal internal to themselves and the PCI bus core without any participation
> from the impacted driver before the ->remove() event. The ->remove() event is
> too late to cancel the hotplug. So the change here would be either an upcall to
> userspace, or some permission request callback to the impacted driver. Since
> this is a policy decision whether to allow a given CXL device to be removed that
> leans towards a userspace upcall mechanism.

Ah, sorry... My description of question was not good.
I understand that PCIe hotremove is not suitable for trigger of CXL memory.

What I would like to ask is "Are there any agent or daemon which gets a hotremove 
request from outside of server and executes offline and cxl disable region without
users operation?"
I suppose such memory pool manager (or others) would like to ask the agent to
execute such operation.
(Probably, the agent need to get the request by REST API.)

> >
> > Q7) Are there any interface to know which block cannot be offlined?
> >     Even DCD is supported, such manager software seems to need to repeat
> memory offline
> >     request for many blocks and collect succeeded blocks until requested
> amount....
> >     It may need much time to complete even if its success rate is low. I think
> that "time out"
> >     for such case is not so good idea.
> >     If there is an interface for such application to know which block has long
> term pin pages
> >     or any other obstacles, then it is desirable the above software can avoid
> to wait long time.
> >
> >     If not, I would like to investigate more how to make it.
> 
> memory_block_offline() can fail and that error code is returned to the sysfs
> operation that wrote 0 to /sys/devices/system/memory/memoryX/online, but
> there is no facility to determine whether a block is long term pinned. You can
> only try to offline it, but whether that error is transient or permanent is
> unknown.
> 
> However, one of the mitigations here is that DCD allows for partial release. So
> unless the fabric manager strictly needs to get all of its memory back then it can
> make forward progress with a partial return rather than timing out. If the fabric
> manager needs guarantees then there is also the option to never online the
> memory. So, we have some flexibility here before it is clear that a new interface
> is needed.

Ok, I understand.

Thanks,
----
Yasunori Goto

^ permalink raw reply	[flat|nested] 29+ messages in thread

* RE: Questions about CXL device (type 3 memory) hotplug
  2023-05-29  8:07             ` Yasunori Gotou (Fujitsu)
@ 2023-06-06 17:58               ` Dan Williams
  2023-06-08  7:39                 ` Yasunori Gotou (Fujitsu)
  0 siblings, 1 reply; 29+ messages in thread
From: Dan Williams @ 2023-06-06 17:58 UTC (permalink / raw)
  To: Yasunori Gotou (Fujitsu), 'Dan Williams',
	linux-cxl@vger.kernel.org

Yasunori Gotou (Fujitsu) wrote:
> > Yasunori Gotou (Fujitsu) wrote:
> > [..]
> > > One of what I mind here --was-- which documentation describes OS
> > triggered hotremove instead of PCIe trigger.
> > > Because many hardware/firmware developers don't know the circumstance
> > of Linux.
> > > They may want to implement same system not only for Linux but also for
> > > VMware or any other system, and may want to obey only the specification or
> > any similar documents.
> > > But I found " CXL* Type 3 Memory Device Software Guide: 2.13.7 OS
> > managed hot remove sequence"
> > >
> > https://cdrdv2-public.intel.com/643805/643805_CXL%20Memory%20Device
> > %20
> > > SW%20Guide_Rev1p0.pdf Then, I can talk with them by it. So, it was
> > > solved.
> > >
> > > My remain questions are the followings.
> > >
> > > Q6) Are there any way to hotremove from outside of servers now?
> > >     Currently, administrator seems to need to login a server and execute
> > offline and cxl disable commands
> > >     to remove memory in it, right? But in future, something software like
> > memory pool manager,
> > >     Fabric Manager, or any other management tools which can manage
> > many servers CXL devices
> > >     will want to remove each server's devices from outside.
> > >     But I'm not sure it can available or not yet now.
> > 
> > As far as I can see all of the PCI hotplug state machines just coordinate the
> > removal internal to themselves and the PCI bus core without any participation
> > from the impacted driver before the ->remove() event. The ->remove() event is
> > too late to cancel the hotplug. So the change here would be either an upcall to
> > userspace, or some permission request callback to the impacted driver. Since
> > this is a policy decision whether to allow a given CXL device to be removed that
> > leans towards a userspace upcall mechanism.
> 
> Ah, sorry... My description of question was not good.
> I understand that PCIe hotremove is not suitable for trigger of CXL memory.
> 
> What I would like to ask is "Are there any agent or daemon which gets a hotremove 
> request from outside of server and executes offline and cxl disable region without
> users operation?"
> I suppose such memory pool manager (or others) would like to ask the agent to
> execute such operation.
> (Probably, the agent need to get the request by REST API.)

No, there's no coordination between the kernel and userspace when the
attention button is pressed. So any coordinated removal must be handled
before the removal is attempted. I think it would be useful to have a
mode of operation where pressing the attention button just notifies
userspace and it handles the coordinated shutdown of the device.

If the question is having a management API to trigger removal I am not
aware of any work in this space.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* RE: Questions about CXL device (type 3 memory) hotplug
  2023-06-06 17:58               ` Dan Williams
@ 2023-06-08  7:39                 ` Yasunori Gotou (Fujitsu)
  2023-06-08 18:37                   ` Dan Williams
  0 siblings, 1 reply; 29+ messages in thread
From: Yasunori Gotou (Fujitsu) @ 2023-06-08  7:39 UTC (permalink / raw)
  To: 'Dan Williams', linux-cxl@vger.kernel.org

> > Ah, sorry... My description of question was not good.
> > I understand that PCIe hotremove is not suitable for trigger of CXL memory.
> >
> > What I would like to ask is "Are there any agent or daemon which gets
> > a hotremove request from outside of server and executes offline and
> > cxl disable region without users operation?"
> > I suppose such memory pool manager (or others) would like to ask the
> > agent to execute such operation.
> > (Probably, the agent need to get the request by REST API.)
> 
> No, there's no coordination between the kernel and userspace when the
> attention button is pressed. So any coordinated removal must be handled
> before the removal is attempted. I think it would be useful to have a mode of
> operation where pressing the attention button just notifies userspace and it
> handles the coordinated shutdown of the device.
> 
> If the question is having a management API to trigger removal I am not aware of
> any work in this space.

Hmmmm, my question is NOT about attention button now. 
(I noticed that my quote of previous mail may be not good. Sorry).
I would like to confirm what component/method is still necessary for memory pool.

I imagine that there is(will be) a total memory pool manager which manages
a lot of Linux systems on each servers which have CXL memory.
And I think that an agent/daemon will be necessary in each Linux system to hotplug operation
like offline/online memory blocks, and/or requesting configuration of Dynamic Capacity Device 
to a Fabric Manager.
I guess such agent/daemon is not created yet for memory pool feature.

Here is my current understanding of an example of steps for memory pool. 
This does not include attention button, and use DCD.

1) The pool manager will request hot remove to a daemon in a Linux system by REST API, ssl, or somekind of
   network interface.
2) Then the daemon will execute memory offline some memory blocks.
3) It will correct offlined memory blocks until requested amount, and request 
  configuration of DCD to Fabric Manager.
4) Fabric Manager will configure Dynamic Capacity Device.
5) The memory pool manager will detect finish of its configuration somehow, and request hot add to a daemon in
  another Linux system.
6) The daemon will request configuration of DCD to FM
7) The daemon will detect new area somehow and online blocks for them.

If I still misunderstand something, please let me know.

Thanks,
---
Yasunori Goto

^ permalink raw reply	[flat|nested] 29+ messages in thread

* RE: Questions about CXL device (type 3 memory) hotplug
  2023-06-08  7:39                 ` Yasunori Gotou (Fujitsu)
@ 2023-06-08 18:37                   ` Dan Williams
  2023-06-09  1:02                     ` Yasunori Gotou (Fujitsu)
  0 siblings, 1 reply; 29+ messages in thread
From: Dan Williams @ 2023-06-08 18:37 UTC (permalink / raw)
  To: Yasunori Gotou (Fujitsu), 'Dan Williams',
	linux-cxl@vger.kernel.org

Yasunori Gotou (Fujitsu) wrote:
> 
> > > Ah, sorry... My description of question was not good.
> > > I understand that PCIe hotremove is not suitable for trigger of CXL memory.
> > >
> > > What I would like to ask is "Are there any agent or daemon which gets
> > > a hotremove request from outside of server and executes offline and
> > > cxl disable region without users operation?"
> > > I suppose such memory pool manager (or others) would like to ask the
> > > agent to execute such operation.
> > > (Probably, the agent need to get the request by REST API.)
> > 
> > No, there's no coordination between the kernel and userspace when the
> > attention button is pressed. So any coordinated removal must be handled
> > before the removal is attempted. I think it would be useful to have a mode of
> > operation where pressing the attention button just notifies userspace and it
> > handles the coordinated shutdown of the device.
> > 
> > If the question is having a management API to trigger removal I am not aware of
> > any work in this space.
> 
> Hmmmm, my question is NOT about attention button now. 
> (I noticed that my quote of previous mail may be not good. Sorry).
> I would like to confirm what component/method is still necessary for memory pool.
> 
> I imagine that there is(will be) a total memory pool manager which manages
> a lot of Linux systems on each servers which have CXL memory.
> And I think that an agent/daemon will be necessary in each Linux system to hotplug operation
> like offline/online memory blocks, and/or requesting configuration of Dynamic Capacity Device 
> to a Fabric Manager.
> I guess such agent/daemon is not created yet for memory pool feature.
> 
> Here is my current understanding of an example of steps for memory pool. 
> This does not include attention button, and use DCD.
> 
> 1) The pool manager will request hot remove to a daemon in a Linux system by REST API, ssl, or somekind of
>    network interface.
> 2) Then the daemon will execute memory offline some memory blocks.
> 3) It will correct offlined memory blocks until requested amount, and request 
>   configuration of DCD to Fabric Manager.
> 4) Fabric Manager will configure Dynamic Capacity Device.
> 5) The memory pool manager will detect finish of its configuration somehow, and request hot add to a daemon in
>   another Linux system.
> 6) The daemon will request configuration of DCD to FM
> 7) The daemon will detect new area somehow and online blocks for them.
> 
> If I still misunderstand something, please let me know.

Yes, I agree that a daemon like that is needed, I am not aware of any
current work in this space.

However, I wonder if there is existing virtio-mem-like management
infrastructure that could be repurposed for coordinating host-mem
dynamic memory adjustment.  In other words coordinating a shared pool of
memory across multiple kernel instances is a problem that has been
solved before, CXL just makes it so that the coordination is across
physical hosts rather than virtual.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* RE: Questions about CXL device (type 3 memory) hotplug
  2023-06-08 18:37                   ` Dan Williams
@ 2023-06-09  1:02                     ` Yasunori Gotou (Fujitsu)
  0 siblings, 0 replies; 29+ messages in thread
From: Yasunori Gotou (Fujitsu) @ 2023-06-09  1:02 UTC (permalink / raw)
  To: 'Dan Williams', linux-cxl@vger.kernel.org


> > Hmmmm, my question is NOT about attention button now.
> > (I noticed that my quote of previous mail may be not good. Sorry).
> > I would like to confirm what component/method is still necessary for memory
> pool.
> >
> > I imagine that there is(will be) a total memory pool manager which
> > manages a lot of Linux systems on each servers which have CXL memory.
> > And I think that an agent/daemon will be necessary in each Linux
> > system to hotplug operation like offline/online memory blocks, and/or
> > requesting configuration of Dynamic Capacity Device to a Fabric Manager.
> > I guess such agent/daemon is not created yet for memory pool feature.
> >
> > Here is my current understanding of an example of steps for memory pool.
> > This does not include attention button, and use DCD.
> >
> > 1) The pool manager will request hot remove to a daemon in a Linux system
> by REST API, ssl, or somekind of
> >    network interface.
> > 2) Then the daemon will execute memory offline some memory blocks.
> > 3) It will correct offlined memory blocks until requested amount, and request
> >   configuration of DCD to Fabric Manager.
> > 4) Fabric Manager will configure Dynamic Capacity Device.
> > 5) The memory pool manager will detect finish of its configuration somehow,
> and request hot add to a daemon in
> >   another Linux system.
> > 6) The daemon will request configuration of DCD to FM
> > 7) The daemon will detect new area somehow and online blocks for them.
> >
> > If I still misunderstand something, please let me know.
> 
> Yes, I agree that a daemon like that is needed, I am not aware of any current
> work in this space.

OK!

> 
> However, I wonder if there is existing virtio-mem-like management
> infrastructure that could be repurposed for coordinating host-mem dynamic
> memory adjustment.  In other words coordinating a shared pool of memory
> across multiple kernel instances is a problem that has been solved before, CXL
> just makes it so that the coordination is across physical hosts rather than
> virtual.

Hmm, I'm not sure. I'll investigate it.

Anyway, thank you very much for your answer!

----
Yasunori Goto


^ permalink raw reply	[flat|nested] 29+ messages in thread

* RE: Questions about CXL device (type 3 memory) hotplug
  2023-05-23  0:11 ` Dan Williams
  2023-05-23  8:31   ` Yasunori Gotou (Fujitsu)
@ 2023-05-23 13:34   ` Vikram Sethi
  2023-05-23 18:40     ` Dan Williams
  2024-03-27  7:10   ` Yuquan Wang
  2024-03-27  7:18   ` Yuquan Wang
  3 siblings, 1 reply; 29+ messages in thread
From: Vikram Sethi @ 2023-05-23 13:34 UTC (permalink / raw)
  To: Dan Williams, Yasunori Gotou (Fujitsu), linux-cxl@vger.kernel.org,
	catalin.marinas@arm.com, James Morse
  Cc: Natu, Mahesh

Hi Dan, 

> From: Dan Williams <dan.j.williams@intel.com>
> Sent: Monday, May 22, 2023 7:12 PM
> To: Yasunori Gotou (Fujitsu) <y-goto@fujitsu.com>; linux-
> cxl@vger.kernel.org
> Cc: 'Dan Williams' <dan.j.williams@intel.com>
> Subject: RE: Questions about CXL device (type 3 memory) hotplug
> 
> > Q4) Current CXL drivers/tools support Hot-removal request from PCIe?
> >
> >     CXL specification says "In a managed Hot-Remove flow, software is
> >     notified of a hot removal request."
> 
> Currently there is a requirement that:
> 
> cxl disable-memdev
> 
> ...is run before the device can be removed. There is no warning from the PCI
> hotplug driver. Which means that if end user does the wrong sequence they
> can crash the kernel / remove memory that may still be in active use.
>
Is there any notion of a cache flush when memory is removed (or in future CXL reset)?
Generally, CPU caches must be flushed when memory is removed because any evictions
when the memory isn't present can cause async errors which can be fatal to the system
or at least to VMs, depending on ISA. If the kernel does the cache flush, it must be done
with only uncacheable mappings present to prevent speculative fetches after the cache flush. 
Even so, kernel VA based cache flushes will likely be slow, so may be better to have the notion
of an arch callback that can invoke firmware to do the cache flush. 
Perhaps arch_remove_memory is the right place to invoke such a cache flush/FW call?
I think the CXL specification should also address the need for cache flush when removing memory
or doing CXL reset.
 
> >     I think that CXL drivers/tools need to find which sections belongs to the
> >     requested device, and execute offline them at least. In addition,
> >     Fabric Manager may need to prepare removing the device due to
> configuration
> >     change.
> >
> >     Does current CXL drivers/tools can execute them?
> >     Otherwise, does it need to be implemented yet?
> 
> Currently the 'cxl disable-memdev' command is not smart about determining
> when the device is in active use it just claims that it is always in use. That is in
> progress to be improved.
> 
> > Q5) How CXL driver treat region/namespace size against section size?
> >     Current x86-64 section size can be 2Gbyte, but CXL region size may be
> >     able to smaller than it.
> 
> The section size is still 128MB, the hotplug memory block size is what
> expands to 2GB. That size limits what can be onlined via the dax_kmem
> driver.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* RE: Questions about CXL device (type 3 memory) hotplug
  2023-05-23 13:34   ` Vikram Sethi
@ 2023-05-23 18:40     ` Dan Williams
  2023-05-24  0:02       ` Vikram Sethi
  0 siblings, 1 reply; 29+ messages in thread
From: Dan Williams @ 2023-05-23 18:40 UTC (permalink / raw)
  To: Vikram Sethi, Dan Williams, Yasunori Gotou (Fujitsu),
	linux-cxl@vger.kernel.org, catalin.marinas@arm.com, James Morse
  Cc: Natu, Mahesh

Vikram Sethi wrote:
> Hi Dan, 
> 
> > From: Dan Williams <dan.j.williams@intel.com>
> > Sent: Monday, May 22, 2023 7:12 PM
> > To: Yasunori Gotou (Fujitsu) <y-goto@fujitsu.com>; linux-
> > cxl@vger.kernel.org
> > Cc: 'Dan Williams' <dan.j.williams@intel.com>
> > Subject: RE: Questions about CXL device (type 3 memory) hotplug
> > 
> > > Q4) Current CXL drivers/tools support Hot-removal request from PCIe?
> > >
> > >     CXL specification says "In a managed Hot-Remove flow, software is
> > >     notified of a hot removal request."
> > 
> > Currently there is a requirement that:
> > 
> > cxl disable-memdev
> > 
> > ...is run before the device can be removed. There is no warning from the PCI
> > hotplug driver. Which means that if end user does the wrong sequence they
> > can crash the kernel / remove memory that may still be in active use.
> >
> Is there any notion of a cache flush when memory is removed (or in future CXL reset)?

No.

> Generally, CPU caches must be flushed when memory is removed because any evictions
> when the memory isn't present can cause async errors which can be fatal to the system
> or at least to VMs, depending on ISA.

This seems incompatible with memory hotplug. The cache flushing is only
done on the subsequent reuse of physical address range to make sure that
any pending evictions are complete before the newly constituted address
range is put into service, or that any prior clean cache lines of old
content are dropped. See cxl_region_invalidate_memregion() for where
this is called.

> If the kernel does the cache flush, it must be done
> with only uncacheable mappings present to prevent speculative fetches after the cache flush. 

This is why the invalidation is done after physical address range is
populated by new devices. To flush any speculative fetches to the old
composition of the address range.

> Even so, kernel VA based cache flushes will likely be slow, so may be better to have the notion
> of an arch callback that can invoke firmware to do the cache flush. 
> Perhaps arch_remove_memory is the right place to invoke such a cache flush/FW call?
> I think the CXL specification should also address the need for cache flush when removing memory
> or doing CXL reset.

Seems out of scope for the CXL specification, this is up to each arch to
handle.

Here is some discussion of what ARM is thinking about in this space:

https://lore.kernel.org/all/40cd479b-f0f8-5dba-0e41-4cef73693927@arm.com/

^ permalink raw reply	[flat|nested] 29+ messages in thread

* RE: Questions about CXL device (type 3 memory) hotplug
  2023-05-23 18:40     ` Dan Williams
@ 2023-05-24  0:02       ` Vikram Sethi
  2023-05-24  4:03         ` Dan Williams
  0 siblings, 1 reply; 29+ messages in thread
From: Vikram Sethi @ 2023-05-24  0:02 UTC (permalink / raw)
  To: Dan Williams, Yasunori Gotou (Fujitsu), linux-cxl@vger.kernel.org,
	catalin.marinas@arm.com, James Morse
  Cc: Natu, Mahesh

> From: Dan Williams <dan.j.williams@intel.com>
> Sent: Tuesday, May 23, 2023 1:40 PM
> To: Vikram Sethi <vsethi@nvidia.com>; Dan Williams
> <dan.j.williams@intel.com>; Yasunori Gotou (Fujitsu) <y-goto@fujitsu.com>;
> linux-cxl@vger.kernel.org; catalin.marinas@arm.com; James Morse
> <james.morse@arm.com>
> Cc: Natu, Mahesh <mahesh.natu@intel.com>
> Subject: RE: Questions about CXL device (type 3 memory) hotplug
> 
> Vikram Sethi wrote:
> > Hi Dan,
> >
> > > From: Dan Williams <dan.j.williams@intel.com>
> > > Sent: Monday, May 22, 2023 7:12 PM
> > > To: Yasunori Gotou (Fujitsu) <y-goto@fujitsu.com>; linux-
> > > cxl@vger.kernel.org
> > > Cc: 'Dan Williams' <dan.j.williams@intel.com>
> > > Subject: RE: Questions about CXL device (type 3 memory) hotplug
> > >
> > > > Q4) Current CXL drivers/tools support Hot-removal request from PCIe?
> > > >
> > > >     CXL specification says "In a managed Hot-Remove flow, software is
> > > >     notified of a hot removal request."
> > >
> > > Currently there is a requirement that:
> > >
> > > cxl disable-memdev
> > >
> > > ...is run before the device can be removed. There is no warning from
> > > the PCI hotplug driver. Which means that if end user does the wrong
> > > sequence they can crash the kernel / remove memory that may still be in
> active use.
> > >
> > Is there any notion of a cache flush when memory is removed (or in future
> CXL reset)?
> 
> No.
> 
> > Generally, CPU caches must be flushed when memory is removed because
> > any evictions when the memory isn't present can cause async errors
> > which can be fatal to the system or at least to VMs, depending on ISA.
> 
> This seems incompatible with memory hotplug. The cache flushing is only
> done on the subsequent reuse of physical address range to make sure that
> any pending evictions are complete before the newly constituted address
> range is put into service, or that any prior clean cache lines of old content are
> dropped. See cxl_region_invalidate_memregion() for where this is called.
> 
> > If the kernel does the cache flush, it must be done with only
> > uncacheable mappings present to prevent speculative fetches after the
> cache flush.
> 
> This is why the invalidation is done after physical address range is populated
> by new devices. To flush any speculative fetches to the old composition of
> the address range.
> 
I don't think invalidate on the probe path will always work for devices with snoop filters, including HDM-DB memory devices or CXL type2 accelerators. 
After CXL reset or hot plug insertion, a HDM-DB device's snoop filter isn't tracking any lines checked out by the host. Even if those were just clean lines in CPU caches, hosts can send drop notifications in CXL in response to the cache flush (MemClnEvict).
A device that isn't expecting this evict notification can go into error state and optionally raise a device internal error interrupt. So you could end up with a non functional device.

> > Even so, kernel VA based cache flushes will likely be slow, so may be
> > better to have the notion of an arch callback that can invoke firmware to do
> the cache flush.
> > Perhaps arch_remove_memory is the right place to invoke such a cache
> flush/FW call?
> > I think the CXL specification should also address the need for cache
> > flush when removing memory or doing CXL reset.
> 
> Seems out of scope for the CXL specification, this is up to each arch to
> handle.
> 
> Here is some discussion of what ARM is thinking about in this space:
> 
> https://lore.kernel.org/all/40cd479b-f0f8-5dba-0e41-
> 4cef73693927@arm.com/

^ permalink raw reply	[flat|nested] 29+ messages in thread

* RE: Questions about CXL device (type 3 memory) hotplug
  2023-05-24  0:02       ` Vikram Sethi
@ 2023-05-24  4:03         ` Dan Williams
  2023-05-24 14:47           ` Vikram Sethi
  0 siblings, 1 reply; 29+ messages in thread
From: Dan Williams @ 2023-05-24  4:03 UTC (permalink / raw)
  To: Vikram Sethi, Dan Williams, Yasunori Gotou (Fujitsu),
	linux-cxl@vger.kernel.org, catalin.marinas@arm.com, James Morse
  Cc: Natu, Mahesh

Vikram Sethi wrote:
> > From: Dan Williams <dan.j.williams@intel.com>
> > Sent: Tuesday, May 23, 2023 1:40 PM
> > To: Vikram Sethi <vsethi@nvidia.com>; Dan Williams
> > <dan.j.williams@intel.com>; Yasunori Gotou (Fujitsu) <y-goto@fujitsu.com>;
> > linux-cxl@vger.kernel.org; catalin.marinas@arm.com; James Morse
> > <james.morse@arm.com>
> > Cc: Natu, Mahesh <mahesh.natu@intel.com>
> > Subject: RE: Questions about CXL device (type 3 memory) hotplug
> > 
> > Vikram Sethi wrote:
> > > Hi Dan,
> > >
> > > > From: Dan Williams <dan.j.williams@intel.com>
> > > > Sent: Monday, May 22, 2023 7:12 PM
> > > > To: Yasunori Gotou (Fujitsu) <y-goto@fujitsu.com>; linux-
> > > > cxl@vger.kernel.org
> > > > Cc: 'Dan Williams' <dan.j.williams@intel.com>
> > > > Subject: RE: Questions about CXL device (type 3 memory) hotplug
> > > >
> > > > > Q4) Current CXL drivers/tools support Hot-removal request from PCIe?
> > > > >
> > > > >     CXL specification says "In a managed Hot-Remove flow, software is
> > > > >     notified of a hot removal request."
> > > >
> > > > Currently there is a requirement that:
> > > >
> > > > cxl disable-memdev
> > > >
> > > > ...is run before the device can be removed. There is no warning from
> > > > the PCI hotplug driver. Which means that if end user does the wrong
> > > > sequence they can crash the kernel / remove memory that may still be in
> > active use.
> > > >
> > > Is there any notion of a cache flush when memory is removed (or in future
> > CXL reset)?
> > 
> > No.
> > 
> > > Generally, CPU caches must be flushed when memory is removed because
> > > any evictions when the memory isn't present can cause async errors
> > > which can be fatal to the system or at least to VMs, depending on ISA.
> > 
> > This seems incompatible with memory hotplug. The cache flushing is only
> > done on the subsequent reuse of physical address range to make sure that
> > any pending evictions are complete before the newly constituted address
> > range is put into service, or that any prior clean cache lines of old content are
> > dropped. See cxl_region_invalidate_memregion() for where this is called.
> > 
> > > If the kernel does the cache flush, it must be done with only
> > > uncacheable mappings present to prevent speculative fetches after the
> > cache flush.
> > 
> > This is why the invalidation is done after physical address range is populated
> > by new devices. To flush any speculative fetches to the old composition of
> > the address range.
> > 
> I don't think invalidate on the probe path will always work for
> devices with snoop filters, including HDM-DB memory devices or CXL
> type2 accelerators.  After CXL reset or hot plug insertion, a HDM-DB
> device's snoop filter isn't tracking any lines checked out by the
> host. Even if those were just clean lines in CPU caches, hosts can
> send drop notifications in CXL in response to the cache flush
> (MemClnEvict).  A device that isn't expecting this evict notification
> can go into error state and optionally raise a device internal error
> interrupt. So you could end up with a non functional device.

I don't understand this failure mode. Accelerator is added, driver sets
up an HDM decode range and triggers CPU cache invalidation before
mapping the memory into page tables. Wouldn't the device, upon receiving
an invalidation request, just snoop its caches and say "nothing for me
to do"?

^ permalink raw reply	[flat|nested] 29+ messages in thread

* RE: Questions about CXL device (type 3 memory) hotplug
  2023-05-24  4:03         ` Dan Williams
@ 2023-05-24 14:47           ` Vikram Sethi
  2023-05-24 21:20             ` Dan Williams
  0 siblings, 1 reply; 29+ messages in thread
From: Vikram Sethi @ 2023-05-24 14:47 UTC (permalink / raw)
  To: Dan Williams, Yasunori Gotou (Fujitsu), linux-cxl@vger.kernel.org,
	catalin.marinas@arm.com, James Morse
  Cc: Natu, Mahesh

> From: Dan Williams <dan.j.williams@intel.com>
> Sent: Tuesday, May 23, 2023 11:03 PM
> To: Vikram Sethi <vsethi@nvidia.com>; Dan Williams
> <dan.j.williams@intel.com>; Yasunori Gotou (Fujitsu) <y-goto@fujitsu.com>;
> linux-cxl@vger.kernel.org; catalin.marinas@arm.com; James Morse
> <james.morse@arm.com>
> Cc: Natu, Mahesh <mahesh.natu@intel.com>
> Subject: RE: Questions about CXL device (type 3 memory) hotplug
> Vikram Sethi wrote:
> > > From: Dan Williams <dan.j.williams@intel.com>
> > > Sent: Tuesday, May 23, 2023 1:40 PM
> > > To: Vikram Sethi <vsethi@nvidia.com>; Dan Williams
> > > <dan.j.williams@intel.com>; Yasunori Gotou (Fujitsu)
> > > <y-goto@fujitsu.com>; linux-cxl@vger.kernel.org;
> > > catalin.marinas@arm.com; James Morse <james.morse@arm.com>
> > > Cc: Natu, Mahesh <mahesh.natu@intel.com>
> > > Subject: RE: Questions about CXL device (type 3 memory) hotplug
> > >
> > > Vikram Sethi wrote:
> > > > Hi Dan,
> > > >
> > > > > From: Dan Williams <dan.j.williams@intel.com>
> > > > > Sent: Monday, May 22, 2023 7:12 PM
> > > > > To: Yasunori Gotou (Fujitsu) <y-goto@fujitsu.com>; linux-
> > > > > cxl@vger.kernel.org
> > > > > Cc: 'Dan Williams' <dan.j.williams@intel.com>
> > > > > Subject: RE: Questions about CXL device (type 3 memory) hotplug
> > > > >
> > > > > > Q4) Current CXL drivers/tools support Hot-removal request from
> PCIe?
> > > > > >
> > > > > >     CXL specification says "In a managed Hot-Remove flow, software
> is
> > > > > >     notified of a hot removal request."
> > > > >
> > > > > Currently there is a requirement that:
> > > > >
> > > > > cxl disable-memdev
> > > > >
> > > > > ...is run before the device can be removed. There is no warning
> > > > > from the PCI hotplug driver. Which means that if end user does
> > > > > the wrong sequence they can crash the kernel / remove memory
> > > > > that may still be in
> > > active use.
> > > > >
> > > > Is there any notion of a cache flush when memory is removed (or in
> > > > future
> > > CXL reset)?
> > >
> > > No.
> > >
> > > > Generally, CPU caches must be flushed when memory is removed
> > > > because any evictions when the memory isn't present can cause
> > > > async errors which can be fatal to the system or at least to VMs,
> depending on ISA.
> > >
> > > This seems incompatible with memory hotplug. The cache flushing is
> > > only done on the subsequent reuse of physical address range to make
> > > sure that any pending evictions are complete before the newly
> > > constituted address range is put into service, or that any prior
> > > clean cache lines of old content are dropped. See
> cxl_region_invalidate_memregion() for where this is called.
> > >
> > > > If the kernel does the cache flush, it must be done with only
> > > > uncacheable mappings present to prevent speculative fetches after
> > > > the
> > > cache flush.
> > >
> > > This is why the invalidation is done after physical address range is
> > > populated by new devices. To flush any speculative fetches to the
> > > old composition of the address range.
> > >
> > I don't think invalidate on the probe path will always work for
> > devices with snoop filters, including HDM-DB memory devices or CXL
> > type2 accelerators.  After CXL reset or hot plug insertion, a HDM-DB
> > device's snoop filter isn't tracking any lines checked out by the
> > host. Even if those were just clean lines in CPU caches, hosts can
> > send drop notifications in CXL in response to the cache flush
> > (MemClnEvict).  A device that isn't expecting this evict notification
> > can go into error state and optionally raise a device internal error
> > interrupt. So you could end up with a non functional device.
> 
> I don't understand this failure mode. Accelerator is added, driver sets up an
> HDM decode range and triggers CPU cache invalidation before mapping the
> memory into page tables. Wouldn't the device, upon receiving an invalidation
> request, just snoop its caches and say "nothing for me to do"?

Device's snoop filter is in a clean reset/power on state. It is not tracking anything checked out by the host CPU/peer.
If it starts receiving writebacks or even CleanEvicts for its memory, it certainly looks like an unexpected coherency message and i
Know of at least one implementation that triggers an error interrupt in response. I don't know of a statement 
In the specification that this is expected and implementations should ignore. If there is such a statement, could you 
please point me to it? 

Remove memory needs a cache flush IMO, in a way that prevents speculative fetches. 
This can be done in kernel with uncacheable mappings alone, if possible in the arch callback, or via FW call. 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* RE: Questions about CXL device (type 3 memory) hotplug
  2023-05-24 14:47           ` Vikram Sethi
@ 2023-05-24 21:20             ` Dan Williams
  2023-05-31  4:25               ` Vikram Sethi
  0 siblings, 1 reply; 29+ messages in thread
From: Dan Williams @ 2023-05-24 21:20 UTC (permalink / raw)
  To: Vikram Sethi, Dan Williams, Yasunori Gotou (Fujitsu),
	linux-cxl@vger.kernel.org, catalin.marinas@arm.com, James Morse
  Cc: Natu, Mahesh

Vikram Sethi wrote:
[..]
> > I don't understand this failure mode. Accelerator is added, driver sets up an
> > HDM decode range and triggers CPU cache invalidation before mapping the
> > memory into page tables. Wouldn't the device, upon receiving an invalidation
> > request, just snoop its caches and say "nothing for me to do"?
> 
> Device's snoop filter is in a clean reset/power on state. It is not
> tracking anything checked out by the host CPU/peer.  If it starts
> receiving writebacks or even CleanEvicts for its memory, 

CleanEvict is a device-to-host request. We are talking about
host-to-device requests which is only SnpData, SnpInv, and SnpCur,
right?

> looks like an unexpected coherency message and i Know of at least one
> implementation that triggers an error interrupt in response. I don't
> know of a statement In the specification that this is expected and
> implementations should ignore. If there is such a statement, could you
> please point me to it? 

All the specification says (CXL 3.0 3.2.4.4 Host to Device Requests) is
what to do *if* the device is holding that cacheline.

If a device fails when it gets one of those requests when it does not
hold a line then how can this work in the nominal case of the device not
owning any random cacheline?

> Remove memory needs a cache flush IMO, in a way that prevents
> speculative fetches.  This can be done in kernel with uncacheable
> mappings alone, if possible in the arch callback, or via FW call. 

That assumes that the kernel owns all mappings. I worry about mappings
that the kernel cannot see like x86 SMM. That's why it's currently an
invalidate before next usage, but I am not opposed to also flushing on
remove if the current solution is causing device-failures in practice.

Can you confirm that the current kernel arrangement is causing failures
in practice, or is this a theoretical concern? ...and if it is happening
in practice do you have the example patch that fixes it?

^ permalink raw reply	[flat|nested] 29+ messages in thread

* RE: Questions about CXL device (type 3 memory) hotplug
  2023-05-24 21:20             ` Dan Williams
@ 2023-05-31  4:25               ` Vikram Sethi
  2023-06-06 20:54                 ` Dan Williams
  0 siblings, 1 reply; 29+ messages in thread
From: Vikram Sethi @ 2023-05-31  4:25 UTC (permalink / raw)
  To: Dan Williams, Yasunori Gotou (Fujitsu), linux-cxl@vger.kernel.org,
	catalin.marinas@arm.com, James Morse
  Cc: Natu, Mahesh

Hi Dan, 
Apologies for the delayed response, was out for a few days. 

> From: Dan Williams <dan.j.williams@intel.com>
> Sent: Wednesday, May 24, 2023 4:20 PM
> To: Vikram Sethi <vsethi@nvidia.com>; Dan Williams
> <dan.j.williams@intel.com>; Yasunori Gotou (Fujitsu) <y-goto@fujitsu.com>;
> linux-cxl@vger.kernel.org; catalin.marinas@arm.com; James Morse
> <james.morse@arm.com>
> Cc: Natu, Mahesh <mahesh.natu@intel.com>
> Subject: RE: Questions about CXL device (type 3 memory) hotplug
> Vikram Sethi wrote:
> [..]
> > > I don't understand this failure mode. Accelerator is added, driver
> > > sets up an HDM decode range and triggers CPU cache invalidation
> > > before mapping the memory into page tables. Wouldn't the device,
> > > upon receiving an invalidation request, just snoop its caches and say
> "nothing for me to do"?
> >
> > Device's snoop filter is in a clean reset/power on state. It is not
> > tracking anything checked out by the host CPU/peer.  If it starts
> > receiving writebacks or even CleanEvicts for its memory,
> 
> CleanEvict is a device-to-host request. We are talking about host-to-device
> requests which is only SnpData, SnpInv, and SnpCur, right?
>
I was referring to MemClnEvct which is a Host request to device (M2S req) as captured in table C-3 of the latest specification
 
> > looks like an unexpected coherency message and i Know of at least one
> > implementation that triggers an error interrupt in response. I don't
> > know of a statement In the specification that this is expected and
> > implementations should ignore. If there is such a statement, could you
> > please point me to it?
> 
> All the specification says (CXL 3.0 3.2.4.4 Host to Device Requests) is what to
> do *if* the device is holding that cacheline.
> 
> If a device fails when it gets one of those requests when it does not hold a
> line then how can this work in the nominal case of the device not owning any
> random cacheline?

I didn't understand. The line in question is owned by the device (it is device memory). The device has just been CXL reset or powered up and its snoop filter isn't tracking ANY of its lines as checked out by the host. The host tells the device it is dropping a line that the host had checked out (MemClnEvct) but per the device the host never checked anything out. Seems perfectly reasonable for the device to think it is an incorrect coherency message and flag an error. What is the nominal case that you think is broken?
> 
> > Remove memory needs a cache flush IMO, in a way that prevents
> > speculative fetches.  This can be done in kernel with uncacheable
> > mappings alone, if possible in the arch callback, or via FW call.
> 
> That assumes that the kernel owns all mappings. I worry about mappings that
> the kernel cannot see like x86 SMM. That's why it's currently an invalidate
> before next usage, but I am not opposed to also flushing on remove if the
> current solution is causing device-failures in practice.
> 
> Can you confirm that the current kernel arrangement is causing failures in
> practice, or is this a theoretical concern? ...and if it is happening in practice do
> you have the example patch that fixes it?
Yes, it is causing error interrupts from the device around device reset if the host caches are not flushed before the reset. 
It is currently being worked around via ACPI magic for the cache flush then reset, but kernel aware handling of the flush seems more appropriate for both hot plug and CXL reset (whether via direct flush or via FW calls from arch callbacks).

^ permalink raw reply	[flat|nested] 29+ messages in thread

* RE: Questions about CXL device (type 3 memory) hotplug
  2023-05-31  4:25               ` Vikram Sethi
@ 2023-06-06 20:54                 ` Dan Williams
  2023-06-07  1:06                   ` Vikram Sethi
  0 siblings, 1 reply; 29+ messages in thread
From: Dan Williams @ 2023-06-06 20:54 UTC (permalink / raw)
  To: Vikram Sethi, Dan Williams, Yasunori Gotou (Fujitsu),
	linux-cxl@vger.kernel.org, catalin.marinas@arm.com, James Morse
  Cc: Natu, Mahesh

Vikram Sethi wrote:
> Hi Dan, 
> Apologies for the delayed response, was out for a few days. 
> 
> > From: Dan Williams <dan.j.williams@intel.com>
> > Sent: Wednesday, May 24, 2023 4:20 PM
> > To: Vikram Sethi <vsethi@nvidia.com>; Dan Williams
> > <dan.j.williams@intel.com>; Yasunori Gotou (Fujitsu) <y-goto@fujitsu.com>;
> > linux-cxl@vger.kernel.org; catalin.marinas@arm.com; James Morse
> > <james.morse@arm.com>
> > Cc: Natu, Mahesh <mahesh.natu@intel.com>
> > Subject: RE: Questions about CXL device (type 3 memory) hotplug
> > Vikram Sethi wrote:
> > [..]
> > > > I don't understand this failure mode. Accelerator is added, driver
> > > > sets up an HDM decode range and triggers CPU cache invalidation
> > > > before mapping the memory into page tables. Wouldn't the device,
> > > > upon receiving an invalidation request, just snoop its caches and say
> > "nothing for me to do"?
> > >
> > > Device's snoop filter is in a clean reset/power on state. It is not
> > > tracking anything checked out by the host CPU/peer.  If it starts
> > > receiving writebacks or even CleanEvicts for its memory,
> > 
> > CleanEvict is a device-to-host request. We are talking about host-to-device
> > requests which is only SnpData, SnpInv, and SnpCur, right?
> >
> I was referring to MemClnEvct which is a Host request to device (M2S
> req) as captured in table C-3 of the latest specification

Ok, thanks for that clarification.

>  
> > > looks like an unexpected coherency message and i Know of at least one
> > > implementation that triggers an error interrupt in response. I don't
> > > know of a statement In the specification that this is expected and
> > > implementations should ignore. If there is such a statement, could you
> > > please point me to it?
> > 
> > All the specification says (CXL 3.0 3.2.4.4 Host to Device Requests) is what to
> > do *if* the device is holding that cacheline.
> > 
> > If a device fails when it gets one of those requests when it does not hold a
> > line then how can this work in the nominal case of the device not owning any
> > random cacheline?
> 
> I didn't understand. The line in question is owned by the device (it
> is device memory). The device has just been CXL reset or powered up
> and its snoop filter isn't tracking ANY of its lines as checked out by
> the host. The host tells the device it is dropping a line that the
> host had checked out (MemClnEvct) but per the device the host never
> checked anything out. Seems perfectly reasonable for the device to
> think it is an incorrect coherency message and flag an error. What is
> the nominal case that you think is broken?

The case I was considering was a broadcast / anonymous invalidation
event, but now I see that MemClnEvct implies that the line was
previously in the Shared / Exclusive state, so now I see your point. The
host will not send MemClnEvct in the scenario I was envisioning.

> > 
> > > Remove memory needs a cache flush IMO, in a way that prevents
> > > speculative fetches.  This can be done in kernel with uncacheable
> > > mappings alone, if possible in the arch callback, or via FW call.
> > 
> > That assumes that the kernel owns all mappings. I worry about mappings that
> > the kernel cannot see like x86 SMM. That's why it's currently an invalidate
> > before next usage, but I am not opposed to also flushing on remove if the
> > current solution is causing device-failures in practice.
> > 
> > Can you confirm that the current kernel arrangement is causing failures in
> > practice, or is this a theoretical concern? ...and if it is happening in practice do
> > you have the example patch that fixes it?
> Yes, it is causing error interrupts from the device around device
> reset if the host caches are not flushed before the reset.  It is
> currently being worked around via ACPI magic for the cache flush then
> reset, but kernel aware handling of the flush seems more appropriate
> for both hot plug and CXL reset (whether via direct flush or via FW
> calls from arch callbacks).

Makes sense, and yikes "ACPI magic". My concern though as you note above
is the cache line immediately going back to the "Shared" state from
speculation before the HDM decoder space is shutdown. It seems it would
only be safe to invalidate sometime *after* all of the page tables and
HDM decode has been torn down, and suppress any errors that result from
unaccepted writes.

I.e. would something like this solve the immediate problem? Or does the
architecture need to have the address range mapped into tables and
decode operational for the flush to succeed?

diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
index 543c4499379e..60d1b5ecf936 100644
--- a/drivers/cxl/core/region.c
+++ b/drivers/cxl/core/region.c
@@ -187,6 +187,15 @@ static int cxl_region_decode_commit(struct cxl_region *cxlr)
 	struct cxl_region_params *p = &cxlr->params;
 	int i, rc = 0;
 
+	/*
+	 * Before the new region goes active, and while the physical address
+	 * range is not mapped in any page tables invalidate any previous cached
+	 * lines in this physical address range.
+	 */
+	rc = cxl_region_invalidate_memregion(cxlr);
+	if (rc)
+		return rc;
+
 	for (i = 0; i < p->nr_targets; i++) {
 		struct cxl_endpoint_decoder *cxled = p->targets[i];
 		struct cxl_memdev *cxlmd = cxled_to_memdev(cxled);
@@ -3158,8 +3167,6 @@ static int cxl_region_probe(struct device *dev)
 		goto out;
 	}
 
-	rc = cxl_region_invalidate_memregion(cxlr);
-
 	/*
 	 * From this point on any path that changes the region's state away from
 	 * CXL_CONFIG_COMMIT is also responsible for releasing the driver.



^ permalink raw reply related	[flat|nested] 29+ messages in thread

* RE: Questions about CXL device (type 3 memory) hotplug
  2023-06-06 20:54                 ` Dan Williams
@ 2023-06-07  1:06                   ` Vikram Sethi
  2023-06-07 15:12                     ` Jonathan Cameron
  0 siblings, 1 reply; 29+ messages in thread
From: Vikram Sethi @ 2023-06-07  1:06 UTC (permalink / raw)
  To: Dan Williams, Yasunori Gotou (Fujitsu), linux-cxl@vger.kernel.org,
	catalin.marinas@arm.com, James Morse
  Cc: Natu, Mahesh

> From: Dan Williams <dan.j.williams@intel.com>
> Sent: Tuesday, June 6, 2023 3:55 PM
> To: Vikram Sethi <vsethi@nvidia.com>; Dan Williams
> <dan.j.williams@intel.com>; Yasunori Gotou (Fujitsu) <y-goto@fujitsu.com>;
> linux-cxl@vger.kernel.org; catalin.marinas@arm.com; James Morse
> <james.morse@arm.com>
> Cc: Natu, Mahesh <mahesh.natu@intel.com>
> Subject: RE: Questions about CXL device (type 3 memory) hotplug
> Vikram Sethi wrote:
> > Hi Dan,
> > Apologies for the delayed response, was out for a few days.
> >
> > > From: Dan Williams <dan.j.williams@intel.com>
> > > Sent: Wednesday, May 24, 2023 4:20 PM
> > > To: Vikram Sethi <vsethi@nvidia.com>; Dan Williams
> > > <dan.j.williams@intel.com>; Yasunori Gotou (Fujitsu)
> > > <y-goto@fujitsu.com>; linux-cxl@vger.kernel.org;
> > > catalin.marinas@arm.com; James Morse <james.morse@arm.com>
> > > Cc: Natu, Mahesh <mahesh.natu@intel.com>
> > > Subject: RE: Questions about CXL device (type 3 memory) hotplug
> > > Vikram Sethi wrote:
> > > [..]
> > > > > I don't understand this failure mode. Accelerator is added,
> > > > > driver sets up an HDM decode range and triggers CPU cache
> > > > > invalidation before mapping the memory into page tables.
> > > > > Wouldn't the device, upon receiving an invalidation request,
> > > > > just snoop its caches and say
> > > "nothing for me to do"?
> > > >
> > > > Device's snoop filter is in a clean reset/power on state. It is
> > > > not tracking anything checked out by the host CPU/peer.  If it
> > > > starts receiving writebacks or even CleanEvicts for its memory,
> > >
> > > CleanEvict is a device-to-host request. We are talking about
> > > host-to-device requests which is only SnpData, SnpInv, and SnpCur,
> right?
> > >
> > I was referring to MemClnEvct which is a Host request to device (M2S
> > req) as captured in table C-3 of the latest specification
> 
> Ok, thanks for that clarification.
> 
> >
> > > > looks like an unexpected coherency message and i Know of at least
> > > > one implementation that triggers an error interrupt in response. I
> > > > don't know of a statement In the specification that this is
> > > > expected and implementations should ignore. If there is such a
> > > > statement, could you please point me to it?
> > >
> > > All the specification says (CXL 3.0 3.2.4.4 Host to Device Requests)
> > > is what to do *if* the device is holding that cacheline.
> > >
> > > If a device fails when it gets one of those requests when it does
> > > not hold a line then how can this work in the nominal case of the
> > > device not owning any random cacheline?
> >
> > I didn't understand. The line in question is owned by the device (it
> > is device memory). The device has just been CXL reset or powered up
> > and its snoop filter isn't tracking ANY of its lines as checked out by
> > the host. The host tells the device it is dropping a line that the
> > host had checked out (MemClnEvct) but per the device the host never
> > checked anything out. Seems perfectly reasonable for the device to
> > think it is an incorrect coherency message and flag an error. What is
> > the nominal case that you think is broken?
> 
> The case I was considering was a broadcast / anonymous invalidation event,
> but now I see that MemClnEvct implies that the line was previously in the
> Shared / Exclusive state, so now I see your point. The host will not send
> MemClnEvct in the scenario I was envisioning.
> 
> > >
> > > > Remove memory needs a cache flush IMO, in a way that prevents
> > > > speculative fetches.  This can be done in kernel with uncacheable
> > > > mappings alone, if possible in the arch callback, or via FW call.
> > >
> > > That assumes that the kernel owns all mappings. I worry about
> > > mappings that the kernel cannot see like x86 SMM. That's why it's
> > > currently an invalidate before next usage, but I am not opposed to
> > > also flushing on remove if the current solution is causing device-failures in
> practice.
> > >
> > > Can you confirm that the current kernel arrangement is causing
> > > failures in practice, or is this a theoretical concern? ...and if it
> > > is happening in practice do you have the example patch that fixes it?
> > Yes, it is causing error interrupts from the device around device
> > reset if the host caches are not flushed before the reset.  It is
> > currently being worked around via ACPI magic for the cache flush then
> > reset, but kernel aware handling of the flush seems more appropriate
> > for both hot plug and CXL reset (whether via direct flush or via FW
> > calls from arch callbacks).
> 
> Makes sense, and yikes "ACPI magic". My concern though as you note above
> is the cache line immediately going back to the "Shared" state from
> speculation before the HDM decoder space is shutdown. It seems it would
> only be safe to invalidate sometime *after* all of the page tables and HDM
> decode has been torn down, and suppress any errors that result from
> unaccepted writes.

I agree regarding cache flush after page table mappings removed, but not sure that HDM decode tear down is a requirement to prevent speculation. 
Are there architectures that can speculate to arbitrary PA without any PTE mappings to those PA?
Would cxl_region_decode_reset be guaranteed to not have any page table mappings to the region and be a suitable place to also flush for a CXL reset type scenario?
> 
> I.e. would something like this solve the immediate problem? Or does the
> architecture need to have the address range mapped into tables and decode
> operational for the flush to succeed?

The specific implementation does not require page table mappings to flush caches. I'm not sure that simply suppressing error interrupts for any writebacks or MemCleanEvict that happen after a device insertion/reset is good enough as devices could view that as a coherency error. 
> 
> diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c index
> 543c4499379e..60d1b5ecf936 100644
> --- a/drivers/cxl/core/region.c
> +++ b/drivers/cxl/core/region.c
> @@ -187,6 +187,15 @@ static int cxl_region_decode_commit(struct
> cxl_region *cxlr)
>         struct cxl_region_params *p = &cxlr->params;
>         int i, rc = 0;
> 
> +       /*
> +        * Before the new region goes active, and while the physical address
> +        * range is not mapped in any page tables invalidate any previous cached
> +        * lines in this physical address range.
> +        */
> +       rc = cxl_region_invalidate_memregion(cxlr);
> +       if (rc)
> +               return rc;
> +
>         for (i = 0; i < p->nr_targets; i++) {
>                 struct cxl_endpoint_decoder *cxled = p->targets[i];
>                 struct cxl_memdev *cxlmd = cxled_to_memdev(cxled); @@ -3158,8
> +3167,6 @@ static int cxl_region_probe(struct device *dev)
>                 goto out;
>         }
> 
> -       rc = cxl_region_invalidate_memregion(cxlr);
> -
>         /*
>          * From this point on any path that changes the region's state away from
>          * CXL_CONFIG_COMMIT is also responsible for releasing the driver.
> 


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Questions about CXL device (type 3 memory) hotplug
  2023-06-07  1:06                   ` Vikram Sethi
@ 2023-06-07 15:12                     ` Jonathan Cameron
  2023-06-07 18:44                       ` Vikram Sethi
  0 siblings, 1 reply; 29+ messages in thread
From: Jonathan Cameron @ 2023-06-07 15:12 UTC (permalink / raw)
  To: Vikram Sethi
  Cc: Dan Williams, Yasunori Gotou (Fujitsu), linux-cxl@vger.kernel.org,
	catalin.marinas@arm.com, James Morse, Natu, Mahesh

On Wed, 7 Jun 2023 01:06:05 +0000
Vikram Sethi <vsethi@nvidia.com> wrote:

> > From: Dan Williams <dan.j.williams@intel.com>
> > Sent: Tuesday, June 6, 2023 3:55 PM
> > To: Vikram Sethi <vsethi@nvidia.com>; Dan Williams
> > <dan.j.williams@intel.com>; Yasunori Gotou (Fujitsu)
> > <y-goto@fujitsu.com>; linux-cxl@vger.kernel.org;
> > catalin.marinas@arm.com; James Morse <james.morse@arm.com>
> > Cc: Natu, Mahesh <mahesh.natu@intel.com>
> > Subject: RE: Questions about CXL device (type 3 memory) hotplug
> > Vikram Sethi wrote:  
> > > Hi Dan,
> > > Apologies for the delayed response, was out for a few days.
> > >  
> > > > From: Dan Williams <dan.j.williams@intel.com>
> > > > Sent: Wednesday, May 24, 2023 4:20 PM
> > > > To: Vikram Sethi <vsethi@nvidia.com>; Dan Williams
> > > > <dan.j.williams@intel.com>; Yasunori Gotou (Fujitsu)
> > > > <y-goto@fujitsu.com>; linux-cxl@vger.kernel.org;
> > > > catalin.marinas@arm.com; James Morse <james.morse@arm.com>
> > > > Cc: Natu, Mahesh <mahesh.natu@intel.com>
> > > > Subject: RE: Questions about CXL device (type 3 memory) hotplug
> > > > Vikram Sethi wrote:
> > > > [..]  
> > > > > > I don't understand this failure mode. Accelerator is added,
> > > > > > driver sets up an HDM decode range and triggers CPU cache
> > > > > > invalidation before mapping the memory into page tables.
> > > > > > Wouldn't the device, upon receiving an invalidation request,
> > > > > > just snoop its caches and say  
> > > > "nothing for me to do"?  
> > > > >
> > > > > Device's snoop filter is in a clean reset/power on state. It
> > > > > is not tracking anything checked out by the host CPU/peer.
> > > > > If it starts receiving writebacks or even CleanEvicts for its
> > > > > memory,  
> > > >
> > > > CleanEvict is a device-to-host request. We are talking about
> > > > host-to-device requests which is only SnpData, SnpInv, and
> > > > SnpCur,  
> > right?  
> > > >  
> > > I was referring to MemClnEvct which is a Host request to device
> > > (M2S req) as captured in table C-3 of the latest specification  
> > 
> > Ok, thanks for that clarification.
> >   
> > >  
> > > > > looks like an unexpected coherency message and i Know of at
> > > > > least one implementation that triggers an error interrupt in
> > > > > response. I don't know of a statement In the specification
> > > > > that this is expected and implementations should ignore. If
> > > > > there is such a statement, could you please point me to it?  
> > > >
> > > > All the specification says (CXL 3.0 3.2.4.4 Host to Device
> > > > Requests) is what to do *if* the device is holding that
> > > > cacheline.
> > > >
> > > > If a device fails when it gets one of those requests when it
> > > > does not hold a line then how can this work in the nominal case
> > > > of the device not owning any random cacheline?  
> > >
> > > I didn't understand. The line in question is owned by the device
> > > (it is device memory). The device has just been CXL reset or
> > > powered up and its snoop filter isn't tracking ANY of its lines
> > > as checked out by the host. The host tells the device it is
> > > dropping a line that the host had checked out (MemClnEvct) but
> > > per the device the host never checked anything out. Seems
> > > perfectly reasonable for the device to think it is an incorrect
> > > coherency message and flag an error. What is the nominal case
> > > that you think is broken?  
> > 
> > The case I was considering was a broadcast / anonymous invalidation
> > event, but now I see that MemClnEvct implies that the line was
> > previously in the Shared / Exclusive state, so now I see your
> > point. The host will not send MemClnEvct in the scenario I was
> > envisioning. 
> > > >  
> > > > > Remove memory needs a cache flush IMO, in a way that prevents
> > > > > speculative fetches.  This can be done in kernel with
> > > > > uncacheable mappings alone, if possible in the arch callback,
> > > > > or via FW call.  
> > > >
> > > > That assumes that the kernel owns all mappings. I worry about
> > > > mappings that the kernel cannot see like x86 SMM. That's why
> > > > it's currently an invalidate before next usage, but I am not
> > > > opposed to also flushing on remove if the current solution is
> > > > causing device-failures in  
> > practice.  
> > > >
> > > > Can you confirm that the current kernel arrangement is causing
> > > > failures in practice, or is this a theoretical concern? ...and
> > > > if it is happening in practice do you have the example patch
> > > > that fixes it?  
> > > Yes, it is causing error interrupts from the device around device
> > > reset if the host caches are not flushed before the reset.  It is
> > > currently being worked around via ACPI magic for the cache flush
> > > then reset, but kernel aware handling of the flush seems more
> > > appropriate for both hot plug and CXL reset (whether via direct
> > > flush or via FW calls from arch callbacks).  
> > 
> > Makes sense, and yikes "ACPI magic". My concern though as you note
> > above is the cache line immediately going back to the "Shared"
> > state from speculation before the HDM decoder space is shutdown. It
> > seems it would only be safe to invalidate sometime *after* all of
> > the page tables and HDM decode has been torn down, and suppress any
> > errors that result from unaccepted writes.  
> 
> I agree regarding cache flush after page table mappings removed, but
> not sure that HDM decode tear down is a requirement to prevent
> speculation. Are there architectures that can speculate to arbitrary
> PA without any PTE mappings to those PA? Would
> cxl_region_decode_reset be guaranteed to not have any page table
> mappings to the region and be a suitable place to also flush for a
> CXL reset type scenario?
> > 
> > I.e. would something like this solve the immediate problem? Or does
> > the architecture need to have the address range mapped into tables
> > and decode operational for the flush to succeed?  
> 

> The specific implementation does not require page table mappings to
> flush caches. I'm not sure that simply suppressing error interrupts
> for any writebacks or MemCleanEvict that happen after a device
> insertion/reset is good enough as devices could view that as a
> coherency error. 

If on an architecture that guarantees no clean write backs (or at least none
if they are ever visible - which should include this case) shouldn't be
a problem.

So who wants to point an laugh at anyone that does clean write backs that
can be observed?
:)

Even on archs that do allow for such write backs, I believe they are
not common as otherwise perf would be terrible: so just let the errors
through - they are flagging errors in PAs that aren't mapped so should
just generate a small amount of noise in the logs.

So flush before to make clean (or invalid but then potentially prefetched
so clean) - tear down the HDM decoders and flush again / invalidate so
nothing stale hanging around (or do it before bringing something new
up at that Host PA). Eat or log any errors and don't worry about it.

Maybe I'm missing some corners cases.

Jonathan

> > 
> > diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
> > index 543c4499379e..60d1b5ecf936 100644
> > --- a/drivers/cxl/core/region.c
> > +++ b/drivers/cxl/core/region.c
> > @@ -187,6 +187,15 @@ static int cxl_region_decode_commit(struct
> > cxl_region *cxlr)
> >         struct cxl_region_params *p = &cxlr->params;
> >         int i, rc = 0;
> > 
> > +       /*
> > +        * Before the new region goes active, and while the
> > physical address
> > +        * range is not mapped in any page tables invalidate any
> > previous cached
> > +        * lines in this physical address range.
> > +        */
> > +       rc = cxl_region_invalidate_memregion(cxlr);
> > +       if (rc)
> > +               return rc;
> > +
> >         for (i = 0; i < p->nr_targets; i++) {
> >                 struct cxl_endpoint_decoder *cxled = p->targets[i];
> >                 struct cxl_memdev *cxlmd = cxled_to_memdev(cxled);
> > @@ -3158,8 +3167,6 @@ static int cxl_region_probe(struct device
> > *dev) goto out;
> >         }
> > 
> > -       rc = cxl_region_invalidate_memregion(cxlr);
> > -
> >         /*
> >          * From this point on any path that changes the region's
> > state away from
> >          * CXL_CONFIG_COMMIT is also responsible for releasing the
> > driver. 
> 


^ permalink raw reply	[flat|nested] 29+ messages in thread

* RE: Questions about CXL device (type 3 memory) hotplug
  2023-06-07 15:12                     ` Jonathan Cameron
@ 2023-06-07 18:44                       ` Vikram Sethi
  2023-06-08 15:19                         ` Jonathan Cameron
  0 siblings, 1 reply; 29+ messages in thread
From: Vikram Sethi @ 2023-06-07 18:44 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: Dan Williams, Yasunori Gotou (Fujitsu), linux-cxl@vger.kernel.org,
	catalin.marinas@arm.com, James Morse, Natu, Mahesh

> From: Jonathan Cameron <Jonathan.Cameron@Huawei.com>
> Sent: Wednesday, June 7, 2023 10:12 AM
> To: Vikram Sethi <vsethi@nvidia.com>
> Cc: Dan Williams <dan.j.williams@intel.com>; Yasunori Gotou (Fujitsu) <y-
> goto@fujitsu.com>; linux-cxl@vger.kernel.org; catalin.marinas@arm.com;
> James Morse <james.morse@arm.com>; Natu, Mahesh
> <mahesh.natu@intel.com>
> Subject: Re: Questions about CXL device (type 3 memory) hotplug
> 
> 
> On Wed, 7 Jun 2023 01:06:05 +0000
> Vikram Sethi <vsethi@nvidia.com> wrote:
> 
> > > From: Dan Williams <dan.j.williams@intel.com>
> > > Sent: Tuesday, June 6, 2023 3:55 PM
> > > To: Vikram Sethi <vsethi@nvidia.com>; Dan Williams
> > > <dan.j.williams@intel.com>; Yasunori Gotou (Fujitsu)
> > > <y-goto@fujitsu.com>; linux-cxl@vger.kernel.org;
> > > catalin.marinas@arm.com; James Morse <james.morse@arm.com>
> > > Cc: Natu, Mahesh <mahesh.natu@intel.com>
> > > Subject: RE: Questions about CXL device (type 3 memory) hotplug
> > > Vikram Sethi wrote:
> > > > Hi Dan,
> > > > Apologies for the delayed response, was out for a few days.
> > > >
> > > > > From: Dan Williams <dan.j.williams@intel.com>
> > > > > Sent: Wednesday, May 24, 2023 4:20 PM
> > > > > To: Vikram Sethi <vsethi@nvidia.com>; Dan Williams
> > > > > <dan.j.williams@intel.com>; Yasunori Gotou (Fujitsu)
> > > > > <y-goto@fujitsu.com>; linux-cxl@vger.kernel.org;
> > > > > catalin.marinas@arm.com; James Morse <james.morse@arm.com>
> > > > > Cc: Natu, Mahesh <mahesh.natu@intel.com>
> > > > > Subject: RE: Questions about CXL device (type 3 memory) hotplug
> > > > > Vikram Sethi wrote:
> > > > > [..]
> > > > > > > I don't understand this failure mode. Accelerator is added,
> > > > > > > driver sets up an HDM decode range and triggers CPU cache
> > > > > > > invalidation before mapping the memory into page tables.
> > > > > > > Wouldn't the device, upon receiving an invalidation request,
> > > > > > > just snoop its caches and say
> > > > > "nothing for me to do"?
> > > > > >
> > > > > > Device's snoop filter is in a clean reset/power on state. It
> > > > > > is not tracking anything checked out by the host CPU/peer.
> > > > > > If it starts receiving writebacks or even CleanEvicts for its
> > > > > > memory,
> > > > >
> > > > > CleanEvict is a device-to-host request. We are talking about
> > > > > host-to-device requests which is only SnpData, SnpInv, and
> > > > > SnpCur,
> > > right?
> > > > >
> > > > I was referring to MemClnEvct which is a Host request to device
> > > > (M2S req) as captured in table C-3 of the latest specification
> > >
> > > Ok, thanks for that clarification.
> > >
> > > >
> > > > > > looks like an unexpected coherency message and i Know of at
> > > > > > least one implementation that triggers an error interrupt in
> > > > > > response. I don't know of a statement In the specification
> > > > > > that this is expected and implementations should ignore. If
> > > > > > there is such a statement, could you please point me to it?
> > > > >
> > > > > All the specification says (CXL 3.0 3.2.4.4 Host to Device
> > > > > Requests) is what to do *if* the device is holding that
> > > > > cacheline.
> > > > >
> > > > > If a device fails when it gets one of those requests when it
> > > > > does not hold a line then how can this work in the nominal case
> > > > > of the device not owning any random cacheline?
> > > >
> > > > I didn't understand. The line in question is owned by the device
> > > > (it is device memory). The device has just been CXL reset or
> > > > powered up and its snoop filter isn't tracking ANY of its lines as
> > > > checked out by the host. The host tells the device it is dropping
> > > > a line that the host had checked out (MemClnEvct) but per the
> > > > device the host never checked anything out. Seems perfectly
> > > > reasonable for the device to think it is an incorrect coherency
> > > > message and flag an error. What is the nominal case that you think
> > > > is broken?
> > >
> > > The case I was considering was a broadcast / anonymous invalidation
> > > event, but now I see that MemClnEvct implies that the line was
> > > previously in the Shared / Exclusive state, so now I see your point.
> > > The host will not send MemClnEvct in the scenario I was envisioning.
> > > > >
> > > > > > Remove memory needs a cache flush IMO, in a way that prevents
> > > > > > speculative fetches.  This can be done in kernel with
> > > > > > uncacheable mappings alone, if possible in the arch callback,
> > > > > > or via FW call.
> > > > >
> > > > > That assumes that the kernel owns all mappings. I worry about
> > > > > mappings that the kernel cannot see like x86 SMM. That's why
> > > > > it's currently an invalidate before next usage, but I am not
> > > > > opposed to also flushing on remove if the current solution is
> > > > > causing device-failures in
> > > practice.
> > > > >
> > > > > Can you confirm that the current kernel arrangement is causing
> > > > > failures in practice, or is this a theoretical concern? ...and
> > > > > if it is happening in practice do you have the example patch
> > > > > that fixes it?
> > > > Yes, it is causing error interrupts from the device around device
> > > > reset if the host caches are not flushed before the reset.  It is
> > > > currently being worked around via ACPI magic for the cache flush
> > > > then reset, but kernel aware handling of the flush seems more
> > > > appropriate for both hot plug and CXL reset (whether via direct
> > > > flush or via FW calls from arch callbacks).
> > >
> > > Makes sense, and yikes "ACPI magic". My concern though as you note
> > > above is the cache line immediately going back to the "Shared"
> > > state from speculation before the HDM decoder space is shutdown. It
> > > seems it would only be safe to invalidate sometime *after* all of
> > > the page tables and HDM decode has been torn down, and suppress any
> > > errors that result from unaccepted writes.
> >
> > I agree regarding cache flush after page table mappings removed, but
> > not sure that HDM decode tear down is a requirement to prevent
> > speculation. Are there architectures that can speculate to arbitrary
> > PA without any PTE mappings to those PA? Would
> cxl_region_decode_reset
> > be guaranteed to not have any page table mappings to the region and be
> > a suitable place to also flush for a CXL reset type scenario?
> > >
> > > I.e. would something like this solve the immediate problem? Or does
> > > the architecture need to have the address range mapped into tables
> > > and decode operational for the flush to succeed?
> >
> 
> > The specific implementation does not require page table mappings to
> > flush caches. I'm not sure that simply suppressing error interrupts
> > for any writebacks or MemCleanEvict that happen after a device
> > insertion/reset is good enough as devices could view that as a
> > coherency error.
> 
> If on an architecture that guarantees no clean write backs (or at least none if
> they are ever visible - which should include this case) shouldn't be a problem.
> 
The clean drop notification (MakeCleanEvict) is sent to the device telling it that a clean line held by the CPU was dropped. That is the more common error condition as I agree that most architectures won't actually writeback a clean line. 

> So who wants to point an laugh at anyone that does clean write backs that
> can be observed?
> :)
> 
> Even on archs that do allow for such write backs, I believe they are not
> common as otherwise perf would be terrible: so just let the errors through -
> they are flagging errors in PAs that aren't mapped so should just generate a
> small amount of noise in the logs.
> 
> So flush before to make clean (or invalid but then potentially prefetched so
> clean) - tear down the HDM decoders and flush again / invalidate so nothing
> stale hanging around (or do it before bringing something new up at that Host
> PA). Eat or log any errors and don't worry about it.
> 

I'm OK with this approach. When the cache flush is done at the time of the decoder tear down, there mustn't be any page table mappings to the decode HPA ranges (and if any ISA wanted to do an in kernel flush vs FW call, and needed a PTE mapping for the flush, that should be done with a non cacheable mapping).

> Maybe I'm missing some corners cases.
> 
> Jonathan
> 
> > >
> > > diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
> > > index 543c4499379e..60d1b5ecf936 100644
> > > --- a/drivers/cxl/core/region.c
> > > +++ b/drivers/cxl/core/region.c
> > > @@ -187,6 +187,15 @@ static int cxl_region_decode_commit(struct
> > > cxl_region *cxlr)
> > >         struct cxl_region_params *p = &cxlr->params;
> > >         int i, rc = 0;
> > >
> > > +       /*
> > > +        * Before the new region goes active, and while the
> > > physical address
> > > +        * range is not mapped in any page tables invalidate any
> > > previous cached
> > > +        * lines in this physical address range.
> > > +        */
> > > +       rc = cxl_region_invalidate_memregion(cxlr);
> > > +       if (rc)
> > > +               return rc;
> > > +
> > >         for (i = 0; i < p->nr_targets; i++) {
> > >                 struct cxl_endpoint_decoder *cxled = p->targets[i];
> > >                 struct cxl_memdev *cxlmd = cxled_to_memdev(cxled);
> > > @@ -3158,8 +3167,6 @@ static int cxl_region_probe(struct device
> > > *dev) goto out;
> > >         }
> > >
> > > -       rc = cxl_region_invalidate_memregion(cxlr);
> > > -
> > >         /*
> > >          * From this point on any path that changes the region's
> > > state away from
> > >          * CXL_CONFIG_COMMIT is also responsible for releasing the
> > > driver.
> >


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Questions about CXL device (type 3 memory) hotplug
  2023-06-07 18:44                       ` Vikram Sethi
@ 2023-06-08 15:19                         ` Jonathan Cameron
  2023-06-08 18:41                           ` Dan Williams
  0 siblings, 1 reply; 29+ messages in thread
From: Jonathan Cameron @ 2023-06-08 15:19 UTC (permalink / raw)
  To: Vikram Sethi
  Cc: Dan Williams, Yasunori Gotou (Fujitsu), linux-cxl@vger.kernel.org,
	catalin.marinas@arm.com, James Morse, Natu, Mahesh

On Wed, 7 Jun 2023 18:44:36 +0000
Vikram Sethi <vsethi@nvidia.com> wrote:

> > From: Jonathan Cameron <Jonathan.Cameron@Huawei.com>
> > Sent: Wednesday, June 7, 2023 10:12 AM
> > To: Vikram Sethi <vsethi@nvidia.com>
> > Cc: Dan Williams <dan.j.williams@intel.com>; Yasunori Gotou (Fujitsu) <y-  
> > goto@fujitsu.com>; linux-cxl@vger.kernel.org; catalin.marinas@arm.com;  
> > James Morse <james.morse@arm.com>; Natu, Mahesh
> > <mahesh.natu@intel.com>
> > Subject: Re: Questions about CXL device (type 3 memory) hotplug
> > 
> > 
> > On Wed, 7 Jun 2023 01:06:05 +0000
> > Vikram Sethi <vsethi@nvidia.com> wrote:
> >   
> > > > From: Dan Williams <dan.j.williams@intel.com>
> > > > Sent: Tuesday, June 6, 2023 3:55 PM
> > > > To: Vikram Sethi <vsethi@nvidia.com>; Dan Williams
> > > > <dan.j.williams@intel.com>; Yasunori Gotou (Fujitsu)
> > > > <y-goto@fujitsu.com>; linux-cxl@vger.kernel.org;
> > > > catalin.marinas@arm.com; James Morse <james.morse@arm.com>
> > > > Cc: Natu, Mahesh <mahesh.natu@intel.com>
> > > > Subject: RE: Questions about CXL device (type 3 memory) hotplug
> > > > Vikram Sethi wrote:  
> > > > > Hi Dan,
> > > > > Apologies for the delayed response, was out for a few days.
> > > > >  
> > > > > > From: Dan Williams <dan.j.williams@intel.com>
> > > > > > Sent: Wednesday, May 24, 2023 4:20 PM
> > > > > > To: Vikram Sethi <vsethi@nvidia.com>; Dan Williams
> > > > > > <dan.j.williams@intel.com>; Yasunori Gotou (Fujitsu)
> > > > > > <y-goto@fujitsu.com>; linux-cxl@vger.kernel.org;
> > > > > > catalin.marinas@arm.com; James Morse <james.morse@arm.com>
> > > > > > Cc: Natu, Mahesh <mahesh.natu@intel.com>
> > > > > > Subject: RE: Questions about CXL device (type 3 memory) hotplug
> > > > > > Vikram Sethi wrote:
> > > > > > [..]  
> > > > > > > > I don't understand this failure mode. Accelerator is added,
> > > > > > > > driver sets up an HDM decode range and triggers CPU cache
> > > > > > > > invalidation before mapping the memory into page tables.
> > > > > > > > Wouldn't the device, upon receiving an invalidation request,
> > > > > > > > just snoop its caches and say  
> > > > > > "nothing for me to do"?  
> > > > > > >
> > > > > > > Device's snoop filter is in a clean reset/power on state. It
> > > > > > > is not tracking anything checked out by the host CPU/peer.
> > > > > > > If it starts receiving writebacks or even CleanEvicts for its
> > > > > > > memory,  
> > > > > >
> > > > > > CleanEvict is a device-to-host request. We are talking about
> > > > > > host-to-device requests which is only SnpData, SnpInv, and
> > > > > > SnpCur,  
> > > > right?  
> > > > > >  
> > > > > I was referring to MemClnEvct which is a Host request to device
> > > > > (M2S req) as captured in table C-3 of the latest specification  
> > > >
> > > > Ok, thanks for that clarification.
> > > >  
> > > > >  
> > > > > > > looks like an unexpected coherency message and i Know of at
> > > > > > > least one implementation that triggers an error interrupt in
> > > > > > > response. I don't know of a statement In the specification
> > > > > > > that this is expected and implementations should ignore. If
> > > > > > > there is such a statement, could you please point me to it?  
> > > > > >
> > > > > > All the specification says (CXL 3.0 3.2.4.4 Host to Device
> > > > > > Requests) is what to do *if* the device is holding that
> > > > > > cacheline.
> > > > > >
> > > > > > If a device fails when it gets one of those requests when it
> > > > > > does not hold a line then how can this work in the nominal case
> > > > > > of the device not owning any random cacheline?  
> > > > >
> > > > > I didn't understand. The line in question is owned by the device
> > > > > (it is device memory). The device has just been CXL reset or
> > > > > powered up and its snoop filter isn't tracking ANY of its lines as
> > > > > checked out by the host. The host tells the device it is dropping
> > > > > a line that the host had checked out (MemClnEvct) but per the
> > > > > device the host never checked anything out. Seems perfectly
> > > > > reasonable for the device to think it is an incorrect coherency
> > > > > message and flag an error. What is the nominal case that you think
> > > > > is broken?  
> > > >
> > > > The case I was considering was a broadcast / anonymous invalidation
> > > > event, but now I see that MemClnEvct implies that the line was
> > > > previously in the Shared / Exclusive state, so now I see your point.
> > > > The host will not send MemClnEvct in the scenario I was envisioning.  
> > > > > >  
> > > > > > > Remove memory needs a cache flush IMO, in a way that prevents
> > > > > > > speculative fetches.  This can be done in kernel with
> > > > > > > uncacheable mappings alone, if possible in the arch callback,
> > > > > > > or via FW call.  
> > > > > >
> > > > > > That assumes that the kernel owns all mappings. I worry about
> > > > > > mappings that the kernel cannot see like x86 SMM. That's why
> > > > > > it's currently an invalidate before next usage, but I am not
> > > > > > opposed to also flushing on remove if the current solution is
> > > > > > causing device-failures in  
> > > > practice.  
> > > > > >
> > > > > > Can you confirm that the current kernel arrangement is causing
> > > > > > failures in practice, or is this a theoretical concern? ...and
> > > > > > if it is happening in practice do you have the example patch
> > > > > > that fixes it?  
> > > > > Yes, it is causing error interrupts from the device around device
> > > > > reset if the host caches are not flushed before the reset.  It is
> > > > > currently being worked around via ACPI magic for the cache flush
> > > > > then reset, but kernel aware handling of the flush seems more
> > > > > appropriate for both hot plug and CXL reset (whether via direct
> > > > > flush or via FW calls from arch callbacks).  
> > > >
> > > > Makes sense, and yikes "ACPI magic". My concern though as you note
> > > > above is the cache line immediately going back to the "Shared"
> > > > state from speculation before the HDM decoder space is shutdown. It
> > > > seems it would only be safe to invalidate sometime *after* all of
> > > > the page tables and HDM decode has been torn down, and suppress any
> > > > errors that result from unaccepted writes.  
> > >
> > > I agree regarding cache flush after page table mappings removed, but
> > > not sure that HDM decode tear down is a requirement to prevent
> > > speculation. Are there architectures that can speculate to arbitrary
> > > PA without any PTE mappings to those PA? Would  
> > cxl_region_decode_reset  
> > > be guaranteed to not have any page table mappings to the region and be
> > > a suitable place to also flush for a CXL reset type scenario?  
> > > >
> > > > I.e. would something like this solve the immediate problem? Or does
> > > > the architecture need to have the address range mapped into tables
> > > > and decode operational for the flush to succeed?  
> > >  
> >   
> > > The specific implementation does not require page table mappings to
> > > flush caches. I'm not sure that simply suppressing error interrupts
> > > for any writebacks or MemCleanEvict that happen after a device
> > > insertion/reset is good enough as devices could view that as a
> > > coherency error.  
> > 
> > If on an architecture that guarantees no clean write backs (or at least none if
> > they are ever visible - which should include this case) shouldn't be a problem.
> >   

> The clean drop notification (MakeCleanEvict) is sent to the device
> telling it that a clean line held by the CPU was dropped. That is the
> more common error condition as I agree that most architectures won't
> actually writeback a clean line. 
>

MemClnEvct?  That's HDM-DB only but fair enough it can happen.
Does the device actually return an error on one of those failing to sink?
I can't recall seeing anything that says it does.
There is text for writes (dropped) and read (complicated) but not
the stuff related to Buried State.
I thought there might be a way to convey it in General Media Event record
(using invalid address) but there isn't a suitable transaction type. Would
be horrible anyway as this is nothing to do with media.

I don't think there is any other way the host can tell the device ignored
it's MemClnEvct so no errors from that end.
 
> > So who wants to point an laugh at anyone that does clean write
> > backs that can be observed?
> > :)
> > 
> > Even on archs that do allow for such write backs, I believe they
> > are not common as otherwise perf would be terrible: so just let the
> > errors through - they are flagging errors in PAs that aren't mapped
> > so should just generate a small amount of noise in the logs.
> > 
> > So flush before to make clean (or invalid but then potentially
> > prefetched so clean) - tear down the HDM decoders and flush again /
> > invalidate so nothing stale hanging around (or do it before
> > bringing something new up at that Host PA). Eat or log any errors
> > and don't worry about it. 
> 
> I'm OK with this approach. When the cache flush is done at the time
> of the decoder tear down, there mustn't be any page table mappings to
> the decode HPA ranges (and if any ISA wanted to do an in kernel flush
> vs FW call, and needed a PTE mapping for the flush, that should be
> done with a non cacheable mapping).

FW magic so we don't have to care :)

Jonathan

> 
> > Maybe I'm missing some corners cases.
> > 
> > Jonathan
> >   
> > > >
> > > > diff --git a/drivers/cxl/core/region.c
> > > > b/drivers/cxl/core/region.c index 543c4499379e..60d1b5ecf936
> > > > 100644 --- a/drivers/cxl/core/region.c
> > > > +++ b/drivers/cxl/core/region.c
> > > > @@ -187,6 +187,15 @@ static int cxl_region_decode_commit(struct
> > > > cxl_region *cxlr)
> > > >         struct cxl_region_params *p = &cxlr->params;
> > > >         int i, rc = 0;
> > > >
> > > > +       /*
> > > > +        * Before the new region goes active, and while the
> > > > physical address
> > > > +        * range is not mapped in any page tables invalidate any
> > > > previous cached
> > > > +        * lines in this physical address range.
> > > > +        */
> > > > +       rc = cxl_region_invalidate_memregion(cxlr);
> > > > +       if (rc)
> > > > +               return rc;
> > > > +
> > > >         for (i = 0; i < p->nr_targets; i++) {
> > > >                 struct cxl_endpoint_decoder *cxled =
> > > > p->targets[i]; struct cxl_memdev *cxlmd =
> > > > cxled_to_memdev(cxled); @@ -3158,8 +3167,6 @@ static int
> > > > cxl_region_probe(struct device *dev) goto out;
> > > >         }
> > > >
> > > > -       rc = cxl_region_invalidate_memregion(cxlr);
> > > > -
> > > >         /*
> > > >          * From this point on any path that changes the region's
> > > > state away from
> > > >          * CXL_CONFIG_COMMIT is also responsible for releasing
> > > > the driver.  
> > >  
> 


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Questions about CXL device (type 3 memory) hotplug
  2023-06-08 15:19                         ` Jonathan Cameron
@ 2023-06-08 18:41                           ` Dan Williams
  0 siblings, 0 replies; 29+ messages in thread
From: Dan Williams @ 2023-06-08 18:41 UTC (permalink / raw)
  To: Jonathan Cameron, Vikram Sethi
  Cc: Dan Williams, Yasunori Gotou (Fujitsu), linux-cxl@vger.kernel.org,
	catalin.marinas@arm.com, James Morse, Natu, Mahesh

Jonathan Cameron wrote:
> On Wed, 7 Jun 2023 18:44:36 +0000
> Vikram Sethi <vsethi@nvidia.com> wrote:
[..]
> > > So flush before to make clean (or invalid but then potentially
> > > prefetched so clean) - tear down the HDM decoders and flush again /
> > > invalidate so nothing stale hanging around (or do it before
> > > bringing something new up at that Host PA). Eat or log any errors
> > > and don't worry about it. 
> > 
> > I'm OK with this approach. When the cache flush is done at the time
> > of the decoder tear down, there mustn't be any page table mappings to
> > the decode HPA ranges (and if any ISA wanted to do an in kernel flush
> > vs FW call, and needed a PTE mapping for the flush, that should be
> > done with a non cacheable mapping).
> 
> FW magic so we don't have to care :)

Hopefully a pre-HDM-teardown flush for draining writebacks and
pre-HDM-setup-flush for clearing out clean lines brought in by
speculation is sufficient. My worry with "FW magic" is that when it
breaks the phone rings for kernel help.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Questions about CXL device (type 3 memory) hotplug
  2023-05-23  0:11 ` Dan Williams
  2023-05-23  8:31   ` Yasunori Gotou (Fujitsu)
  2023-05-23 13:34   ` Vikram Sethi
@ 2024-03-27  7:10   ` Yuquan Wang
  2024-03-27  7:18   ` Yuquan Wang
  3 siblings, 0 replies; 29+ messages in thread
From: Yuquan Wang @ 2024-03-27  7:10 UTC (permalink / raw)
  To: Dan Williams, jonathan.cameron; +Cc: linux-cxl, linux-kernel

On Mon, May 22, 2023 at 05:11:39PM -0700, Dan Williams wrote:
> Yasunori Gotou (Fujitsu) wrote:
[...]

Hi, 

There was some confusions about CXL device hotplug when I recently
tried to use Qemu to emulate CXL device hotplug and verify the relevant
functions of kernel.

> > Q1) Can PCIe hotplug driver detect and call CXL driver?

[...]

> 
> Yes.
> 
> The cxl_pci driver (drivers/cxl/pci.c) is just a typical PCI driver as
> far as the PCI hotplug driver is concerned. So add/remove events of a
> CXL card get turned into probe()/remove() events on the driver.
> 

1. Can we divide steps of CXL device hotplug into two parts(PCI hotplug & Memory Hotplug)?

PCI Hotplug: the same as the native PCIe hotplug, including initializing cxl.io,
             assigning PCIe BARs, allocating interrupts, etc. And the cxl_pci driver
                         is responsible for this part.

Memory Hotplug: focusing on enabling CXL memory including discovering and Configuring HDM,
                extracting NUMA info from device, notifying memory management, etc.

> > 
> > Q2) Can QEMU/KVM emulate CXL device hotplug?
> > 
> >    I heard that QEMU/KVM has PCIe device hotplug emulation, but I'm not sure
> >    it can hotplug CXL device.
> 
> It can, but as far as the driver is concerned you can achieve the same
> by:
> 
> echo $devname > /sys/bus/pci/drivers/cxl_pci/unbind
> 
> ...that exercises the same software flows as physical unplug.
>

2. What is the difference between "echo $devname > /sys/bus/pci/drivers/cxl_pci/unbind" and
"(qemu) device_del cxl-mem0" ?

According to the test, I found that "(qemu) device_del cxl-mem0" would directly
unplug the device and cause the interrupts on the cxl root port. It seems like this
operation would not only trigger cxl_pci driver but also pcieport driver.

The kernel dmesg is like below:

(qemu) device_del cxl-mem0
# dmesg
[  699.057907] pcieport 0000:0c:00.0: pciehp: pending interrupts 0x0001 from Slot Status
[  699.058929] pcieport 0000:0c:00.0: pciehp: Slot(0): Button press: will power off in 5 sec
[  699.059986] pcieport 0000:0c:00.0: pciehp: pending interrupts 0x0010 from Slot Status
[  699.060099] pcieport 0000:0c:00.0: pciehp: pciehp_set_indicators: SLOTCTRL 90 write cmd 2c0

Then I also tried "echo $devname > /sys/bus/pci/drivers/cxl_pci/unbind"
to check the behaviour of kernel. The kernel dmesg is like below:

# echo 0000:0d:00.0 > /sys/bus/pci/drivers/cxl_pci/unbind
# dmesg
[70387.978931] cxl_pci 0000:0d:00.0: vgaarb: pci_notify
[70388.021476] cxl_mem mem0: disconnect mem0 from port1
[70388.033099] pci 0000:0d:00.0: vgaarb: pci_notify

It seems like this operation would just unbind the cxl_pci driver from the cxl device.

Is my understanding about these two method correct?

3) Can I just use "ndctl/test/cxl-topology.sh" to test the cxl hotplug functions of kernel?

   IIUC, cxl-topology.sh would utilize cxl_test (tools/testing/cxl) which is for regression
   testing the kernel-user ABI.

PS: My qemu command line:
qemu-system-x86_64 \
-M q35,nvdimm=on,cxl=on \
-m 4G \
-smp 4 \
-object memory-backend-ram,size=2G,id=mem0 \
-numa node,nodeid=0,cpus=0-1,memdev=mem0 \
-object memory-backend-ram,size=2G,id=mem1 \
-numa node,nodeid=1,cpus=2-3,memdev=mem1 \
-object memory-backend-ram,size=256M,id=cxl-mem0 \
-device pxb-cxl,bus_nr=12,bus=pcie.0,id=cxl.1 \
-device cxl-rp,port=0,bus=cxl.1,id=root_port0,chassis=0,slot=0 \
-device cxl-type3,bus=root_port0,volatile-memdev=cxl-mem0,id=cxl-mem0 \
-M cxl-fmw.0.targets.0=cxl.1,cxl-fmw.0.size=4G,cxl-fmw.0.interleave-granularity=4k \
-hda ../disk/ubuntu_x86_test_new.qcow2 \
-nographic \

Qemu version: 8.2.50, the lastest commit of branch cxl-2024-03-05 in "https://gitlab.com/jic23/qemu"
Kernel version: 6.8.0-rc6


Many thanks
Yuquan

^ permalink raw reply	[flat|nested] 29+ messages in thread

* RE: Questions about CXL device (type 3 memory) hotplug
  2023-05-23  0:11 ` Dan Williams
                     ` (2 preceding siblings ...)
  2024-03-27  7:10   ` Yuquan Wang
@ 2024-03-27  7:18   ` Yuquan Wang
  3 siblings, 0 replies; 29+ messages in thread
From: Yuquan Wang @ 2024-03-27  7:18 UTC (permalink / raw)
  To: dan.j.williams, jonathan.cameron, y-goto
  Cc: linux-cxl, linux-kernel, qemu-devel

On Mon, May 22, 2023 at 05:11:39PM -0700, Dan Williams wrote:
> Yasunori Gotou (Fujitsu) wrote:
[...]

Hi, all

There was some confusions about CXL device hotplug when I recently
tried to use Qemu to emulate CXL device hotplug and verify the relevant
functions of kernel.

> > Q1) Can PCIe hotplug driver detect and call CXL driver?

[...]

> 
> Yes.
> 
> The cxl_pci driver (drivers/cxl/pci.c) is just a typical PCI driver as
> far as the PCI hotplug driver is concerned. So add/remove events of a
> CXL card get turned into probe()/remove() events on the driver.
> 

1. Can we divide steps of CXL device hotplug into two parts(PCI hotplug & Memory Hotplug)?

PCI Hotplug: the same as the native PCIe hotplug, including initializing cxl.io,
             assigning PCIe BARs, allocating interrupts, etc. And the cxl_pci driver
                         is responsible for this part.

Memory Hotplug: focusing on enabling CXL memory including discovering and Configuring HDM,
                extracting NUMA info from device, notifying memory management, etc.

> > 
> > Q2) Can QEMU/KVM emulate CXL device hotplug?
> > 
> >    I heard that QEMU/KVM has PCIe device hotplug emulation, but I'm not sure
> >    it can hotplug CXL device.
> 
> It can, but as far as the driver is concerned you can achieve the same
> by:
> 
> echo $devname > /sys/bus/pci/drivers/cxl_pci/unbind
> 
> ...that exercises the same software flows as physical unplug.
>

2. What is the difference between "echo $devname > /sys/bus/pci/drivers/cxl_pci/unbind" and
"(qemu) device_del cxl-mem0" ?

According to the test, I found that "(qemu) device_del cxl-mem0" would directly
unplug the device and cause the interrupts on the cxl root port. It seems like this
operation would not only trigger cxl_pci driver but also pcieport driver.

The kernel dmesg is like below:

(qemu) device_del cxl-mem0
# dmesg
[  699.057907] pcieport 0000:0c:00.0: pciehp: pending interrupts 0x0001 from Slot Status
[  699.058929] pcieport 0000:0c:00.0: pciehp: Slot(0): Button press: will power off in 5 sec
[  699.059986] pcieport 0000:0c:00.0: pciehp: pending interrupts 0x0010 from Slot Status
[  699.060099] pcieport 0000:0c:00.0: pciehp: pciehp_set_indicators: SLOTCTRL 90 write cmd 2c0

Then I also tried "echo $devname > /sys/bus/pci/drivers/cxl_pci/unbind"
to check the behaviour of kernel. The kernel dmesg is like below:

# echo 0000:0d:00.0 > /sys/bus/pci/drivers/cxl_pci/unbind
# dmesg
[70387.978931] cxl_pci 0000:0d:00.0: vgaarb: pci_notify
[70388.021476] cxl_mem mem0: disconnect mem0 from port1
[70388.033099] pci 0000:0d:00.0: vgaarb: pci_notify

It seems like this operation would just unbind the cxl_pci driver from the cxl device.

Is my understanding about these two method correct?

3) Can I just use "ndctl/test/cxl-topology.sh" to test the cxl hotplug functions of kernel?

   IIUC, cxl-topology.sh would utilize cxl_test (tools/testing/cxl) which is for regression
   testing the kernel-user ABI.

PS: My qemu command line:
qemu-system-x86_64 \
-M q35,nvdimm=on,cxl=on \
-m 4G \
-smp 4 \
-object memory-backend-ram,size=2G,id=mem0 \
-numa node,nodeid=0,cpus=0-1,memdev=mem0 \
-object memory-backend-ram,size=2G,id=mem1 \
-numa node,nodeid=1,cpus=2-3,memdev=mem1 \
-object memory-backend-ram,size=256M,id=cxl-mem0 \
-device pxb-cxl,bus_nr=12,bus=pcie.0,id=cxl.1 \
-device cxl-rp,port=0,bus=cxl.1,id=root_port0,chassis=0,slot=0 \
-device cxl-type3,bus=root_port0,volatile-memdev=cxl-mem0,id=cxl-mem0 \
-M cxl-fmw.0.targets.0=cxl.1,cxl-fmw.0.size=4G,cxl-fmw.0.interleave-granularity=4k \
-hda ../disk/ubuntu_x86_test_new.qcow2 \
-nographic \

Qemu version: 8.2.50, the lastest commit of branch cxl-2024-03-05 in "https://gitlab.com/jic23/qemu"
Kernel version: 6.8.0-rc6

Many thanks
Yuquan


^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2024-03-27  7:19 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-05-22  8:06 Questions about CXL device (type 3 memory) hotplug Yasunori Gotou (Fujitsu)
2023-05-23  0:11 ` Dan Williams
2023-05-23  8:31   ` Yasunori Gotou (Fujitsu)
2023-05-23 17:36     ` Dan Williams
2023-05-24 11:12       ` Yasunori Gotou (Fujitsu)
2023-05-24 20:51         ` Dan Williams
2023-05-25 10:32           ` Yasunori Gotou (Fujitsu)
2023-05-26  8:05         ` Yasunori Gotou (Fujitsu)
2023-05-26 14:48           ` Dan Williams
2023-05-29  8:07             ` Yasunori Gotou (Fujitsu)
2023-06-06 17:58               ` Dan Williams
2023-06-08  7:39                 ` Yasunori Gotou (Fujitsu)
2023-06-08 18:37                   ` Dan Williams
2023-06-09  1:02                     ` Yasunori Gotou (Fujitsu)
2023-05-23 13:34   ` Vikram Sethi
2023-05-23 18:40     ` Dan Williams
2023-05-24  0:02       ` Vikram Sethi
2023-05-24  4:03         ` Dan Williams
2023-05-24 14:47           ` Vikram Sethi
2023-05-24 21:20             ` Dan Williams
2023-05-31  4:25               ` Vikram Sethi
2023-06-06 20:54                 ` Dan Williams
2023-06-07  1:06                   ` Vikram Sethi
2023-06-07 15:12                     ` Jonathan Cameron
2023-06-07 18:44                       ` Vikram Sethi
2023-06-08 15:19                         ` Jonathan Cameron
2023-06-08 18:41                           ` Dan Williams
2024-03-27  7:10   ` Yuquan Wang
2024-03-27  7:18   ` Yuquan Wang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox