Re: Questions about CXL device (type 3 memory) hotplug

Linux CXL
 help / color / mirror / Atom feed

From: Jonathan Cameron <Jonathan.Cameron@Huawei.com>
To: Vikram Sethi <vsethi@nvidia.com>
Cc: Dan Williams <dan.j.williams@intel.com>,
	"Yasunori Gotou (Fujitsu)" <y-goto@fujitsu.com>,
	"linux-cxl@vger.kernel.org" <linux-cxl@vger.kernel.org>,
	"catalin.marinas@arm.com" <Catalin.Marinas@arm.com>,
	James Morse <james.morse@arm.com>,
	"Natu, Mahesh" <mahesh.natu@intel.com>
Subject: Re: Questions about CXL device (type 3 memory) hotplug
Date: Thu, 8 Jun 2023 16:19:11 +0100	[thread overview]
Message-ID: <20230608161911.00000912@Huawei.com> (raw)
In-Reply-To: <BYAPR12MB33364B5EB908BF7239BB996BBD53A@BYAPR12MB3336.namprd12.prod.outlook.com>

On Wed, 7 Jun 2023 18:44:36 +0000
Vikram Sethi <vsethi@nvidia.com> wrote:

> > From: Jonathan Cameron <Jonathan.Cameron@Huawei.com>
> > Sent: Wednesday, June 7, 2023 10:12 AM
> > To: Vikram Sethi <vsethi@nvidia.com>
> > Cc: Dan Williams <dan.j.williams@intel.com>; Yasunori Gotou (Fujitsu) <y-  
> > goto@fujitsu.com>; linux-cxl@vger.kernel.org; catalin.marinas@arm.com;  
> > James Morse <james.morse@arm.com>; Natu, Mahesh
> > <mahesh.natu@intel.com>
> > Subject: Re: Questions about CXL device (type 3 memory) hotplug
> > 
> > 
> > On Wed, 7 Jun 2023 01:06:05 +0000
> > Vikram Sethi <vsethi@nvidia.com> wrote:
> >   
> > > > From: Dan Williams <dan.j.williams@intel.com>
> > > > Sent: Tuesday, June 6, 2023 3:55 PM
> > > > To: Vikram Sethi <vsethi@nvidia.com>; Dan Williams
> > > > <dan.j.williams@intel.com>; Yasunori Gotou (Fujitsu)
> > > > <y-goto@fujitsu.com>; linux-cxl@vger.kernel.org;
> > > > catalin.marinas@arm.com; James Morse <james.morse@arm.com>
> > > > Cc: Natu, Mahesh <mahesh.natu@intel.com>
> > > > Subject: RE: Questions about CXL device (type 3 memory) hotplug
> > > > Vikram Sethi wrote:  
> > > > > Hi Dan,
> > > > > Apologies for the delayed response, was out for a few days.
> > > > >  
> > > > > > From: Dan Williams <dan.j.williams@intel.com>
> > > > > > Sent: Wednesday, May 24, 2023 4:20 PM
> > > > > > To: Vikram Sethi <vsethi@nvidia.com>; Dan Williams
> > > > > > <dan.j.williams@intel.com>; Yasunori Gotou (Fujitsu)
> > > > > > <y-goto@fujitsu.com>; linux-cxl@vger.kernel.org;
> > > > > > catalin.marinas@arm.com; James Morse <james.morse@arm.com>
> > > > > > Cc: Natu, Mahesh <mahesh.natu@intel.com>
> > > > > > Subject: RE: Questions about CXL device (type 3 memory) hotplug
> > > > > > Vikram Sethi wrote:
> > > > > > [..]  
> > > > > > > > I don't understand this failure mode. Accelerator is added,
> > > > > > > > driver sets up an HDM decode range and triggers CPU cache
> > > > > > > > invalidation before mapping the memory into page tables.
> > > > > > > > Wouldn't the device, upon receiving an invalidation request,
> > > > > > > > just snoop its caches and say  
> > > > > > "nothing for me to do"?  
> > > > > > >
> > > > > > > Device's snoop filter is in a clean reset/power on state. It
> > > > > > > is not tracking anything checked out by the host CPU/peer.
> > > > > > > If it starts receiving writebacks or even CleanEvicts for its
> > > > > > > memory,  
> > > > > >
> > > > > > CleanEvict is a device-to-host request. We are talking about
> > > > > > host-to-device requests which is only SnpData, SnpInv, and
> > > > > > SnpCur,  
> > > > right?  
> > > > > >  
> > > > > I was referring to MemClnEvct which is a Host request to device
> > > > > (M2S req) as captured in table C-3 of the latest specification  
> > > >
> > > > Ok, thanks for that clarification.
> > > >  
> > > > >  
> > > > > > > looks like an unexpected coherency message and i Know of at
> > > > > > > least one implementation that triggers an error interrupt in
> > > > > > > response. I don't know of a statement In the specification
> > > > > > > that this is expected and implementations should ignore. If
> > > > > > > there is such a statement, could you please point me to it?  
> > > > > >
> > > > > > All the specification says (CXL 3.0 3.2.4.4 Host to Device
> > > > > > Requests) is what to do *if* the device is holding that
> > > > > > cacheline.
> > > > > >
> > > > > > If a device fails when it gets one of those requests when it
> > > > > > does not hold a line then how can this work in the nominal case
> > > > > > of the device not owning any random cacheline?  
> > > > >
> > > > > I didn't understand. The line in question is owned by the device
> > > > > (it is device memory). The device has just been CXL reset or
> > > > > powered up and its snoop filter isn't tracking ANY of its lines as
> > > > > checked out by the host. The host tells the device it is dropping
> > > > > a line that the host had checked out (MemClnEvct) but per the
> > > > > device the host never checked anything out. Seems perfectly
> > > > > reasonable for the device to think it is an incorrect coherency
> > > > > message and flag an error. What is the nominal case that you think
> > > > > is broken?  
> > > >
> > > > The case I was considering was a broadcast / anonymous invalidation
> > > > event, but now I see that MemClnEvct implies that the line was
> > > > previously in the Shared / Exclusive state, so now I see your point.
> > > > The host will not send MemClnEvct in the scenario I was envisioning.  
> > > > > >  
> > > > > > > Remove memory needs a cache flush IMO, in a way that prevents
> > > > > > > speculative fetches.  This can be done in kernel with
> > > > > > > uncacheable mappings alone, if possible in the arch callback,
> > > > > > > or via FW call.  
> > > > > >
> > > > > > That assumes that the kernel owns all mappings. I worry about
> > > > > > mappings that the kernel cannot see like x86 SMM. That's why
> > > > > > it's currently an invalidate before next usage, but I am not
> > > > > > opposed to also flushing on remove if the current solution is
> > > > > > causing device-failures in  
> > > > practice.  
> > > > > >
> > > > > > Can you confirm that the current kernel arrangement is causing
> > > > > > failures in practice, or is this a theoretical concern? ...and
> > > > > > if it is happening in practice do you have the example patch
> > > > > > that fixes it?  
> > > > > Yes, it is causing error interrupts from the device around device
> > > > > reset if the host caches are not flushed before the reset.  It is
> > > > > currently being worked around via ACPI magic for the cache flush
> > > > > then reset, but kernel aware handling of the flush seems more
> > > > > appropriate for both hot plug and CXL reset (whether via direct
> > > > > flush or via FW calls from arch callbacks).  
> > > >
> > > > Makes sense, and yikes "ACPI magic". My concern though as you note
> > > > above is the cache line immediately going back to the "Shared"
> > > > state from speculation before the HDM decoder space is shutdown. It
> > > > seems it would only be safe to invalidate sometime *after* all of
> > > > the page tables and HDM decode has been torn down, and suppress any
> > > > errors that result from unaccepted writes.  
> > >
> > > I agree regarding cache flush after page table mappings removed, but
> > > not sure that HDM decode tear down is a requirement to prevent
> > > speculation. Are there architectures that can speculate to arbitrary
> > > PA without any PTE mappings to those PA? Would  
> > cxl_region_decode_reset  
> > > be guaranteed to not have any page table mappings to the region and be
> > > a suitable place to also flush for a CXL reset type scenario?  
> > > >
> > > > I.e. would something like this solve the immediate problem? Or does
> > > > the architecture need to have the address range mapped into tables
> > > > and decode operational for the flush to succeed?  
> > >  
> >   
> > > The specific implementation does not require page table mappings to
> > > flush caches. I'm not sure that simply suppressing error interrupts
> > > for any writebacks or MemCleanEvict that happen after a device
> > > insertion/reset is good enough as devices could view that as a
> > > coherency error.  
> > 
> > If on an architecture that guarantees no clean write backs (or at least none if
> > they are ever visible - which should include this case) shouldn't be a problem.
> >   

> The clean drop notification (MakeCleanEvict) is sent to the device
> telling it that a clean line held by the CPU was dropped. That is the
> more common error condition as I agree that most architectures won't
> actually writeback a clean line. 
>

MemClnEvct?  That's HDM-DB only but fair enough it can happen.
Does the device actually return an error on one of those failing to sink?
I can't recall seeing anything that says it does.
There is text for writes (dropped) and read (complicated) but not
the stuff related to Buried State.
I thought there might be a way to convey it in General Media Event record
(using invalid address) but there isn't a suitable transaction type. Would
be horrible anyway as this is nothing to do with media.

I don't think there is any other way the host can tell the device ignored
it's MemClnEvct so no errors from that end.
 
> > So who wants to point an laugh at anyone that does clean write
> > backs that can be observed?
> > :)
> > 
> > Even on archs that do allow for such write backs, I believe they
> > are not common as otherwise perf would be terrible: so just let the
> > errors through - they are flagging errors in PAs that aren't mapped
> > so should just generate a small amount of noise in the logs.
> > 
> > So flush before to make clean (or invalid but then potentially
> > prefetched so clean) - tear down the HDM decoders and flush again /
> > invalidate so nothing stale hanging around (or do it before
> > bringing something new up at that Host PA). Eat or log any errors
> > and don't worry about it. 
> 
> I'm OK with this approach. When the cache flush is done at the time
> of the decoder tear down, there mustn't be any page table mappings to
> the decode HPA ranges (and if any ISA wanted to do an in kernel flush
> vs FW call, and needed a PTE mapping for the flush, that should be
> done with a non cacheable mapping).

FW magic so we don't have to care :)

Jonathan

> 
> > Maybe I'm missing some corners cases.
> > 
> > Jonathan
> >   
> > > >
> > > > diff --git a/drivers/cxl/core/region.c
> > > > b/drivers/cxl/core/region.c index 543c4499379e..60d1b5ecf936
> > > > 100644 --- a/drivers/cxl/core/region.c
> > > > +++ b/drivers/cxl/core/region.c
> > > > @@ -187,6 +187,15 @@ static int cxl_region_decode_commit(struct
> > > > cxl_region *cxlr)
> > > >         struct cxl_region_params *p = &cxlr->params;
> > > >         int i, rc = 0;
> > > >
> > > > +       /*
> > > > +        * Before the new region goes active, and while the
> > > > physical address
> > > > +        * range is not mapped in any page tables invalidate any
> > > > previous cached
> > > > +        * lines in this physical address range.
> > > > +        */
> > > > +       rc = cxl_region_invalidate_memregion(cxlr);
> > > > +       if (rc)
> > > > +               return rc;
> > > > +
> > > >         for (i = 0; i < p->nr_targets; i++) {
> > > >                 struct cxl_endpoint_decoder *cxled =
> > > > p->targets[i]; struct cxl_memdev *cxlmd =
> > > > cxled_to_memdev(cxled); @@ -3158,8 +3167,6 @@ static int
> > > > cxl_region_probe(struct device *dev) goto out;
> > > >         }
> > > >
> > > > -       rc = cxl_region_invalidate_memregion(cxlr);
> > > > -
> > > >         /*
> > > >          * From this point on any path that changes the region's
> > > > state away from
> > > >          * CXL_CONFIG_COMMIT is also responsible for releasing
> > > > the driver.  
> > >  
>

next prev parent reply	other threads:[~2023-06-08 15:19 UTC|newest]

Thread overview: 29+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-05-22  8:06 Questions about CXL device (type 3 memory) hotplug Yasunori Gotou (Fujitsu)
2023-05-23  0:11 ` Dan Williams
2023-05-23  8:31   ` Yasunori Gotou (Fujitsu)
2023-05-23 17:36     ` Dan Williams
2023-05-24 11:12       ` Yasunori Gotou (Fujitsu)
2023-05-24 20:51         ` Dan Williams
2023-05-25 10:32           ` Yasunori Gotou (Fujitsu)
2023-05-26  8:05         ` Yasunori Gotou (Fujitsu)
2023-05-26 14:48           ` Dan Williams
2023-05-29  8:07             ` Yasunori Gotou (Fujitsu)
2023-06-06 17:58               ` Dan Williams
2023-06-08  7:39                 ` Yasunori Gotou (Fujitsu)
2023-06-08 18:37                   ` Dan Williams
2023-06-09  1:02                     ` Yasunori Gotou (Fujitsu)
2023-05-23 13:34   ` Vikram Sethi
2023-05-23 18:40     ` Dan Williams
2023-05-24  0:02       ` Vikram Sethi
2023-05-24  4:03         ` Dan Williams
2023-05-24 14:47           ` Vikram Sethi
2023-05-24 21:20             ` Dan Williams
2023-05-31  4:25               ` Vikram Sethi
2023-06-06 20:54                 ` Dan Williams
2023-06-07  1:06                   ` Vikram Sethi
2023-06-07 15:12                     ` Jonathan Cameron
2023-06-07 18:44                       ` Vikram Sethi
2023-06-08 15:19                         ` Jonathan Cameron [this message]
2023-06-08 18:41                           ` Dan Williams
2024-03-27  7:10   ` Yuquan Wang
2024-03-27  7:18   ` Yuquan Wang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20230608161911.00000912@Huawei.com \
    --to=jonathan.cameron@huawei.com \
    --cc=Catalin.Marinas@arm.com \
    --cc=dan.j.williams@intel.com \
    --cc=james.morse@arm.com \
    --cc=linux-cxl@vger.kernel.org \
    --cc=mahesh.natu@intel.com \
    --cc=vsethi@nvidia.com \
    --cc=y-goto@fujitsu.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox