From: Jonathan Cameron <Jonathan.Cameron@Huawei.com>
To: Vikram Sethi <vsethi@nvidia.com>
Cc: Dan Williams <dan.j.williams@intel.com>,
"Yasunori Gotou (Fujitsu)" <y-goto@fujitsu.com>,
"linux-cxl@vger.kernel.org" <linux-cxl@vger.kernel.org>,
"catalin.marinas@arm.com" <Catalin.Marinas@arm.com>,
James Morse <james.morse@arm.com>,
"Natu, Mahesh" <mahesh.natu@intel.com>
Subject: Re: Questions about CXL device (type 3 memory) hotplug
Date: Thu, 8 Jun 2023 16:19:11 +0100 [thread overview]
Message-ID: <20230608161911.00000912@Huawei.com> (raw)
In-Reply-To: <BYAPR12MB33364B5EB908BF7239BB996BBD53A@BYAPR12MB3336.namprd12.prod.outlook.com>
On Wed, 7 Jun 2023 18:44:36 +0000
Vikram Sethi <vsethi@nvidia.com> wrote:
> > From: Jonathan Cameron <Jonathan.Cameron@Huawei.com>
> > Sent: Wednesday, June 7, 2023 10:12 AM
> > To: Vikram Sethi <vsethi@nvidia.com>
> > Cc: Dan Williams <dan.j.williams@intel.com>; Yasunori Gotou (Fujitsu) <y-
> > goto@fujitsu.com>; linux-cxl@vger.kernel.org; catalin.marinas@arm.com;
> > James Morse <james.morse@arm.com>; Natu, Mahesh
> > <mahesh.natu@intel.com>
> > Subject: Re: Questions about CXL device (type 3 memory) hotplug
> >
> >
> > On Wed, 7 Jun 2023 01:06:05 +0000
> > Vikram Sethi <vsethi@nvidia.com> wrote:
> >
> > > > From: Dan Williams <dan.j.williams@intel.com>
> > > > Sent: Tuesday, June 6, 2023 3:55 PM
> > > > To: Vikram Sethi <vsethi@nvidia.com>; Dan Williams
> > > > <dan.j.williams@intel.com>; Yasunori Gotou (Fujitsu)
> > > > <y-goto@fujitsu.com>; linux-cxl@vger.kernel.org;
> > > > catalin.marinas@arm.com; James Morse <james.morse@arm.com>
> > > > Cc: Natu, Mahesh <mahesh.natu@intel.com>
> > > > Subject: RE: Questions about CXL device (type 3 memory) hotplug
> > > > Vikram Sethi wrote:
> > > > > Hi Dan,
> > > > > Apologies for the delayed response, was out for a few days.
> > > > >
> > > > > > From: Dan Williams <dan.j.williams@intel.com>
> > > > > > Sent: Wednesday, May 24, 2023 4:20 PM
> > > > > > To: Vikram Sethi <vsethi@nvidia.com>; Dan Williams
> > > > > > <dan.j.williams@intel.com>; Yasunori Gotou (Fujitsu)
> > > > > > <y-goto@fujitsu.com>; linux-cxl@vger.kernel.org;
> > > > > > catalin.marinas@arm.com; James Morse <james.morse@arm.com>
> > > > > > Cc: Natu, Mahesh <mahesh.natu@intel.com>
> > > > > > Subject: RE: Questions about CXL device (type 3 memory) hotplug
> > > > > > Vikram Sethi wrote:
> > > > > > [..]
> > > > > > > > I don't understand this failure mode. Accelerator is added,
> > > > > > > > driver sets up an HDM decode range and triggers CPU cache
> > > > > > > > invalidation before mapping the memory into page tables.
> > > > > > > > Wouldn't the device, upon receiving an invalidation request,
> > > > > > > > just snoop its caches and say
> > > > > > "nothing for me to do"?
> > > > > > >
> > > > > > > Device's snoop filter is in a clean reset/power on state. It
> > > > > > > is not tracking anything checked out by the host CPU/peer.
> > > > > > > If it starts receiving writebacks or even CleanEvicts for its
> > > > > > > memory,
> > > > > >
> > > > > > CleanEvict is a device-to-host request. We are talking about
> > > > > > host-to-device requests which is only SnpData, SnpInv, and
> > > > > > SnpCur,
> > > > right?
> > > > > >
> > > > > I was referring to MemClnEvct which is a Host request to device
> > > > > (M2S req) as captured in table C-3 of the latest specification
> > > >
> > > > Ok, thanks for that clarification.
> > > >
> > > > >
> > > > > > > looks like an unexpected coherency message and i Know of at
> > > > > > > least one implementation that triggers an error interrupt in
> > > > > > > response. I don't know of a statement In the specification
> > > > > > > that this is expected and implementations should ignore. If
> > > > > > > there is such a statement, could you please point me to it?
> > > > > >
> > > > > > All the specification says (CXL 3.0 3.2.4.4 Host to Device
> > > > > > Requests) is what to do *if* the device is holding that
> > > > > > cacheline.
> > > > > >
> > > > > > If a device fails when it gets one of those requests when it
> > > > > > does not hold a line then how can this work in the nominal case
> > > > > > of the device not owning any random cacheline?
> > > > >
> > > > > I didn't understand. The line in question is owned by the device
> > > > > (it is device memory). The device has just been CXL reset or
> > > > > powered up and its snoop filter isn't tracking ANY of its lines as
> > > > > checked out by the host. The host tells the device it is dropping
> > > > > a line that the host had checked out (MemClnEvct) but per the
> > > > > device the host never checked anything out. Seems perfectly
> > > > > reasonable for the device to think it is an incorrect coherency
> > > > > message and flag an error. What is the nominal case that you think
> > > > > is broken?
> > > >
> > > > The case I was considering was a broadcast / anonymous invalidation
> > > > event, but now I see that MemClnEvct implies that the line was
> > > > previously in the Shared / Exclusive state, so now I see your point.
> > > > The host will not send MemClnEvct in the scenario I was envisioning.
> > > > > >
> > > > > > > Remove memory needs a cache flush IMO, in a way that prevents
> > > > > > > speculative fetches. This can be done in kernel with
> > > > > > > uncacheable mappings alone, if possible in the arch callback,
> > > > > > > or via FW call.
> > > > > >
> > > > > > That assumes that the kernel owns all mappings. I worry about
> > > > > > mappings that the kernel cannot see like x86 SMM. That's why
> > > > > > it's currently an invalidate before next usage, but I am not
> > > > > > opposed to also flushing on remove if the current solution is
> > > > > > causing device-failures in
> > > > practice.
> > > > > >
> > > > > > Can you confirm that the current kernel arrangement is causing
> > > > > > failures in practice, or is this a theoretical concern? ...and
> > > > > > if it is happening in practice do you have the example patch
> > > > > > that fixes it?
> > > > > Yes, it is causing error interrupts from the device around device
> > > > > reset if the host caches are not flushed before the reset. It is
> > > > > currently being worked around via ACPI magic for the cache flush
> > > > > then reset, but kernel aware handling of the flush seems more
> > > > > appropriate for both hot plug and CXL reset (whether via direct
> > > > > flush or via FW calls from arch callbacks).
> > > >
> > > > Makes sense, and yikes "ACPI magic". My concern though as you note
> > > > above is the cache line immediately going back to the "Shared"
> > > > state from speculation before the HDM decoder space is shutdown. It
> > > > seems it would only be safe to invalidate sometime *after* all of
> > > > the page tables and HDM decode has been torn down, and suppress any
> > > > errors that result from unaccepted writes.
> > >
> > > I agree regarding cache flush after page table mappings removed, but
> > > not sure that HDM decode tear down is a requirement to prevent
> > > speculation. Are there architectures that can speculate to arbitrary
> > > PA without any PTE mappings to those PA? Would
> > cxl_region_decode_reset
> > > be guaranteed to not have any page table mappings to the region and be
> > > a suitable place to also flush for a CXL reset type scenario?
> > > >
> > > > I.e. would something like this solve the immediate problem? Or does
> > > > the architecture need to have the address range mapped into tables
> > > > and decode operational for the flush to succeed?
> > >
> >
> > > The specific implementation does not require page table mappings to
> > > flush caches. I'm not sure that simply suppressing error interrupts
> > > for any writebacks or MemCleanEvict that happen after a device
> > > insertion/reset is good enough as devices could view that as a
> > > coherency error.
> >
> > If on an architecture that guarantees no clean write backs (or at least none if
> > they are ever visible - which should include this case) shouldn't be a problem.
> >
> The clean drop notification (MakeCleanEvict) is sent to the device
> telling it that a clean line held by the CPU was dropped. That is the
> more common error condition as I agree that most architectures won't
> actually writeback a clean line.
>
MemClnEvct? That's HDM-DB only but fair enough it can happen.
Does the device actually return an error on one of those failing to sink?
I can't recall seeing anything that says it does.
There is text for writes (dropped) and read (complicated) but not
the stuff related to Buried State.
I thought there might be a way to convey it in General Media Event record
(using invalid address) but there isn't a suitable transaction type. Would
be horrible anyway as this is nothing to do with media.
I don't think there is any other way the host can tell the device ignored
it's MemClnEvct so no errors from that end.
> > So who wants to point an laugh at anyone that does clean write
> > backs that can be observed?
> > :)
> >
> > Even on archs that do allow for such write backs, I believe they
> > are not common as otherwise perf would be terrible: so just let the
> > errors through - they are flagging errors in PAs that aren't mapped
> > so should just generate a small amount of noise in the logs.
> >
> > So flush before to make clean (or invalid but then potentially
> > prefetched so clean) - tear down the HDM decoders and flush again /
> > invalidate so nothing stale hanging around (or do it before
> > bringing something new up at that Host PA). Eat or log any errors
> > and don't worry about it.
>
> I'm OK with this approach. When the cache flush is done at the time
> of the decoder tear down, there mustn't be any page table mappings to
> the decode HPA ranges (and if any ISA wanted to do an in kernel flush
> vs FW call, and needed a PTE mapping for the flush, that should be
> done with a non cacheable mapping).
FW magic so we don't have to care :)
Jonathan
>
> > Maybe I'm missing some corners cases.
> >
> > Jonathan
> >
> > > >
> > > > diff --git a/drivers/cxl/core/region.c
> > > > b/drivers/cxl/core/region.c index 543c4499379e..60d1b5ecf936
> > > > 100644 --- a/drivers/cxl/core/region.c
> > > > +++ b/drivers/cxl/core/region.c
> > > > @@ -187,6 +187,15 @@ static int cxl_region_decode_commit(struct
> > > > cxl_region *cxlr)
> > > > struct cxl_region_params *p = &cxlr->params;
> > > > int i, rc = 0;
> > > >
> > > > + /*
> > > > + * Before the new region goes active, and while the
> > > > physical address
> > > > + * range is not mapped in any page tables invalidate any
> > > > previous cached
> > > > + * lines in this physical address range.
> > > > + */
> > > > + rc = cxl_region_invalidate_memregion(cxlr);
> > > > + if (rc)
> > > > + return rc;
> > > > +
> > > > for (i = 0; i < p->nr_targets; i++) {
> > > > struct cxl_endpoint_decoder *cxled =
> > > > p->targets[i]; struct cxl_memdev *cxlmd =
> > > > cxled_to_memdev(cxled); @@ -3158,8 +3167,6 @@ static int
> > > > cxl_region_probe(struct device *dev) goto out;
> > > > }
> > > >
> > > > - rc = cxl_region_invalidate_memregion(cxlr);
> > > > -
> > > > /*
> > > > * From this point on any path that changes the region's
> > > > state away from
> > > > * CXL_CONFIG_COMMIT is also responsible for releasing
> > > > the driver.
> > >
>
next prev parent reply other threads:[~2023-06-08 15:19 UTC|newest]
Thread overview: 29+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-05-22 8:06 Questions about CXL device (type 3 memory) hotplug Yasunori Gotou (Fujitsu)
2023-05-23 0:11 ` Dan Williams
2023-05-23 8:31 ` Yasunori Gotou (Fujitsu)
2023-05-23 17:36 ` Dan Williams
2023-05-24 11:12 ` Yasunori Gotou (Fujitsu)
2023-05-24 20:51 ` Dan Williams
2023-05-25 10:32 ` Yasunori Gotou (Fujitsu)
2023-05-26 8:05 ` Yasunori Gotou (Fujitsu)
2023-05-26 14:48 ` Dan Williams
2023-05-29 8:07 ` Yasunori Gotou (Fujitsu)
2023-06-06 17:58 ` Dan Williams
2023-06-08 7:39 ` Yasunori Gotou (Fujitsu)
2023-06-08 18:37 ` Dan Williams
2023-06-09 1:02 ` Yasunori Gotou (Fujitsu)
2023-05-23 13:34 ` Vikram Sethi
2023-05-23 18:40 ` Dan Williams
2023-05-24 0:02 ` Vikram Sethi
2023-05-24 4:03 ` Dan Williams
2023-05-24 14:47 ` Vikram Sethi
2023-05-24 21:20 ` Dan Williams
2023-05-31 4:25 ` Vikram Sethi
2023-06-06 20:54 ` Dan Williams
2023-06-07 1:06 ` Vikram Sethi
2023-06-07 15:12 ` Jonathan Cameron
2023-06-07 18:44 ` Vikram Sethi
2023-06-08 15:19 ` Jonathan Cameron [this message]
2023-06-08 18:41 ` Dan Williams
2024-03-27 7:10 ` Yuquan Wang
2024-03-27 7:18 ` Yuquan Wang
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20230608161911.00000912@Huawei.com \
--to=jonathan.cameron@huawei.com \
--cc=Catalin.Marinas@arm.com \
--cc=dan.j.williams@intel.com \
--cc=james.morse@arm.com \
--cc=linux-cxl@vger.kernel.org \
--cc=mahesh.natu@intel.com \
--cc=vsethi@nvidia.com \
--cc=y-goto@fujitsu.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox