Linux CXL
 help / color / mirror / Atom feed
From: Jonathan Cameron <Jonathan.Cameron@huawei.com>
To: "Bowman, Terry" <terry.bowman@amd.com>
Cc: <linux-cxl@vger.kernel.org>, <linux-kernel@vger.kernel.org>,
	<linux-pci@vger.kernel.org>, <nifan.cxl@gmail.com>,
	<dave@stgolabs.net>, <dave.jiang@intel.com>,
	<alison.schofield@intel.com>, <vishal.l.verma@intel.com>,
	<dan.j.williams@intel.com>, <bhelgaas@google.com>,
	<mahesh@linux.ibm.com>, <ira.weiny@intel.com>, <oohall@gmail.com>,
	<Benjamin.Cheatham@amd.com>, <rrichter@amd.com>,
	<nathan.fontenot@amd.com>,
	<Smita.KoralahalliChannabasappa@amd.com>, <lukas@wunner.de>,
	<ming.li@zohomail.com>, <PradeepVineshReddy.Kodamati@amd.com>
Subject: Re: [PATCH v8 04/16] cxl/aer: AER service driver forwards CXL error to CXL driver
Date: Wed, 21 May 2025 19:34:12 +0100	[thread overview]
Message-ID: <20250521193412.000067b3@huawei.com> (raw)
In-Reply-To: <46c962de-a691-4e67-9af6-5765225633ae@amd.com>

On Tue, 20 May 2025 08:21:18 -0500
"Bowman, Terry" <terry.bowman@amd.com> wrote:

> On 5/20/2025 6:04 AM, Jonathan Cameron wrote:
> > On Thu, 15 May 2025 16:52:15 -0500
> > "Bowman, Terry" <terry.bowman@amd.com> wrote:
> >  
> >> On 4/25/2025 8:18 AM, Jonathan Cameron wrote:  
> >>> On Thu, 24 Apr 2025 09:17:45 -0500
> >>> "Bowman, Terry" <terry.bowman@amd.com> wrote:
> >>>    
> >>>> On 4/23/2025 10:04 AM, Jonathan Cameron wrote:    
> >>>>> On Wed, 26 Mar 2025 20:47:05 -0500
> >>>>> Terry Bowman <terry.bowman@amd.com> wrote:
> >>>>>      
> >>>>>> The AER service driver includes a CXL-specific kfifo, intended to forward
> >>>>>> CXL errors to the CXL driver. However, the forwarding functionality is
> >>>>>> currently unimplemented. Update the AER driver to enable error forwarding
> >>>>>> to the CXL driver.
> >>>>>>
> >>>>>> Modify the AER service driver's handle_error_source(), which is called from
> >>>>>> process_aer_err_devices(), to distinguish between PCIe and CXL errors.
> >>>>>>
> >>>>>> Rename and update is_internal_error() to is_cxl_error(). Ensuring it
> >>>>>> checks both the 'struct aer_info::is_cxl' flag and the AER internal error
> >>>>>> masks.
> >>>>>>
> >>>>>> If the error is a standard PCIe error then continue calling pcie_aer_handle_error()
> >>>>>> as done in the current AER driver.
> >>>>>>
> >>>>>> If the error is a CXL-related error then forward it to the CXL driver for
> >>>>>> handling using the kfifo mechanism.
> >>>>>>
> >>>>>> Introduce a new function forward_cxl_error(), which constructs a CXL
> >>>>>> protocol error context using cxl_create_prot_err_info(). This context is
> >>>>>> then passed to the CXL driver via kfifo using a 'struct work_struct'.
> >>>>>>
> >>>>>> Signed-off-by: Terry Bowman <terry.bowman@amd.com>      
> >>>>> Hi Terry,
> >>>>>
> >>>>> Finally got back to this.  I'm not following how some of the reference
> >>>>> counting in here is working.  It might be fine but there is a lot
> >>>>> taking then dropping device references - some of which are taken again later.
> >>>>>      
> >>>>>> @@ -1082,10 +1094,44 @@ static void cxl_rch_enable_rcec(struct pci_dev *rcec)
> >>>>>>  	pci_info(rcec, "CXL: Internal errors unmasked");
> >>>>>>  }
> >>>>>>  
> >>>>>> +static void forward_cxl_error(struct pci_dev *_pdev, struct aer_err_info *info)
> >>>>>> +{
> >>>>>> +	int severity = info->severity;      
> >>>>> So far this variable isn't really justified.  Maybe it makes sense later in the
> >>>>> series?      
> >>>> This is used below in call to cxl_create_prot_err_info().    
> >>> Sure, but why not just do
> >>>
> >>> if (cxl_create_prot_error_info(pdev, info->severity, &wd.err_info)) {
> >>>
> >>> There isn't anything modifying info->severity in between so that local
> >>> variable is just padding out the code to no real benefit.
> >>>
> >>>    
> >>>>>> +		pci_err(pdev, "Failed to create CXL protocol error information");
> >>>>>> +		return;
> >>>>>> +	}
> >>>>>> +
> >>>>>> +	struct device *cxl_dev __free(put_device) = get_device(err_info->dev);      
> >>>>> Also this one.  A reference was acquired and dropped in cxl_create_prot_err_info()
> >>>>> followed by retaking it here.  How do we know it is still about by this call
> >>>>> and once we pull it off the kfifo later?      
> >>>> Yes, this is a problem I realized after sending the series.
> >>>>
> >>>> The device reference incr could be changed for all the devices to the non-cleanup
> >>>> variety. Then would add the reference incr in the caller after calling cxl_create_prot_err_info().
> >>>> I need to look at the other calls to to cxl_create_prot_err_info() as well.
> >>>>
> >>>> In addition, I think we should consider adding the CXL RAS status into the struct cxl_prot_err_info.
> >>>> This would eliminate the need for further accesses to the CXL device after being dequeued from the
> >>>> fifo. Thoughts?    
> >>> That sounds like a reasonable solution to me.
> >>>
> >>> Jonathan    
> >> Hi Jonathan,  
> > Hi Terry,
> >
> > Sorry for delay - travel etc...
> >  
> >> Is it sufficient to rely on correctly implemented reference counting implementation instead
> >> of caching the RAS status I mentioned earlier?
> >>
> >> I have the next revision coded to 'get' the CXL erring device's reference count in the AER
> >> driver before enqueuing in the kfifo and then added a reference count 'put' in the CXL driver
> >> after dequeuing and handling/logging. This is an alternative to what I mentioned earlier reading
> >> the RAS status and caching it. One more question: is it OK to implement the get and put (of
> >> the same object) in different drivers?  
> > It's definitely unusual.  If there is anything similar to point at I'd be happier than
> > this 'innovation' showing up here first. 
> >  
> >> If we need to read and cache the RAS status before the kfifo enqueue there will be some other
> >> details to work through.  
> > This still smells like the cleaner solution to me, but depends on those details..
> >
> > Jonathan  
> 
> In this case I believe we will need to move the CE handling (RAS status reading and clearing) before
> the kfifo enqueue. I think this is necessary because CXL errors may continue to be received and we
> don't want their status's combined when reading or clearing. I can refactor cxl_handle_ras()/
> cxl_handle_cor_ras() to return the RAS status value and remove the trace logging (to instead be
> called after kfifo dequeue).
> 
> This leaves the UCE case. It's worth mentioning the UCE flow is different than the the CE case
> because it uses the top-bottom traversal starting at the erring device. Correct me if I'm wrong
> this would be handled before the kfifo as well. The handling and logging in the UCE case are
> baked together. The UCE flow would therefore need to include the trace logging during handling.
> 
> Another flow is the PCI EP errors. The PCIe EP CE and UCE handlers remain and can call the
> the refactored cxl_handle_ras()/cxl_handle_cor_ras() and then trace log afterwards. This is no
> issue.
> 
> This leaves only CE trace logging to be called after the kfifo dequeue. This is what doesn't
> feel right and wanted to draw attention to.
> 
> All this to say: very little work will be done after the kfifo dequeue. Most of the work in
> the kfifo implementation would be before the kfifo enqueuing in the CXL create_prot_error_info()
> callback. I am concerned the balance of work done before and after the kfifo enqueue/dequeue
> will be very asymmetric with little value provided from the kfifo.
> 
As per the discord chat - if you look up the device again from BDF or similar and get this
info once you have right locks post kfifo all should be fine as any race will be easy
to resolve by doing nothing if the driver has gone away.

Jonathan

> -Terry
> 


  reply	other threads:[~2025-05-21 18:34 UTC|newest]

Thread overview: 76+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-03-27  1:47 [PATCH v8 00/16] Enable CXL PCIe port protocol error handling and logging Terry Bowman
2025-03-27  1:47 ` [PATCH v8 01/16] PCI/CXL: Introduce PCIe helper function pcie_is_cxl() Terry Bowman
2025-03-27 15:11   ` Ira Weiny
2025-03-27 15:30     ` Bowman, Terry
2025-03-27  1:47 ` [PATCH v8 02/16] PCI/AER: Modify AER driver logging to report CXL or PCIe bus error type Terry Bowman
2025-03-27 16:48   ` Bjorn Helgaas
2025-03-27 17:15     ` Bowman, Terry
2025-03-27 17:49       ` Bjorn Helgaas
2025-03-27 16:58   ` Ira Weiny
2025-03-27 17:17     ` Bowman, Terry
2025-03-27  1:47 ` [PATCH v8 03/16] CXL/AER: Introduce Kfifo for forwarding CXL errors Terry Bowman
2025-03-27 17:08   ` Bjorn Helgaas
2025-03-27 18:12     ` Bowman, Terry
2025-03-28 17:02       ` Bjorn Helgaas
2025-03-28 17:36         ` Bowman, Terry
2025-03-28 17:01   ` Ira Weiny
2025-04-07 13:43     ` Bowman, Terry
2025-04-04 16:53   ` Jonathan Cameron
2025-04-23 14:33   ` Jonathan Cameron
2025-04-23 15:04   ` Jonathan Cameron
2025-04-23 22:12   ` Gregory Price
2025-03-27  1:47 ` [PATCH v8 04/16] cxl/aer: AER service driver forwards CXL error to CXL driver Terry Bowman
2025-03-27 17:13   ` Bjorn Helgaas
2025-04-07 14:00     ` Bowman, Terry
2025-04-23 15:04   ` Jonathan Cameron
2025-04-24 14:17     ` Bowman, Terry
2025-04-25 13:18       ` Jonathan Cameron
2025-04-25 21:03         ` Bowman, Terry
2025-05-15 21:52         ` Bowman, Terry
2025-05-20 11:04           ` Jonathan Cameron
2025-05-20 13:21             ` Bowman, Terry
2025-05-21 18:34               ` Jonathan Cameron [this message]
2025-05-21 23:30                 ` Bowman, Terry
2025-04-23 22:21   ` Gregory Price
2025-03-27  1:47 ` [PATCH v8 05/16] PCI/AER: CXL driver dequeues CXL error forwarded from AER service driver Terry Bowman
2025-03-27  4:43   ` kernel test robot
2025-04-23 16:28   ` Jonathan Cameron
2025-04-24 15:03     ` Bowman, Terry
2025-03-27  1:47 ` [PATCH v8 06/16] CXL/PCI: Introduce CXL uncorrectable protocol error 'recovery' Terry Bowman
2025-03-27  3:37   ` kernel test robot
2025-03-27  4:19   ` kernel test robot
2025-04-23 16:35   ` Jonathan Cameron
2025-04-24 14:22     ` Bowman, Terry
2025-03-27  1:47 ` [PATCH v8 07/16] cxl/pci: Move existing CXL RAS initialization to CXL's cxl_port driver Terry Bowman
2025-04-17 10:18   ` Jonathan Cameron
2025-04-24 14:25     ` Bowman, Terry
2025-05-12 14:47     ` Bowman, Terry
2025-03-27  1:47 ` [PATCH v8 08/16] cxl/pci: Map CXL Endpoint Port and CXL Switch Port RAS registers Terry Bowman
2025-03-27  1:47 ` [PATCH v8 09/16] cxl/pci: Update RAS handler interfaces to also support CXL PCIe Ports Terry Bowman
2025-03-27  1:47 ` [PATCH v8 10/16] cxl/pci: Add log message if RAS registers are not mapped Terry Bowman
2025-04-23 16:41   ` Jonathan Cameron
2025-04-24 14:30     ` Bowman, Terry
2025-03-27  1:47 ` [PATCH v8 11/16] cxl/pci: Unifi CXL trace logging for CXL Endpoints and CXL Ports Terry Bowman
2025-04-23 16:44   ` Jonathan Cameron
2025-05-07 16:28     ` Shiju Jose
2025-05-07 18:30       ` Bowman, Terry
2025-03-27  1:47 ` [PATCH v8 12/16] cxl/pci: Assign CXL Port protocol error handlers Terry Bowman
2025-04-23 16:47   ` Jonathan Cameron
2025-03-27  1:47 ` [PATCH v8 13/16] cxl/pci: Assign CXL Endpoint " Terry Bowman
2025-03-27 19:46   ` kernel test robot
2025-04-23 16:49   ` Jonathan Cameron
2025-03-27  1:47 ` [PATCH v8 14/16] cxl/pci: Remove unnecessary CXL Endpoint handling helper functions Terry Bowman
2025-04-17 17:22   ` Jonathan Cameron
2025-03-27  1:47 ` [PATCH v8 15/16] CXL/PCI: Enable CXL protocol errors during CXL Port probe Terry Bowman
2025-04-04 17:05   ` Jonathan Cameron
2025-04-07 14:34     ` Bowman, Terry
2025-03-27  1:47 ` [PATCH v8 16/16] CXL/PCI: Disable CXL protocol errors during CXL Port cleanup Terry Bowman
2025-03-28  1:18   ` kernel test robot
2025-04-04 17:04   ` Jonathan Cameron
2025-04-07 14:25     ` Bowman, Terry
2025-04-17 10:13       ` Jonathan Cameron
2025-04-24 16:37         ` Bowman, Terry
2025-03-27 17:16 ` [PATCH v8 00/16] Enable CXL PCIe port protocol error handling and logging Bjorn Helgaas
2025-03-27 22:04   ` Bowman, Terry
2025-05-06 23:06 ` Gregory Price
2025-05-07 18:28   ` Bowman, Terry

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20250521193412.000067b3@huawei.com \
    --to=jonathan.cameron@huawei.com \
    --cc=Benjamin.Cheatham@amd.com \
    --cc=PradeepVineshReddy.Kodamati@amd.com \
    --cc=Smita.KoralahalliChannabasappa@amd.com \
    --cc=alison.schofield@intel.com \
    --cc=bhelgaas@google.com \
    --cc=dan.j.williams@intel.com \
    --cc=dave.jiang@intel.com \
    --cc=dave@stgolabs.net \
    --cc=ira.weiny@intel.com \
    --cc=linux-cxl@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-pci@vger.kernel.org \
    --cc=lukas@wunner.de \
    --cc=mahesh@linux.ibm.com \
    --cc=ming.li@zohomail.com \
    --cc=nathan.fontenot@amd.com \
    --cc=nifan.cxl@gmail.com \
    --cc=oohall@gmail.com \
    --cc=rrichter@amd.com \
    --cc=terry.bowman@amd.com \
    --cc=vishal.l.verma@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox