Linux CXL
 help / color / mirror / Atom feed
From: Bjorn Helgaas <helgaas@kernel.org>
To: "Bowman, Terry" <terry.bowman@amd.com>
Cc: linux-cxl@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-pci@vger.kernel.org, nifan.cxl@gmail.com,
	dave@stgolabs.net, jonathan.cameron@huawei.com,
	dave.jiang@intel.com, alison.schofield@intel.com,
	vishal.l.verma@intel.com, dan.j.williams@intel.com,
	bhelgaas@google.com, mahesh@linux.ibm.com, ira.weiny@intel.com,
	oohall@gmail.com, Benjamin.Cheatham@amd.com, rrichter@amd.com,
	nathan.fontenot@amd.com, Smita.KoralahalliChannabasappa@amd.com,
	lukas@wunner.de, ming.li@zohomail.com,
	PradeepVineshReddy.Kodamati@amd.com
Subject: Re: [PATCH v8 03/16] CXL/AER: Introduce Kfifo for forwarding CXL errors
Date: Fri, 28 Mar 2025 12:02:12 -0500	[thread overview]
Message-ID: <20250328170212.GA1508786@bhelgaas> (raw)
In-Reply-To: <9ae4740d-e2d9-4d49-b021-6712311842ed@amd.com>

What does this series apply to?  I default to the current -rc1
(v6.14-rc1), but this doesn't apply there, and I don't have the
base-commit: aae0594a7053c60b82621136257c8b648c67b512 mentioned in the
cover letter.

Sometimes things make more sense when I can see everything as applied.

On Thu, Mar 27, 2025 at 01:12:30PM -0500, Bowman, Terry wrote:
> On 3/27/2025 12:08 PM, Bjorn Helgaas wrote:
> > On Wed, Mar 26, 2025 at 08:47:04PM -0500, Terry Bowman wrote:
> >> CXL error handling will soon be moved from the AER driver into the CXL
> >> driver. This requires a notification mechanism for the AER driver to share
> >> the AER interrupt details with CXL driver. The notification is required for
> >> the CXL drivers to then handle CXL RAS errors.
> >>
> >> Add a kfifo work queue to be used by the AER driver and CXL driver. The AER
> >> driver will be the sole kfifo producer adding work. The cxl_core will be
> >> the sole kfifo consumer removing work. Add the boilerplate kfifo support.
> >>
> >> Add CXL work queue handler registration functions in the AER driver. Export
> >> the functions allowing CXL driver to access. Implement the registration
> >> functions for the CXL driver to assign or clear the work handler function.
> >>
> >> Create a work queue handler function, cxl_prot_err_work_fn(), as a stub for
> >> now. The CXL specific handling will be added in future patch.
> >>
> >> Introduce 'struct cxl_prot_err_info'. This structure caches CXL error
> >> details used in completing error handling. This avoid duplicating some
> >> function calls and allows the error to be treated generically when
> >> possible.

> >> +int cxl_create_prot_err_info(struct pci_dev *_pdev, int severity,
> >> +			     struct cxl_prot_error_info *err_info)
> >> +{
> ...

> >> +	if ((pci_pcie_type(pdev) != PCI_EXP_TYPE_ENDPOINT) &&
> >> +	    (pci_pcie_type(pdev) != PCI_EXP_TYPE_RC_END)) {
> >> +		pci_warn_once(pdev, "Error: Unsupported device type (%X)", pci_pcie_type(pdev));
> >> +		return -ENODEV;
> >
> > Similar.  A pci_warn_once() here seems like a debugging aid during
> > development, not necessarily a production kind of thing.
> >
> > Thanks for printing the type.  I would use "%#x" to make it clear that
> > it's hex.  There are about 1900 %X uses compared with 33K
> > %x uses, but maybe you have a reason to capitalize it?
>
> Got it "%x". Would you recommend the pci_warn_once() is removed?

The dependency on pdev being an endpoint is not clear here, so I would
just remove the check altogether or move it to the place that breaks
if pdev is not an endpoint.

> >> +#if defined(CONFIG_PCIEAER_CXL)
> >> +int cxl_register_prot_err_work(struct work_struct *work,
> >> +			       int (*_cxl_create_prot_err_info)(struct pci_dev*, int,
> >> +								struct cxl_prot_error_info*))
> >
> > Ditto.  Rewrap to fit in 80 columns, unindent this function
> > pointer decl to make it fit.  Same below in aer.h.
>
> Ok, got it. Without using typedefs, right ?

A typedef would be fine with me.

> >> +struct cxl_prot_error_info {
> >> +	struct pci_dev *pdev;
> >> +	struct device *dev;
> >> +	void __iomem *ras_base;
> >> +	int severity;
> >
> > What does the "prot" in "cxl_prot_error_info" refer to?
>
> Protocol. As in CXL Protocol Error Information. I suppose it needs
> to be renamed if it wasn't obvious.

Unless there are CXL non-protocol errors that need to be
distinguished, I would just omit "prot" altogether.

> > There's basically no error info here other than "severity".
>
> Correct. It's more accurately "CXL Protocol Error Context" but I didn't
> want to re-use 'context' because 'context' is used for thread/process
> statefulness. Also, I followed the existing CPER parallel work that uses
> a similar kfifo etc. Thoughts on rename?

What's the name of the corresponding CPER struct?

> > I guess "dev" and "pdev" are separate devices (otherwise you would
> > just use "&pdev->dev"), but I don't have any intuition about how they
> > might be related, which is a little disconcerting.
>
> "pdev" represents a PCIe device: RP, USP, DSP, or EP.  "dev" is the
> same device as "pdev" but "dev" is found in CXL topology. "dev" is
> discovered through ACPI/platform enumeration and interconnected with
> other CXL "devs" using upstream and downstream links. Moving back
> and forth between "pdev" and its CXL "dev" requires a search unique
> to the device type and point beginning the search.

It seems weird to me to have two device pointers here.  Seems like we
should use a single pointer to identify the device, and if we need to
get from PCI to CXL or vice versa, there should be a pointer somewhere
so we don't have to search all the time.

> > I would have thought that "ras_base" would be a property of "dev"
> > (the CXL device) and wouldn't need to be separate.
>
> "ras_base" is a common member of the CXL Port, CXL Downstream Port,
> CXL Upstream Port, and CXL EP. If one wants the "ras_base" for a
> given CXL "dev" then the "dev" must be converted to CXL Port,
> Downstream Port, or Upstream Port.

Passing around ras_base seems dodgy to me.  I think it's better to
pass around a real entity like a pci_dev or cxl_port or cxl_dport or
whatever.  Code that needs to deal with ras_base presumably should
know about the internals of the device ras_base belongs to.

Bjorn

  reply	other threads:[~2025-03-28 17:02 UTC|newest]

Thread overview: 76+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-03-27  1:47 [PATCH v8 00/16] Enable CXL PCIe port protocol error handling and logging Terry Bowman
2025-03-27  1:47 ` [PATCH v8 01/16] PCI/CXL: Introduce PCIe helper function pcie_is_cxl() Terry Bowman
2025-03-27 15:11   ` Ira Weiny
2025-03-27 15:30     ` Bowman, Terry
2025-03-27  1:47 ` [PATCH v8 02/16] PCI/AER: Modify AER driver logging to report CXL or PCIe bus error type Terry Bowman
2025-03-27 16:48   ` Bjorn Helgaas
2025-03-27 17:15     ` Bowman, Terry
2025-03-27 17:49       ` Bjorn Helgaas
2025-03-27 16:58   ` Ira Weiny
2025-03-27 17:17     ` Bowman, Terry
2025-03-27  1:47 ` [PATCH v8 03/16] CXL/AER: Introduce Kfifo for forwarding CXL errors Terry Bowman
2025-03-27 17:08   ` Bjorn Helgaas
2025-03-27 18:12     ` Bowman, Terry
2025-03-28 17:02       ` Bjorn Helgaas [this message]
2025-03-28 17:36         ` Bowman, Terry
2025-03-28 17:01   ` Ira Weiny
2025-04-07 13:43     ` Bowman, Terry
2025-04-04 16:53   ` Jonathan Cameron
2025-04-23 14:33   ` Jonathan Cameron
2025-04-23 15:04   ` Jonathan Cameron
2025-04-23 22:12   ` Gregory Price
2025-03-27  1:47 ` [PATCH v8 04/16] cxl/aer: AER service driver forwards CXL error to CXL driver Terry Bowman
2025-03-27 17:13   ` Bjorn Helgaas
2025-04-07 14:00     ` Bowman, Terry
2025-04-23 15:04   ` Jonathan Cameron
2025-04-24 14:17     ` Bowman, Terry
2025-04-25 13:18       ` Jonathan Cameron
2025-04-25 21:03         ` Bowman, Terry
2025-05-15 21:52         ` Bowman, Terry
2025-05-20 11:04           ` Jonathan Cameron
2025-05-20 13:21             ` Bowman, Terry
2025-05-21 18:34               ` Jonathan Cameron
2025-05-21 23:30                 ` Bowman, Terry
2025-04-23 22:21   ` Gregory Price
2025-03-27  1:47 ` [PATCH v8 05/16] PCI/AER: CXL driver dequeues CXL error forwarded from AER service driver Terry Bowman
2025-03-27  4:43   ` kernel test robot
2025-04-23 16:28   ` Jonathan Cameron
2025-04-24 15:03     ` Bowman, Terry
2025-03-27  1:47 ` [PATCH v8 06/16] CXL/PCI: Introduce CXL uncorrectable protocol error 'recovery' Terry Bowman
2025-03-27  3:37   ` kernel test robot
2025-03-27  4:19   ` kernel test robot
2025-04-23 16:35   ` Jonathan Cameron
2025-04-24 14:22     ` Bowman, Terry
2025-03-27  1:47 ` [PATCH v8 07/16] cxl/pci: Move existing CXL RAS initialization to CXL's cxl_port driver Terry Bowman
2025-04-17 10:18   ` Jonathan Cameron
2025-04-24 14:25     ` Bowman, Terry
2025-05-12 14:47     ` Bowman, Terry
2025-03-27  1:47 ` [PATCH v8 08/16] cxl/pci: Map CXL Endpoint Port and CXL Switch Port RAS registers Terry Bowman
2025-03-27  1:47 ` [PATCH v8 09/16] cxl/pci: Update RAS handler interfaces to also support CXL PCIe Ports Terry Bowman
2025-03-27  1:47 ` [PATCH v8 10/16] cxl/pci: Add log message if RAS registers are not mapped Terry Bowman
2025-04-23 16:41   ` Jonathan Cameron
2025-04-24 14:30     ` Bowman, Terry
2025-03-27  1:47 ` [PATCH v8 11/16] cxl/pci: Unifi CXL trace logging for CXL Endpoints and CXL Ports Terry Bowman
2025-04-23 16:44   ` Jonathan Cameron
2025-05-07 16:28     ` Shiju Jose
2025-05-07 18:30       ` Bowman, Terry
2025-03-27  1:47 ` [PATCH v8 12/16] cxl/pci: Assign CXL Port protocol error handlers Terry Bowman
2025-04-23 16:47   ` Jonathan Cameron
2025-03-27  1:47 ` [PATCH v8 13/16] cxl/pci: Assign CXL Endpoint " Terry Bowman
2025-03-27 19:46   ` kernel test robot
2025-04-23 16:49   ` Jonathan Cameron
2025-03-27  1:47 ` [PATCH v8 14/16] cxl/pci: Remove unnecessary CXL Endpoint handling helper functions Terry Bowman
2025-04-17 17:22   ` Jonathan Cameron
2025-03-27  1:47 ` [PATCH v8 15/16] CXL/PCI: Enable CXL protocol errors during CXL Port probe Terry Bowman
2025-04-04 17:05   ` Jonathan Cameron
2025-04-07 14:34     ` Bowman, Terry
2025-03-27  1:47 ` [PATCH v8 16/16] CXL/PCI: Disable CXL protocol errors during CXL Port cleanup Terry Bowman
2025-03-28  1:18   ` kernel test robot
2025-04-04 17:04   ` Jonathan Cameron
2025-04-07 14:25     ` Bowman, Terry
2025-04-17 10:13       ` Jonathan Cameron
2025-04-24 16:37         ` Bowman, Terry
2025-03-27 17:16 ` [PATCH v8 00/16] Enable CXL PCIe port protocol error handling and logging Bjorn Helgaas
2025-03-27 22:04   ` Bowman, Terry
2025-05-06 23:06 ` Gregory Price
2025-05-07 18:28   ` Bowman, Terry

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20250328170212.GA1508786@bhelgaas \
    --to=helgaas@kernel.org \
    --cc=Benjamin.Cheatham@amd.com \
    --cc=PradeepVineshReddy.Kodamati@amd.com \
    --cc=Smita.KoralahalliChannabasappa@amd.com \
    --cc=alison.schofield@intel.com \
    --cc=bhelgaas@google.com \
    --cc=dan.j.williams@intel.com \
    --cc=dave.jiang@intel.com \
    --cc=dave@stgolabs.net \
    --cc=ira.weiny@intel.com \
    --cc=jonathan.cameron@huawei.com \
    --cc=linux-cxl@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-pci@vger.kernel.org \
    --cc=lukas@wunner.de \
    --cc=mahesh@linux.ibm.com \
    --cc=ming.li@zohomail.com \
    --cc=nathan.fontenot@amd.com \
    --cc=nifan.cxl@gmail.com \
    --cc=oohall@gmail.com \
    --cc=rrichter@amd.com \
    --cc=terry.bowman@amd.com \
    --cc=vishal.l.verma@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox