From: Jonathan Cameron <Jonathan.Cameron@Huawei.com>
To: Terry Bowman <terry.bowman@amd.com>
Cc: <dan.j.williams@intel.com>, <ira.weiny@intel.com>,
<dave@stgolabs.net>, <dave.jiang@intel.com>,
<alison.schofield@intel.com>, <ming4.li@intel.com>,
<vishal.l.verma@intel.com>, <jim.harris@samsung.com>,
<ilpo.jarvinen@linux.intel.com>, <ardb@kernel.org>,
<sathyanarayanan.kuppuswamy@linux.intel.com>,
<linux-cxl@vger.kernel.org>, <linux-kernel@vger.kernel.org>,
<Yazen.Ghannam@amd.com>, <Robert.Richter@amd.com>
Subject: Re: [RFC PATCH 7/9] cxl/pci: Add atomic notifier callback for CXL PCIe port AER internal errors
Date: Thu, 20 Jun 2024 14:09:26 +0100 [thread overview]
Message-ID: <20240620140926.000029d2@Huawei.com> (raw)
In-Reply-To: <20240617200411.1426554-8-terry.bowman@amd.com>
On Mon, 17 Jun 2024 15:04:09 -0500
Terry Bowman <terry.bowman@amd.com> wrote:
> CXL root ports, CXL downstream switch ports, and CXL upstream switch
> ports are bound to the PCIe port bus driver, portdrv. portdrv provides
> an atomic notifier chain for reporting PCIe port device AER
> correctable internal errors (CIE) and AER uncorrectable internal
> errors (UIE).
>
> CXL PCIe port devices use AER CIE/UIE to report CXL RAS.[1]
>
> Add a cxl_pci atomic notification callback for handling the portdrv's
> AER UIE/CIE notifications.
>
> Register the atomic notification callback in the cxl_pci module's
> load. Unregister the callback in the cxl_pci driver's unload.
>
> Implement the callback to check if the device parameter is a valid
> CXL PCIe port. If it is valid then make the notification callback call
> __cxl_handle_cor_ras() or __cxl_handle_ras() depending on the AER
> type.
>
> [1] CXL3.1 - 12.2.2 CXL Root Ports, Downstream Switch Ports, and
> Upstream Switch Ports
>
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
Hi Terry,
Some comments inline. Mostly this comes down to earlier question of whether
this notifier should be registered per device or globally.
I think I still prefer per device, but attaching the handler will be trickier
and I'm guessing there may be some locking/lifetime issues doing that which
are avoided by a global notifier.
Jonathan
> ---
> drivers/cxl/core/core.h | 4 ++
> drivers/cxl/core/pci.c | 97 ++++++++++++++++++++++++++++++++++++++---
> drivers/cxl/core/port.c | 6 +--
> drivers/cxl/cxl.h | 5 +++
> drivers/cxl/cxlpci.h | 2 +
> drivers/cxl/pci.c | 19 +++++++-
> 6 files changed, 123 insertions(+), 10 deletions(-)
>
> diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
> index bc5a95665aa0..69bef1db6ee0 100644
> --- a/drivers/cxl/core/core.h
> +++ b/drivers/cxl/core/core.h
> @@ -94,4 +94,8 @@ int cxl_update_hmat_access_coordinates(int nid, struct cxl_region *cxlr,
> enum access_coordinate_class access);
> bool cxl_need_node_perf_attrs_update(int nid);
>
> +struct cxl_dport *find_dport(struct cxl_port *port, int id);
> +struct cxl_port *find_cxl_port(struct device *dport_dev,
> + struct cxl_dport **dport);
> +
> #endif /* __CXL_CORE_H__ */
> diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
> index 59a317ab84bb..e630eccb733d 100644
> --- a/drivers/cxl/core/pci.c
> +++ b/drivers/cxl/core/pci.c
> @@ -689,7 +689,6 @@ EXPORT_SYMBOL_NS_GPL(read_cdat_data, CXL);
> static void __cxl_handle_cor_ras(struct device *dev,
> void __iomem *ras_base)
> {
> - struct cxl_memdev *cxlmd = to_cxl_memdev(dev);
> void __iomem *addr;
> u32 status;
>
> @@ -698,10 +697,17 @@ static void __cxl_handle_cor_ras(struct device *dev,
>
> addr = ras_base + CXL_RAS_CORRECTABLE_STATUS_OFFSET;
> status = readl(addr);
> - if (status & CXL_RAS_CORRECTABLE_STATUS_MASK) {
> - writel(status & CXL_RAS_CORRECTABLE_STATUS_MASK, addr);
> +
Blank line probably not wanted as we want to group the status
check with the read (it's kind of an error check).
> + if (!(status & CXL_RAS_CORRECTABLE_STATUS_MASK))
> + return;
> +
> + writel(status & CXL_RAS_CORRECTABLE_STATUS_MASK, addr);
> + if (is_cxl_memdev(dev)) {
> + struct cxl_memdev *cxlmd = to_cxl_memdev(dev);
> +
> trace_cxl_aer_correctable_error(cxlmd, status);
As below - don't bother with local cxlmd variable.
> - }
> + } else if (dev_is_pci(dev))
> + trace_cxl_port_aer_correctable_error(dev, status);
> }
>
> static void cxl_handle_endpoint_cor_ras(struct cxl_dev_state *cxlds)
> @@ -733,7 +739,6 @@ static void header_log_copy(void __iomem *ras_base, u32 *log)
> static bool __cxl_handle_ras(struct device *dev,
> void __iomem *ras_base)
> {
> - struct cxl_memdev *cxlmd = to_cxl_memdev(dev);
> u32 hl[CXL_HEADERLOG_SIZE_U32];
> void __iomem *addr;
> u32 status;
> @@ -759,7 +764,13 @@ static bool __cxl_handle_ras(struct device *dev,
> }
>
> header_log_copy(ras_base, hl);
> - trace_cxl_aer_uncorrectable_error(cxlmd, status, fe, hl);
> + if (is_cxl_memdev(dev)) {
> + struct cxl_memdev *cxlmd = to_cxl_memdev(dev);
Just use this inline.
trace_cxl_aer_uncorrectable_error(to_cxl_memdev(dev),
status, fe, hl);
> +
> + trace_cxl_aer_uncorrectable_error(cxlmd, status, fe, hl);
> + } else if (dev_is_pci(dev))
> + trace_cxl_port_aer_uncorrectable_error(dev, status);
As before, why no fe or hl? I'm sure I'm missing some spec subtlty
but a comment would help me and others avoid that.
> +
> writel(status & CXL_RAS_UNCORRECTABLE_STATUS_MASK, addr);
>
> return true;
> @@ -882,6 +893,80 @@ static bool cxl_handle_rdport_ras(struct cxl_dev_state *cxlds,
> return __cxl_handle_ras(&cxlds->cxlmd->dev, dport->regs.ras);
> }
>
> +static int match_uport(struct device *dev, void *data)
> +{
> + struct device *uport_dev = (struct device *)data;
> + struct cxl_port *port;
> +
> + if (!is_cxl_port(dev))
> + return 0;
> +
> + port = to_cxl_port(dev);
> +
> + return (port->uport_dev == uport_dev);
() doesn't add much so I'd drop them.
> +}
> +
> +static struct cxl_port *pci_to_cxl_uport(struct pci_dev *pdev)
> +{
> + struct cxl_dport *dport;
> + struct device *port_dev;
> + struct cxl_port *port;
> +
> + port = find_cxl_port(pdev->dev.parent, &dport);
> + if (!port)
> + return NULL;
> + put_device(&port->dev);
I'm confused on the lifetimes. Doesn't it make more sense
to hold this until after you've stopped using it? So move the
put after device_find_child(...)
> +
> + port_dev = device_find_child(&port->dev, &pdev->dev, match_uport);
> + if (!port_dev)
> + return NULL;
> + put_device(port_dev);
> +
> + port = to_cxl_port(port_dev);
> +
> + return port;
return to_cxl_port(port_dev);
> +}
> +
> +static void __iomem *cxl_pci_port_ras(struct pci_dev *pdev)
> +{
> + void __iomem *ras_base = NULL;
Don't initialize and...
> +
> + if ((pci_pcie_type(pdev) == PCI_EXP_TYPE_ROOT_PORT) ||
> + (pci_pcie_type(pdev) == PCI_EXP_TYPE_DOWNSTREAM)) {
> + struct cxl_dport *dport;
> +
> + find_cxl_port(&pdev->dev, &dport);
> + ras_base = dport ? dport->regs.ras : NULL;
if (dport)
return dport->regs.ras;
> + } else if (pci_pcie_type(pdev) == PCI_EXP_TYPE_UPSTREAM) {
> + struct cxl_port *port = pci_to_cxl_uport(pdev);
> +
> + ras_base = port ? port->regs.ras : NULL;
if (port)
return port->regs.ras;
> + }
return NULL;
> +
> + return ras_base;
> +}
> +
> +int port_internal_err_cb(struct notifier_block *unused,
> + unsigned long event, void *ptr)
> +{
> + struct pci_dev *pdev = (struct pci_dev *)ptr;
I think you can use this notifier for other types of device in future?
If it's going to be global definitely want to check here that we
actually have a CXL port of some type.
It may be that via the various calls any non CXL device
will result in a safe error. However that's not obvious, so an
upfront check makes sense (or a per device notifier registration!)
> + void __iomem *ras_base;
> +
> + if (!pdev)
> + return 0;
> +
> + if (event == AER_CORRECTABLE) {
> + ras_base = cxl_pci_port_ras(pdev);
> + __cxl_handle_cor_ras(&pdev->dev, ras_base);
as below - one line should be fine for this.
> + } else if ((event == AER_FATAL) || (event == AER_NONFATAL)) {
> + ras_base = cxl_pci_port_ras(pdev);
> + __cxl_handle_ras(&pdev->dev, ras_base);
__cxl_handle_ras(&pdev->dev, cxl_pci_port_ras(pdev));
> + }
> +
> + return 0;
> +}
> +EXPORT_SYMBOL_NS_GPL(port_internal_err_cb, CXL);
> +
> /*
> * Copy the AER capability registers using 32 bit read accesses.
> * This is necessary because RCRB AER capability is MMIO mapped. Clear the
> diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
> index 887ed6e358fb..d0f95c965ab4 100644
> --- a/drivers/cxl/core/port.c
> +++ b/drivers/cxl/core/port.c
> @@ -1027,7 +1027,7 @@ void put_cxl_root(struct cxl_root *cxl_root)
> }
> EXPORT_SYMBOL_NS_GPL(put_cxl_root, CXL);
>
> -static struct cxl_dport *find_dport(struct cxl_port *port, int id)
> +struct cxl_dport *find_dport(struct cxl_port *port, int id)
> {
> struct cxl_dport *dport;
> unsigned long index;
> @@ -1336,8 +1336,8 @@ static struct cxl_port *__find_cxl_port(struct cxl_find_port_ctx *ctx)
> return NULL;
> }
>
> -static struct cxl_port *find_cxl_port(struct device *dport_dev,
> - struct cxl_dport **dport)
> +struct cxl_port *find_cxl_port(struct device *dport_dev,
> + struct cxl_dport **dport)
> {
> struct cxl_find_port_ctx ctx = {
> .dport_dev = dport_dev,
> diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
> index 7cee678fdb75..04725344393b 100644
> --- a/drivers/cxl/cxl.h
> +++ b/drivers/cxl/cxl.h
> @@ -11,6 +11,7 @@
> #include <linux/log2.h>
> #include <linux/node.h>
> #include <linux/io.h>
> +#include "../pci/pcie/portdrv.h"
Why add the include? Maybe only needed in c files/
>
> /**
> * DOC: cxl objects
> @@ -760,11 +761,15 @@ struct cxl_dport *devm_cxl_add_rch_dport(struct cxl_port *port,
> #ifdef CONFIG_PCIEAER_CXL
> void cxl_setup_parent_dport(struct device *host, struct cxl_dport *dport);
> void cxl_setup_parent_uport(struct device *host, struct cxl_port *port);
> +int port_internal_err_cb(struct notifier_block *unused,
> + unsigned long event, void *ptr);
> #else
> static inline void cxl_setup_parent_dport(struct device *host,
> struct cxl_dport *dport) { }
> static inline void cxl_setup_parent_uport(struct device *host,
> struct cxl_port *port) { }
> +static inline int port_internal_err_cb(struct notifier_block *unused,
> + unsigned long event, void *ptr) { return 0; }
> #endif
>
> struct cxl_decoder *to_cxl_decoder(struct device *dev);
> diff --git a/drivers/cxl/cxlpci.h b/drivers/cxl/cxlpci.h
> index 93992a1c8eec..6044955e1c48 100644
> --- a/drivers/cxl/cxlpci.h
> +++ b/drivers/cxl/cxlpci.h
> @@ -130,4 +130,6 @@ void read_cdat_data(struct cxl_port *port);
> void cxl_cor_error_detected(struct pci_dev *pdev);
> pci_ers_result_t cxl_error_detected(struct pci_dev *pdev,
> pci_channel_state_t state);
> +int port_err_nb_cb(struct notifier_block *unused,
> + unsigned long event, void *ptr);
> #endif /* __CXL_PCI_H__ */
> diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
> index 2ff361e756d6..f4183c5aea38 100644
> --- a/drivers/cxl/pci.c
> +++ b/drivers/cxl/pci.c
> @@ -926,6 +926,10 @@ static int cxl_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
> return rc;
> }
>
> +struct notifier_block port_internal_err_nb = {
> + .notifier_call = port_internal_err_cb,
> +};
> +
> static const struct pci_device_id cxl_mem_pci_tbl[] = {
> /* PCI class code for CXL.mem Type-3 Devices */
> { PCI_DEVICE_CLASS((PCI_CLASS_MEMORY_CXL << 8 | CXL_MEMORY_PROGIF), ~0)},
> @@ -974,6 +978,19 @@ static struct pci_driver cxl_pci_driver = {
> },
> };
>
> -module_pci_driver(cxl_pci_driver);
> +static int __init cxl_pci_init(void)
> +{
> + atomic_notifier_chain_register(&portdrv_aer_internal_err_chain, &port_internal_err_nb);
Long line that you can easily break.
> + return pci_register_driver(&cxl_pci_driver);
> +}
> +module_init(cxl_pci_init);
> +
> +static void __exit cxl_pci_exit(void)
> +{
> + atomic_notifier_chain_unregister(&portdrv_aer_internal_err_chain, &port_internal_err_nb);
Long line that you can easily break.
> + pci_unregister_driver(&cxl_pci_driver);
> +}
> +module_exit(cxl_pci_exit);
> +
> MODULE_LICENSE("GPL v2");
> MODULE_IMPORT_NS(CXL);
next prev parent reply other threads:[~2024-06-20 13:09 UTC|newest]
Thread overview: 59+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-06-17 20:04 [RFC PATCH 0/9] Add RAS support for CXL root ports, CXL downstream switch ports, and CXL upstream switch ports Terry Bowman
2024-06-17 20:04 ` [RFC PATCH 1/9] PCI/AER: Update AER driver to call root port and downstream port UCE handlers Terry Bowman
2024-06-20 11:21 ` Jonathan Cameron
2024-06-24 14:58 ` Terry Bowman
2024-06-21 19:17 ` Dan Williams
2024-06-24 17:56 ` Terry Bowman
2024-07-10 20:48 ` nifan.cxl
2024-07-10 21:48 ` Terry Bowman
2024-07-11 1:14 ` fan
2024-08-19 18:35 ` Fan Ni
2024-06-17 20:04 ` [RFC PATCH 2/9] PCI/AER: Call AER CE handler before clearing AER CE status register Terry Bowman
2024-06-20 11:31 ` Jonathan Cameron
2024-06-24 15:08 ` Terry Bowman
2024-06-21 19:23 ` Dan Williams
2024-06-24 18:00 ` Terry Bowman
2024-06-17 20:04 ` [RFC PATCH 3/9] PCI/portdrv: Update portdrv with an atomic notifier for reporting AER internal errors Terry Bowman
2024-06-20 12:30 ` Jonathan Cameron
2024-06-24 15:22 ` Terry Bowman
2024-06-21 19:36 ` Dan Williams
2024-06-24 18:21 ` Terry Bowman
2024-06-24 21:46 ` Dan Williams
2024-06-25 14:41 ` Terry Bowman
2024-06-26 2:54 ` Li, Ming4
2024-06-26 13:39 ` Terry Bowman
2024-06-17 20:04 ` [RFC PATCH 4/9] cxl/pci: Map CXL PCIe ports' RAS registers Terry Bowman
2024-06-20 12:46 ` Jonathan Cameron
2024-06-24 15:51 ` Terry Bowman
2024-07-02 15:18 ` Jonathan Cameron
2024-06-26 3:39 ` Li, Ming4
2024-06-17 20:04 ` [RFC PATCH 5/9] cxl/pci: Update RAS handler interfaces to support CXL PCIe ports Terry Bowman
2024-06-20 12:49 ` Jonathan Cameron
2024-07-15 17:50 ` nifan.cxl
2024-06-17 20:04 ` [RFC PATCH 6/9] cxl/pci: Add trace logging for CXL PCIe port RAS errors Terry Bowman
2024-06-20 12:53 ` Jonathan Cameron
2024-06-24 15:53 ` Terry Bowman
2024-07-02 15:53 ` Jonathan Cameron
2024-06-17 20:04 ` [RFC PATCH 7/9] cxl/pci: Add atomic notifier callback for CXL PCIe port AER internal errors Terry Bowman
2024-06-20 13:09 ` Jonathan Cameron [this message]
2024-06-24 16:09 ` Terry Bowman
2024-07-02 15:58 ` Jonathan Cameron
2024-06-26 6:22 ` Li, Ming4
2024-06-26 13:51 ` Terry Bowman
2024-06-17 20:04 ` [RFC PATCH 8/9] PCI/AER: Export pci_aer_unmask_internal_errors() Terry Bowman
2024-06-19 7:09 ` Christoph Hellwig
2024-06-19 15:40 ` Terry Bowman
2024-06-20 13:11 ` Jonathan Cameron
2024-06-24 16:22 ` Terry Bowman
2024-07-10 21:47 ` Bjorn Helgaas
2024-06-17 20:04 ` [RFC PATCH 9/9] cxl/pci: Enable interrupts for CXL PCIe ports' AER internal errors Terry Bowman
2024-06-20 13:15 ` Jonathan Cameron
2024-06-24 16:46 ` Terry Bowman
2024-07-02 16:00 ` Jonathan Cameron
2024-06-21 19:04 ` [RFC PATCH 0/9] Add RAS support for CXL root ports, CXL downstream switch ports, and CXL upstream switch ports Dan Williams
2024-06-24 17:47 ` Terry Bowman
2024-06-24 20:51 ` Dan Williams
2024-06-25 14:29 ` Terry Bowman
2024-07-25 18:49 ` fan
2024-08-19 16:21 ` Terry Bowman
2024-08-19 18:17 ` Fan Ni
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20240620140926.000029d2@Huawei.com \
--to=jonathan.cameron@huawei.com \
--cc=Robert.Richter@amd.com \
--cc=Yazen.Ghannam@amd.com \
--cc=alison.schofield@intel.com \
--cc=ardb@kernel.org \
--cc=dan.j.williams@intel.com \
--cc=dave.jiang@intel.com \
--cc=dave@stgolabs.net \
--cc=ilpo.jarvinen@linux.intel.com \
--cc=ira.weiny@intel.com \
--cc=jim.harris@samsung.com \
--cc=linux-cxl@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=ming4.li@intel.com \
--cc=sathyanarayanan.kuppuswamy@linux.intel.com \
--cc=terry.bowman@amd.com \
--cc=vishal.l.verma@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.