* [PATCH 01/15] cxl/aer/pci: Add CXL PCIe port error handler callbacks in AER service driver
2024-10-08 22:16 [PATCH 0/15] Enable CXL PCIe port protocol error handling and logging Terry Bowman
@ 2024-10-08 22:16 ` Terry Bowman
2024-10-22 1:53 ` Dan Williams
2024-10-08 22:16 ` [PATCH 02/15] cxl/aer/pci: Update is_internal_error() to be callable w/o CONFIG_PCIEAER_CXL Terry Bowman
` (17 subsequent siblings)
18 siblings, 1 reply; 62+ messages in thread
From: Terry Bowman @ 2024-10-08 22:16 UTC (permalink / raw)
To: ming4.li, linux-cxl, linux-kernel, linux-pci, dave,
jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
dan.j.williams, bhelgaas, mahesh, oohall, Benjamin.Cheatham,
rrichter, nathan.fontenot, smita.koralahallichannabasappa,
terry.bowman
CXL protocol errors are reported to the OS through PCIe correctable and
uncorrectable internal errors. However, since CXL PCIe port devices
are currently bound to the portdrv driver, there is no mechanism to
notify the CXL driver, which is necessary for proper logging and
handling.
To address this, introduce CXL PCIe port error callbacks along with
register/unregister and accessor functions. The callbacks will be
invoked by the AER driver in the case protocol errors are reported by
a CXL port device.
The AER driver callbacks will be used in future patches implementing
CXL PCIe port error handling.
Signed-off-by: Terry Bowman <terry.bowman@amd.com>
---
drivers/pci/pcie/aer.c | 22 ++++++++++++++++++++++
include/linux/aer.h | 14 ++++++++++++++
2 files changed, 36 insertions(+)
diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index 13b8586924ea..a9792b9576b4 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -50,6 +50,8 @@ struct aer_rpc {
DECLARE_KFIFO(aer_fifo, struct aer_err_source, AER_ERROR_SOURCES_MAX);
};
+static struct cxl_port_err_hndlrs cxl_port_hndlrs;
+
/* AER stats for the device */
struct aer_stats {
@@ -1078,6 +1080,26 @@ static inline void cxl_rch_handle_error(struct pci_dev *dev,
struct aer_err_info *info) { }
#endif
+void register_cxl_port_hndlrs(struct cxl_port_err_hndlrs *_cxl_port_hndlrs)
+{
+ cxl_port_hndlrs.error_detected = _cxl_port_hndlrs->error_detected;
+ cxl_port_hndlrs.cor_error_detected = _cxl_port_hndlrs->cor_error_detected;
+}
+EXPORT_SYMBOL_NS_GPL(register_cxl_port_hndlrs, CXL);
+
+void unregister_cxl_port_hndlrs(void)
+{
+ cxl_port_hndlrs.error_detected = NULL;
+ cxl_port_hndlrs.cor_error_detected = NULL;
+}
+EXPORT_SYMBOL_NS_GPL(unregister_cxl_port_hndlrs, CXL);
+
+struct cxl_port_err_hndlrs *find_cxl_port_hndlrs(void)
+{
+ return &cxl_port_hndlrs;
+}
+EXPORT_SYMBOL_NS_GPL(find_cxl_port_hndlrs, CXL);
+
/**
* pci_aer_handle_error - handle logging error into an event log
* @dev: pointer to pci_dev data structure of error source device
diff --git a/include/linux/aer.h b/include/linux/aer.h
index 4b97f38f3fcf..67fd04c5ae2b 100644
--- a/include/linux/aer.h
+++ b/include/linux/aer.h
@@ -10,6 +10,7 @@
#include <linux/errno.h>
#include <linux/types.h>
+#include <linux/pci.h>
#define AER_NONFATAL 0
#define AER_FATAL 1
@@ -55,5 +56,18 @@ void pci_print_aer(struct pci_dev *dev, int aer_severity,
int cper_severity_to_aer(int cper_severity);
void aer_recover_queue(int domain, unsigned int bus, unsigned int devfn,
int severity, struct aer_capability_regs *aer_regs);
+
+struct cxl_port_err_hndlrs {
+
+ /* CXL uncorrectable error detected on this device */
+ pci_ers_result_t (*error_detected)(struct pci_dev *dev,
+ pci_channel_state_t error);
+
+ /* CXL corrected error detected on this device */
+ void (*cor_error_detected)(struct pci_dev *dev);
+};
+void register_cxl_port_hndlrs(struct cxl_port_err_hndlrs *_cxl_port_hndlrs);
+void unregister_cxl_port_hndlrs(void);
+struct cxl_port_err_hndlrs *find_cxl_port_hndlrs(void);
#endif //_AER_H_
--
2.34.1
^ permalink raw reply related [flat|nested] 62+ messages in thread* Re: [PATCH 01/15] cxl/aer/pci: Add CXL PCIe port error handler callbacks in AER service driver
2024-10-08 22:16 ` [PATCH 01/15] cxl/aer/pci: Add CXL PCIe port error handler callbacks in AER service driver Terry Bowman
@ 2024-10-22 1:53 ` Dan Williams
2024-10-22 13:50 ` Terry Bowman
0 siblings, 1 reply; 62+ messages in thread
From: Dan Williams @ 2024-10-22 1:53 UTC (permalink / raw)
To: Terry Bowman, ming4.li, linux-cxl, linux-kernel, linux-pci, dave,
jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
dan.j.williams, bhelgaas, mahesh, oohall, Benjamin.Cheatham,
rrichter, nathan.fontenot, smita.koralahallichannabasappa
Terry Bowman wrote:
> CXL protocol errors are reported to the OS through PCIe correctable and
> uncorrectable internal errors. However, since CXL PCIe port devices
> are currently bound to the portdrv driver, there is no mechanism to
> notify the CXL driver, which is necessary for proper logging and
> handling.
>
> To address this, introduce CXL PCIe port error callbacks along with
> register/unregister and accessor functions. The callbacks will be
> invoked by the AER driver in the case protocol errors are reported by
> a CXL port device.
>
> The AER driver callbacks will be used in future patches implementing
> CXL PCIe port error handling.
>
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> ---
> drivers/pci/pcie/aer.c | 22 ++++++++++++++++++++++
> include/linux/aer.h | 14 ++++++++++++++
> 2 files changed, 36 insertions(+)
>
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index 13b8586924ea..a9792b9576b4 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -50,6 +50,8 @@ struct aer_rpc {
> DECLARE_KFIFO(aer_fifo, struct aer_err_source, AER_ERROR_SOURCES_MAX);
> };
>
> +static struct cxl_port_err_hndlrs cxl_port_hndlrs;
I think this can afford to splurge on a few more letters and make this
static struct cxl_port_error_handlers cxl_port_error_handlers;
> +
> /* AER stats for the device */
> struct aer_stats {
>
> @@ -1078,6 +1080,26 @@ static inline void cxl_rch_handle_error(struct pci_dev *dev,
> struct aer_err_info *info) { }
> #endif
>
> +void register_cxl_port_hndlrs(struct cxl_port_err_hndlrs *_cxl_port_hndlrs)
> +{
> + cxl_port_hndlrs.error_detected = _cxl_port_hndlrs->error_detected;
> + cxl_port_hndlrs.cor_error_detected = _cxl_port_hndlrs->cor_error_detected;
> +}
> +EXPORT_SYMBOL_NS_GPL(register_cxl_port_hndlrs, CXL);
> +
> +void unregister_cxl_port_hndlrs(void)
> +{
> + cxl_port_hndlrs.error_detected = NULL;
> + cxl_port_hndlrs.cor_error_detected = NULL;
> +}
> +EXPORT_SYMBOL_NS_GPL(unregister_cxl_port_hndlrs, CXL);
> +
> +struct cxl_port_err_hndlrs *find_cxl_port_hndlrs(void)
> +{
> + return &cxl_port_hndlrs;
> +}
> +EXPORT_SYMBOL_NS_GPL(find_cxl_port_hndlrs, CXL);
I guess I will need to go deeper into the code, but I would not have
expected that new registration interfaces are needed. Each 'struct
pci_driver' could optionally include CXL error handlers alongside their
PCIe error handlers and when CXL AER errors are broadcast only the CXL
handlers are invoked. I.e. the registration is something like:
diff --git a/drivers/pci/pcie/portdrv.c b/drivers/pci/pcie/portdrv.c
index 6af5e0425872..42db26195bda 100644
--- a/drivers/pci/pcie/portdrv.c
+++ b/drivers/pci/pcie/portdrv.c
@@ -793,6 +793,7 @@ static struct pci_driver pcie_portdriver = {
.shutdown = pcie_portdrv_shutdown,
.err_handler = &pcie_portdrv_err_handler,
+ .cxl_err_handler = &cxl_portdrv_err_handler,
.driver_managed_dma = true,
^ permalink raw reply related [flat|nested] 62+ messages in thread* Re: [PATCH 01/15] cxl/aer/pci: Add CXL PCIe port error handler callbacks in AER service driver
2024-10-22 1:53 ` Dan Williams
@ 2024-10-22 13:50 ` Terry Bowman
2024-10-22 17:09 ` Dan Williams
0 siblings, 1 reply; 62+ messages in thread
From: Terry Bowman @ 2024-10-22 13:50 UTC (permalink / raw)
To: Dan Williams, ming4.li, linux-cxl, linux-kernel, linux-pci, dave,
jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
bhelgaas, mahesh, oohall, Benjamin.Cheatham, rrichter,
nathan.fontenot, smita.koralahallichannabasappa
Hi Dan,
On 10/21/24 20:53, Dan Williams wrote:
> Terry Bowman wrote:
>> CXL protocol errors are reported to the OS through PCIe correctable and
>> uncorrectable internal errors. However, since CXL PCIe port devices
>> are currently bound to the portdrv driver, there is no mechanism to
>> notify the CXL driver, which is necessary for proper logging and
>> handling.
>>
>> To address this, introduce CXL PCIe port error callbacks along with
>> register/unregister and accessor functions. The callbacks will be
>> invoked by the AER driver in the case protocol errors are reported by
>> a CXL port device.
>>
>> The AER driver callbacks will be used in future patches implementing
>> CXL PCIe port error handling.
>>
>> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
>> ---
>> drivers/pci/pcie/aer.c | 22 ++++++++++++++++++++++
>> include/linux/aer.h | 14 ++++++++++++++
>> 2 files changed, 36 insertions(+)
>>
>> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
>> index 13b8586924ea..a9792b9576b4 100644
>> --- a/drivers/pci/pcie/aer.c
>> +++ b/drivers/pci/pcie/aer.c
>> @@ -50,6 +50,8 @@ struct aer_rpc {
>> DECLARE_KFIFO(aer_fifo, struct aer_err_source, AER_ERROR_SOURCES_MAX);
>> };
>>
>> +static struct cxl_port_err_hndlrs cxl_port_hndlrs;
>
> I think this can afford to splurge on a few more letters and make this
>
> static struct cxl_port_error_handlers cxl_port_error_handlers;
>
>
Ok.
>> +
>> /* AER stats for the device */
>> struct aer_stats {
>>
>> @@ -1078,6 +1080,26 @@ static inline void cxl_rch_handle_error(struct pci_dev *dev,
>> struct aer_err_info *info) { }
>> #endif
>>
>> +void register_cxl_port_hndlrs(struct cxl_port_err_hndlrs *_cxl_port_hndlrs)
>> +{
>> + cxl_port_hndlrs.error_detected = _cxl_port_hndlrs->error_detected;
>> + cxl_port_hndlrs.cor_error_detected = _cxl_port_hndlrs->cor_error_detected;
>> +}
>> +EXPORT_SYMBOL_NS_GPL(register_cxl_port_hndlrs, CXL);
>> +
>> +void unregister_cxl_port_hndlrs(void)
>> +{
>> + cxl_port_hndlrs.error_detected = NULL;
>> + cxl_port_hndlrs.cor_error_detected = NULL;
>> +}
>> +EXPORT_SYMBOL_NS_GPL(unregister_cxl_port_hndlrs, CXL);
>> +
>> +struct cxl_port_err_hndlrs *find_cxl_port_hndlrs(void)
>> +{
>> + return &cxl_port_hndlrs;
>> +}
>> +EXPORT_SYMBOL_NS_GPL(find_cxl_port_hndlrs, CXL);
>
> I guess I will need to go deeper into the code, but I would not have
> expected that new registration interfaces are needed. Each 'struct
> pci_driver' could optionally include CXL error handlers alongside their
> PCIe error handlers and when CXL AER errors are broadcast only the CXL
> handlers are invoked. I.e. the registration is something like:
>
> diff --git a/drivers/pci/pcie/portdrv.c b/drivers/pci/pcie/portdrv.c
> index 6af5e0425872..42db26195bda 100644
> --- a/drivers/pci/pcie/portdrv.c
> +++ b/drivers/pci/pcie/portdrv.c
> @@ -793,6 +793,7 @@ static struct pci_driver pcie_portdriver = {
> .shutdown = pcie_portdrv_shutdown,
>
> .err_handler = &pcie_portdrv_err_handler,
> + .cxl_err_handler = &cxl_portdrv_err_handler,
>
> .driver_managed_dma = true,
Ok. I'm thinking to add a definition for 'pci_dev::cxl_err_handler' of type
'struct pci_error_handler'.
'struct pci_error_handler' contains a slot reset(), resume(), and mmio_enabled() fn
pointers that are used in PCIe recovery if available. The plan is for CXL devices to
call panic for UCE fatal and non-fatal but it might be good to use the
'struct pci_error_handler' type in case there are needs for the other handlers in
the future. It also makes the logic to access and use the error handlers common,
requiring less code.
Regards,
Terry
^ permalink raw reply [flat|nested] 62+ messages in thread* Re: [PATCH 01/15] cxl/aer/pci: Add CXL PCIe port error handler callbacks in AER service driver
2024-10-22 13:50 ` Terry Bowman
@ 2024-10-22 17:09 ` Dan Williams
2024-10-22 18:40 ` Terry Bowman
0 siblings, 1 reply; 62+ messages in thread
From: Dan Williams @ 2024-10-22 17:09 UTC (permalink / raw)
To: Terry Bowman, Dan Williams, ming4.li, linux-cxl, linux-kernel,
linux-pci, dave, jonathan.cameron, dave.jiang, alison.schofield,
vishal.l.verma, bhelgaas, mahesh, oohall, Benjamin.Cheatham,
rrichter, nathan.fontenot, smita.koralahallichannabasappa
Terry Bowman wrote:
[..]
> > diff --git a/drivers/pci/pcie/portdrv.c b/drivers/pci/pcie/portdrv.c
> > index 6af5e0425872..42db26195bda 100644
> > --- a/drivers/pci/pcie/portdrv.c
> > +++ b/drivers/pci/pcie/portdrv.c
> > @@ -793,6 +793,7 @@ static struct pci_driver pcie_portdriver = {
> > .shutdown = pcie_portdrv_shutdown,
> >
> > .err_handler = &pcie_portdrv_err_handler,
> > + .cxl_err_handler = &cxl_portdrv_err_handler,
> >
> > .driver_managed_dma = true,
>
> Ok. I'm thinking to add a definition for 'pci_dev::cxl_err_handler' of type
> 'struct pci_error_handler'.
>
> 'struct pci_error_handler' contains a slot reset(), resume(), and mmio_enabled() fn
> pointers that are used in PCIe recovery if available. The plan is for CXL devices to
> call panic for UCE fatal and non-fatal but it might be good to use the
> 'struct pci_error_handler' type in case there are needs for the other handlers in
> the future. It also makes the logic to access and use the error handlers common,
> requiring less code.
Can you give an example where CXL can reuse 'struct pci_error_handlers`
infrastructure? The PCI error handlers are built around the idea that
operations can be paused and recovered, CXL operations assume near
constant device participation in CPU cache and memory coherency
protocol.
About the only reuse I can think of is cases where a CXL error could be
sent down the PCI error handler path, i.e. ones that would send a
'pci_channel_io_normal' notice to ->error_detected(). Otherwise,
pci_channel_state_t and pci_ers_result_t seem to be a poor fit for CXL
error handling.
^ permalink raw reply [flat|nested] 62+ messages in thread* Re: [PATCH 01/15] cxl/aer/pci: Add CXL PCIe port error handler callbacks in AER service driver
2024-10-22 17:09 ` Dan Williams
@ 2024-10-22 18:40 ` Terry Bowman
2024-10-22 23:43 ` Dan Williams
0 siblings, 1 reply; 62+ messages in thread
From: Terry Bowman @ 2024-10-22 18:40 UTC (permalink / raw)
To: Dan Williams, ming4.li, linux-cxl, linux-kernel, linux-pci, dave,
jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
bhelgaas, mahesh, oohall, Benjamin.Cheatham, rrichter,
nathan.fontenot, smita.koralahallichannabasappa
Hi Dan,
On 10/22/24 12:09, Dan Williams wrote:
> Terry Bowman wrote:
> [..]
>>> diff --git a/drivers/pci/pcie/portdrv.c b/drivers/pci/pcie/portdrv.c
>>> index 6af5e0425872..42db26195bda 100644
>>> --- a/drivers/pci/pcie/portdrv.c
>>> +++ b/drivers/pci/pcie/portdrv.c
>>> @@ -793,6 +793,7 @@ static struct pci_driver pcie_portdriver = {
>>> .shutdown = pcie_portdrv_shutdown,
>>>
>>> .err_handler = &pcie_portdrv_err_handler,
>>> + .cxl_err_handler = &cxl_portdrv_err_handler,
>>>
>>> .driver_managed_dma = true,
>>
>> Ok. I'm thinking to add a definition for 'pci_dev::cxl_err_handler' of type
>> 'struct pci_error_handler'.
>>
>> 'struct pci_error_handler' contains a slot reset(), resume(), and mmio_enabled() fn
>> pointers that are used in PCIe recovery if available. The plan is for CXL devices to
>> call panic for UCE fatal and non-fatal but it might be good to use the
>> 'struct pci_error_handler' type in case there are needs for the other handlers in
>> the future. It also makes the logic to access and use the error handlers common,
>> requiring less code.
>
> Can you give an example where CXL can reuse 'struct pci_error_handlers`
> infrastructure? The PCI error handlers are built around the idea that
> operations can be paused and recovered, CXL operations assume near
> constant device participation in CPU cache and memory coherency
> protocol.
>
> About the only reuse I can think of is cases where a CXL error could be
> sent down the PCI error handler path, i.e. ones that would send a
> 'pci_channel_io_normal' notice to ->error_detected(). Otherwise,
> pci_channel_state_t and pci_ers_result_t seem to be a poor fit for CXL
> error handling.
I was referring to reusing separate instance of 'struct pci_error_handlers' for CXL
UCE-CE errors.
One example where it can be reused in infrastructure is in err.c's
report_error_detected(). If both PCIe and CXL errors use 'struct pci_error_handlers'
then the updated report_error_detected() becomes a bit simpler with less helper
function logic. But, it's not a reason by itself to choose to reuse 'struct
pci_error_handlers' for CXL errors.
Looking closer at aer,c shows there is no advantage in this file for using 'struct
pci_error_handlers' for CXL errors.
If I understand correctly you want a new type introduced, 'struct cxl_error_handlers'.
And will contain 2 function pointers for CE and UCE handling.
Regards,
Terry
^ permalink raw reply [flat|nested] 62+ messages in thread* Re: [PATCH 01/15] cxl/aer/pci: Add CXL PCIe port error handler callbacks in AER service driver
2024-10-22 18:40 ` Terry Bowman
@ 2024-10-22 23:43 ` Dan Williams
2024-10-24 15:20 ` Bowman, Terry
0 siblings, 1 reply; 62+ messages in thread
From: Dan Williams @ 2024-10-22 23:43 UTC (permalink / raw)
To: Terry Bowman, Dan Williams, ming4.li, linux-cxl, linux-kernel,
linux-pci, dave, jonathan.cameron, dave.jiang, alison.schofield,
vishal.l.verma, bhelgaas, mahesh, oohall, Benjamin.Cheatham,
rrichter, nathan.fontenot, smita.koralahallichannabasappa
Terry Bowman wrote:
[..]
> I was referring to reusing separate instance of 'struct pci_error_handlers' for CXL
> UCE-CE errors.
>
> One example where it can be reused in infrastructure is in err.c's
> report_error_detected(). If both PCIe and CXL errors use 'struct pci_error_handlers'
> then the updated report_error_detected() becomes a bit simpler with less helper
> function logic.
report_error_detected() is concerned with link and i/o state
(pci_dev_is_disconnected() and pci_dev_set_io_state()). For device
disconnects, CXL recovery potentially needs to span multiple devices.
For i/o state, CXL.io could be fully operational while CXL.cache and
CXL.mem are in fatal state.
CXL considerations do not feel welcome in that function.
Ideally a PCIe developer never needs to see or understand the CXL error
model because it is off in its own path. In other words, if someone
maintaining pcie_do_recovery=>report_error_detected() for the PCIe case
needs to go find a CXL expert each time they want to touch that path,
that feels like a regression in PCIe error handling maintainability.
> But, it's not a reason by itself to choose to reuse 'struct
> pci_error_handlers' for CXL errors.
>
> Looking closer at aer,c shows there is no advantage in this file for using 'struct
> pci_error_handlers' for CXL errors.
>
> If I understand correctly you want a new type introduced, 'struct cxl_error_handlers'.
Yes, mainly because the bus state and the result of the recovery tend to
be a different operational model. If a CXL error fits the PCIe model
then it can be sent via pcie_do_recovery(), but I expect that only
applies to a handful of correctable errors like CRC_Threshold,
Retry_Threshold, or Physical_Layer_Error. Almost everything else *seems*
like it has a CXL specific response that would confuse
pcie_do_recovery().
So, in general new operational models == new data structures and types.
> And will contain 2 function pointers for CE and UCE handling.
Unless and until we define a CXL Reset flow, then yes, I assume you
mean ->error_detected() and ->cor_error_detected()?
I do think there will be some limited fatal cases with CXL accelerators
that could be recoverable if the accelerator knows that the memory error
can be recovered by resetting the device without surprise data-loss.
That work can wait until those failure cases become clearer.
I imagine something like CXL error isolation for a host-bridge dedicated
to a single accelerator might be recoverable, but anything with
general-purpose memory is likely better off with a kernel-panic (see the
CXL error isolation discussion:
http://lore.kernel.org/e7d4a31a-bd5e-41d9-9b51-fbbd5e8fc9b2@amd.com).
^ permalink raw reply [flat|nested] 62+ messages in thread* Re: [PATCH 01/15] cxl/aer/pci: Add CXL PCIe port error handler callbacks in AER service driver
2024-10-22 23:43 ` Dan Williams
@ 2024-10-24 15:20 ` Bowman, Terry
2024-10-24 19:10 ` Dan Williams
0 siblings, 1 reply; 62+ messages in thread
From: Bowman, Terry @ 2024-10-24 15:20 UTC (permalink / raw)
To: Dan Williams, ming4.li, linux-cxl, linux-kernel, linux-pci, dave,
jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
bhelgaas, mahesh, oohall, Benjamin.Cheatham, rrichter,
nathan.fontenot, smita.koralahallichannabasappa
Hi Dan,
I added a question below.
On 10/22/2024 6:43 PM, Dan Williams wrote:
> Terry Bowman wrote:
> [..]
>> I was referring to reusing separate instance of 'struct pci_error_handlers' for CXL
>> UCE-CE errors.
>>
>> One example where it can be reused in infrastructure is in err.c's
>> report_error_detected(). If both PCIe and CXL errors use 'struct pci_error_handlers'
>> then the updated report_error_detected() becomes a bit simpler with less helper
>> function logic.
> report_error_detected() is concerned with link and i/o state
> (pci_dev_is_disconnected() and pci_dev_set_io_state()). For device
> disconnects, CXL recovery potentially needs to span multiple devices.
> For i/o state, CXL.io could be fully operational while CXL.cache and
> CXL.mem are in fatal state.
>
> CXL considerations do not feel welcome in that function.
>
> Ideally a PCIe developer never needs to see or understand the CXL error
> model because it is off in its own path. In other words, if someone
> maintaining pcie_do_recovery=>report_error_detected() for the PCIe case
> needs to go find a CXL expert each time they want to touch that path,
> that feels like a regression in PCIe error handling maintainability.
>
>> But, it's not a reason by itself to choose to reuse 'struct
>> pci_error_handlers' for CXL errors.
>>
>> Looking closer at aer,c shows there is no advantage in this file for using 'struct
>> pci_error_handlers' for CXL errors.
>>
>> If I understand correctly you want a new type introduced, 'struct cxl_error_handlers'.
> Yes, mainly because the bus state and the result of the recovery tend to
> be a different operational model. If a CXL error fits the PCIe model
> then it can be sent via pcie_do_recovery(), but I expect that only
> applies to a handful of correctable errors like CRC_Threshold,
> Retry_Threshold, or Physical_Layer_Error. Almost everything else *seems*
> like it has a CXL specific response that would confuse
> pcie_do_recovery().
>
> So, in general new operational models == new data structures and types.
Would you like to continue to use the pci_error_handlers for the CXL PCIe
endpoint device driver? Or do we change the CXL PCIe endpoint driver to
use the cxl_error_handlers ?
Regards,
Terry
^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [PATCH 01/15] cxl/aer/pci: Add CXL PCIe port error handler callbacks in AER service driver
2024-10-24 15:20 ` Bowman, Terry
@ 2024-10-24 19:10 ` Dan Williams
0 siblings, 0 replies; 62+ messages in thread
From: Dan Williams @ 2024-10-24 19:10 UTC (permalink / raw)
To: Bowman, Terry, Dan Williams, ming4.li, linux-cxl, linux-kernel,
linux-pci, dave, jonathan.cameron, dave.jiang, alison.schofield,
vishal.l.verma, bhelgaas, mahesh, oohall, Benjamin.Cheatham,
rrichter, nathan.fontenot, smita.koralahallichannabasappa
Bowman, Terry wrote:
[..]
> > So, in general new operational models == new data structures and types.
>
> Would you like to continue to use the pci_error_handlers for the CXL PCIe
> endpoint device driver? Or do we change the CXL PCIe endpoint driver to
> use the cxl_error_handlers ?
I would expect to extend with new 'cxl_error_handlers' support. 'Extend'
not 'replace' because a CXL endpoint could in fact be operating in pure
PCIe mode.
Otherwise, it is currently ambiguous what an error handler invocation is
signaling. In the new scheme fatal CXL errors are panics, while fatal
PCIe errors remain device resets.
So the question is how to distinguish those events without separate
entry points. Either the error (bus) type needs to be added as a
parameter to the callbacks or the error type is implied by the 'error
handlers' type. I would rather not go disturb the PCIe error handler
world by making them deal with an error type passed to existing
callbacks, so a new invocation regime seems appropriate.
I am open to other thoughts, but this seems the most maintainable way to
handle these parallel universes that both happen to fork from an AER
error source.
^ permalink raw reply [flat|nested] 62+ messages in thread
* [PATCH 02/15] cxl/aer/pci: Update is_internal_error() to be callable w/o CONFIG_PCIEAER_CXL
2024-10-08 22:16 [PATCH 0/15] Enable CXL PCIe port protocol error handling and logging Terry Bowman
2024-10-08 22:16 ` [PATCH 01/15] cxl/aer/pci: Add CXL PCIe port error handler callbacks in AER service driver Terry Bowman
@ 2024-10-08 22:16 ` Terry Bowman
2024-10-16 16:11 ` Jonathan Cameron
2024-10-22 2:17 ` Dan Williams
2024-10-08 22:16 ` [PATCH 03/15] cxl/aer/pci: Refactor AER driver's existing interfaces to support CXL PCIe ports Terry Bowman
` (16 subsequent siblings)
18 siblings, 2 replies; 62+ messages in thread
From: Terry Bowman @ 2024-10-08 22:16 UTC (permalink / raw)
To: ming4.li, linux-cxl, linux-kernel, linux-pci, dave,
jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
dan.j.williams, bhelgaas, mahesh, oohall, Benjamin.Cheatham,
rrichter, nathan.fontenot, smita.koralahallichannabasappa,
terry.bowman
CXL port error handling will be updated in future and will use
logic to determine if an error requires CXL or PCIe processing.
Internal errors are one indicator to identify an error is a CXL
protocol error.
is_internal_error() is currently limited by CONFIG_PCIEAER_CXL
kernel config.
Update the is_internal_error() function's declaration such that it is
always available regardless if CONFIG_PCIEAER_CXL kernel config
is enabled or disabled.
Signed-off-by: Terry Bowman <terry.bowman@amd.com>
---
drivers/pci/pcie/aer.c | 17 ++++++++---------
1 file changed, 8 insertions(+), 9 deletions(-)
diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index a9792b9576b4..1e72829a249f 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -941,8 +941,15 @@ static bool find_source_device(struct pci_dev *parent,
return true;
}
-#ifdef CONFIG_PCIEAER_CXL
+static bool is_internal_error(struct aer_err_info *info)
+{
+ if (info->severity == AER_CORRECTABLE)
+ return info->status & PCI_ERR_COR_INTERNAL;
+ return info->status & PCI_ERR_UNC_INTN;
+}
+
+#ifdef CONFIG_PCIEAER_CXL
/**
* pci_aer_unmask_internal_errors - unmask internal errors
* @dev: pointer to the pcie_dev data structure
@@ -994,14 +1001,6 @@ static bool cxl_error_is_native(struct pci_dev *dev)
return (pcie_ports_native || host->native_aer);
}
-static bool is_internal_error(struct aer_err_info *info)
-{
- if (info->severity == AER_CORRECTABLE)
- return info->status & PCI_ERR_COR_INTERNAL;
-
- return info->status & PCI_ERR_UNC_INTN;
-}
-
static int cxl_rch_handle_error_iter(struct pci_dev *dev, void *data)
{
struct aer_err_info *info = (struct aer_err_info *)data;
--
2.34.1
^ permalink raw reply related [flat|nested] 62+ messages in thread* Re: [PATCH 02/15] cxl/aer/pci: Update is_internal_error() to be callable w/o CONFIG_PCIEAER_CXL
2024-10-08 22:16 ` [PATCH 02/15] cxl/aer/pci: Update is_internal_error() to be callable w/o CONFIG_PCIEAER_CXL Terry Bowman
@ 2024-10-16 16:11 ` Jonathan Cameron
2024-10-22 2:17 ` Dan Williams
1 sibling, 0 replies; 62+ messages in thread
From: Jonathan Cameron @ 2024-10-16 16:11 UTC (permalink / raw)
To: Terry Bowman
Cc: ming4.li, linux-cxl, linux-kernel, linux-pci, dave, dave.jiang,
alison.schofield, vishal.l.verma, dan.j.williams, bhelgaas,
mahesh, oohall, Benjamin.Cheatham, rrichter, nathan.fontenot,
smita.koralahallichannabasappa
On Tue, 8 Oct 2024 17:16:44 -0500
Terry Bowman <terry.bowman@amd.com> wrote:
> CXL port error handling will be updated in future and will use
> logic to determine if an error requires CXL or PCIe processing.
> Internal errors are one indicator to identify an error is a CXL
> protocol error.
>
> is_internal_error() is currently limited by CONFIG_PCIEAER_CXL
> kernel config.
>
> Update the is_internal_error() function's declaration such that it is
> always available regardless if CONFIG_PCIEAER_CXL kernel config
> is enabled or disabled.
>
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
Given this has nothing specifically to do with CXL, this seems
sensible to me.
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [PATCH 02/15] cxl/aer/pci: Update is_internal_error() to be callable w/o CONFIG_PCIEAER_CXL
2024-10-08 22:16 ` [PATCH 02/15] cxl/aer/pci: Update is_internal_error() to be callable w/o CONFIG_PCIEAER_CXL Terry Bowman
2024-10-16 16:11 ` Jonathan Cameron
@ 2024-10-22 2:17 ` Dan Williams
2024-10-22 13:54 ` Terry Bowman
1 sibling, 1 reply; 62+ messages in thread
From: Dan Williams @ 2024-10-22 2:17 UTC (permalink / raw)
To: Terry Bowman, ming4.li, linux-cxl, linux-kernel, linux-pci, dave,
jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
dan.j.williams, bhelgaas, mahesh, oohall, Benjamin.Cheatham,
rrichter, nathan.fontenot, smita.koralahallichannabasappa
Terry Bowman wrote:
> CXL port error handling will be updated in future and will use
> logic to determine if an error requires CXL or PCIe processing.
> Internal errors are one indicator to identify an error is a CXL
> protocol error.
I expect it would better to fold this into the patch that makes use of
the is_internal_error() outside of the CONFIG_PCIEAER_CXL case.
With this patch in isolation it is not clear that a kernel that sets
CONFIG_PCIEAER_CXL=n should distinguish PCIe internal errors from CXL
errors.
The real problem seems to be that CONFIG_PCIEAER_CXL depends on CXL_PCI.
I.e. is_internal_error() only matters for the CXL case, and the CXL
handling is moving more into the core and dropping its CXL_PCI
dependency.
^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [PATCH 02/15] cxl/aer/pci: Update is_internal_error() to be callable w/o CONFIG_PCIEAER_CXL
2024-10-22 2:17 ` Dan Williams
@ 2024-10-22 13:54 ` Terry Bowman
0 siblings, 0 replies; 62+ messages in thread
From: Terry Bowman @ 2024-10-22 13:54 UTC (permalink / raw)
To: Dan Williams, ming4.li, linux-cxl, linux-kernel, linux-pci, dave,
jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
bhelgaas, mahesh, oohall, Benjamin.Cheatham, rrichter,
nathan.fontenot, smita.koralahallichannabasappa
Hi Dan,
On 10/21/24 21:17, Dan Williams wrote:
> Terry Bowman wrote:
>> CXL port error handling will be updated in future and will use
>> logic to determine if an error requires CXL or PCIe processing.
>> Internal errors are one indicator to identify an error is a CXL
>> protocol error.
>
> I expect it would better to fold this into the patch that makes use of
> the is_internal_error() outside of the CONFIG_PCIEAER_CXL case.
>
> With this patch in isolation it is not clear that a kernel that sets
> CONFIG_PCIEAER_CXL=n should distinguish PCIe internal errors from CXL
> errors.
>
> The real problem seems to be that CONFIG_PCIEAER_CXL depends on CXL_PCI.
> I.e. is_internal_error() only matters for the CXL case, and the CXL
> handling is moving more into the core and dropping its CXL_PCI
> dependency.
I will merge the patch as you described.
Regards,
Terry
^ permalink raw reply [flat|nested] 62+ messages in thread
* [PATCH 03/15] cxl/aer/pci: Refactor AER driver's existing interfaces to support CXL PCIe ports
2024-10-08 22:16 [PATCH 0/15] Enable CXL PCIe port protocol error handling and logging Terry Bowman
2024-10-08 22:16 ` [PATCH 01/15] cxl/aer/pci: Add CXL PCIe port error handler callbacks in AER service driver Terry Bowman
2024-10-08 22:16 ` [PATCH 02/15] cxl/aer/pci: Update is_internal_error() to be callable w/o CONFIG_PCIEAER_CXL Terry Bowman
@ 2024-10-08 22:16 ` Terry Bowman
2024-10-10 19:11 ` Bjorn Helgaas
2024-10-08 22:16 ` [PATCH 04/15] cxl/aer/pci: Add CXL PCIe port correctable error support in AER service driver Terry Bowman
` (15 subsequent siblings)
18 siblings, 1 reply; 62+ messages in thread
From: Terry Bowman @ 2024-10-08 22:16 UTC (permalink / raw)
To: ming4.li, linux-cxl, linux-kernel, linux-pci, dave,
jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
dan.j.williams, bhelgaas, mahesh, oohall, Benjamin.Cheatham,
rrichter, nathan.fontenot, smita.koralahallichannabasappa,
terry.bowman
The AER service driver already includes support for CXL restricted host
(RCH) downstream port error handling. The current implementation is based
CXl1.1 using a root complex event collector.
Update the function interfaces and parameters where necessary to add
virtual hierarchy (VH) mode CXL PCIe port error handling alongside the RCH
handling. The CXL PCIe port error handling will be added in a future patch.
Limit changes to refactoring variable and function names. No
functional changes are added.
Signed-off-by: Terry Bowman <terry.bowman@amd.com>
---
drivers/pci/pcie/aer.c | 28 ++++++++++++++--------------
1 file changed, 14 insertions(+), 14 deletions(-)
diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index 1e72829a249f..dc8b17999001 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -1030,7 +1030,7 @@ static int cxl_rch_handle_error_iter(struct pci_dev *dev, void *data)
return 0;
}
-static void cxl_rch_handle_error(struct pci_dev *dev, struct aer_err_info *info)
+static void cxl_handle_error(struct pci_dev *dev, struct aer_err_info *info)
{
/*
* Internal errors of an RCEC indicate an AER error in an
@@ -1053,30 +1053,30 @@ static int handles_cxl_error_iter(struct pci_dev *dev, void *data)
return *handles_cxl;
}
-static bool handles_cxl_errors(struct pci_dev *rcec)
+static bool handles_cxl_errors(struct pci_dev *dev)
{
bool handles_cxl = false;
- if (pci_pcie_type(rcec) == PCI_EXP_TYPE_RC_EC &&
- pcie_aer_is_native(rcec))
- pcie_walk_rcec(rcec, handles_cxl_error_iter, &handles_cxl);
+ if (pci_pcie_type(dev) == PCI_EXP_TYPE_RC_EC &&
+ pcie_aer_is_native(dev))
+ pcie_walk_rcec(dev, handles_cxl_error_iter, &handles_cxl);
return handles_cxl;
}
-static void cxl_rch_enable_rcec(struct pci_dev *rcec)
+static void cxl_enable_internal_errors(struct pci_dev *dev)
{
- if (!handles_cxl_errors(rcec))
+ if (!handles_cxl_errors(dev))
return;
- pci_aer_unmask_internal_errors(rcec);
- pci_info(rcec, "CXL: Internal errors unmasked");
+ pci_aer_unmask_internal_errors(dev);
+ pci_info(dev, "CXL: Internal errors unmasked");
}
#else
-static inline void cxl_rch_enable_rcec(struct pci_dev *dev) { }
-static inline void cxl_rch_handle_error(struct pci_dev *dev,
- struct aer_err_info *info) { }
+static inline void cxl_enable_internal_errors(struct pci_dev *dev) { }
+static inline void cxl_handle_error(struct pci_dev *dev,
+ struct aer_err_info *info) { }
#endif
void register_cxl_port_hndlrs(struct cxl_port_err_hndlrs *_cxl_port_hndlrs)
@@ -1134,7 +1134,7 @@ static void pci_aer_handle_error(struct pci_dev *dev, struct aer_err_info *info)
static void handle_error_source(struct pci_dev *dev, struct aer_err_info *info)
{
- cxl_rch_handle_error(dev, info);
+ cxl_handle_error(dev, info);
pci_aer_handle_error(dev, info);
pci_dev_put(dev);
}
@@ -1512,7 +1512,7 @@ static int aer_probe(struct pcie_device *dev)
return status;
}
- cxl_rch_enable_rcec(port);
+ cxl_enable_internal_errors(port);
aer_enable_rootport(rpc);
pci_info(port, "enabled with IRQ %d\n", dev->irq);
return 0;
--
2.34.1
^ permalink raw reply related [flat|nested] 62+ messages in thread* Re: [PATCH 03/15] cxl/aer/pci: Refactor AER driver's existing interfaces to support CXL PCIe ports
2024-10-08 22:16 ` [PATCH 03/15] cxl/aer/pci: Refactor AER driver's existing interfaces to support CXL PCIe ports Terry Bowman
@ 2024-10-10 19:11 ` Bjorn Helgaas
2024-10-14 17:27 ` Terry Bowman
0 siblings, 1 reply; 62+ messages in thread
From: Bjorn Helgaas @ 2024-10-10 19:11 UTC (permalink / raw)
To: Terry Bowman
Cc: ming4.li, linux-cxl, linux-kernel, linux-pci, dave,
jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
dan.j.williams, bhelgaas, mahesh, oohall, Benjamin.Cheatham,
rrichter, nathan.fontenot, smita.koralahallichannabasappa
I would describe this more as "renaming" than "refactoring".
On Tue, Oct 08, 2024 at 05:16:45PM -0500, Terry Bowman wrote:
> The AER service driver already includes support for CXL restricted host
> (RCH) downstream port error handling. The current implementation is based
> CXl1.1 using a root complex event collector.
>
> Update the function interfaces and parameters where necessary to add
> virtual hierarchy (VH) mode CXL PCIe port error handling alongside the RCH
> handling. The CXL PCIe port error handling will be added in a future patch.
"Virtual Hierarchy mode" sounds like something defined by the spec.
If so, add a citation and capitalize it the same way it's used in the
spec.
Same for "restricted host", at least in terms of styling. That
support was added previously, so a citation probably isn't necessary
here, but since this is part of *adding* VH support, hints about VH
will be more helpful.
> Limit changes to refactoring variable and function names. No
> functional changes are added.
>
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> ---
> drivers/pci/pcie/aer.c | 28 ++++++++++++++--------------
> 1 file changed, 14 insertions(+), 14 deletions(-)
>
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index 1e72829a249f..dc8b17999001 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -1030,7 +1030,7 @@ static int cxl_rch_handle_error_iter(struct pci_dev *dev, void *data)
> return 0;
> }
>
> -static void cxl_rch_handle_error(struct pci_dev *dev, struct aer_err_info *info)
> +static void cxl_handle_error(struct pci_dev *dev, struct aer_err_info *info)
> {
> /*
> * Internal errors of an RCEC indicate an AER error in an
> @@ -1053,30 +1053,30 @@ static int handles_cxl_error_iter(struct pci_dev *dev, void *data)
> return *handles_cxl;
> }
>
> -static bool handles_cxl_errors(struct pci_dev *rcec)
> +static bool handles_cxl_errors(struct pci_dev *dev)
> {
> bool handles_cxl = false;
>
> - if (pci_pcie_type(rcec) == PCI_EXP_TYPE_RC_EC &&
> - pcie_aer_is_native(rcec))
> - pcie_walk_rcec(rcec, handles_cxl_error_iter, &handles_cxl);
> + if (pci_pcie_type(dev) == PCI_EXP_TYPE_RC_EC &&
> + pcie_aer_is_native(dev))
> + pcie_walk_rcec(dev, handles_cxl_error_iter, &handles_cxl);
>
> return handles_cxl;
> }
>
> -static void cxl_rch_enable_rcec(struct pci_dev *rcec)
> +static void cxl_enable_internal_errors(struct pci_dev *dev)
> {
> - if (!handles_cxl_errors(rcec))
> + if (!handles_cxl_errors(dev))
> return;
>
> - pci_aer_unmask_internal_errors(rcec);
> - pci_info(rcec, "CXL: Internal errors unmasked");
> + pci_aer_unmask_internal_errors(dev);
> + pci_info(dev, "CXL: Internal errors unmasked");
> }
>
> #else
> -static inline void cxl_rch_enable_rcec(struct pci_dev *dev) { }
> -static inline void cxl_rch_handle_error(struct pci_dev *dev,
> - struct aer_err_info *info) { }
> +static inline void cxl_enable_internal_errors(struct pci_dev *dev) { }
> +static inline void cxl_handle_error(struct pci_dev *dev,
> + struct aer_err_info *info) { }
> #endif
>
> void register_cxl_port_hndlrs(struct cxl_port_err_hndlrs *_cxl_port_hndlrs)
> @@ -1134,7 +1134,7 @@ static void pci_aer_handle_error(struct pci_dev *dev, struct aer_err_info *info)
>
> static void handle_error_source(struct pci_dev *dev, struct aer_err_info *info)
> {
> - cxl_rch_handle_error(dev, info);
> + cxl_handle_error(dev, info);
> pci_aer_handle_error(dev, info);
> pci_dev_put(dev);
> }
> @@ -1512,7 +1512,7 @@ static int aer_probe(struct pcie_device *dev)
> return status;
> }
>
> - cxl_rch_enable_rcec(port);
> + cxl_enable_internal_errors(port);
> aer_enable_rootport(rpc);
> pci_info(port, "enabled with IRQ %d\n", dev->irq);
> return 0;
> --
> 2.34.1
>
^ permalink raw reply [flat|nested] 62+ messages in thread* Re: [PATCH 03/15] cxl/aer/pci: Refactor AER driver's existing interfaces to support CXL PCIe ports
2024-10-10 19:11 ` Bjorn Helgaas
@ 2024-10-14 17:27 ` Terry Bowman
0 siblings, 0 replies; 62+ messages in thread
From: Terry Bowman @ 2024-10-14 17:27 UTC (permalink / raw)
To: Bjorn Helgaas
Cc: ming4.li, linux-cxl, linux-kernel, linux-pci, dave,
jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
dan.j.williams, bhelgaas, mahesh, oohall, Benjamin.Cheatham,
rrichter, nathan.fontenot, smita.koralahallichannabasappa
Hi Bjorn,
On 10/10/24 14:11, Bjorn Helgaas wrote:
> I would describe this more as "renaming" than "refactoring".
>
Good point. Renaming is more correct. Thanks.
> On Tue, Oct 08, 2024 at 05:16:45PM -0500, Terry Bowman wrote:
>> The AER service driver already includes support for CXL restricted host
>> (RCH) downstream port error handling. The current implementation is based
>> CXl1.1 using a root complex event collector.
>>
>> Update the function interfaces and parameters where necessary to add
>> virtual hierarchy (VH) mode CXL PCIe port error handling alongside the RCH
>> handling. The CXL PCIe port error handling will be added in a future patch.
>
> "Virtual Hierarchy mode" sounds like something defined by the spec.
> If so, add a citation and capitalize it the same way it's used in the
> spec.
>
> Same for "restricted host", at least in terms of styling. That
> support was added previously, so a citation probably isn't necessary
> here, but since this is part of *adding* VH support, hints about VH
> will be more helpful.
>
Ok.
Regards,
Terry
>> Limit changes to refactoring variable and function names. No
>> functional changes are added.
>>
>> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
>> ---
>> drivers/pci/pcie/aer.c | 28 ++++++++++++++--------------
>> 1 file changed, 14 insertions(+), 14 deletions(-)
>>
>> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
>> index 1e72829a249f..dc8b17999001 100644
>> --- a/drivers/pci/pcie/aer.c
>> +++ b/drivers/pci/pcie/aer.c
>> @@ -1030,7 +1030,7 @@ static int cxl_rch_handle_error_iter(struct pci_dev *dev, void *data)
>> return 0;
>> }
>>
>> -static void cxl_rch_handle_error(struct pci_dev *dev, struct aer_err_info *info)
>> +static void cxl_handle_error(struct pci_dev *dev, struct aer_err_info *info)
>> {
>> /*
>> * Internal errors of an RCEC indicate an AER error in an
>> @@ -1053,30 +1053,30 @@ static int handles_cxl_error_iter(struct pci_dev *dev, void *data)
>> return *handles_cxl;
>> }
>>
>> -static bool handles_cxl_errors(struct pci_dev *rcec)
>> +static bool handles_cxl_errors(struct pci_dev *dev)
>> {
>> bool handles_cxl = false;
>>
>> - if (pci_pcie_type(rcec) == PCI_EXP_TYPE_RC_EC &&
>> - pcie_aer_is_native(rcec))
>> - pcie_walk_rcec(rcec, handles_cxl_error_iter, &handles_cxl);
>> + if (pci_pcie_type(dev) == PCI_EXP_TYPE_RC_EC &&
>> + pcie_aer_is_native(dev))
>> + pcie_walk_rcec(dev, handles_cxl_error_iter, &handles_cxl);
>>
>> return handles_cxl;
>> }
>>
>> -static void cxl_rch_enable_rcec(struct pci_dev *rcec)
>> +static void cxl_enable_internal_errors(struct pci_dev *dev)
>> {
>> - if (!handles_cxl_errors(rcec))
>> + if (!handles_cxl_errors(dev))
>> return;
>>
>> - pci_aer_unmask_internal_errors(rcec);
>> - pci_info(rcec, "CXL: Internal errors unmasked");
>> + pci_aer_unmask_internal_errors(dev);
>> + pci_info(dev, "CXL: Internal errors unmasked");
>> }
>>
>> #else
>> -static inline void cxl_rch_enable_rcec(struct pci_dev *dev) { }
>> -static inline void cxl_rch_handle_error(struct pci_dev *dev,
>> - struct aer_err_info *info) { }
>> +static inline void cxl_enable_internal_errors(struct pci_dev *dev) { }
>> +static inline void cxl_handle_error(struct pci_dev *dev,
>> + struct aer_err_info *info) { }
>> #endif
>>
>> void register_cxl_port_hndlrs(struct cxl_port_err_hndlrs *_cxl_port_hndlrs)
>> @@ -1134,7 +1134,7 @@ static void pci_aer_handle_error(struct pci_dev *dev, struct aer_err_info *info)
>>
>> static void handle_error_source(struct pci_dev *dev, struct aer_err_info *info)
>> {
>> - cxl_rch_handle_error(dev, info);
>> + cxl_handle_error(dev, info);
>> pci_aer_handle_error(dev, info);
>> pci_dev_put(dev);
>> }
>> @@ -1512,7 +1512,7 @@ static int aer_probe(struct pcie_device *dev)
>> return status;
>> }
>>
>> - cxl_rch_enable_rcec(port);
>> + cxl_enable_internal_errors(port);
>> aer_enable_rootport(rpc);
>> pci_info(port, "enabled with IRQ %d\n", dev->irq);
>> return 0;
>> --
>> 2.34.1
>>
^ permalink raw reply [flat|nested] 62+ messages in thread
* [PATCH 04/15] cxl/aer/pci: Add CXL PCIe port correctable error support in AER service driver
2024-10-08 22:16 [PATCH 0/15] Enable CXL PCIe port protocol error handling and logging Terry Bowman
` (2 preceding siblings ...)
2024-10-08 22:16 ` [PATCH 03/15] cxl/aer/pci: Refactor AER driver's existing interfaces to support CXL PCIe ports Terry Bowman
@ 2024-10-08 22:16 ` Terry Bowman
2024-10-16 16:22 ` Jonathan Cameron
2024-10-08 22:16 ` [PATCH 05/15] cxl/aer/pci: Update AER driver to read UCE fatal status for all CXL PCIe port devices Terry Bowman
` (14 subsequent siblings)
18 siblings, 1 reply; 62+ messages in thread
From: Terry Bowman @ 2024-10-08 22:16 UTC (permalink / raw)
To: ming4.li, linux-cxl, linux-kernel, linux-pci, dave,
jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
dan.j.williams, bhelgaas, mahesh, oohall, Benjamin.Cheatham,
rrichter, nathan.fontenot, smita.koralahallichannabasappa,
terry.bowman
The AER service driver currently does not manage CXL PCIe port
protocol errors reported by CXL root ports, CXL upstream switch ports,
and CXL downstream switch ports. Consequently, RAS protocol errors
from CXL PCIe port devices are not properly logged or handled.
These errors are reported to the OS via the root port's AER correctable
and uncorrectable internal error fields. While the AER driver supports
handling downstream port protocol errors in restricted CXL host (RCH)
mode also known as CXL1.1, it lacks the same functionality for CXL
PCIe ports operating in virtual hierarchy (VH) mode, introduced in
CXL2.0.
To address this gap, update the AER driver to handle CXL PCIe port
device protocol correctable errors (CE).
The uncorrectable error handling (UCE) will be added in a future
patch.
Make this update alongside the existing downstream port RCH error
handling logic, extending support to CXL PCIe ports in VH.
Signed-off-by: Terry Bowman <terry.bowman@amd.com>
---
drivers/pci/pcie/aer.c | 54 +++++++++++++++++++++++++++++++++---------
1 file changed, 43 insertions(+), 11 deletions(-)
diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index dc8b17999001..1c996287d4ce 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -40,6 +40,8 @@
#define AER_MAX_TYPEOF_COR_ERRS 16 /* as per PCI_ERR_COR_STATUS */
#define AER_MAX_TYPEOF_UNCOR_ERRS 27 /* as per PCI_ERR_UNCOR_STATUS*/
+#define CXL_DVSEC_PORT_EXTENSIONS 3
+
struct aer_err_source {
u32 status; /* PCI_ERR_ROOT_STATUS */
u32 id; /* PCI_ERR_ROOT_ERR_SRC */
@@ -941,6 +943,17 @@ static bool find_source_device(struct pci_dev *parent,
return true;
}
+static bool is_pcie_cxl_port(struct pci_dev *dev)
+{
+ if ((pci_pcie_type(dev) != PCI_EXP_TYPE_ROOT_PORT) &&
+ (pci_pcie_type(dev) != PCI_EXP_TYPE_UPSTREAM) &&
+ (pci_pcie_type(dev) != PCI_EXP_TYPE_DOWNSTREAM))
+ return false;
+
+ return (!!pci_find_dvsec_capability(dev, PCI_VENDOR_ID_CXL,
+ CXL_DVSEC_PORT_EXTENSIONS));
+}
+
static bool is_internal_error(struct aer_err_info *info)
{
if (info->severity == AER_CORRECTABLE)
@@ -1032,14 +1045,22 @@ static int cxl_rch_handle_error_iter(struct pci_dev *dev, void *data)
static void cxl_handle_error(struct pci_dev *dev, struct aer_err_info *info)
{
- /*
- * Internal errors of an RCEC indicate an AER error in an
- * RCH's downstream port. Check and handle them in the CXL.mem
- * device driver.
- */
- if (pci_pcie_type(dev) == PCI_EXP_TYPE_RC_EC &&
- is_internal_error(info))
+ if (pci_pcie_type(dev) == PCI_EXP_TYPE_RC_EC)
pcie_walk_rcec(dev, cxl_rch_handle_error_iter, info);
+
+ if (info->severity == AER_CORRECTABLE) {
+ struct cxl_port_err_hndlrs *cxl_port_hndlrs =
+ find_cxl_port_hndlrs();
+ int aer = dev->aer_cap;
+
+ if (aer)
+ pci_write_config_dword(dev, aer + PCI_ERR_COR_STATUS,
+ info->status);
+
+ if (cxl_port_hndlrs && cxl_port_hndlrs->cor_error_detected)
+ cxl_port_hndlrs->cor_error_detected(dev);
+ pcie_clear_device_status(dev);
+ }
}
static int handles_cxl_error_iter(struct pci_dev *dev, void *data)
@@ -1057,9 +1078,13 @@ static bool handles_cxl_errors(struct pci_dev *dev)
{
bool handles_cxl = false;
- if (pci_pcie_type(dev) == PCI_EXP_TYPE_RC_EC &&
- pcie_aer_is_native(dev))
+ if (!pcie_aer_is_native(dev))
+ return false;
+
+ if (pci_pcie_type(dev) == PCI_EXP_TYPE_RC_EC)
pcie_walk_rcec(dev, handles_cxl_error_iter, &handles_cxl);
+ else
+ handles_cxl = is_pcie_cxl_port(dev);
return handles_cxl;
}
@@ -1077,6 +1102,10 @@ static void cxl_enable_internal_errors(struct pci_dev *dev)
static inline void cxl_enable_internal_errors(struct pci_dev *dev) { }
static inline void cxl_handle_error(struct pci_dev *dev,
struct aer_err_info *info) { }
+static bool handles_cxl_errors(struct pci_dev *dev)
+{
+ return false;
+}
#endif
void register_cxl_port_hndlrs(struct cxl_port_err_hndlrs *_cxl_port_hndlrs)
@@ -1134,8 +1163,11 @@ static void pci_aer_handle_error(struct pci_dev *dev, struct aer_err_info *info)
static void handle_error_source(struct pci_dev *dev, struct aer_err_info *info)
{
- cxl_handle_error(dev, info);
- pci_aer_handle_error(dev, info);
+ if (is_internal_error(info) && handles_cxl_errors(dev))
+ cxl_handle_error(dev, info);
+ else
+ pci_aer_handle_error(dev, info);
+
pci_dev_put(dev);
}
--
2.34.1
^ permalink raw reply related [flat|nested] 62+ messages in thread* Re: [PATCH 04/15] cxl/aer/pci: Add CXL PCIe port correctable error support in AER service driver
2024-10-08 22:16 ` [PATCH 04/15] cxl/aer/pci: Add CXL PCIe port correctable error support in AER service driver Terry Bowman
@ 2024-10-16 16:22 ` Jonathan Cameron
2024-10-16 17:18 ` Terry Bowman
0 siblings, 1 reply; 62+ messages in thread
From: Jonathan Cameron @ 2024-10-16 16:22 UTC (permalink / raw)
To: Terry Bowman
Cc: ming4.li, linux-cxl, linux-kernel, linux-pci, dave, dave.jiang,
alison.schofield, vishal.l.verma, dan.j.williams, bhelgaas,
mahesh, oohall, Benjamin.Cheatham, rrichter, nathan.fontenot,
smita.koralahallichannabasappa
On Tue, 8 Oct 2024 17:16:46 -0500
Terry Bowman <terry.bowman@amd.com> wrote:
> The AER service driver currently does not manage CXL PCIe port
> protocol errors reported by CXL root ports, CXL upstream switch ports,
> and CXL downstream switch ports. Consequently, RAS protocol errors
> from CXL PCIe port devices are not properly logged or handled.
>
> These errors are reported to the OS via the root port's AER correctable
> and uncorrectable internal error fields. While the AER driver supports
> handling downstream port protocol errors in restricted CXL host (RCH)
> mode also known as CXL1.1, it lacks the same functionality for CXL
> PCIe ports operating in virtual hierarchy (VH) mode, introduced in
> CXL2.0.
>
> To address this gap, update the AER driver to handle CXL PCIe port
> device protocol correctable errors (CE).
>
> The uncorrectable error handling (UCE) will be added in a future
> patch.
>
> Make this update alongside the existing downstream port RCH error
> handling logic, extending support to CXL PCIe ports in VH.
>
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
Minor comments inline.
J
> ---
> drivers/pci/pcie/aer.c | 54 +++++++++++++++++++++++++++++++++---------
> 1 file changed, 43 insertions(+), 11 deletions(-)
>
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index dc8b17999001..1c996287d4ce 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -40,6 +40,8 @@
> #define AER_MAX_TYPEOF_COR_ERRS 16 /* as per PCI_ERR_COR_STATUS */
> #define AER_MAX_TYPEOF_UNCOR_ERRS 27 /* as per PCI_ERR_UNCOR_STATUS*/
>
> +#define CXL_DVSEC_PORT_EXTENSIONS 3
Duplicate of definition in drivers/cxl/cxlpci.h
Maybe wrap it up in an is_cxl_port() or similar? Or just
move that to a header both places can exercise.
> +
> struct aer_err_source {
> u32 status; /* PCI_ERR_ROOT_STATUS */
> u32 id; /* PCI_ERR_ROOT_ERR_SRC */
> @@ -941,6 +943,17 @@ static bool find_source_device(struct pci_dev *parent,
> return true;
> }
>
> +static bool is_pcie_cxl_port(struct pci_dev *dev)
> +{
> + if ((pci_pcie_type(dev) != PCI_EXP_TYPE_ROOT_PORT) &&
> + (pci_pcie_type(dev) != PCI_EXP_TYPE_UPSTREAM) &&
> + (pci_pcie_type(dev) != PCI_EXP_TYPE_DOWNSTREAM))
> + return false;
> +
> + return (!!pci_find_dvsec_capability(dev, PCI_VENDOR_ID_CXL,
> + CXL_DVSEC_PORT_EXTENSIONS));
No need for the !! it will return the same without that clamping to 1/0
because any non 0 value is true.
> +}
> +
^ permalink raw reply [flat|nested] 62+ messages in thread* Re: [PATCH 04/15] cxl/aer/pci: Add CXL PCIe port correctable error support in AER service driver
2024-10-16 16:22 ` Jonathan Cameron
@ 2024-10-16 17:18 ` Terry Bowman
2024-10-16 17:29 ` Jonathan Cameron
0 siblings, 1 reply; 62+ messages in thread
From: Terry Bowman @ 2024-10-16 17:18 UTC (permalink / raw)
To: Jonathan Cameron
Cc: ming4.li, linux-cxl, linux-kernel, linux-pci, dave, dave.jiang,
alison.schofield, vishal.l.verma, dan.j.williams, bhelgaas,
mahesh, oohall, Benjamin.Cheatham, rrichter, nathan.fontenot,
smita.koralahallichannabasappa
Hi Jonathan,
On 10/16/24 11:22, Jonathan Cameron wrote:
> On Tue, 8 Oct 2024 17:16:46 -0500
> Terry Bowman <terry.bowman@amd.com> wrote:
>
>> The AER service driver currently does not manage CXL PCIe port
>> protocol errors reported by CXL root ports, CXL upstream switch ports,
>> and CXL downstream switch ports. Consequently, RAS protocol errors
>> from CXL PCIe port devices are not properly logged or handled.
>>
>> These errors are reported to the OS via the root port's AER correctable
>> and uncorrectable internal error fields. While the AER driver supports
>> handling downstream port protocol errors in restricted CXL host (RCH)
>> mode also known as CXL1.1, it lacks the same functionality for CXL
>> PCIe ports operating in virtual hierarchy (VH) mode, introduced in
>> CXL2.0.
>>
>> To address this gap, update the AER driver to handle CXL PCIe port
>> device protocol correctable errors (CE).
>>
>> The uncorrectable error handling (UCE) will be added in a future
>> patch.
>>
>> Make this update alongside the existing downstream port RCH error
>> handling logic, extending support to CXL PCIe ports in VH.
>>
>> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> Minor comments inline.
>
> J
>> ---
>> drivers/pci/pcie/aer.c | 54 +++++++++++++++++++++++++++++++++---------
>> 1 file changed, 43 insertions(+), 11 deletions(-)
>>
>> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
>> index dc8b17999001..1c996287d4ce 100644
>> --- a/drivers/pci/pcie/aer.c
>> +++ b/drivers/pci/pcie/aer.c
>> @@ -40,6 +40,8 @@
>> #define AER_MAX_TYPEOF_COR_ERRS 16 /* as per PCI_ERR_COR_STATUS */
>> #define AER_MAX_TYPEOF_UNCOR_ERRS 27 /* as per PCI_ERR_UNCOR_STATUS*/
>>
>> +#define CXL_DVSEC_PORT_EXTENSIONS 3
>
> Duplicate of definition in drivers/cxl/cxlpci.h
>
> Maybe wrap it up in an is_cxl_port() or similar? Or just
> move that to a header both places can exercise.
>
>
Ok. I'll move the value '3' into the function call rather than use a #define.
>> +
>> struct aer_err_source {
>> u32 status; /* PCI_ERR_ROOT_STATUS */
>> u32 id; /* PCI_ERR_ROOT_ERR_SRC */
>> @@ -941,6 +943,17 @@ static bool find_source_device(struct pci_dev *parent,
>> return true;
>> }
>>
>> +static bool is_pcie_cxl_port(struct pci_dev *dev)
>> +{
>> + if ((pci_pcie_type(dev) != PCI_EXP_TYPE_ROOT_PORT) &&
>> + (pci_pcie_type(dev) != PCI_EXP_TYPE_UPSTREAM) &&
>> + (pci_pcie_type(dev) != PCI_EXP_TYPE_DOWNSTREAM))
>> + return false;
>> +
>> + return (!!pci_find_dvsec_capability(dev, PCI_VENDOR_ID_CXL,
>> + CXL_DVSEC_PORT_EXTENSIONS));
>
> No need for the !! it will return the same without that clamping to 1/0
> because any non 0 value is true.
>
Ok
Regards,
Terry
>> +}
>> +
^ permalink raw reply [flat|nested] 62+ messages in thread* Re: [PATCH 04/15] cxl/aer/pci: Add CXL PCIe port correctable error support in AER service driver
2024-10-16 17:18 ` Terry Bowman
@ 2024-10-16 17:29 ` Jonathan Cameron
0 siblings, 0 replies; 62+ messages in thread
From: Jonathan Cameron @ 2024-10-16 17:29 UTC (permalink / raw)
To: Terry Bowman
Cc: ming4.li, linux-cxl, linux-kernel, linux-pci, dave, dave.jiang,
alison.schofield, vishal.l.verma, dan.j.williams, bhelgaas,
mahesh, oohall, Benjamin.Cheatham, rrichter, nathan.fontenot,
smita.koralahallichannabasappa
On Wed, 16 Oct 2024 12:18:06 -0500
Terry Bowman <Terry.Bowman@amd.com> wrote:
> Hi Jonathan,
>
> On 10/16/24 11:22, Jonathan Cameron wrote:
> > On Tue, 8 Oct 2024 17:16:46 -0500
> > Terry Bowman <terry.bowman@amd.com> wrote:
> >
> >> The AER service driver currently does not manage CXL PCIe port
> >> protocol errors reported by CXL root ports, CXL upstream switch ports,
> >> and CXL downstream switch ports. Consequently, RAS protocol errors
> >> from CXL PCIe port devices are not properly logged or handled.
> >>
> >> These errors are reported to the OS via the root port's AER correctable
> >> and uncorrectable internal error fields. While the AER driver supports
> >> handling downstream port protocol errors in restricted CXL host (RCH)
> >> mode also known as CXL1.1, it lacks the same functionality for CXL
> >> PCIe ports operating in virtual hierarchy (VH) mode, introduced in
> >> CXL2.0.
> >>
> >> To address this gap, update the AER driver to handle CXL PCIe port
> >> device protocol correctable errors (CE).
> >>
> >> The uncorrectable error handling (UCE) will be added in a future
> >> patch.
> >>
> >> Make this update alongside the existing downstream port RCH error
> >> handling logic, extending support to CXL PCIe ports in VH.
> >>
> >> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> > Minor comments inline.
> >
> > J
> >> ---
> >> drivers/pci/pcie/aer.c | 54 +++++++++++++++++++++++++++++++++---------
> >> 1 file changed, 43 insertions(+), 11 deletions(-)
> >>
> >> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> >> index dc8b17999001..1c996287d4ce 100644
> >> --- a/drivers/pci/pcie/aer.c
> >> +++ b/drivers/pci/pcie/aer.c
> >> @@ -40,6 +40,8 @@
> >> #define AER_MAX_TYPEOF_COR_ERRS 16 /* as per PCI_ERR_COR_STATUS */
> >> #define AER_MAX_TYPEOF_UNCOR_ERRS 27 /* as per PCI_ERR_UNCOR_STATUS*/
> >>
> >> +#define CXL_DVSEC_PORT_EXTENSIONS 3
> >
> > Duplicate of definition in drivers/cxl/cxlpci.h
> >
> > Maybe wrap it up in an is_cxl_port() or similar? Or just
> > move that to a header both places can exercise.
> >
> >
>
> Ok. I'll move the value '3' into the function call rather than use a #define.
Not that's worse!
Find a way to have just one definition.
>
> >> +
> >> struct aer_err_source {
> >> u32 status; /* PCI_ERR_ROOT_STATUS */
> >> u32 id; /* PCI_ERR_ROOT_ERR_SRC */
> >> @@ -941,6 +943,17 @@ static bool find_source_device(struct pci_dev *parent,
> >> return true;
> >> }
> >>
> >> +static bool is_pcie_cxl_port(struct pci_dev *dev)
> >> +{
> >> + if ((pci_pcie_type(dev) != PCI_EXP_TYPE_ROOT_PORT) &&
> >> + (pci_pcie_type(dev) != PCI_EXP_TYPE_UPSTREAM) &&
> >> + (pci_pcie_type(dev) != PCI_EXP_TYPE_DOWNSTREAM))
> >> + return false;
> >> +
> >> + return (!!pci_find_dvsec_capability(dev, PCI_VENDOR_ID_CXL,
> >> + CXL_DVSEC_PORT_EXTENSIONS));
> >
> > No need for the !! it will return the same without that clamping to 1/0
> > because any non 0 value is true.
> >
>
> Ok
>
> Regards,
> Terry
> >> +}
> >> +
^ permalink raw reply [flat|nested] 62+ messages in thread
* [PATCH 05/15] cxl/aer/pci: Update AER driver to read UCE fatal status for all CXL PCIe port devices
2024-10-08 22:16 [PATCH 0/15] Enable CXL PCIe port protocol error handling and logging Terry Bowman
` (3 preceding siblings ...)
2024-10-08 22:16 ` [PATCH 04/15] cxl/aer/pci: Add CXL PCIe port correctable error support in AER service driver Terry Bowman
@ 2024-10-08 22:16 ` Terry Bowman
2024-10-16 16:28 ` Jonathan Cameron
2024-10-08 22:16 ` [PATCH 06/15] cxl/aer/pci: Introduce PCI_ERS_RESULT_PANIC to pci_ers_result type Terry Bowman
` (13 subsequent siblings)
18 siblings, 1 reply; 62+ messages in thread
From: Terry Bowman @ 2024-10-08 22:16 UTC (permalink / raw)
To: ming4.li, linux-cxl, linux-kernel, linux-pci, dave,
jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
dan.j.williams, bhelgaas, mahesh, oohall, Benjamin.Cheatham,
rrichter, nathan.fontenot, smita.koralahallichannabasappa,
terry.bowman
The AER service driver's aer_get_device_err_info() function does not
read uncorrectable (UCE) fatal error status from PCIe upstream port
devices. As a result, fatal errors are not logged or handled as needed
for CXL PCIe upstream switch port devices.
Update the aer_get_device_err_info() function to read the UCE fatal
status for all CXL PCIe port devices.
The fatal error status will be used in future patches implementing
CXL PCIe port error handling and logging.
Signed-off-by: Terry Bowman <terry.bowman@amd.com>
---
drivers/pci/pcie/aer.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index 1c996287d4ce..9b2872c8e20d 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -1282,6 +1282,7 @@ int aer_get_device_error_info(struct pci_dev *dev, struct aer_err_info *info)
} else if (type == PCI_EXP_TYPE_ROOT_PORT ||
type == PCI_EXP_TYPE_RC_EC ||
type == PCI_EXP_TYPE_DOWNSTREAM ||
+ type == PCI_EXP_TYPE_UPSTREAM ||
info->severity == AER_NONFATAL) {
/* Link is still healthy for IO reads */
--
2.34.1
^ permalink raw reply related [flat|nested] 62+ messages in thread* Re: [PATCH 05/15] cxl/aer/pci: Update AER driver to read UCE fatal status for all CXL PCIe port devices
2024-10-08 22:16 ` [PATCH 05/15] cxl/aer/pci: Update AER driver to read UCE fatal status for all CXL PCIe port devices Terry Bowman
@ 2024-10-16 16:28 ` Jonathan Cameron
0 siblings, 0 replies; 62+ messages in thread
From: Jonathan Cameron @ 2024-10-16 16:28 UTC (permalink / raw)
To: Terry Bowman
Cc: ming4.li, linux-cxl, linux-kernel, linux-pci, dave, dave.jiang,
alison.schofield, vishal.l.verma, dan.j.williams, bhelgaas,
mahesh, oohall, Benjamin.Cheatham, rrichter, nathan.fontenot,
smita.koralahallichannabasappa
On Tue, 8 Oct 2024 17:16:47 -0500
Terry Bowman <terry.bowman@amd.com> wrote:
> The AER service driver's aer_get_device_err_info() function does not
> read uncorrectable (UCE) fatal error status from PCIe upstream port
> devices. As a result, fatal errors are not logged or handled as needed
> for CXL PCIe upstream switch port devices.
I wonder why not? Is this the first ever upstream port to
report an uncorrectable error (that didn't mean the link was
down) or is there something more subtle going on.
PCI folk, this one looks like it might cause problems to me.
>
> Update the aer_get_device_err_info() function to read the UCE fatal
error_info()
> status for all CXL PCIe port devices.
>
> The fatal error status will be used in future patches implementing
> CXL PCIe port error handling and logging.
>
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> ---
> drivers/pci/pcie/aer.c | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index 1c996287d4ce..9b2872c8e20d 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -1282,6 +1282,7 @@ int aer_get_device_error_info(struct pci_dev *dev, struct aer_err_info *info)
> } else if (type == PCI_EXP_TYPE_ROOT_PORT ||
> type == PCI_EXP_TYPE_RC_EC ||
> type == PCI_EXP_TYPE_DOWNSTREAM ||
> + type == PCI_EXP_TYPE_UPSTREAM ||
> info->severity == AER_NONFATAL) {
>
> /* Link is still healthy for IO reads */
So this comment makes me worried. In general case the fatal
error may mean we can't talk to the USP?
Jonathan
^ permalink raw reply [flat|nested] 62+ messages in thread
* [PATCH 06/15] cxl/aer/pci: Introduce PCI_ERS_RESULT_PANIC to pci_ers_result type
2024-10-08 22:16 [PATCH 0/15] Enable CXL PCIe port protocol error handling and logging Terry Bowman
` (4 preceding siblings ...)
2024-10-08 22:16 ` [PATCH 05/15] cxl/aer/pci: Update AER driver to read UCE fatal status for all CXL PCIe port devices Terry Bowman
@ 2024-10-08 22:16 ` Terry Bowman
2024-10-16 16:30 ` Jonathan Cameron
2024-10-08 22:16 ` [PATCH 07/15] cxl/aer/pci: Add CXL PCIe port uncorrectable error recovery in AER service driver Terry Bowman
` (12 subsequent siblings)
18 siblings, 1 reply; 62+ messages in thread
From: Terry Bowman @ 2024-10-08 22:16 UTC (permalink / raw)
To: ming4.li, linux-cxl, linux-kernel, linux-pci, dave,
jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
dan.j.williams, bhelgaas, mahesh, oohall, Benjamin.Cheatham,
rrichter, nathan.fontenot, smita.koralahallichannabasappa,
terry.bowman
The CXL AER service will be updated to support CXL PCIe port error
handling in the future. These devices will use a system panic during
recovery handling.
Add PCI_ERS_RESULT_PANIC enumeration to pci_ers_result type.
Signed-off-by: Terry Bowman <terry.bowman@amd.com>
---
include/linux/pci.h | 3 +++
1 file changed, 3 insertions(+)
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 4cf89a4b4cbc..6f7e7371161d 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -857,6 +857,9 @@ enum pci_ers_result {
/* No AER capabilities registered for the driver */
PCI_ERS_RESULT_NO_AER_DRIVER = (__force pci_ers_result_t) 6,
+
+ /* Device state requires system panic */
+ PCI_ERS_RESULT_PANIC = (__force pci_ers_result_t) 7,
};
/* PCI bus error event callbacks */
--
2.34.1
^ permalink raw reply related [flat|nested] 62+ messages in thread* Re: [PATCH 06/15] cxl/aer/pci: Introduce PCI_ERS_RESULT_PANIC to pci_ers_result type
2024-10-08 22:16 ` [PATCH 06/15] cxl/aer/pci: Introduce PCI_ERS_RESULT_PANIC to pci_ers_result type Terry Bowman
@ 2024-10-16 16:30 ` Jonathan Cameron
2024-10-16 17:31 ` Terry Bowman
0 siblings, 1 reply; 62+ messages in thread
From: Jonathan Cameron @ 2024-10-16 16:30 UTC (permalink / raw)
To: Terry Bowman
Cc: ming4.li, linux-cxl, linux-kernel, linux-pci, dave, dave.jiang,
alison.schofield, vishal.l.verma, dan.j.williams, bhelgaas,
mahesh, oohall, Benjamin.Cheatham, rrichter, nathan.fontenot,
smita.koralahallichannabasappa
On Tue, 8 Oct 2024 17:16:48 -0500
Terry Bowman <terry.bowman@amd.com> wrote:
> The CXL AER service will be updated to support CXL PCIe port error
> handling in the future. These devices will use a system panic during
> recovery handling.
Recovery handling by panic? :) That's an interesting form of recovery..
>
> Add PCI_ERS_RESULT_PANIC enumeration to pci_ers_result type.
>
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> ---
> include/linux/pci.h | 3 +++
> 1 file changed, 3 insertions(+)
>
> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index 4cf89a4b4cbc..6f7e7371161d 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -857,6 +857,9 @@ enum pci_ers_result {
>
> /* No AER capabilities registered for the driver */
> PCI_ERS_RESULT_NO_AER_DRIVER = (__force pci_ers_result_t) 6,
> +
> + /* Device state requires system panic */
> + PCI_ERS_RESULT_PANIC = (__force pci_ers_result_t) 7,
> };
>
> /* PCI bus error event callbacks */
^ permalink raw reply [flat|nested] 62+ messages in thread* Re: [PATCH 06/15] cxl/aer/pci: Introduce PCI_ERS_RESULT_PANIC to pci_ers_result type
2024-10-16 16:30 ` Jonathan Cameron
@ 2024-10-16 17:31 ` Terry Bowman
2024-10-17 13:31 ` Jonathan Cameron
0 siblings, 1 reply; 62+ messages in thread
From: Terry Bowman @ 2024-10-16 17:31 UTC (permalink / raw)
To: Jonathan Cameron
Cc: ming4.li, linux-cxl, linux-kernel, linux-pci, dave, dave.jiang,
alison.schofield, vishal.l.verma, dan.j.williams, bhelgaas,
mahesh, oohall, Benjamin.Cheatham, rrichter, nathan.fontenot,
smita.koralahallichannabasappa
On 10/16/24 11:30, Jonathan Cameron wrote:
> On Tue, 8 Oct 2024 17:16:48 -0500
> Terry Bowman <terry.bowman@amd.com> wrote:
>
>> The CXL AER service will be updated to support CXL PCIe port error
>> handling in the future. These devices will use a system panic during
>> recovery handling.
>
> Recovery handling by panic? :) That's an interesting form of recovery..
>
Yes, Dan requested all UCE (fatal and non-fatal) are handled by panic in order
to limit the blast radius of corruption in the case of UCE.
The recovery logic in cxl_do_recovery() (not using the panic) is also tested as well.
Regards,
Terry
>>
>> Add PCI_ERS_RESULT_PANIC enumeration to pci_ers_result type.
>>
>> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
>> ---
>> include/linux/pci.h | 3 +++
>> 1 file changed, 3 insertions(+)
>>
>> diff --git a/include/linux/pci.h b/include/linux/pci.h
>> index 4cf89a4b4cbc..6f7e7371161d 100644
>> --- a/include/linux/pci.h
>> +++ b/include/linux/pci.h
>> @@ -857,6 +857,9 @@ enum pci_ers_result {
>>
>> /* No AER capabilities registered for the driver */
>> PCI_ERS_RESULT_NO_AER_DRIVER = (__force pci_ers_result_t) 6,
>> +
>> + /* Device state requires system panic */
>> + PCI_ERS_RESULT_PANIC = (__force pci_ers_result_t) 7,
>> };
>>
>> /* PCI bus error event callbacks */
>
^ permalink raw reply [flat|nested] 62+ messages in thread* Re: [PATCH 06/15] cxl/aer/pci: Introduce PCI_ERS_RESULT_PANIC to pci_ers_result type
2024-10-16 17:31 ` Terry Bowman
@ 2024-10-17 13:31 ` Jonathan Cameron
2024-10-17 14:50 ` Bowman, Terry
0 siblings, 1 reply; 62+ messages in thread
From: Jonathan Cameron @ 2024-10-17 13:31 UTC (permalink / raw)
To: Terry Bowman
Cc: ming4.li, linux-cxl, linux-kernel, linux-pci, dave, dave.jiang,
alison.schofield, vishal.l.verma, dan.j.williams, bhelgaas,
mahesh, oohall, Benjamin.Cheatham, rrichter, nathan.fontenot,
smita.koralahallichannabasappa
On Wed, 16 Oct 2024 12:31:35 -0500
Terry Bowman <Terry.Bowman@amd.com> wrote:
> On 10/16/24 11:30, Jonathan Cameron wrote:
> > On Tue, 8 Oct 2024 17:16:48 -0500
> > Terry Bowman <terry.bowman@amd.com> wrote:
> >
> >> The CXL AER service will be updated to support CXL PCIe port error
> >> handling in the future. These devices will use a system panic during
> >> recovery handling.
> >
> > Recovery handling by panic? :) That's an interesting form of recovery..
> >
>
> Yes, Dan requested all UCE (fatal and non-fatal) are handled by panic in order
> to limit the blast radius of corruption in the case of UCE.
That's fair enough. Maybe it should be called attempted recovery handling ;)
This is fine.
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Jonathan
>
> The recovery logic in cxl_do_recovery() (not using the panic) is also tested as well.
>
> Regards,
> Terry
>
> >>
> >> Add PCI_ERS_RESULT_PANIC enumeration to pci_ers_result type.
> >>
> >> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> >> ---
> >> include/linux/pci.h | 3 +++
> >> 1 file changed, 3 insertions(+)
> >>
> >> diff --git a/include/linux/pci.h b/include/linux/pci.h
> >> index 4cf89a4b4cbc..6f7e7371161d 100644
> >> --- a/include/linux/pci.h
> >> +++ b/include/linux/pci.h
> >> @@ -857,6 +857,9 @@ enum pci_ers_result {
> >>
> >> /* No AER capabilities registered for the driver */
> >> PCI_ERS_RESULT_NO_AER_DRIVER = (__force pci_ers_result_t) 6,
> >> +
> >> + /* Device state requires system panic */
> >> + PCI_ERS_RESULT_PANIC = (__force pci_ers_result_t) 7,
> >> };
> >>
> >> /* PCI bus error event callbacks */
> >
^ permalink raw reply [flat|nested] 62+ messages in thread* Re: [PATCH 06/15] cxl/aer/pci: Introduce PCI_ERS_RESULT_PANIC to pci_ers_result type
2024-10-17 13:31 ` Jonathan Cameron
@ 2024-10-17 14:50 ` Bowman, Terry
0 siblings, 0 replies; 62+ messages in thread
From: Bowman, Terry @ 2024-10-17 14:50 UTC (permalink / raw)
To: Jonathan Cameron, Terry Bowman
Cc: ming4.li, linux-cxl, linux-kernel, linux-pci, dave, dave.jiang,
alison.schofield, vishal.l.verma, dan.j.williams, bhelgaas,
mahesh, oohall, Benjamin.Cheatham, rrichter, nathan.fontenot,
smita.koralahallichannabasappa
Hi Jonathan,
On 10/17/2024 8:31 AM, Jonathan Cameron wrote:
> On Wed, 16 Oct 2024 12:31:35 -0500
> Terry Bowman <Terry.Bowman@amd.com> wrote:
>
>> On 10/16/24 11:30, Jonathan Cameron wrote:
>>> On Tue, 8 Oct 2024 17:16:48 -0500
>>> Terry Bowman <terry.bowman@amd.com> wrote:
>>>
>>>> The CXL AER service will be updated to support CXL PCIe port error
>>>> handling in the future. These devices will use a system panic during
>>>> recovery handling.
>>>
>>> Recovery handling by panic? :) That's an interesting form of recovery..
>>>
>>
>> Yes, Dan requested all UCE (fatal and non-fatal) are handled by panic in order
>> to limit the blast radius of corruption in the case of UCE.
> That's fair enough. Maybe it should be called attempted recovery handling ;)
>
> This is fine.
> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
>
> Jonathan
>
I'll add "attempted" recovery to the commit message.
Regards,
Terry
>>
>> The recovery logic in cxl_do_recovery() (not using the panic) is also tested as well.
>>
>> Regards,
>> Terry
>>
>>>>
>>>> Add PCI_ERS_RESULT_PANIC enumeration to pci_ers_result type.
>>>>
>>>> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
>>>> ---
>>>> include/linux/pci.h | 3 +++
>>>> 1 file changed, 3 insertions(+)
>>>>
>>>> diff --git a/include/linux/pci.h b/include/linux/pci.h
>>>> index 4cf89a4b4cbc..6f7e7371161d 100644
>>>> --- a/include/linux/pci.h
>>>> +++ b/include/linux/pci.h
>>>> @@ -857,6 +857,9 @@ enum pci_ers_result {
>>>>
>>>> /* No AER capabilities registered for the driver */
>>>> PCI_ERS_RESULT_NO_AER_DRIVER = (__force pci_ers_result_t) 6,
>>>> +
>>>> + /* Device state requires system panic */
>>>> + PCI_ERS_RESULT_PANIC = (__force pci_ers_result_t) 7,
>>>> };
>>>>
>>>> /* PCI bus error event callbacks */
>>>
>
^ permalink raw reply [flat|nested] 62+ messages in thread
* [PATCH 07/15] cxl/aer/pci: Add CXL PCIe port uncorrectable error recovery in AER service driver
2024-10-08 22:16 [PATCH 0/15] Enable CXL PCIe port protocol error handling and logging Terry Bowman
` (5 preceding siblings ...)
2024-10-08 22:16 ` [PATCH 06/15] cxl/aer/pci: Introduce PCI_ERS_RESULT_PANIC to pci_ers_result type Terry Bowman
@ 2024-10-08 22:16 ` Terry Bowman
2024-10-16 16:54 ` Jonathan Cameron
2024-10-08 22:16 ` [PATCH 08/15] cxl/pci: Change find_cxl_ports() to be non-static Terry Bowman
` (11 subsequent siblings)
18 siblings, 1 reply; 62+ messages in thread
From: Terry Bowman @ 2024-10-08 22:16 UTC (permalink / raw)
To: ming4.li, linux-cxl, linux-kernel, linux-pci, dave,
jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
dan.j.williams, bhelgaas, mahesh, oohall, Benjamin.Cheatham,
rrichter, nathan.fontenot, smita.koralahallichannabasappa,
terry.bowman
The current pcie_do_recovery() handles device recovery as result of
uncorrectable errors (UCE). But, CXL port devices require unique
recovery handling.
Create a cxl_do_recovery() function parallel to pcie_do_recovery(). Add CXL
specific handling to the new recovery function.
The CXL port UCE recovery must invoke the AER service driver's CXL port
UCE callback. This is different than the standard pcie_do_recovery()
recovery that calls the pci_driver::err_handler UCE handler instead.
Treat all CXL PCIe port UCE errors as fatal and call kernel panic to
"recover" the error. A panic is called instead of attempting recovery
to avoid potential system corruption.
The uncorrectable support added here will be used to complete CXL PCIe
port error handling in the future.
Signed-off-by: Terry Bowman <terry.bowman@amd.com>
---
drivers/pci/pci.h | 5 ++
drivers/pci/pcie/aer.c | 5 +-
drivers/pci/pcie/err.c | 150 +++++++++++++++++++++++++++++++++++++++++
3 files changed, 159 insertions(+), 1 deletion(-)
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 79c8398f3938..d1f5b42fa48d 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -632,6 +632,11 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
pci_channel_state_t state,
pci_ers_result_t (*reset_subordinates)(struct pci_dev *pdev));
+/* CXL error reporting and recovery */
+pci_ers_result_t cxl_do_recovery(struct pci_dev *dev,
+ pci_channel_state_t state,
+ pci_ers_result_t (*reset_subordinates)(struct pci_dev *pdev));
+
bool pcie_wait_for_link(struct pci_dev *pdev, bool active);
int pcie_retrain_link(struct pci_dev *pdev, bool use_lt);
diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index 9b2872c8e20d..81a19028c4e7 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -1060,7 +1060,10 @@ static void cxl_handle_error(struct pci_dev *dev, struct aer_err_info *info)
if (cxl_port_hndlrs && cxl_port_hndlrs->cor_error_detected)
cxl_port_hndlrs->cor_error_detected(dev);
pcie_clear_device_status(dev);
- }
+ } else if (info->severity == AER_NONFATAL)
+ cxl_do_recovery(dev, pci_channel_io_normal, aer_root_reset);
+ else if (info->severity == AER_FATAL)
+ cxl_do_recovery(dev, pci_channel_io_frozen, aer_root_reset);
}
static int handles_cxl_error_iter(struct pci_dev *dev, void *data)
diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
index 31090770fffc..de12f2eb19ef 100644
--- a/drivers/pci/pcie/err.c
+++ b/drivers/pci/pcie/err.c
@@ -86,6 +86,63 @@ static int report_error_detected(struct pci_dev *dev,
return 0;
}
+static int cxl_report_error_detected(struct pci_dev *dev,
+ pci_channel_state_t state,
+ enum pci_ers_result *result)
+{
+ struct cxl_port_err_hndlrs *cxl_port_hndlrs;
+ struct pci_driver *pdrv;
+ pci_ers_result_t vote;
+
+ device_lock(&dev->dev);
+ cxl_port_hndlrs = find_cxl_port_hndlrs();
+ pdrv = dev->driver;
+ if (pci_dev_is_disconnected(dev)) {
+ vote = PCI_ERS_RESULT_DISCONNECT;
+ } else if (!pci_dev_set_io_state(dev, state)) {
+ pci_info(dev, "can't recover (state transition %u -> %u invalid)\n",
+ dev->error_state, state);
+ vote = PCI_ERS_RESULT_NONE;
+ } else if (!cxl_port_hndlrs || !cxl_port_hndlrs->error_detected) {
+ if (dev->hdr_type != PCI_HEADER_TYPE_BRIDGE) {
+ vote = PCI_ERS_RESULT_NO_AER_DRIVER;
+ pci_info(dev, "can't recover (no error_detected callback)\n");
+ } else {
+ vote = PCI_ERS_RESULT_NONE;
+ }
+ } else {
+ vote = cxl_port_hndlrs->error_detected(dev, state);
+ }
+ pci_uevent_ers(dev, vote);
+ *result = merge_result(*result, vote);
+ device_unlock(&dev->dev);
+ return 0;
+}
+
+static int cxl_report_frozen_detected(struct pci_dev *dev, void *data)
+{
+ /*
+ * CXL endpoints report using pci_dev::err_handlers.
+ * CXL PCIe ports report using aer_rpc::cxl_port_err_handlers.
+ */
+ if (pci_pcie_type(dev) == PCI_EXP_TYPE_ENDPOINT)
+ return report_error_detected(dev, pci_channel_io_frozen, data);
+ else
+ return cxl_report_error_detected(dev, pci_channel_io_frozen, data);
+}
+
+static int cxl_report_normal_detected(struct pci_dev *dev, void *data)
+{
+ /*
+ * CXL endpoints report using pci_dev::err_handlers.
+ * CXL PCIe ports report using aer_rpc::cxl_port_err_handlers.
+ */
+ if (pci_pcie_type(dev) == PCI_EXP_TYPE_ENDPOINT)
+ return report_error_detected(dev, pci_channel_io_normal, data);
+ else
+ return cxl_report_error_detected(dev, pci_channel_io_normal, data);
+}
+
static int pci_pm_runtime_get_sync(struct pci_dev *pdev, void *data)
{
pm_runtime_get_sync(&pdev->dev);
@@ -188,6 +245,28 @@ static void pci_walk_bridge(struct pci_dev *bridge,
cb(bridge, userdata);
}
+/**
+ * cxl_walk_bridge - walk bridges potentially AER affected
+ * @bridge: bridge which may be a Port, an RCEC, or an RCiEP
+ * @cb: callback to be called for each device found
+ * @userdata: arbitrary pointer to be passed to callback
+ *
+ * If the device provided is a bridge, walk the subordinate bus, including
+ * the device itself and any bridged devices on buses under this bus. Call
+ * the provided callback on each device found.
+ *
+ * If the device provided has no subordinate bus, e.g., an RCEC or RCiEP,
+ * call the callback on the device itself.
+ */
+static void cxl_walk_bridge(struct pci_dev *bridge,
+ int (*cb)(struct pci_dev *, void *),
+ void *userdata)
+{
+ cb(bridge, userdata);
+ if (bridge->subordinate)
+ pci_walk_bus(bridge->subordinate, cb, userdata);
+}
+
pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
pci_channel_state_t state,
pci_ers_result_t (*reset_subordinates)(struct pci_dev *pdev))
@@ -276,3 +355,74 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
return status;
}
+
+pci_ers_result_t cxl_do_recovery(struct pci_dev *bridge,
+ pci_channel_state_t state,
+ pci_ers_result_t (*reset_subordinates)(struct pci_dev *pdev))
+{
+ struct pci_host_bridge *host = pci_find_host_bridge(bridge->bus);
+ pci_ers_result_t status = PCI_ERS_RESULT_CAN_RECOVER;
+ int type = pci_pcie_type(bridge);
+
+ if ((type != PCI_EXP_TYPE_ROOT_PORT) &&
+ (type != PCI_EXP_TYPE_RC_EC) &&
+ (type != PCI_EXP_TYPE_DOWNSTREAM) &&
+ (type != PCI_EXP_TYPE_UPSTREAM)) {
+ pci_dbg(bridge, "Unsupported device type (%x)\n", type);
+ return status;
+ }
+
+ cxl_walk_bridge(bridge, pci_pm_runtime_get_sync, NULL);
+
+ pci_dbg(bridge, "broadcast error_detected message\n");
+ if (state == pci_channel_io_frozen) {
+ cxl_walk_bridge(bridge, cxl_report_frozen_detected, &status);
+ if (reset_subordinates(bridge) != PCI_ERS_RESULT_RECOVERED) {
+ pci_warn(bridge, "subordinate device reset failed\n");
+ goto failed;
+ }
+ } else {
+ cxl_walk_bridge(bridge, cxl_report_normal_detected, &status);
+ }
+
+ if (status == PCI_ERS_RESULT_PANIC)
+ panic("CXL cachemem error. Invoking panic");
+
+ if (status == PCI_ERS_RESULT_CAN_RECOVER) {
+ status = PCI_ERS_RESULT_RECOVERED;
+ pci_dbg(bridge, "broadcast mmio_enabled message\n");
+ cxl_walk_bridge(bridge, report_mmio_enabled, &status);
+ }
+
+ if (status == PCI_ERS_RESULT_NEED_RESET) {
+ status = PCI_ERS_RESULT_RECOVERED;
+ pci_dbg(bridge, "broadcast slot_reset message\n");
+ report_slot_reset(bridge, &status);
+ pci_walk_bridge(bridge, report_slot_reset, &status);
+ }
+
+ if (status != PCI_ERS_RESULT_RECOVERED)
+ goto failed;
+
+ pci_dbg(bridge, "broadcast resume message\n");
+ cxl_walk_bridge(bridge, report_resume, &status);
+
+ if (host->native_aer || pcie_ports_native) {
+ pcie_clear_device_status(bridge);
+ pci_aer_clear_nonfatal_status(bridge);
+ }
+
+ cxl_walk_bridge(bridge, pci_pm_runtime_put, NULL);
+
+ pci_info(bridge, "device recovery successful\n");
+ return status;
+
+failed:
+ cxl_walk_bridge(bridge, pci_pm_runtime_put, NULL);
+
+ pci_uevent_ers(bridge, PCI_ERS_RESULT_DISCONNECT);
+
+ pci_info(bridge, "device recovery failed\n");
+
+ return status;
+}
--
2.34.1
^ permalink raw reply related [flat|nested] 62+ messages in thread* Re: [PATCH 07/15] cxl/aer/pci: Add CXL PCIe port uncorrectable error recovery in AER service driver
2024-10-08 22:16 ` [PATCH 07/15] cxl/aer/pci: Add CXL PCIe port uncorrectable error recovery in AER service driver Terry Bowman
@ 2024-10-16 16:54 ` Jonathan Cameron
2024-10-16 18:07 ` Terry Bowman
0 siblings, 1 reply; 62+ messages in thread
From: Jonathan Cameron @ 2024-10-16 16:54 UTC (permalink / raw)
To: Terry Bowman
Cc: ming4.li, linux-cxl, linux-kernel, linux-pci, dave, dave.jiang,
alison.schofield, vishal.l.verma, dan.j.williams, bhelgaas,
mahesh, oohall, Benjamin.Cheatham, rrichter, nathan.fontenot,
smita.koralahallichannabasappa
On Tue, 8 Oct 2024 17:16:49 -0500
Terry Bowman <terry.bowman@amd.com> wrote:
> The current pcie_do_recovery() handles device recovery as result of
> uncorrectable errors (UCE). But, CXL port devices require unique
> recovery handling.
>
> Create a cxl_do_recovery() function parallel to pcie_do_recovery(). Add CXL
> specific handling to the new recovery function.
>
> The CXL port UCE recovery must invoke the AER service driver's CXL port
> UCE callback. This is different than the standard pcie_do_recovery()
> recovery that calls the pci_driver::err_handler UCE handler instead.
>
> Treat all CXL PCIe port UCE errors as fatal and call kernel panic to
> "recover" the error. A panic is called instead of attempting recovery
> to avoid potential system corruption.
>
> The uncorrectable support added here will be used to complete CXL PCIe
> port error handling in the future.
>
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
Hi Terry,
I'm a little bothered by the subtle difference in the bus walks
in here vs the existing cases. If we need them, comments needed
to explain why.
If we are going to have separate handling, see if you can share
a lot more of the code by factoring out common functions for
the pci and cxl handling with callbacks to handle the differences.
I've managed to get my head around this code a few times in the past
(I think!) and really don't fancy having two subtle variants to
consider next time we get a bug :( The RC_EC additions hurt my head.
Jonathan
> static int handles_cxl_error_iter(struct pci_dev *dev, void *data)
> diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
> index 31090770fffc..de12f2eb19ef 100644
> --- a/drivers/pci/pcie/err.c
> +++ b/drivers/pci/pcie/err.c
> @@ -86,6 +86,63 @@ static int report_error_detected(struct pci_dev *dev,
> return 0;
> }
>
> +static int cxl_report_error_detected(struct pci_dev *dev,
> + pci_channel_state_t state,
> + enum pci_ers_result *result)
> +{
> + struct cxl_port_err_hndlrs *cxl_port_hndlrs;
> + struct pci_driver *pdrv;
> + pci_ers_result_t vote;
> +
> + device_lock(&dev->dev);
> + cxl_port_hndlrs = find_cxl_port_hndlrs();
Can we refactor to have a common function under this and report_error_detected()?
> + pdrv = dev->driver;
> + if (pci_dev_is_disconnected(dev)) {
> + vote = PCI_ERS_RESULT_DISCONNECT;
> + } else if (!pci_dev_set_io_state(dev, state)) {
> + pci_info(dev, "can't recover (state transition %u -> %u invalid)\n",
> + dev->error_state, state);
> + vote = PCI_ERS_RESULT_NONE;
> + } else if (!cxl_port_hndlrs || !cxl_port_hndlrs->error_detected) {
> + if (dev->hdr_type != PCI_HEADER_TYPE_BRIDGE) {
> + vote = PCI_ERS_RESULT_NO_AER_DRIVER;
> + pci_info(dev, "can't recover (no error_detected callback)\n");
> + } else {
> + vote = PCI_ERS_RESULT_NONE;
> + }
> + } else {
> + vote = cxl_port_hndlrs->error_detected(dev, state);
> + }
> + pci_uevent_ers(dev, vote);
> + *result = merge_result(*result, vote);
> + device_unlock(&dev->dev);
> + return 0;
> +}
> static int pci_pm_runtime_get_sync(struct pci_dev *pdev, void *data)
> {
> pm_runtime_get_sync(&pdev->dev);
> @@ -188,6 +245,28 @@ static void pci_walk_bridge(struct pci_dev *bridge,
> cb(bridge, userdata);
> }
>
> +/**
> + * cxl_walk_bridge - walk bridges potentially AER affected
> + * @bridge: bridge which may be a Port, an RCEC, or an RCiEP
> + * @cb: callback to be called for each device found
> + * @userdata: arbitrary pointer to be passed to callback
> + *
> + * If the device provided is a bridge, walk the subordinate bus, including
> + * the device itself and any bridged devices on buses under this bus. Call
> + * the provided callback on each device found.
> + *
> + * If the device provided has no subordinate bus, e.g., an RCEC or RCiEP,
> + * call the callback on the device itself.
only call the callback on the device itself.
(as you call it as stated above either way).
> + */
> +static void cxl_walk_bridge(struct pci_dev *bridge,
> + int (*cb)(struct pci_dev *, void *),
> + void *userdata)
> +{
> + cb(bridge, userdata);
> + if (bridge->subordinate)
> + pci_walk_bus(bridge->subordinate, cb, userdata);
The difference between this and pci_walk_bridge() is subtle and
I'd like to avoid having both if we can.
> +}
> +
> pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
> pci_channel_state_t state,
> pci_ers_result_t (*reset_subordinates)(struct pci_dev *pdev))
> @@ -276,3 +355,74 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
>
> return status;
> }
> +
> +pci_ers_result_t cxl_do_recovery(struct pci_dev *bridge,
> + pci_channel_state_t state,
> + pci_ers_result_t (*reset_subordinates)(struct pci_dev *pdev))
> +{
> + struct pci_host_bridge *host = pci_find_host_bridge(bridge->bus);
> + pci_ers_result_t status = PCI_ERS_RESULT_CAN_RECOVER;
> + int type = pci_pcie_type(bridge);
> +
> + if ((type != PCI_EXP_TYPE_ROOT_PORT) &&
> + (type != PCI_EXP_TYPE_RC_EC) &&
> + (type != PCI_EXP_TYPE_DOWNSTREAM) &&
> + (type != PCI_EXP_TYPE_UPSTREAM)) {
> + pci_dbg(bridge, "Unsupported device type (%x)\n", type);
> + return status;
> + }
> +
Would similar trick to in pcie_do_recovery work here for the upstream
and downstream ports use pci_upstream_bridge() and for the others pass the dev into
pci_walk_bridge()?
> + cxl_walk_bridge(bridge, pci_pm_runtime_get_sync, NULL);
> +
> + pci_dbg(bridge, "broadcast error_detected message\n");
> + if (state == pci_channel_io_frozen) {
> + cxl_walk_bridge(bridge, cxl_report_frozen_detected, &status);
> + if (reset_subordinates(bridge) != PCI_ERS_RESULT_RECOVERED) {
> + pci_warn(bridge, "subordinate device reset failed\n");
> + goto failed;
> + }
> + } else {
> + cxl_walk_bridge(bridge, cxl_report_normal_detected, &status);
> + }
> +
> + if (status == PCI_ERS_RESULT_PANIC)
> + panic("CXL cachemem error. Invoking panic");
> +
> + if (status == PCI_ERS_RESULT_CAN_RECOVER) {
> + status = PCI_ERS_RESULT_RECOVERED;
> + pci_dbg(bridge, "broadcast mmio_enabled message\n");
> + cxl_walk_bridge(bridge, report_mmio_enabled, &status);
> + }
> +
> + if (status == PCI_ERS_RESULT_NEED_RESET) {
> + status = PCI_ERS_RESULT_RECOVERED;
> + pci_dbg(bridge, "broadcast slot_reset message\n");
> + report_slot_reset(bridge, &status);
> + pci_walk_bridge(bridge, report_slot_reset, &status);
> + }
> +
> + if (status != PCI_ERS_RESULT_RECOVERED)
> + goto failed;
> +
> + pci_dbg(bridge, "broadcast resume message\n");
> + cxl_walk_bridge(bridge, report_resume, &status);
> +
> + if (host->native_aer || pcie_ports_native) {
> + pcie_clear_device_status(bridge);
> + pci_aer_clear_nonfatal_status(bridge);
> + }
> +
> + cxl_walk_bridge(bridge, pci_pm_runtime_put, NULL);
> +
> + pci_info(bridge, "device recovery successful\n");
> + return status;
> +
> +failed:
> + cxl_walk_bridge(bridge, pci_pm_runtime_put, NULL);
> +
> + pci_uevent_ers(bridge, PCI_ERS_RESULT_DISCONNECT);
> +
> + pci_info(bridge, "device recovery failed\n");
> +
> + return status;
> +}
^ permalink raw reply [flat|nested] 62+ messages in thread* Re: [PATCH 07/15] cxl/aer/pci: Add CXL PCIe port uncorrectable error recovery in AER service driver
2024-10-16 16:54 ` Jonathan Cameron
@ 2024-10-16 18:07 ` Terry Bowman
2024-10-17 13:43 ` Jonathan Cameron
0 siblings, 1 reply; 62+ messages in thread
From: Terry Bowman @ 2024-10-16 18:07 UTC (permalink / raw)
To: Jonathan Cameron
Cc: ming4.li, linux-cxl, linux-kernel, linux-pci, dave, dave.jiang,
alison.schofield, vishal.l.verma, dan.j.williams, bhelgaas,
mahesh, oohall, Benjamin.Cheatham, rrichter, nathan.fontenot,
smita.koralahallichannabasappa
Hi Jonathan,
On 10/16/24 11:54, Jonathan Cameron wrote:
> On Tue, 8 Oct 2024 17:16:49 -0500
> Terry Bowman <terry.bowman@amd.com> wrote:
>
>> The current pcie_do_recovery() handles device recovery as result of
>> uncorrectable errors (UCE). But, CXL port devices require unique
>> recovery handling.
>>
>> Create a cxl_do_recovery() function parallel to pcie_do_recovery(). Add CXL
>> specific handling to the new recovery function.
>>
>> The CXL port UCE recovery must invoke the AER service driver's CXL port
>> UCE callback. This is different than the standard pcie_do_recovery()
>> recovery that calls the pci_driver::err_handler UCE handler instead.
>>
>> Treat all CXL PCIe port UCE errors as fatal and call kernel panic to
>> "recover" the error. A panic is called instead of attempting recovery
>> to avoid potential system corruption.
>>
>> The uncorrectable support added here will be used to complete CXL PCIe
>> port error handling in the future.
>>
>> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
>
> Hi Terry,
>
> I'm a little bothered by the subtle difference in the bus walks
> in here vs the existing cases. If we need them, comments needed
> to explain why.
>
Yes, I will add more details in the commit message about "why".
I added explanation following your below comment.
> If we are going to have separate handling, see if you can share
> a lot more of the code by factoring out common functions for
> the pci and cxl handling with callbacks to handle the differences.
>
Dan requested separate paths for the PCIe and CXL recovery. The intent,
as I understand, is to isolate the handling of PCIe and CXL protocol
errors. This is to create 2 different classes of protocol errors.
> I've managed to get my head around this code a few times in the past
> (I think!) and really don't fancy having two subtle variants to
> consider next time we get a bug :( The RC_EC additions hurt my head.
>
> Jonathan
Right, the UCE recovery logic is not straightforward. The code can be
refactored to take advantage of reuse. I'm interested in your thoughts
after I have provided some responses here.
>
>> static int handles_cxl_error_iter(struct pci_dev *dev, void *data)
>> diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
>> index 31090770fffc..de12f2eb19ef 100644
>> --- a/drivers/pci/pcie/err.c
>> +++ b/drivers/pci/pcie/err.c
>> @@ -86,6 +86,63 @@ static int report_error_detected(struct pci_dev *dev,
>> return 0;
>> }
>>
>> +static int cxl_report_error_detected(struct pci_dev *dev,
>> + pci_channel_state_t state,
>> + enum pci_ers_result *result)
>> +{
>> + struct cxl_port_err_hndlrs *cxl_port_hndlrs;
>> + struct pci_driver *pdrv;
>> + pci_ers_result_t vote;
>> +
>> + device_lock(&dev->dev);
>> + cxl_port_hndlrs = find_cxl_port_hndlrs();
>
> Can we refactor to have a common function under this and report_error_detected()?
>
Sure, this can be refactored.
The difference between cxl_report_error_detected() and report_error_detected() is the
handlers that are called.
cxl_report_error_detected() calls the CXL driver's registered port error handler.
report_error_recovery() calls the pcie_dev::err_handlers.
Let me know if I should refactor for common code here?
>> + pdrv = dev->driver;
>> + if (pci_dev_is_disconnected(dev)) {
>> + vote = PCI_ERS_RESULT_DISCONNECT;
>> + } else if (!pci_dev_set_io_state(dev, state)) {
>> + pci_info(dev, "can't recover (state transition %u -> %u invalid)\n",
>> + dev->error_state, state);
>> + vote = PCI_ERS_RESULT_NONE;
>> + } else if (!cxl_port_hndlrs || !cxl_port_hndlrs->error_detected) {
>> + if (dev->hdr_type != PCI_HEADER_TYPE_BRIDGE) {
>> + vote = PCI_ERS_RESULT_NO_AER_DRIVER;
>> + pci_info(dev, "can't recover (no error_detected callback)\n");
>> + } else {
>> + vote = PCI_ERS_RESULT_NONE;
>> + }
>> + } else {
>> + vote = cxl_port_hndlrs->error_detected(dev, state);
>> + }
>> + pci_uevent_ers(dev, vote);
>> + *result = merge_result(*result, vote);
>> + device_unlock(&dev->dev);
>> + return 0;
>> +}
>
>> static int pci_pm_runtime_get_sync(struct pci_dev *pdev, void *data)
>> {
>> pm_runtime_get_sync(&pdev->dev);
>> @@ -188,6 +245,28 @@ static void pci_walk_bridge(struct pci_dev *bridge,
>> cb(bridge, userdata);
>> }
>>
>> +/**
>> + * cxl_walk_bridge - walk bridges potentially AER affected
>> + * @bridge: bridge which may be a Port, an RCEC, or an RCiEP
>> + * @cb: callback to be called for each device found
>> + * @userdata: arbitrary pointer to be passed to callback
>> + *
>> + * If the device provided is a bridge, walk the subordinate bus, including
>> + * the device itself and any bridged devices on buses under this bus. Call
>> + * the provided callback on each device found.
>> + *
>> + * If the device provided has no subordinate bus, e.g., an RCEC or RCiEP,
>> + * call the callback on the device itself.
> only call the callback on the device itself.
>
> (as you call it as stated above either way).
>
Thanks. I will update the function header to include "only".
>> + */
>> +static void cxl_walk_bridge(struct pci_dev *bridge,
>> + int (*cb)(struct pci_dev *, void *),
>> + void *userdata)
>> +{
>> + cb(bridge, userdata);
>> + if (bridge->subordinate)
>> + pci_walk_bus(bridge->subordinate, cb, userdata);
> The difference between this and pci_walk_bridge() is subtle and
> I'd like to avoid having both if we can.
>
The cxl_walk_bridge() was added because pci_walk_bridge() does not report
CXL errors as needed. If the erroring device is a bridge then pci_walk_bridge()
does not call report_error_detected() for the root port itself. If the bridge
is a CXL root port then the CXL port error handler is not called. This has 2
problems: 1. Error logging is not provided, 2. A result vote is not provided
by the root port's CXL port handler.
>> +}
>> +
>> pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
>> pci_channel_state_t state,
>> pci_ers_result_t (*reset_subordinates)(struct pci_dev *pdev))
>> @@ -276,3 +355,74 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
>>
>> return status;
>> }
>> +
>> +pci_ers_result_t cxl_do_recovery(struct pci_dev *bridge,
>> + pci_channel_state_t state,
>> + pci_ers_result_t (*reset_subordinates)(struct pci_dev *pdev))
>> +{
>> + struct pci_host_bridge *host = pci_find_host_bridge(bridge->bus);
>> + pci_ers_result_t status = PCI_ERS_RESULT_CAN_RECOVER;
>> + int type = pci_pcie_type(bridge);
>> +
>> + if ((type != PCI_EXP_TYPE_ROOT_PORT) &&
>> + (type != PCI_EXP_TYPE_RC_EC) &&
>> + (type != PCI_EXP_TYPE_DOWNSTREAM) &&
>> + (type != PCI_EXP_TYPE_UPSTREAM)) {
>> + pci_dbg(bridge, "Unsupported device type (%x)\n", type);
>> + return status;
>> + }
>> +
>
> Would similar trick to in pcie_do_recovery work here for the upstream
> and downstream ports use pci_upstream_bridge() and for the others pass the dev into
> pci_walk_bridge()?
>
Yes, that would be a good starting point to begin reuse refactoring.
I'm interested in getting yours and others feedback on the separation of the
PCI and CXL protocol errors and how much separation is or not needed.
Regards,
Terry
^ permalink raw reply [flat|nested] 62+ messages in thread* Re: [PATCH 07/15] cxl/aer/pci: Add CXL PCIe port uncorrectable error recovery in AER service driver
2024-10-16 18:07 ` Terry Bowman
@ 2024-10-17 13:43 ` Jonathan Cameron
2024-10-17 16:21 ` Bowman, Terry
0 siblings, 1 reply; 62+ messages in thread
From: Jonathan Cameron @ 2024-10-17 13:43 UTC (permalink / raw)
To: Terry Bowman
Cc: ming4.li, linux-cxl, linux-kernel, linux-pci, dave, dave.jiang,
alison.schofield, vishal.l.verma, dan.j.williams, bhelgaas,
mahesh, oohall, Benjamin.Cheatham, rrichter, nathan.fontenot,
smita.koralahallichannabasappa
On Wed, 16 Oct 2024 13:07:37 -0500
Terry Bowman <Terry.Bowman@amd.com> wrote:
> Hi Jonathan,
>
> On 10/16/24 11:54, Jonathan Cameron wrote:
> > On Tue, 8 Oct 2024 17:16:49 -0500
> > Terry Bowman <terry.bowman@amd.com> wrote:
> >
> >> The current pcie_do_recovery() handles device recovery as result of
> >> uncorrectable errors (UCE). But, CXL port devices require unique
> >> recovery handling.
> >>
> >> Create a cxl_do_recovery() function parallel to pcie_do_recovery(). Add CXL
> >> specific handling to the new recovery function.
> >>
> >> The CXL port UCE recovery must invoke the AER service driver's CXL port
> >> UCE callback. This is different than the standard pcie_do_recovery()
> >> recovery that calls the pci_driver::err_handler UCE handler instead.
> >>
> >> Treat all CXL PCIe port UCE errors as fatal and call kernel panic to
> >> "recover" the error. A panic is called instead of attempting recovery
> >> to avoid potential system corruption.
> >>
> >> The uncorrectable support added here will be used to complete CXL PCIe
> >> port error handling in the future.
> >>
> >> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> >
> > Hi Terry,
> >
> > I'm a little bothered by the subtle difference in the bus walks
> > in here vs the existing cases. If we need them, comments needed
> > to explain why.
> >
>
> Yes, I will add more details in the commit message about "why".
> I added explanation following your below comment.
>
> > If we are going to have separate handling, see if you can share
> > a lot more of the code by factoring out common functions for
> > the pci and cxl handling with callbacks to handle the differences.
> >
>
> Dan requested separate paths for the PCIe and CXL recovery. The intent,
> as I understand, is to isolate the handling of PCIe and CXL protocol
> errors. This is to create 2 different classes of protocol errors.
Function call chain wise I'm reasonably convinced that might be a good
idea. But not code wise if it means we end up with more hard to review
code.
>
> > I've managed to get my head around this code a few times in the past
> > (I think!) and really don't fancy having two subtle variants to
> > consider next time we get a bug :( The RC_EC additions hurt my head.
> >
> > Jonathan
>
> Right, the UCE recovery logic is not straightforward. The code can be
> refactored to take advantage of reuse. I'm interested in your thoughts
> after I have provided some responses here.
>
> >
> >> static int handles_cxl_error_iter(struct pci_dev *dev, void *data)
> >> diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
> >> index 31090770fffc..de12f2eb19ef 100644
> >> --- a/drivers/pci/pcie/err.c
> >> +++ b/drivers/pci/pcie/err.c
> >> @@ -86,6 +86,63 @@ static int report_error_detected(struct pci_dev *dev,
> >> return 0;
> >> }
> >>
> >> +static int cxl_report_error_detected(struct pci_dev *dev,
> >> + pci_channel_state_t state,
> >> + enum pci_ers_result *result)
> >> +{
> >> + struct cxl_port_err_hndlrs *cxl_port_hndlrs;
> >> + struct pci_driver *pdrv;
> >> + pci_ers_result_t vote;
> >> +
> >> + device_lock(&dev->dev);
> >> + cxl_port_hndlrs = find_cxl_port_hndlrs();
> >
> > Can we refactor to have a common function under this and report_error_detected()?
> >
>
> Sure, this can be refactored.
>
> The difference between cxl_report_error_detected() and report_error_detected() is the
> handlers that are called.
>
> cxl_report_error_detected() calls the CXL driver's registered port error handler.
>
> report_error_recovery() calls the pcie_dev::err_handlers.
>
> Let me know if I should refactor for common code here?
It certainly makes sense to do that somewhere in here. Just have light
wrappers that provide callbacks so the bulk of the code is shared.
>
>
> >> + pdrv = dev->driver;
> >> + if (pci_dev_is_disconnected(dev)) {
> >> + vote = PCI_ERS_RESULT_DISCONNECT;
> >> + } else if (!pci_dev_set_io_state(dev, state)) {
> >> + pci_info(dev, "can't recover (state transition %u -> %u invalid)\n",
> >> + dev->error_state, state);
> >> + vote = PCI_ERS_RESULT_NONE;
> >> + } else if (!cxl_port_hndlrs || !cxl_port_hndlrs->error_detected) {
> >> + if (dev->hdr_type != PCI_HEADER_TYPE_BRIDGE) {
> >> + vote = PCI_ERS_RESULT_NO_AER_DRIVER;
> >> + pci_info(dev, "can't recover (no error_detected callback)\n");
> >> + } else {
> >> + vote = PCI_ERS_RESULT_NONE;
> >> + }
> >> + } else {
> >> + vote = cxl_port_hndlrs->error_detected(dev, state);
> >> + }
> >> + pci_uevent_ers(dev, vote);
> >> + *result = merge_result(*result, vote);
> >> + device_unlock(&dev->dev);
> >> + return 0;
> >> +}
> >
> >> static int pci_pm_runtime_get_sync(struct pci_dev *pdev, void *data)
> >> {
> >> pm_runtime_get_sync(&pdev->dev);
> >> @@ -188,6 +245,28 @@ static void pci_walk_bridge(struct pci_dev *bridge,
> >> cb(bridge, userdata);
> >> }
> >>
> >> +/**
> >> + * cxl_walk_bridge - walk bridges potentially AER affected
> >> + * @bridge: bridge which may be a Port, an RCEC, or an RCiEP
> >> + * @cb: callback to be called for each device found
> >> + * @userdata: arbitrary pointer to be passed to callback
> >> + *
> >> + * If the device provided is a bridge, walk the subordinate bus, including
> >> + * the device itself and any bridged devices on buses under this bus. Call
> >> + * the provided callback on each device found.
> >> + *
> >> + * If the device provided has no subordinate bus, e.g., an RCEC or RCiEP,
> >> + * call the callback on the device itself.
> > only call the callback on the device itself.
> >
> > (as you call it as stated above either way).
> >
>
> Thanks. I will update the function header to include "only".
>
> >> + */
> >> +static void cxl_walk_bridge(struct pci_dev *bridge,
> >> + int (*cb)(struct pci_dev *, void *),
> >> + void *userdata)
> >> +{
> >> + cb(bridge, userdata);
> >> + if (bridge->subordinate)
> >> + pci_walk_bus(bridge->subordinate, cb, userdata);
> > The difference between this and pci_walk_bridge() is subtle and
> > I'd like to avoid having both if we can.
> >
>
> The cxl_walk_bridge() was added because pci_walk_bridge() does not report
> CXL errors as needed. If the erroring device is a bridge then pci_walk_bridge()
> does not call report_error_detected() for the root port itself. If the bridge
> is a CXL root port then the CXL port error handler is not called. This has 2
> problems: 1. Error logging is not provided, 2. A result vote is not provided
> by the root port's CXL port handler.
So what happens for PCIe errors on the root port? How are they reported?
What I'm failing to understand is why these should be different.
Maybe there is something missing on the PCIe side though!
That code plays a game with what bridge and I thought that was there to handle
this case.
>
> >> +}
> >> +
> >> pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
> >> pci_channel_state_t state,
> >> pci_ers_result_t (*reset_subordinates)(struct pci_dev *pdev))
> >> @@ -276,3 +355,74 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
> >>
> >> return status;
> >> }
> >> +
> >> +pci_ers_result_t cxl_do_recovery(struct pci_dev *bridge,
> >> + pci_channel_state_t state,
> >> + pci_ers_result_t (*reset_subordinates)(struct pci_dev *pdev))
> >> +{
> >> + struct pci_host_bridge *host = pci_find_host_bridge(bridge->bus);
> >> + pci_ers_result_t status = PCI_ERS_RESULT_CAN_RECOVER;
> >> + int type = pci_pcie_type(bridge);
> >> +
> >> + if ((type != PCI_EXP_TYPE_ROOT_PORT) &&
> >> + (type != PCI_EXP_TYPE_RC_EC) &&
> >> + (type != PCI_EXP_TYPE_DOWNSTREAM) &&
> >> + (type != PCI_EXP_TYPE_UPSTREAM)) {
> >> + pci_dbg(bridge, "Unsupported device type (%x)\n", type);
> >> + return status;
> >> + }
> >> +
> >
> > Would similar trick to in pcie_do_recovery work here for the upstream
> > and downstream ports use pci_upstream_bridge() and for the others pass the dev into
> > pci_walk_bridge()?
> >
>
> Yes, that would be a good starting point to begin reuse refactoring.
> I'm interested in getting yours and others feedback on the separation of the
> PCI and CXL protocol errors and how much separation is or not needed.
Separation may make sense (I'm still thinking about it) for separate passes
through the topology and separate callbacks / handling when an error is seen.
What I don't want to see is two horribly complex separate walking codes if
we can possibly avoid it. Long term to me that just means two sets of bugs
and problem corners instead of one.
Jonathan
>
>
> Regards,
> Terry
>
^ permalink raw reply [flat|nested] 62+ messages in thread* Re: [PATCH 07/15] cxl/aer/pci: Add CXL PCIe port uncorrectable error recovery in AER service driver
2024-10-17 13:43 ` Jonathan Cameron
@ 2024-10-17 16:21 ` Bowman, Terry
2024-10-17 17:08 ` Jonathan Cameron
0 siblings, 1 reply; 62+ messages in thread
From: Bowman, Terry @ 2024-10-17 16:21 UTC (permalink / raw)
To: Jonathan Cameron, Terry Bowman
Cc: ming4.li, linux-cxl, linux-kernel, linux-pci, dave, dave.jiang,
alison.schofield, vishal.l.verma, dan.j.williams, bhelgaas,
mahesh, oohall, Benjamin.Cheatham, rrichter, nathan.fontenot,
smita.koralahallichannabasappa
Hi Jonathan,
On 10/17/2024 8:43 AM, Jonathan Cameron wrote:
> On Wed, 16 Oct 2024 13:07:37 -0500
> Terry Bowman <Terry.Bowman@amd.com> wrote:
>
>> Hi Jonathan,
>>
>> On 10/16/24 11:54, Jonathan Cameron wrote:
>>> On Tue, 8 Oct 2024 17:16:49 -0500
>>> Terry Bowman <terry.bowman@amd.com> wrote:
>>>
>>>> The current pcie_do_recovery() handles device recovery as result of
>>>> uncorrectable errors (UCE). But, CXL port devices require unique
>>>> recovery handling.
>>>>
>>>> Create a cxl_do_recovery() function parallel to pcie_do_recovery(). Add CXL
>>>> specific handling to the new recovery function.
>>>>
>>>> The CXL port UCE recovery must invoke the AER service driver's CXL port
>>>> UCE callback. This is different than the standard pcie_do_recovery()
>>>> recovery that calls the pci_driver::err_handler UCE handler instead.
>>>>
>>>> Treat all CXL PCIe port UCE errors as fatal and call kernel panic to
>>>> "recover" the error. A panic is called instead of attempting recovery
>>>> to avoid potential system corruption.
>>>>
>>>> The uncorrectable support added here will be used to complete CXL PCIe
>>>> port error handling in the future.
>>>>
>>>> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
>>>
>>> Hi Terry,
>>>
>>> I'm a little bothered by the subtle difference in the bus walks
>>> in here vs the existing cases. If we need them, comments needed
>>> to explain why.
>>>
>>
>> Yes, I will add more details in the commit message about "why".
>> I added explanation following your below comment.
>>
>>> If we are going to have separate handling, see if you can share
>>> a lot more of the code by factoring out common functions for
>>> the pci and cxl handling with callbacks to handle the differences.
>>>
>>
>> Dan requested separate paths for the PCIe and CXL recovery. The intent,
>> as I understand, is to isolate the handling of PCIe and CXL protocol
>> errors. This is to create 2 different classes of protocol errors.
> Function call chain wise I'm reasonably convinced that might be a good
> idea. But not code wise if it means we end up with more hard to review
> code.
>
>>
>>> I've managed to get my head around this code a few times in the past
>>> (I think!) and really don't fancy having two subtle variants to
>>> consider next time we get a bug :( The RC_EC additions hurt my head.
>>>
>>> Jonathan
>>
>> Right, the UCE recovery logic is not straightforward. The code can be
>> refactored to take advantage of reuse. I'm interested in your thoughts
>> after I have provided some responses here.
>>
>>>
>>>> static int handles_cxl_error_iter(struct pci_dev *dev, void *data)
>>>> diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
>>>> index 31090770fffc..de12f2eb19ef 100644
>>>> --- a/drivers/pci/pcie/err.c
>>>> +++ b/drivers/pci/pcie/err.c
>>>> @@ -86,6 +86,63 @@ static int report_error_detected(struct pci_dev *dev,
>>>> return 0;
>>>> }
>>>>
>>>> +static int cxl_report_error_detected(struct pci_dev *dev,
>>>> + pci_channel_state_t state,
>>>> + enum pci_ers_result *result)
>>>> +{
>>>> + struct cxl_port_err_hndlrs *cxl_port_hndlrs;
>>>> + struct pci_driver *pdrv;
>>>> + pci_ers_result_t vote;
>>>> +
>>>> + device_lock(&dev->dev);
>>>> + cxl_port_hndlrs = find_cxl_port_hndlrs();
>>>
>>> Can we refactor to have a common function under this and report_error_detected()?
>>>
>>
>> Sure, this can be refactored.
>>
>> The difference between cxl_report_error_detected() and report_error_detected() is the
>> handlers that are called.
>>
>> cxl_report_error_detected() calls the CXL driver's registered port error handler.
>>
>> report_error_recovery() calls the pcie_dev::err_handlers.
>>
>> Let me know if I should refactor for common code here?
>
> It certainly makes sense to do that somewhere in here. Just have light
> wrappers that provide callbacks so the bulk of the code is shared.
>
Ok, Ill start on that. I have a v2 ready to-go without the reuse changes.
You want me to wait on sending v2 till it has reuse refactoring?
>>
>>
>>>> + pdrv = dev->driver;
>>>> + if (pci_dev_is_disconnected(dev)) {
>>>> + vote = PCI_ERS_RESULT_DISCONNECT;
>>>> + } else if (!pci_dev_set_io_state(dev, state)) {
>>>> + pci_info(dev, "can't recover (state transition %u -> %u invalid)\n",
>>>> + dev->error_state, state);
>>>> + vote = PCI_ERS_RESULT_NONE;
>>>> + } else if (!cxl_port_hndlrs || !cxl_port_hndlrs->error_detected) {
>>>> + if (dev->hdr_type != PCI_HEADER_TYPE_BRIDGE) {
>>>> + vote = PCI_ERS_RESULT_NO_AER_DRIVER;
>>>> + pci_info(dev, "can't recover (no error_detected callback)\n");
>>>> + } else {
>>>> + vote = PCI_ERS_RESULT_NONE;
>>>> + }
>>>> + } else {
>>>> + vote = cxl_port_hndlrs->error_detected(dev, state);
>>>> + }
>>>> + pci_uevent_ers(dev, vote);
>>>> + *result = merge_result(*result, vote);
>>>> + device_unlock(&dev->dev);
>>>> + return 0;
>>>> +}
>>>
>>>> static int pci_pm_runtime_get_sync(struct pci_dev *pdev, void *data)
>>>> {
>>>> pm_runtime_get_sync(&pdev->dev);
>>>> @@ -188,6 +245,28 @@ static void pci_walk_bridge(struct pci_dev *bridge,
>>>> cb(bridge, userdata);
>>>> }
>>>>
>>>> +/**
>>>> + * cxl_walk_bridge - walk bridges potentially AER affected
>>>> + * @bridge: bridge which may be a Port, an RCEC, or an RCiEP
>>>> + * @cb: callback to be called for each device found
>>>> + * @userdata: arbitrary pointer to be passed to callback
>>>> + *
>>>> + * If the device provided is a bridge, walk the subordinate bus, including
>>>> + * the device itself and any bridged devices on buses under this bus. Call
>>>> + * the provided callback on each device found.
>>>> + *
>>>> + * If the device provided has no subordinate bus, e.g., an RCEC or RCiEP,
>>>> + * call the callback on the device itself.
>>> only call the callback on the device itself.
>>>
>>> (as you call it as stated above either way).
>>>
>>
>> Thanks. I will update the function header to include "only".
>>
>>>> + */
>>>> +static void cxl_walk_bridge(struct pci_dev *bridge,
>>>> + int (*cb)(struct pci_dev *, void *),
>>>> + void *userdata)
>>>> +{
>>>> + cb(bridge, userdata);
>>>> + if (bridge->subordinate)
>>>> + pci_walk_bus(bridge->subordinate, cb, userdata);
>>> The difference between this and pci_walk_bridge() is subtle and
>>> I'd like to avoid having both if we can.
>>>
>>
>> The cxl_walk_bridge() was added because pci_walk_bridge() does not report
>> CXL errors as needed. If the erroring device is a bridge then pci_walk_bridge()
>> does not call report_error_detected() for the root port itself. If the bridge
>> is a CXL root port then the CXL port error handler is not called. This has 2
>> problems: 1. Error logging is not provided, 2. A result vote is not provided
>> by the root port's CXL port handler.
>
> So what happens for PCIe errors on the root port? How are they reported?
> What I'm failing to understand is why these should be different.
> Maybe there is something missing on the PCIe side though!
> That code plays a game with what bridge and I thought that was there to handle
> this case.
>
PCIe errors (not CXL errors) on a root port will be processed as they are today.
An AER error is treated as a CXL error if *all* of the following are met:
- The AER error is not an internal error
- Check is in AER's is_internal_error(info) function.
- The device is not a CXL device
- Check is in AER's handles_cxl_errors() function.
Root port device PCIe error processing will not call the the pci_dev::err_handlers::error_detected().
because of the walk_bridge() implementation. The result vote to direct handling
is determined by downstream devices. This has probably been Ok until now because ports have been
fairly vanilla and standard until CXL.
>>
>>>> +}
>>>> +
>>>> pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
>>>> pci_channel_state_t state,
>>>> pci_ers_result_t (*reset_subordinates)(struct pci_dev *pdev))
>>>> @@ -276,3 +355,74 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
>>>>
>>>> return status;
>>>> }
>>>> +
>>>> +pci_ers_result_t cxl_do_recovery(struct pci_dev *bridge,
>>>> + pci_channel_state_t state,
>>>> + pci_ers_result_t (*reset_subordinates)(struct pci_dev *pdev))
>>>> +{
>>>> + struct pci_host_bridge *host = pci_find_host_bridge(bridge->bus);
>>>> + pci_ers_result_t status = PCI_ERS_RESULT_CAN_RECOVER;
>>>> + int type = pci_pcie_type(bridge);
>>>> +
>>>> + if ((type != PCI_EXP_TYPE_ROOT_PORT) &&
>>>> + (type != PCI_EXP_TYPE_RC_EC) &&
>>>> + (type != PCI_EXP_TYPE_DOWNSTREAM) &&
>>>> + (type != PCI_EXP_TYPE_UPSTREAM)) {
>>>> + pci_dbg(bridge, "Unsupported device type (%x)\n", type);
>>>> + return status;
>>>> + }
>>>> +
>>>
>>> Would similar trick to in pcie_do_recovery work here for the upstream
>>> and downstream ports use pci_upstream_bridge() and for the others pass the dev into
>>> pci_walk_bridge()?
>>>
>>
>> Yes, that would be a good starting point to begin reuse refactoring.
>> I'm interested in getting yours and others feedback on the separation of the
>> PCI and CXL protocol errors and how much separation is or not needed.
>
> Separation may make sense (I'm still thinking about it) for separate passes
> through the topology and separate callbacks / handling when an error is seen.
> What I don't want to see is two horribly complex separate walking codes if
> we can possibly avoid it. Long term to me that just means two sets of bugs
> and problem corners instead of one.
>
> Jonathan
>
I understand. I will look to make changes here for reuse.
Regards,
Terry
^ permalink raw reply [flat|nested] 62+ messages in thread* Re: [PATCH 07/15] cxl/aer/pci: Add CXL PCIe port uncorrectable error recovery in AER service driver
2024-10-17 16:21 ` Bowman, Terry
@ 2024-10-17 17:08 ` Jonathan Cameron
0 siblings, 0 replies; 62+ messages in thread
From: Jonathan Cameron @ 2024-10-17 17:08 UTC (permalink / raw)
To: Bowman, Terry
Cc: Terry Bowman, ming4.li, linux-cxl, linux-kernel, linux-pci, dave,
dave.jiang, alison.schofield, vishal.l.verma, dan.j.williams,
bhelgaas, mahesh, oohall, Benjamin.Cheatham, rrichter,
nathan.fontenot, smita.koralahallichannabasappa
On Thu, 17 Oct 2024 11:21:36 -0500
"Bowman, Terry" <kibowman@amd.com> wrote:
> Hi Jonathan,
>
> On 10/17/2024 8:43 AM, Jonathan Cameron wrote:
> > On Wed, 16 Oct 2024 13:07:37 -0500
> > Terry Bowman <Terry.Bowman@amd.com> wrote:
> >
> >> Hi Jonathan,
> >>
> >> On 10/16/24 11:54, Jonathan Cameron wrote:
> >>> On Tue, 8 Oct 2024 17:16:49 -0500
> >>> Terry Bowman <terry.bowman@amd.com> wrote:
> >>>
> >>>> The current pcie_do_recovery() handles device recovery as result of
> >>>> uncorrectable errors (UCE). But, CXL port devices require unique
> >>>> recovery handling.
> >>>>
> >>>> Create a cxl_do_recovery() function parallel to pcie_do_recovery(). Add CXL
> >>>> specific handling to the new recovery function.
> >>>>
> >>>> The CXL port UCE recovery must invoke the AER service driver's CXL port
> >>>> UCE callback. This is different than the standard pcie_do_recovery()
> >>>> recovery that calls the pci_driver::err_handler UCE handler instead.
> >>>>
> >>>> Treat all CXL PCIe port UCE errors as fatal and call kernel panic to
> >>>> "recover" the error. A panic is called instead of attempting recovery
> >>>> to avoid potential system corruption.
> >>>>
> >>>> The uncorrectable support added here will be used to complete CXL PCIe
> >>>> port error handling in the future.
> >>>>
> >>>> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> >>>
> >>> Hi Terry,
> >>>
> >>> I'm a little bothered by the subtle difference in the bus walks
> >>> in here vs the existing cases. If we need them, comments needed
> >>> to explain why.
> >>>
> >>
> >> Yes, I will add more details in the commit message about "why".
> >> I added explanation following your below comment.
> >>
> >>> If we are going to have separate handling, see if you can share
> >>> a lot more of the code by factoring out common functions for
> >>> the pci and cxl handling with callbacks to handle the differences.
> >>>
> >>
> >> Dan requested separate paths for the PCIe and CXL recovery. The intent,
> >> as I understand, is to isolate the handling of PCIe and CXL protocol
> >> errors. This is to create 2 different classes of protocol errors.
> > Function call chain wise I'm reasonably convinced that might be a good
> > idea. But not code wise if it means we end up with more hard to review
> > code.
> >
> >>
> >>> I've managed to get my head around this code a few times in the past
> >>> (I think!) and really don't fancy having two subtle variants to
> >>> consider next time we get a bug :( The RC_EC additions hurt my head.
> >>>
> >>> Jonathan
> >>
> >> Right, the UCE recovery logic is not straightforward. The code can be
> >> refactored to take advantage of reuse. I'm interested in your thoughts
> >> after I have provided some responses here.
> >>
> >>>
> >>>> static int handles_cxl_error_iter(struct pci_dev *dev, void *data)
> >>>> diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
> >>>> index 31090770fffc..de12f2eb19ef 100644
> >>>> --- a/drivers/pci/pcie/err.c
> >>>> +++ b/drivers/pci/pcie/err.c
> >>>> @@ -86,6 +86,63 @@ static int report_error_detected(struct pci_dev *dev,
> >>>> return 0;
> >>>> }
> >>>>
> >>>> +static int cxl_report_error_detected(struct pci_dev *dev,
> >>>> + pci_channel_state_t state,
> >>>> + enum pci_ers_result *result)
> >>>> +{
> >>>> + struct cxl_port_err_hndlrs *cxl_port_hndlrs;
> >>>> + struct pci_driver *pdrv;
> >>>> + pci_ers_result_t vote;
> >>>> +
> >>>> + device_lock(&dev->dev);
> >>>> + cxl_port_hndlrs = find_cxl_port_hndlrs();
> >>>
> >>> Can we refactor to have a common function under this and report_error_detected()?
> >>>
> >>
> >> Sure, this can be refactored.
> >>
> >> The difference between cxl_report_error_detected() and report_error_detected() is the
> >> handlers that are called.
> >>
> >> cxl_report_error_detected() calls the CXL driver's registered port error handler.
> >>
> >> report_error_recovery() calls the pcie_dev::err_handlers.
> >>
> >> Let me know if I should refactor for common code here?
> >
> > It certainly makes sense to do that somewhere in here. Just have light
> > wrappers that provide callbacks so the bulk of the code is shared.
> >
>
> Ok, Ill start on that. I have a v2 ready to-go without the reuse changes.
> You want me to wait on sending v2 till it has reuse refactoring?
I'd imagine we might have some time after v2, so go ahead - experiments
with refactoring can come later.
> >>>> + */
> >>>> +static void cxl_walk_bridge(struct pci_dev *bridge,
> >>>> + int (*cb)(struct pci_dev *, void *),
> >>>> + void *userdata)
> >>>> +{
> >>>> + cb(bridge, userdata);
> >>>> + if (bridge->subordinate)
> >>>> + pci_walk_bus(bridge->subordinate, cb, userdata);
> >>> The difference between this and pci_walk_bridge() is subtle and
> >>> I'd like to avoid having both if we can.
> >>>
> >>
> >> The cxl_walk_bridge() was added because pci_walk_bridge() does not report
> >> CXL errors as needed. If the erroring device is a bridge then pci_walk_bridge()
> >> does not call report_error_detected() for the root port itself. If the bridge
> >> is a CXL root port then the CXL port error handler is not called. This has 2
> >> problems: 1. Error logging is not provided, 2. A result vote is not provided
> >> by the root port's CXL port handler.
> >
> > So what happens for PCIe errors on the root port? How are they reported?
> > What I'm failing to understand is why these should be different.
> > Maybe there is something missing on the PCIe side though!
> > That code plays a game with what bridge and I thought that was there to handle
> > this case.
> >
>
> PCIe errors (not CXL errors) on a root port will be processed as they are today.
Sure, I was just failing to understand why the code didn't need to check
for error_detected on the root port, but the CXL code does.
>
> An AER error is treated as a CXL error if *all* of the following are met:
> - The AER error is not an internal error
> - Check is in AER's is_internal_error(info) function.
> - The device is not a CXL device
> - Check is in AER's handles_cxl_errors() function.
>
> Root port device PCIe error processing will not call the the pci_dev::err_handlers::error_detected().
> because of the walk_bridge() implementation. The result vote to direct handling
> is determined by downstream devices. This has probably been Ok until now because ports have been
> fairly vanilla and standard until CXL.
Ah. Got it - Root ports didn't have the handler.
So is there any harm in making them run it? (well not as they don't have it -
actually they do. There is one in portdrv)
That way the two codes look more similar. Also, does this mean there were
runtime pm and other calls that didn't hit the root port for PCIe that should have
done?
Comes back to I don't want too complex bits of code. I'm fine with changing
the PCIe one to add new handling needed for CXL.
Just to be clear I'm fine with totally separate call paths just with lots
of code reuse. That may mean a few precursor patches touching only the
PCIe code to make it fit for reuse.
Jonathan
>
>
> >>
> >>>> +}
> >>>> +
> >>>> pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
> >>>> pci_channel_state_t state,
> >>>> pci_ers_result_t (*reset_subordinates)(struct pci_dev *pdev))
> >>>> @@ -276,3 +355,74 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
> >>>>
> >>>> return status;
> >>>> }
> >>>> +
> >>>> +pci_ers_result_t cxl_do_recovery(struct pci_dev *bridge,
> >>>> + pci_channel_state_t state,
> >>>> + pci_ers_result_t (*reset_subordinates)(struct pci_dev *pdev))
> >>>> +{
> >>>> + struct pci_host_bridge *host = pci_find_host_bridge(bridge->bus);
> >>>> + pci_ers_result_t status = PCI_ERS_RESULT_CAN_RECOVER;
> >>>> + int type = pci_pcie_type(bridge);
> >>>> +
> >>>> + if ((type != PCI_EXP_TYPE_ROOT_PORT) &&
> >>>> + (type != PCI_EXP_TYPE_RC_EC) &&
> >>>> + (type != PCI_EXP_TYPE_DOWNSTREAM) &&
> >>>> + (type != PCI_EXP_TYPE_UPSTREAM)) {
> >>>> + pci_dbg(bridge, "Unsupported device type (%x)\n", type);
> >>>> + return status;
> >>>> + }
> >>>> +
> >>>
> >>> Would similar trick to in pcie_do_recovery work here for the upstream
> >>> and downstream ports use pci_upstream_bridge() and for the others pass the dev into
> >>> pci_walk_bridge()?
> >>>
> >>
> >> Yes, that would be a good starting point to begin reuse refactoring.
> >> I'm interested in getting yours and others feedback on the separation of the
> >> PCI and CXL protocol errors and how much separation is or not needed.
> >
> > Separation may make sense (I'm still thinking about it) for separate passes
> > through the topology and separate callbacks / handling when an error is seen.
> > What I don't want to see is two horribly complex separate walking codes if
> > we can possibly avoid it. Long term to me that just means two sets of bugs
> > and problem corners instead of one.
> >
> > Jonathan
> >
>
> I understand. I will look to make changes here for reuse.
>
> Regards,
> Terry
>
^ permalink raw reply [flat|nested] 62+ messages in thread
* [PATCH 08/15] cxl/pci: Change find_cxl_ports() to be non-static
2024-10-08 22:16 [PATCH 0/15] Enable CXL PCIe port protocol error handling and logging Terry Bowman
` (6 preceding siblings ...)
2024-10-08 22:16 ` [PATCH 07/15] cxl/aer/pci: Add CXL PCIe port uncorrectable error recovery in AER service driver Terry Bowman
@ 2024-10-08 22:16 ` Terry Bowman
2024-10-08 22:16 ` [PATCH 09/15] cxl/pci: Map CXL PCIe downstream port RAS registers Terry Bowman
` (10 subsequent siblings)
18 siblings, 0 replies; 62+ messages in thread
From: Terry Bowman @ 2024-10-08 22:16 UTC (permalink / raw)
To: ming4.li, linux-cxl, linux-kernel, linux-pci, dave,
jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
dan.j.williams, bhelgaas, mahesh, oohall, Benjamin.Cheatham,
rrichter, nathan.fontenot, smita.koralahallichannabasappa,
terry.bowman
CXL PCIe port protocol error support will be added in the future. This
requires searching for a CXL PCIe port device in the CXL topology as
provided by find_cxl_port(). But, find_cxl_port() is defined static
and as a result is not callable outside of this source file.
Update the find_cxl_port() declaration to be non-static.
Signed-off-by: Terry Bowman <terry.bowman@amd.com>
---
drivers/cxl/core/core.h | 3 +++
drivers/cxl/core/port.c | 4 ++--
2 files changed, 5 insertions(+), 2 deletions(-)
diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
index 72a506c9dbd0..14a8b4d14af6 100644
--- a/drivers/cxl/core/core.h
+++ b/drivers/cxl/core/core.h
@@ -108,4 +108,7 @@ int cxl_update_hmat_access_coordinates(int nid, struct cxl_region *cxlr,
enum access_coordinate_class access);
bool cxl_need_node_perf_attrs_update(int nid);
+struct cxl_port *find_cxl_port(struct device *dport_dev,
+ struct cxl_dport **dport);
+
#endif /* __CXL_CORE_H__ */
diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
index 1d5007e3795a..089a1f4535c1 100644
--- a/drivers/cxl/core/port.c
+++ b/drivers/cxl/core/port.c
@@ -1336,8 +1336,8 @@ static struct cxl_port *__find_cxl_port(struct cxl_find_port_ctx *ctx)
return NULL;
}
-static struct cxl_port *find_cxl_port(struct device *dport_dev,
- struct cxl_dport **dport)
+struct cxl_port *find_cxl_port(struct device *dport_dev,
+ struct cxl_dport **dport)
{
struct cxl_find_port_ctx ctx = {
.dport_dev = dport_dev,
--
2.34.1
^ permalink raw reply related [flat|nested] 62+ messages in thread* [PATCH 09/15] cxl/pci: Map CXL PCIe downstream port RAS registers
2024-10-08 22:16 [PATCH 0/15] Enable CXL PCIe port protocol error handling and logging Terry Bowman
` (7 preceding siblings ...)
2024-10-08 22:16 ` [PATCH 08/15] cxl/pci: Change find_cxl_ports() to be non-static Terry Bowman
@ 2024-10-08 22:16 ` Terry Bowman
2024-10-16 17:14 ` Jonathan Cameron
2024-10-08 22:16 ` [PATCH 10/15] cxl/pci: Map CXL PCIe upstream " Terry Bowman
` (9 subsequent siblings)
18 siblings, 1 reply; 62+ messages in thread
From: Terry Bowman @ 2024-10-08 22:16 UTC (permalink / raw)
To: ming4.li, linux-cxl, linux-kernel, linux-pci, dave,
jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
dan.j.williams, bhelgaas, mahesh, oohall, Benjamin.Cheatham,
rrichter, nathan.fontenot, smita.koralahallichannabasappa,
terry.bowman
RAS registers are not mapped for CXL root ports, CXL downstream switch
ports, or CXL upstream switch ports. To prepare for future RAS logging
and handling, the driver needs updating to map PCIe port RAS registers.
Refactor and rename cxl_setup_parent_dport() to be cxl_init_ep_ports_aer().
Update the function such that it will iterate an endpoint's dports to map
the RAS registers.
Rename cxl_dport_map_regs() to be cxl_dport_init_aer(). The new
function name is a more accurate description of the function's work.
This update should also include checking for previously mapped registers
within the topology, particularly with CXL switches. Endpoints under a
CXL switch may share a common downstream and upstream port, ensure that
the registers are only mapped once.
Signed-off-by: Terry Bowman <terry.bowman@amd.com>
---
drivers/cxl/core/pci.c | 37 ++++++++++++++++---------------------
drivers/cxl/cxl.h | 7 ++++---
drivers/cxl/mem.c | 27 +++++++++++++++++++++++++--
3 files changed, 45 insertions(+), 26 deletions(-)
diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
index 51132a575b27..6f7bcdb389bf 100644
--- a/drivers/cxl/core/pci.c
+++ b/drivers/cxl/core/pci.c
@@ -787,21 +787,6 @@ static void cxl_dport_map_rch_aer(struct cxl_dport *dport)
dport->regs.dport_aer = dport_aer;
}
-static void cxl_dport_map_regs(struct cxl_dport *dport)
-{
- struct cxl_register_map *map = &dport->reg_map;
- struct device *dev = dport->dport_dev;
-
- if (!map->component_map.ras.valid)
- dev_dbg(dev, "RAS registers not found\n");
- else if (cxl_map_component_regs(map, &dport->regs.component,
- BIT(CXL_CM_CAP_CAP_ID_RAS)))
- dev_dbg(dev, "Failed to map RAS capability.\n");
-
- if (dport->rch)
- cxl_dport_map_rch_aer(dport);
-}
-
static void cxl_disable_rch_root_ints(struct cxl_dport *dport)
{
void __iomem *aer_base = dport->regs.dport_aer;
@@ -831,7 +816,7 @@ static void cxl_disable_rch_root_ints(struct cxl_dport *dport)
}
}
-void cxl_setup_parent_dport(struct device *host, struct cxl_dport *dport)
+void cxl_dport_init_aer(struct cxl_dport *dport)
{
struct device *dport_dev = dport->dport_dev;
@@ -840,15 +825,25 @@ void cxl_setup_parent_dport(struct device *host, struct cxl_dport *dport)
if (host_bridge->native_aer)
dport->rcrb.aer_cap = cxl_rcrb_to_aer(dport_dev, dport->rcrb.base);
+
+ cxl_dport_map_rch_aer(dport);
+ cxl_disable_rch_root_ints(dport);
}
- dport->reg_map.host = host;
- cxl_dport_map_regs(dport);
+ /* dport may have more than 1 downstream EP. Check if already mapped. */
+ if (dport->regs.ras) {
+ dev_warn(dport_dev, "RAS is already mapped\n");
+ return;
+ }
- if (dport->rch)
- cxl_disable_rch_root_ints(dport);
+ dport->reg_map.host = dport_dev;
+ if (cxl_map_component_regs(&dport->reg_map, &dport->regs.component,
+ BIT(CXL_CM_CAP_CAP_ID_RAS))) {
+ dev_err(dport_dev, "Failed to map RAS capability.\n");
+ return;
+ }
}
-EXPORT_SYMBOL_NS_GPL(cxl_setup_parent_dport, CXL);
+EXPORT_SYMBOL_NS_GPL(cxl_dport_init_aer, CXL);
static void cxl_handle_rdport_cor_ras(struct cxl_dev_state *cxlds,
struct cxl_dport *dport)
diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index 9afb407d438f..cb9e05e2912b 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -592,6 +592,7 @@ struct cxl_dax_region {
* @parent_dport: dport that points to this port in the parent
* @decoder_ida: allocator for decoder ids
* @reg_map: component and ras register mapping parameters
+ * @uport_regs: mapped component registers
* @nr_dports: number of entries in @dports
* @hdm_end: track last allocated HDM decoder instance for allocation ordering
* @commit_end: cursor to track highest committed decoder for commit ordering
@@ -612,6 +613,7 @@ struct cxl_port {
struct cxl_dport *parent_dport;
struct ida decoder_ida;
struct cxl_register_map reg_map;
+ struct cxl_component_regs uport_regs;
int nr_dports;
int hdm_end;
int commit_end;
@@ -761,10 +763,9 @@ struct cxl_dport *devm_cxl_add_rch_dport(struct cxl_port *port,
resource_size_t rcrb);
#ifdef CONFIG_PCIEAER_CXL
-void cxl_setup_parent_dport(struct device *host, struct cxl_dport *dport);
+void cxl_dport_init_aer(struct cxl_dport *dport);
#else
-static inline void cxl_setup_parent_dport(struct device *host,
- struct cxl_dport *dport) { }
+static inline void cxl_dport_init_aer(struct cxl_dport *dport) { }
#endif
struct cxl_decoder *to_cxl_decoder(struct device *dev);
diff --git a/drivers/cxl/mem.c b/drivers/cxl/mem.c
index 7de232eaeb17..b7204f010785 100644
--- a/drivers/cxl/mem.c
+++ b/drivers/cxl/mem.c
@@ -45,6 +45,30 @@ static int cxl_mem_dpa_show(struct seq_file *file, void *data)
return 0;
}
+static bool dev_is_cxl_pci(struct device *dev, u32 pcie_type)
+{
+ struct pci_dev *pdev;
+
+ if (!dev_is_pci(dev))
+ return false;
+
+ pdev = to_pci_dev(dev);
+ if (pci_pcie_type(pdev) != pcie_type)
+ return false;
+
+ return pci_find_dvsec_capability(pdev, PCI_VENDOR_ID_CXL,
+ CXL_DVSEC_REG_LOCATOR);
+}
+
+static void cxl_init_ep_ports_aer(struct cxl_ep *ep)
+{
+ struct cxl_dport *dport = ep->dport;
+
+ if (dev_is_cxl_pci(dport->dport_dev, PCI_EXP_TYPE_DOWNSTREAM) ||
+ dev_is_cxl_pci(dport->dport_dev, PCI_EXP_TYPE_ROOT_PORT))
+ cxl_dport_init_aer(dport);
+}
+
static int devm_cxl_add_endpoint(struct device *host, struct cxl_memdev *cxlmd,
struct cxl_dport *parent_dport)
{
@@ -62,6 +86,7 @@ static int devm_cxl_add_endpoint(struct device *host, struct cxl_memdev *cxlmd,
ep = cxl_ep_load(iter, cxlmd);
ep->next = down;
+ cxl_init_ep_ports_aer(ep);
}
/* Note: endpoint port component registers are derived from @cxlds */
@@ -166,8 +191,6 @@ static int cxl_mem_probe(struct device *dev)
else
endpoint_parent = &parent_port->dev;
- cxl_setup_parent_dport(dev, dport);
-
device_lock(endpoint_parent);
if (!endpoint_parent->driver) {
dev_err(dev, "CXL port topology %s not enabled\n",
--
2.34.1
^ permalink raw reply related [flat|nested] 62+ messages in thread* Re: [PATCH 09/15] cxl/pci: Map CXL PCIe downstream port RAS registers
2024-10-08 22:16 ` [PATCH 09/15] cxl/pci: Map CXL PCIe downstream port RAS registers Terry Bowman
@ 2024-10-16 17:14 ` Jonathan Cameron
2024-10-16 18:16 ` Terry Bowman
0 siblings, 1 reply; 62+ messages in thread
From: Jonathan Cameron @ 2024-10-16 17:14 UTC (permalink / raw)
To: Terry Bowman
Cc: ming4.li, linux-cxl, linux-kernel, linux-pci, dave, dave.jiang,
alison.schofield, vishal.l.verma, dan.j.williams, bhelgaas,
mahesh, oohall, Benjamin.Cheatham, rrichter, nathan.fontenot,
smita.koralahallichannabasappa
On Tue, 8 Oct 2024 17:16:51 -0500
Terry Bowman <terry.bowman@amd.com> wrote:
> RAS registers are not mapped for CXL root ports, CXL downstream switch
> ports, or CXL upstream switch ports. To prepare for future RAS logging
> and handling, the driver needs updating to map PCIe port RAS registers.
Give the upstream port is in next patch, I'd just mention that you
are adding mapping of RP and DSP here (This confused me before I noticed
the next patch).
>
> Refactor and rename cxl_setup_parent_dport() to be cxl_init_ep_ports_aer().
> Update the function such that it will iterate an endpoint's dports to map
> the RAS registers.
>
> Rename cxl_dport_map_regs() to be cxl_dport_init_aer(). The new
> function name is a more accurate description of the function's work.
>
> This update should also include checking for previously mapped registers
> within the topology, particularly with CXL switches. Endpoints under a
> CXL switch may share a common downstream and upstream port, ensure that
> the registers are only mapped once.
I don't understand why we need to do this for the ras registers but
it doesn't apply for HDM decoders for instance? Why can't
we map these registers in cxl_port_probe()?
End of day here, so maybe I'm completely misunderstanding this.
Will take another look tomorrow morning.
>
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> ---
> drivers/cxl/core/pci.c | 37 ++++++++++++++++---------------------
> drivers/cxl/cxl.h | 7 ++++---
> drivers/cxl/mem.c | 27 +++++++++++++++++++++++++--
> 3 files changed, 45 insertions(+), 26 deletions(-)
>
> diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
> index 51132a575b27..6f7bcdb389bf 100644
> --- a/drivers/cxl/core/pci.c
> +++ b/drivers/cxl/core/pci.c
> @@ -787,21 +787,6 @@ static void cxl_dport_map_rch_aer(struct cxl_dport *dport)
> dport->regs.dport_aer = dport_aer;
> }
>
> -static void cxl_dport_map_regs(struct cxl_dport *dport)
> -{
> - struct cxl_register_map *map = &dport->reg_map;
> - struct device *dev = dport->dport_dev;
> -
> - if (!map->component_map.ras.valid)
> - dev_dbg(dev, "RAS registers not found\n");
> - else if (cxl_map_component_regs(map, &dport->regs.component,
> - BIT(CXL_CM_CAP_CAP_ID_RAS)))
> - dev_dbg(dev, "Failed to map RAS capability.\n");
> -
> - if (dport->rch)
> - cxl_dport_map_rch_aer(dport);
> -}
> -
> static void cxl_disable_rch_root_ints(struct cxl_dport *dport)
> {
> void __iomem *aer_base = dport->regs.dport_aer;
> @@ -831,7 +816,7 @@ static void cxl_disable_rch_root_ints(struct cxl_dport *dport)
> }
> }
>
> -void cxl_setup_parent_dport(struct device *host, struct cxl_dport *dport)
> +void cxl_dport_init_aer(struct cxl_dport *dport)
> {
> struct device *dport_dev = dport->dport_dev;
>
> @@ -840,15 +825,25 @@ void cxl_setup_parent_dport(struct device *host, struct cxl_dport *dport)
>
> if (host_bridge->native_aer)
> dport->rcrb.aer_cap = cxl_rcrb_to_aer(dport_dev, dport->rcrb.base);
> +
> + cxl_dport_map_rch_aer(dport);
> + cxl_disable_rch_root_ints(dport);
> }
>
> - dport->reg_map.host = host;
> - cxl_dport_map_regs(dport);
> + /* dport may have more than 1 downstream EP. Check if already mapped. */
> + if (dport->regs.ras) {
> + dev_warn(dport_dev, "RAS is already mapped\n");
This is valid. Why are we warning?
However why do we need this dance here but not for other
root port registers etc.
> + return;
> + }
>
> - if (dport->rch)
> - cxl_disable_rch_root_ints(dport);
> + dport->reg_map.host = dport_dev;
> + if (cxl_map_component_regs(&dport->reg_map, &dport->regs.component,
> + BIT(CXL_CM_CAP_CAP_ID_RAS))) {
> + dev_err(dport_dev, "Failed to map RAS capability.\n");
> + return;
> + }
> }
> -EXPORT_SYMBOL_NS_GPL(cxl_setup_parent_dport, CXL);
> +EXPORT_SYMBOL_NS_GPL(cxl_dport_init_aer, CXL);
>
> static void cxl_handle_rdport_cor_ras(struct cxl_dev_state *cxlds,
> struct cxl_dport *dport)
> diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
> index 9afb407d438f..cb9e05e2912b 100644
> --- a/drivers/cxl/cxl.h
> +++ b/drivers/cxl/cxl.h
> @@ -592,6 +592,7 @@ struct cxl_dax_region {
> * @parent_dport: dport that points to this port in the parent
> * @decoder_ida: allocator for decoder ids
> * @reg_map: component and ras register mapping parameters
> + * @uport_regs: mapped component registers
> * @nr_dports: number of entries in @dports
> * @hdm_end: track last allocated HDM decoder instance for allocation ordering
> * @commit_end: cursor to track highest committed decoder for commit ordering
> @@ -612,6 +613,7 @@ struct cxl_port {
> struct cxl_dport *parent_dport;
> struct ida decoder_ida;
> struct cxl_register_map reg_map;
> + struct cxl_component_regs uport_regs;
> int nr_dports;
> int hdm_end;
> int commit_end;
> @@ -761,10 +763,9 @@ struct cxl_dport *devm_cxl_add_rch_dport(struct cxl_port *port,
> resource_size_t rcrb);
>
> #ifdef CONFIG_PCIEAER_CXL
> -void cxl_setup_parent_dport(struct device *host, struct cxl_dport *dport);
> +void cxl_dport_init_aer(struct cxl_dport *dport);
> #else
> -static inline void cxl_setup_parent_dport(struct device *host,
> - struct cxl_dport *dport) { }
> +static inline void cxl_dport_init_aer(struct cxl_dport *dport) { }
> #endif
>
> struct cxl_decoder *to_cxl_decoder(struct device *dev);
> diff --git a/drivers/cxl/mem.c b/drivers/cxl/mem.c
> index 7de232eaeb17..b7204f010785 100644
> --- a/drivers/cxl/mem.c
> +++ b/drivers/cxl/mem.c
> @@ -45,6 +45,30 @@ static int cxl_mem_dpa_show(struct seq_file *file, void *data)
> return 0;
> }
>
> +static bool dev_is_cxl_pci(struct device *dev, u32 pcie_type)
> +{
> + struct pci_dev *pdev;
> +
> + if (!dev_is_pci(dev))
> + return false;
> +
> + pdev = to_pci_dev(dev);
> + if (pci_pcie_type(pdev) != pcie_type)
> + return false;
> +
> + return pci_find_dvsec_capability(pdev, PCI_VENDOR_ID_CXL,
> + CXL_DVSEC_REG_LOCATOR);
> +}
> +
> +static void cxl_init_ep_ports_aer(struct cxl_ep *ep)
> +{
> + struct cxl_dport *dport = ep->dport;
> +
> + if (dev_is_cxl_pci(dport->dport_dev, PCI_EXP_TYPE_DOWNSTREAM) ||
> + dev_is_cxl_pci(dport->dport_dev, PCI_EXP_TYPE_ROOT_PORT))
> + cxl_dport_init_aer(dport);
> +}
> +
> static int devm_cxl_add_endpoint(struct device *host, struct cxl_memdev *cxlmd,
> struct cxl_dport *parent_dport)
> {
> @@ -62,6 +86,7 @@ static int devm_cxl_add_endpoint(struct device *host, struct cxl_memdev *cxlmd,
>
> ep = cxl_ep_load(iter, cxlmd);
> ep->next = down;
> + cxl_init_ep_ports_aer(ep);
> }
>
> /* Note: endpoint port component registers are derived from @cxlds */
> @@ -166,8 +191,6 @@ static int cxl_mem_probe(struct device *dev)
> else
> endpoint_parent = &parent_port->dev;
>
> - cxl_setup_parent_dport(dev, dport);
> -
> device_lock(endpoint_parent);
> if (!endpoint_parent->driver) {
> dev_err(dev, "CXL port topology %s not enabled\n",
^ permalink raw reply [flat|nested] 62+ messages in thread* Re: [PATCH 09/15] cxl/pci: Map CXL PCIe downstream port RAS registers
2024-10-16 17:14 ` Jonathan Cameron
@ 2024-10-16 18:16 ` Terry Bowman
2024-10-17 13:50 ` Jonathan Cameron
0 siblings, 1 reply; 62+ messages in thread
From: Terry Bowman @ 2024-10-16 18:16 UTC (permalink / raw)
To: Jonathan Cameron
Cc: ming4.li, linux-cxl, linux-kernel, linux-pci, dave, dave.jiang,
alison.schofield, vishal.l.verma, dan.j.williams, bhelgaas,
mahesh, oohall, Benjamin.Cheatham, rrichter, nathan.fontenot,
smita.koralahallichannabasappa
Hi Jonathan,
On 10/16/24 12:14, Jonathan Cameron wrote:
> On Tue, 8 Oct 2024 17:16:51 -0500
> Terry Bowman <terry.bowman@amd.com> wrote:
>
>> RAS registers are not mapped for CXL root ports, CXL downstream switch
>> ports, or CXL upstream switch ports. To prepare for future RAS logging
>> and handling, the driver needs updating to map PCIe port RAS registers.
>
> Give the upstream port is in next patch, I'd just mention that you
> are adding mapping of RP and DSP here (This confused me before I noticed
> the next patch).
Ok. Good point,
>>
>> Refactor and rename cxl_setup_parent_dport() to be cxl_init_ep_ports_aer().
>> Update the function such that it will iterate an endpoint's dports to map
>> the RAS registers.
>>
>> Rename cxl_dport_map_regs() to be cxl_dport_init_aer(). The new
>> function name is a more accurate description of the function's work.
>>
>> This update should also include checking for previously mapped registers
>> within the topology, particularly with CXL switches. Endpoints under a
>> CXL switch may share a common downstream and upstream port, ensure that
>> the registers are only mapped once.
>
> I don't understand why we need to do this for the ras registers but
> it doesn't apply for HDM decoders for instance? Why can't
> we map these registers in cxl_port_probe()?
>
We have seen downstream root ports with DVSECs that are not fully populated
immediately after booting. The plan here was to push out the RAS register
block mapping until as late as possible, in the memdev driver.
> End of day here, so maybe I'm completely misunderstanding this.
> Will take another look tomorrow morning.
>
Thanks for your reviews.
Regards,
Terry
^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [PATCH 09/15] cxl/pci: Map CXL PCIe downstream port RAS registers
2024-10-16 18:16 ` Terry Bowman
@ 2024-10-17 13:50 ` Jonathan Cameron
2024-10-17 16:26 ` Bowman, Terry
0 siblings, 1 reply; 62+ messages in thread
From: Jonathan Cameron @ 2024-10-17 13:50 UTC (permalink / raw)
To: Terry Bowman
Cc: ming4.li, linux-cxl, linux-kernel, linux-pci, dave, dave.jiang,
alison.schofield, vishal.l.verma, dan.j.williams, bhelgaas,
mahesh, oohall, Benjamin.Cheatham, rrichter, nathan.fontenot,
smita.koralahallichannabasappa
On Wed, 16 Oct 2024 13:16:34 -0500
Terry Bowman <Terry.Bowman@amd.com> wrote:
> Hi Jonathan,
>
> On 10/16/24 12:14, Jonathan Cameron wrote:
> > On Tue, 8 Oct 2024 17:16:51 -0500
> > Terry Bowman <terry.bowman@amd.com> wrote:
> >
> >> RAS registers are not mapped for CXL root ports, CXL downstream switch
> >> ports, or CXL upstream switch ports. To prepare for future RAS logging
> >> and handling, the driver needs updating to map PCIe port RAS registers.
> >
> > Give the upstream port is in next patch, I'd just mention that you
> > are adding mapping of RP and DSP here (This confused me before I noticed
> > the next patch).
>
> Ok. Good point,
>
> >>
> >> Refactor and rename cxl_setup_parent_dport() to be cxl_init_ep_ports_aer().
> >> Update the function such that it will iterate an endpoint's dports to map
> >> the RAS registers.
> >>
> >> Rename cxl_dport_map_regs() to be cxl_dport_init_aer(). The new
> >> function name is a more accurate description of the function's work.
> >>
> >> This update should also include checking for previously mapped registers
> >> within the topology, particularly with CXL switches. Endpoints under a
> >> CXL switch may share a common downstream and upstream port, ensure that
> >> the registers are only mapped once.
> >
> > I don't understand why we need to do this for the ras registers but
> > it doesn't apply for HDM decoders for instance? Why can't
> > we map these registers in cxl_port_probe()?
> >
>
> We have seen downstream root ports with DVSECs that are not fully populated
> immediately after booting. The plan here was to push out the RAS register
> block mapping until as late as possible, in the memdev driver.
That needs debugging because simply pushing it later like this is
only going to make the race harder to hit unless we understand the
'why' of that. If there is a reason to delay, my gut feeling would
be to delay the cxl_port_probe() until things are stable rather
than just trying this a bit later.
This might be the whole link must train before CXL registers are
presented thing (a less than ideal corner of the CXL spec) but not
sure it would mean they weren't available in cxl_port_probe()
Jonathan
>
>
> > End of day here, so maybe I'm completely misunderstanding this.
> > Will take another look tomorrow morning.
> >
>
> Thanks for your reviews.
>
> Regards,
> Terry
>
^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [PATCH 09/15] cxl/pci: Map CXL PCIe downstream port RAS registers
2024-10-17 13:50 ` Jonathan Cameron
@ 2024-10-17 16:26 ` Bowman, Terry
0 siblings, 0 replies; 62+ messages in thread
From: Bowman, Terry @ 2024-10-17 16:26 UTC (permalink / raw)
To: Jonathan Cameron, Terry Bowman
Cc: ming4.li, linux-cxl, linux-kernel, linux-pci, dave, dave.jiang,
alison.schofield, vishal.l.verma, dan.j.williams, bhelgaas,
mahesh, oohall, Benjamin.Cheatham, rrichter, nathan.fontenot,
smita.koralahallichannabasappa
Hi Jonathan,
On 10/17/2024 8:50 AM, Jonathan Cameron wrote:
> On Wed, 16 Oct 2024 13:16:34 -0500
> Terry Bowman <Terry.Bowman@amd.com> wrote:
>
>> Hi Jonathan,
>>
>> On 10/16/24 12:14, Jonathan Cameron wrote:
>>> On Tue, 8 Oct 2024 17:16:51 -0500
>>> Terry Bowman <terry.bowman@amd.com> wrote:
>>>
>>>> RAS registers are not mapped for CXL root ports, CXL downstream switch
>>>> ports, or CXL upstream switch ports. To prepare for future RAS logging
>>>> and handling, the driver needs updating to map PCIe port RAS registers.
>>>
>>> Give the upstream port is in next patch, I'd just mention that you
>>> are adding mapping of RP and DSP here (This confused me before I noticed
>>> the next patch).
>>
>> Ok. Good point,
>>
>>>>
>>>> Refactor and rename cxl_setup_parent_dport() to be cxl_init_ep_ports_aer().
>>>> Update the function such that it will iterate an endpoint's dports to map
>>>> the RAS registers.
>>>>
>>>> Rename cxl_dport_map_regs() to be cxl_dport_init_aer(). The new
>>>> function name is a more accurate description of the function's work.
>>>>
>>>> This update should also include checking for previously mapped registers
>>>> within the topology, particularly with CXL switches. Endpoints under a
>>>> CXL switch may share a common downstream and upstream port, ensure that
>>>> the registers are only mapped once.
>>>
>>> I don't understand why we need to do this for the ras registers but
>>> it doesn't apply for HDM decoders for instance? Why can't
>>> we map these registers in cxl_port_probe()?
>>>
>>
>> We have seen downstream root ports with DVSECs that are not fully populated
>> immediately after booting. The plan here was to push out the RAS register
>> block mapping until as late as possible, in the memdev driver.
>
> That needs debugging because simply pushing it later like this is
> only going to make the race harder to hit unless we understand the
> 'why' of that. If there is a reason to delay, my gut feeling would
> be to delay the cxl_port_probe() until things are stable rather
> than just trying this a bit later.
>
> This might be the whole link must train before CXL registers are
> presented thing (a less than ideal corner of the CXL spec) but not
> sure it would mean they weren't available in cxl_port_probe()
>
> Jonathan
>
>
>
My understanding is there is no spec defined expectation for when CXL
config registers are ready.
We need Dan's feedback. He has asked several times for this to be located after
adding the endpoint in the memdev driver.
Regards,
Terry
^ permalink raw reply [flat|nested] 62+ messages in thread
* [PATCH 10/15] cxl/pci: Map CXL PCIe upstream port RAS registers
2024-10-08 22:16 [PATCH 0/15] Enable CXL PCIe port protocol error handling and logging Terry Bowman
` (8 preceding siblings ...)
2024-10-08 22:16 ` [PATCH 09/15] cxl/pci: Map CXL PCIe downstream port RAS registers Terry Bowman
@ 2024-10-08 22:16 ` Terry Bowman
2024-10-08 22:16 ` [PATCH 11/15] cxl/pci: Update RAS handler interfaces to support CXL PCIe ports Terry Bowman
` (8 subsequent siblings)
18 siblings, 0 replies; 62+ messages in thread
From: Terry Bowman @ 2024-10-08 22:16 UTC (permalink / raw)
To: ming4.li, linux-cxl, linux-kernel, linux-pci, dave,
jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
dan.j.williams, bhelgaas, mahesh, oohall, Benjamin.Cheatham,
rrichter, nathan.fontenot, smita.koralahallichannabasappa,
terry.bowman
RAS registers are mapped for CXL root ports and CXL downstream but
not for CXL upstream switch ports. CXL upstream switch ports' mapped
RAS registers are required for handling and logging protocol errors.
Introduce 'struct cxl_regs' member into 'struct cxl_port' to store a
pointer to the upstream port's mapped RAS registers.
Map the the CXL upstream switch port's RAS register block.
The upstream port may be have multiple downstream endpoints. Before
mapping AER registers check if the registers are already mapped.
Signed-off-by: Terry Bowman <terry.bowman@amd.com>
---
drivers/cxl/core/pci.c | 17 +++++++++++++++++
drivers/cxl/cxl.h | 2 ++
drivers/cxl/mem.c | 3 +++
3 files changed, 22 insertions(+)
diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
index 6f7bcdb389bf..be181358a775 100644
--- a/drivers/cxl/core/pci.c
+++ b/drivers/cxl/core/pci.c
@@ -816,6 +816,23 @@ static void cxl_disable_rch_root_ints(struct cxl_dport *dport)
}
}
+void cxl_uport_init_aer(struct cxl_port *port)
+{
+ /* uport may have more than 1 downstream EP. Check if already mapped. */
+ if (port->uport_regs.ras) {
+ dev_warn(&port->dev, "RAS is already mapped\n");
+ return;
+ }
+
+ port->reg_map.host = &port->dev;
+ if (cxl_map_component_regs(&port->reg_map, &port->uport_regs,
+ BIT(CXL_CM_CAP_CAP_ID_RAS))) {
+ dev_err(&port->dev, "Failed to map RAS capability.\n");
+ return;
+ }
+}
+EXPORT_SYMBOL_NS_GPL(cxl_uport_init_aer, CXL);
+
void cxl_dport_init_aer(struct cxl_dport *dport)
{
struct device *dport_dev = dport->dport_dev;
diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index cb9e05e2912b..7a5f2c33223e 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -764,8 +764,10 @@ struct cxl_dport *devm_cxl_add_rch_dport(struct cxl_port *port,
#ifdef CONFIG_PCIEAER_CXL
void cxl_dport_init_aer(struct cxl_dport *dport);
+void cxl_uport_init_aer(struct cxl_port *port);
#else
static inline void cxl_dport_init_aer(struct cxl_dport *dport) { }
+static inline void cxl_uport_init_aer(struct cxl_port *port) { }
#endif
struct cxl_decoder *to_cxl_decoder(struct device *dev);
diff --git a/drivers/cxl/mem.c b/drivers/cxl/mem.c
index b7204f010785..82b1383fb6f3 100644
--- a/drivers/cxl/mem.c
+++ b/drivers/cxl/mem.c
@@ -67,6 +67,9 @@ static void cxl_init_ep_ports_aer(struct cxl_ep *ep)
if (dev_is_cxl_pci(dport->dport_dev, PCI_EXP_TYPE_DOWNSTREAM) ||
dev_is_cxl_pci(dport->dport_dev, PCI_EXP_TYPE_ROOT_PORT))
cxl_dport_init_aer(dport);
+
+ if (dev_is_cxl_pci(dport->port->uport_dev, PCI_EXP_TYPE_UPSTREAM))
+ cxl_uport_init_aer(dport->port);
}
static int devm_cxl_add_endpoint(struct device *host, struct cxl_memdev *cxlmd,
--
2.34.1
^ permalink raw reply related [flat|nested] 62+ messages in thread* [PATCH 11/15] cxl/pci: Update RAS handler interfaces to support CXL PCIe ports
2024-10-08 22:16 [PATCH 0/15] Enable CXL PCIe port protocol error handling and logging Terry Bowman
` (9 preceding siblings ...)
2024-10-08 22:16 ` [PATCH 10/15] cxl/pci: Map CXL PCIe upstream " Terry Bowman
@ 2024-10-08 22:16 ` Terry Bowman
2024-10-08 22:16 ` [PATCH 12/15] cxl/pci: Add error handler for CXL PCIe port RAS errors Terry Bowman
` (7 subsequent siblings)
18 siblings, 0 replies; 62+ messages in thread
From: Terry Bowman @ 2024-10-08 22:16 UTC (permalink / raw)
To: ming4.li, linux-cxl, linux-kernel, linux-pci, dave,
jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
dan.j.williams, bhelgaas, mahesh, oohall, Benjamin.Cheatham,
rrichter, nathan.fontenot, smita.koralahallichannabasappa,
terry.bowman
CXL PCIe port protocol error handling support will be added to the
CXL drivers in the future. In preparation, refactor the existing
interfaces to support handling all CXL PCIe port protocol errors.
The driver's RAS support functions currently rely on a 'struct
cxl_dev_state' type parameter, which is not available for CXL port
devices. However, since the same CXL RAS capability structure is
needed across most CXL components and devices, a common handling
approach should be adopted.
To accommodate this, update the __cxl_handle_cor_ras() and
__cxl_handle_ras() functions to use a `struct device` instead of
`struct cxl_dev_state`.
No functional changes are introduced.
[1] CXL3.1 - 8.2.4 CXL.cache and CXL.mem Registers
Signed-off-by: Terry Bowman <terry.bowman@amd.com>
---
drivers/cxl/core/pci.c | 17 ++++++++---------
1 file changed, 8 insertions(+), 9 deletions(-)
diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
index be181358a775..c3c82c051d73 100644
--- a/drivers/cxl/core/pci.c
+++ b/drivers/cxl/core/pci.c
@@ -686,7 +686,7 @@ void read_cdat_data(struct cxl_port *port)
}
EXPORT_SYMBOL_NS_GPL(read_cdat_data, CXL);
-static void __cxl_handle_cor_ras(struct cxl_dev_state *cxlds,
+static void __cxl_handle_cor_ras(struct device *dev,
void __iomem *ras_base)
{
void __iomem *addr;
@@ -699,13 +699,13 @@ static void __cxl_handle_cor_ras(struct cxl_dev_state *cxlds,
status = readl(addr);
if (status & CXL_RAS_CORRECTABLE_STATUS_MASK) {
writel(status & CXL_RAS_CORRECTABLE_STATUS_MASK, addr);
- trace_cxl_aer_correctable_error(cxlds->cxlmd, status);
+ trace_cxl_aer_correctable_error(to_cxl_memdev(dev), status);
}
}
static void cxl_handle_endpoint_cor_ras(struct cxl_dev_state *cxlds)
{
- return __cxl_handle_cor_ras(cxlds, cxlds->regs.ras);
+ return __cxl_handle_cor_ras(&cxlds->cxlmd->dev, cxlds->regs.ras);
}
/* CXL spec rev3.0 8.2.4.16.1 */
@@ -729,8 +729,7 @@ static void header_log_copy(void __iomem *ras_base, u32 *log)
* Log the state of the RAS status registers and prepare them to log the
* next error status. Return 1 if reset needed.
*/
-static bool __cxl_handle_ras(struct cxl_dev_state *cxlds,
- void __iomem *ras_base)
+static bool __cxl_handle_ras(struct device *dev, void __iomem *ras_base)
{
u32 hl[CXL_HEADERLOG_SIZE_U32];
void __iomem *addr;
@@ -757,7 +756,7 @@ static bool __cxl_handle_ras(struct cxl_dev_state *cxlds,
}
header_log_copy(ras_base, hl);
- trace_cxl_aer_uncorrectable_error(cxlds->cxlmd, status, fe, hl);
+ trace_cxl_aer_uncorrectable_error(to_cxl_memdev(dev), status, fe, hl);
writel(status & CXL_RAS_UNCORRECTABLE_STATUS_MASK, addr);
return true;
@@ -765,7 +764,7 @@ static bool __cxl_handle_ras(struct cxl_dev_state *cxlds,
static bool cxl_handle_endpoint_ras(struct cxl_dev_state *cxlds)
{
- return __cxl_handle_ras(cxlds, cxlds->regs.ras);
+ return __cxl_handle_ras(&cxlds->cxlmd->dev, cxlds->regs.ras);
}
#ifdef CONFIG_PCIEAER_CXL
@@ -865,13 +864,13 @@ EXPORT_SYMBOL_NS_GPL(cxl_dport_init_aer, CXL);
static void cxl_handle_rdport_cor_ras(struct cxl_dev_state *cxlds,
struct cxl_dport *dport)
{
- return __cxl_handle_cor_ras(cxlds, dport->regs.ras);
+ return __cxl_handle_cor_ras(&cxlds->cxlmd->dev, dport->regs.ras);
}
static bool cxl_handle_rdport_ras(struct cxl_dev_state *cxlds,
struct cxl_dport *dport)
{
- return __cxl_handle_ras(cxlds, dport->regs.ras);
+ return __cxl_handle_ras(&cxlds->cxlmd->dev, dport->regs.ras);
}
/*
--
2.34.1
^ permalink raw reply related [flat|nested] 62+ messages in thread* [PATCH 12/15] cxl/pci: Add error handler for CXL PCIe port RAS errors
2024-10-08 22:16 [PATCH 0/15] Enable CXL PCIe port protocol error handling and logging Terry Bowman
` (10 preceding siblings ...)
2024-10-08 22:16 ` [PATCH 11/15] cxl/pci: Update RAS handler interfaces to support CXL PCIe ports Terry Bowman
@ 2024-10-08 22:16 ` Terry Bowman
2024-10-17 13:57 ` Jonathan Cameron
2024-10-08 22:16 ` [PATCH 13/15] cxl/pci: Add trace logging " Terry Bowman
` (6 subsequent siblings)
18 siblings, 1 reply; 62+ messages in thread
From: Terry Bowman @ 2024-10-08 22:16 UTC (permalink / raw)
To: ming4.li, linux-cxl, linux-kernel, linux-pci, dave,
jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
dan.j.williams, bhelgaas, mahesh, oohall, Benjamin.Cheatham,
rrichter, nathan.fontenot, smita.koralahallichannabasappa,
terry.bowman
The CXL drivers do not contain error handlers for CXL PCIe port
device protocol errors. These are needed in order to handle and log
RAS protocol errors.
Add CXL PCIe port protocol error handlers to the CXL driver.
Provide access to RAS registers for the specific CXL PCIe port types:
root port, upstream switch port, and downstream switch port.
Also, register and unregister the CXL PCIe port error handlers with
the AER service driver using register_cxl_port_err_hndlrs() and
unregister_cxl_port_err_hndlrs(). Invoke the registration from
cxl_pci_driver_init() and the unregistration from cxl_pci_driver_exit().
[1] CXL3.1 - 12.2.2 CXL Root Ports, Downstream Switch Ports, and
Upstream Switch Ports
Signed-off-by: Terry Bowman <terry.bowman@amd.com>
---
drivers/cxl/core/pci.c | 83 ++++++++++++++++++++++++++++++++++++++++++
drivers/cxl/cxl.h | 5 +++
drivers/cxl/pci.c | 8 ++++
3 files changed, 96 insertions(+)
diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
index c3c82c051d73..7e3770f7a955 100644
--- a/drivers/cxl/core/pci.c
+++ b/drivers/cxl/core/pci.c
@@ -815,6 +815,89 @@ static void cxl_disable_rch_root_ints(struct cxl_dport *dport)
}
}
+static int match_uport(struct device *dev, const void *data)
+{
+ struct device *uport_dev = (struct device *)data;
+ struct cxl_port *port;
+
+ if (!is_cxl_port(dev))
+ return 0;
+
+ port = to_cxl_port(dev);
+
+ return port->uport_dev == uport_dev;
+}
+
+static void __iomem *cxl_pci_port_ras(struct pci_dev *pdev)
+{
+ void __iomem *ras_base;
+ struct cxl_port *port;
+
+ if (!pdev)
+ return NULL;
+
+ if ((pci_pcie_type(pdev) == PCI_EXP_TYPE_ROOT_PORT) ||
+ (pci_pcie_type(pdev) == PCI_EXP_TYPE_DOWNSTREAM)) {
+ struct cxl_dport *dport;
+
+ port = find_cxl_port(&pdev->dev, &dport);
+ ras_base = dport ? dport->regs.ras : NULL;
+ put_device(&port->dev);
+ return ras_base;
+ } else if (pci_pcie_type(pdev) == PCI_EXP_TYPE_UPSTREAM) {
+ struct device *port_dev __free(put_device);
+
+ port_dev = bus_find_device(&cxl_bus_type, NULL, &pdev->dev, match_uport);
+ if (!port_dev)
+ return NULL;
+
+ port = to_cxl_port(port_dev);
+ if (!port)
+ return NULL;
+
+ ras_base = port ? port->uport_regs.ras : NULL;
+ return ras_base;
+ }
+
+ return NULL;
+}
+
+void cxl_cor_port_err_detected(struct pci_dev *pdev)
+{
+ void __iomem *ras_base = cxl_pci_port_ras(pdev);
+
+ __cxl_handle_cor_ras(&pdev->dev, ras_base);
+}
+EXPORT_SYMBOL_NS_GPL(cxl_cor_port_err_detected, CXL);
+
+pci_ers_result_t cxl_port_err_detected(struct pci_dev *pdev, pci_channel_state_t state)
+{
+ void __iomem *ras_base = cxl_pci_port_ras(pdev);
+ bool ue;
+
+ ue = __cxl_handle_ras(&pdev->dev, ras_base);
+ if (ue)
+ return PCI_ERS_RESULT_PANIC;
+
+ switch (state) {
+ case pci_channel_io_normal:
+ dev_err(&pdev->dev, "%s():%d: pci_channel_io_normal\n",
+ __func__, __LINE__);
+ return PCI_ERS_RESULT_CAN_RECOVER;
+ case pci_channel_io_frozen:
+ dev_err(&pdev->dev, "%s():%d: pci_channel_io_frozen\n",
+ __func__, __LINE__);
+ return PCI_ERS_RESULT_NEED_RESET;
+ case pci_channel_io_perm_failure:
+ dev_err(&pdev->dev, "%s():%d: pci_channel_io_perm_failure\n",
+ __func__, __LINE__);
+ return PCI_ERS_RESULT_DISCONNECT;
+ }
+
+ return PCI_ERS_RESULT_NEED_RESET;
+}
+EXPORT_SYMBOL_NS_GPL(cxl_port_err_detected, CXL);
+
void cxl_uport_init_aer(struct cxl_port *port)
{
/* uport may have more than 1 downstream EP. Check if already mapped. */
diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index 7a5f2c33223e..06fcde4b88b5 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -10,6 +10,7 @@
#include <linux/bitops.h>
#include <linux/log2.h>
#include <linux/node.h>
+#include <linux/pci.h>
#include <linux/io.h>
extern const struct nvdimm_security_ops *cxl_security_ops;
@@ -901,6 +902,10 @@ void cxl_coordinates_combine(struct access_coordinate *out,
bool cxl_endpoint_decoder_reset_detected(struct cxl_port *port);
+pci_ers_result_t cxl_port_err_detected(struct pci_dev *pdev,
+ pci_channel_state_t state);
+void cxl_cor_port_err_detected(struct pci_dev *pdev);
+
/*
* Unit test builds overrides this to __weak, find the 'strong' version
* of these symbols in tools/testing/cxl/.
diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
index 4be35dc22202..9179b34c35bb 100644
--- a/drivers/cxl/pci.c
+++ b/drivers/cxl/pci.c
@@ -978,6 +978,11 @@ static void cxl_reset_done(struct pci_dev *pdev)
}
}
+static struct cxl_port_err_hndlrs cxl_port_hndlrs = {
+ .error_detected = cxl_port_err_detected,
+ .cor_error_detected = cxl_cor_port_err_detected
+};
+
static const struct pci_error_handlers cxl_error_handlers = {
.error_detected = cxl_error_detected,
.slot_reset = cxl_slot_reset,
@@ -1054,11 +1059,14 @@ static int __init cxl_pci_driver_init(void)
if (rc)
pci_unregister_driver(&cxl_pci_driver);
+ register_cxl_port_hndlrs(&cxl_port_hndlrs);
+
return rc;
}
static void __exit cxl_pci_driver_exit(void)
{
+ unregister_cxl_port_hndlrs();
cxl_cper_unregister_work(&cxl_cper_work);
cancel_work_sync(&cxl_cper_work);
pci_unregister_driver(&cxl_pci_driver);
--
2.34.1
^ permalink raw reply related [flat|nested] 62+ messages in thread* Re: [PATCH 12/15] cxl/pci: Add error handler for CXL PCIe port RAS errors
2024-10-08 22:16 ` [PATCH 12/15] cxl/pci: Add error handler for CXL PCIe port RAS errors Terry Bowman
@ 2024-10-17 13:57 ` Jonathan Cameron
2024-10-17 16:42 ` Bowman, Terry
0 siblings, 1 reply; 62+ messages in thread
From: Jonathan Cameron @ 2024-10-17 13:57 UTC (permalink / raw)
To: Terry Bowman
Cc: ming4.li, linux-cxl, linux-kernel, linux-pci, dave, dave.jiang,
alison.schofield, vishal.l.verma, dan.j.williams, bhelgaas,
mahesh, oohall, Benjamin.Cheatham, rrichter, nathan.fontenot,
smita.koralahallichannabasappa
On Tue, 8 Oct 2024 17:16:54 -0500
Terry Bowman <terry.bowman@amd.com> wrote:
> The CXL drivers do not contain error handlers for CXL PCIe port
> device protocol errors. These are needed in order to handle and log
> RAS protocol errors.
>
> Add CXL PCIe port protocol error handlers to the CXL driver.
>
> Provide access to RAS registers for the specific CXL PCIe port types:
> root port, upstream switch port, and downstream switch port.
>
> Also, register and unregister the CXL PCIe port error handlers with
> the AER service driver using register_cxl_port_err_hndlrs() and
> unregister_cxl_port_err_hndlrs(). Invoke the registration from
> cxl_pci_driver_init() and the unregistration from cxl_pci_driver_exit().
>
> [1] CXL3.1 - 12.2.2 CXL Root Ports, Downstream Switch Ports, and
> Upstream Switch Ports
>
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
A few comments inline.
Jonathan
> ---
> drivers/cxl/core/pci.c | 83 ++++++++++++++++++++++++++++++++++++++++++
> drivers/cxl/cxl.h | 5 +++
> drivers/cxl/pci.c | 8 ++++
> 3 files changed, 96 insertions(+)
>
> diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
> index c3c82c051d73..7e3770f7a955 100644
> --- a/drivers/cxl/core/pci.c
> +++ b/drivers/cxl/core/pci.c
> @@ -815,6 +815,89 @@ static void cxl_disable_rch_root_ints(struct cxl_dport *dport)
> }
> }
>
> +static int match_uport(struct device *dev, const void *data)
> +{
> + struct device *uport_dev = (struct device *)data;
> + struct cxl_port *port;
> +
> + if (!is_cxl_port(dev))
> + return 0;
> +
> + port = to_cxl_port(dev);
> +
> + return port->uport_dev == uport_dev;
> +}
> +
> +static void __iomem *cxl_pci_port_ras(struct pci_dev *pdev)
> +{
> + void __iomem *ras_base;
> + struct cxl_port *port;
> +
> + if (!pdev)
> + return NULL;
Why would this happen? Seems an odd check to have so maybe a comment.
> +
> + if ((pci_pcie_type(pdev) == PCI_EXP_TYPE_ROOT_PORT) ||
> + (pci_pcie_type(pdev) == PCI_EXP_TYPE_DOWNSTREAM)) {
> + struct cxl_dport *dport;
> +
> + port = find_cxl_port(&pdev->dev, &dport);
Can in theory fail.
> + ras_base = dport ? dport->regs.ras : NULL;
> + put_device(&port->dev);
If it fails this is a null pointer dereference.
> + return ras_base;
> + } else if (pci_pcie_type(pdev) == PCI_EXP_TYPE_UPSTREAM) {
> + struct device *port_dev __free(put_device);
Should be combined with the next line. We want it to be hard for anyone
to put code in between!
> +
> + port_dev = bus_find_device(&cxl_bus_type, NULL, &pdev->dev, match_uport);
> + if (!port_dev)
> + return NULL;
> +
> + port = to_cxl_port(port_dev);
> + if (!port)
> + return NULL;
> +
> + ras_base = port ? port->uport_regs.ras : NULL;
Given check above, port exists. Remove one of the two
checks.
> + return ras_base;
> + }
> +
> + return NULL;
> +}
^ permalink raw reply [flat|nested] 62+ messages in thread* Re: [PATCH 12/15] cxl/pci: Add error handler for CXL PCIe port RAS errors
2024-10-17 13:57 ` Jonathan Cameron
@ 2024-10-17 16:42 ` Bowman, Terry
0 siblings, 0 replies; 62+ messages in thread
From: Bowman, Terry @ 2024-10-17 16:42 UTC (permalink / raw)
To: Jonathan Cameron, Terry Bowman
Cc: ming4.li, linux-cxl, linux-kernel, linux-pci, dave, dave.jiang,
alison.schofield, vishal.l.verma, dan.j.williams, bhelgaas,
mahesh, oohall, Benjamin.Cheatham, rrichter, nathan.fontenot,
smita.koralahallichannabasappa
Hi Jonathan,
On 10/17/2024 8:57 AM, Jonathan Cameron wrote:
> On Tue, 8 Oct 2024 17:16:54 -0500
> Terry Bowman <terry.bowman@amd.com> wrote:
>
>> The CXL drivers do not contain error handlers for CXL PCIe port
>> device protocol errors. These are needed in order to handle and log
>> RAS protocol errors.
>>
>> Add CXL PCIe port protocol error handlers to the CXL driver.
>>
>> Provide access to RAS registers for the specific CXL PCIe port types:
>> root port, upstream switch port, and downstream switch port.
>>
>> Also, register and unregister the CXL PCIe port error handlers with
>> the AER service driver using register_cxl_port_err_hndlrs() and
>> unregister_cxl_port_err_hndlrs(). Invoke the registration from
>> cxl_pci_driver_init() and the unregistration from cxl_pci_driver_exit().
>>
>> [1] CXL3.1 - 12.2.2 CXL Root Ports, Downstream Switch Ports, and
>> Upstream Switch Ports
>>
>> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> A few comments inline.
>
> Jonathan
>
>> ---
>> drivers/cxl/core/pci.c | 83 ++++++++++++++++++++++++++++++++++++++++++
>> drivers/cxl/cxl.h | 5 +++
>> drivers/cxl/pci.c | 8 ++++
>> 3 files changed, 96 insertions(+)
>>
>> diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
>> index c3c82c051d73..7e3770f7a955 100644
>> --- a/drivers/cxl/core/pci.c
>> +++ b/drivers/cxl/core/pci.c
>> @@ -815,6 +815,89 @@ static void cxl_disable_rch_root_ints(struct cxl_dport *dport)
>> }
>> }
>>
>> +static int match_uport(struct device *dev, const void *data)
>> +{
>> + struct device *uport_dev = (struct device *)data;
>> + struct cxl_port *port;
>> +
>> + if (!is_cxl_port(dev))
>> + return 0;
>> +
>> + port = to_cxl_port(dev);
>> +
>> + return port->uport_dev == uport_dev;
>> +}
>> +
>> +static void __iomem *cxl_pci_port_ras(struct pci_dev *pdev)
>> +{
>> + void __iomem *ras_base;
>> + struct cxl_port *port;
>> +
>> + if (!pdev)
>> + return NULL;
> Why would this happen? Seems an odd check to have so maybe a comment.
>
This is called directly from cxl_port_err_detected() and cxl_cor_port_err_detected().
We moved the pdev validation check into cxl_pci_port_ras().
>> +
>> + if ((pci_pcie_type(pdev) == PCI_EXP_TYPE_ROOT_PORT) ||
>> + (pci_pcie_type(pdev) == PCI_EXP_TYPE_DOWNSTREAM)) {
>> + struct cxl_dport *dport;
>> +
>> + port = find_cxl_port(&pdev->dev, &dport);
> Can in theory fail>> + ras_base = dport ? dport->regs.ras : NULL;
>> + put_device(&port->dev);
> If it fails this is a null pointer dereference.
>
>> + return ras_base;
>> + } else if (pci_pcie_type(pdev) == PCI_EXP_TYPE_UPSTREAM) {
>> + struct device *port_dev __free(put_device);
>
> Should be combined with the next line. We want it to be hard for anyone
> to put code in between!
>
>> +
>> + port_dev = bus_find_device(&cxl_bus_type, NULL, &pdev->dev, match_uport);
>> + if (!port_dev)
>> + return NULL;
>> +
>> + port = to_cxl_port(port_dev);
>> + if (!port)
>> + return NULL;
>> +
>> + ras_base = port ? port->uport_regs.ras : NULL;
>
> Given check above, port exists. Remove one of the two
> checks.
>
>> + return ras_base;
>> + }
>> +
>> + return NULL;
>> +}
I have v2 changed (not posted yet) to use the following at the top of the function for the
if and else blocks using 'port'. Is not needed for dport.
struct cxl_port *port __free(put_cxl_port) = NULL;
Regards,
Terry
^ permalink raw reply [flat|nested] 62+ messages in thread
* [PATCH 13/15] cxl/pci: Add trace logging for CXL PCIe port RAS errors
2024-10-08 22:16 [PATCH 0/15] Enable CXL PCIe port protocol error handling and logging Terry Bowman
` (11 preceding siblings ...)
2024-10-08 22:16 ` [PATCH 12/15] cxl/pci: Add error handler for CXL PCIe port RAS errors Terry Bowman
@ 2024-10-08 22:16 ` Terry Bowman
2024-10-17 14:04 ` Jonathan Cameron
2024-10-08 22:16 ` [PATCH 14/15] cxl/aer/pci: Export pci_aer_unmask_internal_errors() Terry Bowman
` (5 subsequent siblings)
18 siblings, 1 reply; 62+ messages in thread
From: Terry Bowman @ 2024-10-08 22:16 UTC (permalink / raw)
To: ming4.li, linux-cxl, linux-kernel, linux-pci, dave,
jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
dan.j.williams, bhelgaas, mahesh, oohall, Benjamin.Cheatham,
rrichter, nathan.fontenot, smita.koralahallichannabasappa,
terry.bowman
The CXL drivers use kernel trace functions for logging endpoint and
RCH downstream port RAS errors. However, similar functionality is
required for CXL root ports, CXL downstream switch ports, and CXL
upstream switch ports.
Introduce trace logging functions for both RAS correctable and
uncorrectable errors specific to CXL PCIe ports. Additionally, update
the PCIe port error handlers to invoke these new trace functions.
Signed-off-by: Terry Bowman <terry.bowman@amd.com>
---
drivers/cxl/core/pci.c | 16 ++++++++++----
drivers/cxl/core/trace.h | 47 ++++++++++++++++++++++++++++++++++++++++
2 files changed, 59 insertions(+), 4 deletions(-)
diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
index 7e3770f7a955..4706113d2582 100644
--- a/drivers/cxl/core/pci.c
+++ b/drivers/cxl/core/pci.c
@@ -697,10 +697,14 @@ static void __cxl_handle_cor_ras(struct device *dev,
addr = ras_base + CXL_RAS_CORRECTABLE_STATUS_OFFSET;
status = readl(addr);
- if (status & CXL_RAS_CORRECTABLE_STATUS_MASK) {
- writel(status & CXL_RAS_CORRECTABLE_STATUS_MASK, addr);
+ if (!(status & CXL_RAS_CORRECTABLE_STATUS_MASK))
+ return;
+ writel(status & CXL_RAS_CORRECTABLE_STATUS_MASK, addr);
+
+ if (is_cxl_memdev(dev))
trace_cxl_aer_correctable_error(to_cxl_memdev(dev), status);
- }
+ else if (dev_is_pci(dev))
+ trace_cxl_port_aer_correctable_error(dev, status);
}
static void cxl_handle_endpoint_cor_ras(struct cxl_dev_state *cxlds)
@@ -756,7 +760,11 @@ static bool __cxl_handle_ras(struct device *dev, void __iomem *ras_base)
}
header_log_copy(ras_base, hl);
- trace_cxl_aer_uncorrectable_error(to_cxl_memdev(dev), status, fe, hl);
+ if (is_cxl_memdev(dev))
+ trace_cxl_aer_uncorrectable_error(to_cxl_memdev(dev), status, fe, hl);
+ else if (dev_is_pci(dev))
+ trace_cxl_port_aer_uncorrectable_error(dev, status, fe, hl);
+
writel(status & CXL_RAS_UNCORRECTABLE_STATUS_MASK, addr);
return true;
diff --git a/drivers/cxl/core/trace.h b/drivers/cxl/core/trace.h
index 9167cfba7f59..6305c0eea627 100644
--- a/drivers/cxl/core/trace.h
+++ b/drivers/cxl/core/trace.h
@@ -48,6 +48,34 @@
{ CXL_RAS_UC_IDE_RX_ERR, "IDE Rx Error" } \
)
+TRACE_EVENT(cxl_port_aer_uncorrectable_error,
+ TP_PROTO(struct device *dev, u32 status, u32 fe, u32 *hl),
+ TP_ARGS(dev, status, fe, hl),
+ TP_STRUCT__entry(
+ __string(devname, dev_name(dev))
+ __string(host, dev_name(dev->parent))
+ __field(u32, status)
+ __field(u32, first_error)
+ __array(u32, header_log, CXL_HEADERLOG_SIZE_U32)
+ ),
+ TP_fast_assign(
+ __assign_str(devname);
+ __assign_str(host);
+ __entry->status = status;
+ __entry->first_error = fe;
+ /*
+ * Embed the 512B headerlog data for user app retrieval and
+ * parsing, but no need to print this in the trace buffer.
+ */
+ memcpy(__entry->header_log, hl, CXL_HEADERLOG_SIZE);
+ ),
+ TP_printk("device=%s host=%s status: '%s' first_error: '%s'",
+ __get_str(devname), __get_str(host),
+ show_uc_errs(__entry->status),
+ show_uc_errs(__entry->first_error)
+ )
+);
+
TRACE_EVENT(cxl_aer_uncorrectable_error,
TP_PROTO(const struct cxl_memdev *cxlmd, u32 status, u32 fe, u32 *hl),
TP_ARGS(cxlmd, status, fe, hl),
@@ -96,6 +124,25 @@ TRACE_EVENT(cxl_aer_uncorrectable_error,
{ CXL_RAS_CE_PHYS_LAYER_ERR, "Received Error From Physical Layer" } \
)
+TRACE_EVENT(cxl_port_aer_correctable_error,
+ TP_PROTO(struct device *dev, u32 status),
+ TP_ARGS(dev, status),
+ TP_STRUCT__entry(
+ __string(devname, dev_name(dev))
+ __string(host, dev_name(dev->parent))
+ __field(u32, status)
+ ),
+ TP_fast_assign(
+ __assign_str(devname);
+ __assign_str(host);
+ __entry->status = status;
+ ),
+ TP_printk("device=%s host=%s status='%s'",
+ __get_str(devname), __get_str(host),
+ show_ce_errs(__entry->status)
+ )
+);
+
TRACE_EVENT(cxl_aer_correctable_error,
TP_PROTO(const struct cxl_memdev *cxlmd, u32 status),
TP_ARGS(cxlmd, status),
--
2.34.1
^ permalink raw reply related [flat|nested] 62+ messages in thread* Re: [PATCH 13/15] cxl/pci: Add trace logging for CXL PCIe port RAS errors
2024-10-08 22:16 ` [PATCH 13/15] cxl/pci: Add trace logging " Terry Bowman
@ 2024-10-17 14:04 ` Jonathan Cameron
0 siblings, 0 replies; 62+ messages in thread
From: Jonathan Cameron @ 2024-10-17 14:04 UTC (permalink / raw)
To: Terry Bowman
Cc: ming4.li, linux-cxl, linux-kernel, linux-pci, dave, dave.jiang,
alison.schofield, vishal.l.verma, dan.j.williams, bhelgaas,
mahesh, oohall, Benjamin.Cheatham, rrichter, nathan.fontenot,
smita.koralahallichannabasappa, shiju.jose
On Tue, 8 Oct 2024 17:16:55 -0500
Terry Bowman <terry.bowman@amd.com> wrote:
> The CXL drivers use kernel trace functions for logging endpoint and
> RCH downstream port RAS errors. However, similar functionality is
> required for CXL root ports, CXL downstream switch ports, and CXL
> upstream switch ports.
>
> Introduce trace logging functions for both RAS correctable and
> uncorrectable errors specific to CXL PCIe ports. Additionally, update
> the PCIe port error handlers to invoke these new trace functions.
>
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
+CC Shiju Jose
Shiju,
Could you check the tracepoints / vs rasdaemon etc
Terry,
Just a patch ordering question from me.
> ---
> drivers/cxl/core/pci.c | 16 ++++++++++----
> drivers/cxl/core/trace.h | 47 ++++++++++++++++++++++++++++++++++++++++
> 2 files changed, 59 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
> index 7e3770f7a955..4706113d2582 100644
> --- a/drivers/cxl/core/pci.c
> +++ b/drivers/cxl/core/pci.c
> @@ -697,10 +697,14 @@ static void __cxl_handle_cor_ras(struct device *dev,
>
> addr = ras_base + CXL_RAS_CORRECTABLE_STATUS_OFFSET;
> status = readl(addr);
> - if (status & CXL_RAS_CORRECTABLE_STATUS_MASK) {
> - writel(status & CXL_RAS_CORRECTABLE_STATUS_MASK, addr);
> + if (!(status & CXL_RAS_CORRECTABLE_STATUS_MASK))
> + return;
Add a blank line here to make that early exit easier to spot.
> + writel(status & CXL_RAS_CORRECTABLE_STATUS_MASK, addr);
> +
> + if (is_cxl_memdev(dev))
> trace_cxl_aer_correctable_error(to_cxl_memdev(dev), status);
So after previous patch, this code will be called for ports without this
check in here? Perhaps the two patches need to be swapped in order?
However, we already know the type of device at the callers. Maybe just
pass that in here.
> - }
> + else if (dev_is_pci(dev))
> + trace_cxl_port_aer_correctable_error(dev, status);
> }
>
> static void cxl_handle_endpoint_cor_ras(struct cxl_dev_state *cxlds)
> @@ -756,7 +760,11 @@ static bool __cxl_handle_ras(struct device *dev, void __iomem *ras_base)
> }
>
> header_log_copy(ras_base, hl);
> - trace_cxl_aer_uncorrectable_error(to_cxl_memdev(dev), status, fe, hl);
> + if (is_cxl_memdev(dev))
> + trace_cxl_aer_uncorrectable_error(to_cxl_memdev(dev), status, fe, hl);
> + else if (dev_is_pci(dev))
> + trace_cxl_port_aer_uncorrectable_error(dev, status, fe, hl);
> +
> writel(status & CXL_RAS_UNCORRECTABLE_STATUS_MASK, addr);
>
> return true;
> diff --git a/drivers/cxl/core/trace.h b/drivers/cxl/core/trace.h
> index 9167cfba7f59..6305c0eea627 100644
> --- a/drivers/cxl/core/trace.h
> +++ b/drivers/cxl/core/trace.h
> @@ -48,6 +48,34 @@
> { CXL_RAS_UC_IDE_RX_ERR, "IDE Rx Error" } \
> )
>
> +TRACE_EVENT(cxl_port_aer_uncorrectable_error,
> + TP_PROTO(struct device *dev, u32 status, u32 fe, u32 *hl),
> + TP_ARGS(dev, status, fe, hl),
> + TP_STRUCT__entry(
> + __string(devname, dev_name(dev))
> + __string(host, dev_name(dev->parent))
> + __field(u32, status)
> + __field(u32, first_error)
> + __array(u32, header_log, CXL_HEADERLOG_SIZE_U32)
> + ),
> + TP_fast_assign(
> + __assign_str(devname);
> + __assign_str(host);
> + __entry->status = status;
> + __entry->first_error = fe;
> + /*
> + * Embed the 512B headerlog data for user app retrieval and
> + * parsing, but no need to print this in the trace buffer.
> + */
> + memcpy(__entry->header_log, hl, CXL_HEADERLOG_SIZE);
> + ),
> + TP_printk("device=%s host=%s status: '%s' first_error: '%s'",
> + __get_str(devname), __get_str(host),
> + show_uc_errs(__entry->status),
> + show_uc_errs(__entry->first_error)
> + )
> +);
> +
> TRACE_EVENT(cxl_aer_uncorrectable_error,
> TP_PROTO(const struct cxl_memdev *cxlmd, u32 status, u32 fe, u32 *hl),
> TP_ARGS(cxlmd, status, fe, hl),
> @@ -96,6 +124,25 @@ TRACE_EVENT(cxl_aer_uncorrectable_error,
> { CXL_RAS_CE_PHYS_LAYER_ERR, "Received Error From Physical Layer" } \
> )
>
> +TRACE_EVENT(cxl_port_aer_correctable_error,
> + TP_PROTO(struct device *dev, u32 status),
> + TP_ARGS(dev, status),
> + TP_STRUCT__entry(
> + __string(devname, dev_name(dev))
> + __string(host, dev_name(dev->parent))
> + __field(u32, status)
> + ),
> + TP_fast_assign(
> + __assign_str(devname);
> + __assign_str(host);
> + __entry->status = status;
> + ),
> + TP_printk("device=%s host=%s status='%s'",
> + __get_str(devname), __get_str(host),
> + show_ce_errs(__entry->status)
> + )
> +);
> +
> TRACE_EVENT(cxl_aer_correctable_error,
> TP_PROTO(const struct cxl_memdev *cxlmd, u32 status),
> TP_ARGS(cxlmd, status),
^ permalink raw reply [flat|nested] 62+ messages in thread
* [PATCH 14/15] cxl/aer/pci: Export pci_aer_unmask_internal_errors()
2024-10-08 22:16 [PATCH 0/15] Enable CXL PCIe port protocol error handling and logging Terry Bowman
` (12 preceding siblings ...)
2024-10-08 22:16 ` [PATCH 13/15] cxl/pci: Add trace logging " Terry Bowman
@ 2024-10-08 22:16 ` Terry Bowman
2024-10-16 17:22 ` Jonathan Cameron
2024-10-08 22:16 ` [PATCH 15/15] cxl/pci: Enable internal CE/UCE interrupts for CXL PCIe port devices Terry Bowman
` (4 subsequent siblings)
18 siblings, 1 reply; 62+ messages in thread
From: Terry Bowman @ 2024-10-08 22:16 UTC (permalink / raw)
To: ming4.li, linux-cxl, linux-kernel, linux-pci, dave,
jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
dan.j.williams, bhelgaas, mahesh, oohall, Benjamin.Cheatham,
rrichter, nathan.fontenot, smita.koralahallichannabasappa,
terry.bowman
The CXL driver needs to enable AER correctable and uncorrectable internal
errors in order to receive notification of protocol
errors. pci_aer_unmask_internal_errors() is currently defined as
'static' in the AER service driver.
Update the AER service driver, exporting pci_aer_unmask_internal_errors()
to CXL namespace.
Signed-off-by: Terry Bowman <terry.bowman@amd.com>
---
drivers/pci/pcie/aer.c | 6 ++++--
include/linux/aer.h | 2 ++
2 files changed, 6 insertions(+), 2 deletions(-)
diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index 81a19028c4e7..1b4004932084 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -962,7 +962,6 @@ static bool is_internal_error(struct aer_err_info *info)
return info->status & PCI_ERR_UNC_INTN;
}
-#ifdef CONFIG_PCIEAER_CXL
/**
* pci_aer_unmask_internal_errors - unmask internal errors
* @dev: pointer to the pcie_dev data structure
@@ -973,7 +972,7 @@ static bool is_internal_error(struct aer_err_info *info)
* Note: AER must be enabled and supported by the device which must be
* checked in advance, e.g. with pcie_aer_is_native().
*/
-static void pci_aer_unmask_internal_errors(struct pci_dev *dev)
+void pci_aer_unmask_internal_errors(struct pci_dev *dev)
{
int aer = dev->aer_cap;
u32 mask;
@@ -986,6 +985,9 @@ static void pci_aer_unmask_internal_errors(struct pci_dev *dev)
mask &= ~PCI_ERR_COR_INTERNAL;
pci_write_config_dword(dev, aer + PCI_ERR_COR_MASK, mask);
}
+EXPORT_SYMBOL_NS_GPL(pci_aer_unmask_internal_errors, CXL);
+
+#ifdef CONFIG_PCIEAER_CXL
static bool is_cxl_mem_dev(struct pci_dev *dev)
{
diff --git a/include/linux/aer.h b/include/linux/aer.h
index 67fd04c5ae2b..c43d2b60b992 100644
--- a/include/linux/aer.h
+++ b/include/linux/aer.h
@@ -69,5 +69,7 @@ struct cxl_port_err_hndlrs {
void register_cxl_port_hndlrs(struct cxl_port_err_hndlrs *_cxl_port_hndlrs);
void unregister_cxl_port_hndlrs(void);
struct cxl_port_err_hndlrs *find_cxl_port_hndlrs(void);
+
+void pci_aer_unmask_internal_errors(struct pci_dev *dev);
#endif //_AER_H_
--
2.34.1
^ permalink raw reply related [flat|nested] 62+ messages in thread* Re: [PATCH 14/15] cxl/aer/pci: Export pci_aer_unmask_internal_errors()
2024-10-08 22:16 ` [PATCH 14/15] cxl/aer/pci: Export pci_aer_unmask_internal_errors() Terry Bowman
@ 2024-10-16 17:22 ` Jonathan Cameron
0 siblings, 0 replies; 62+ messages in thread
From: Jonathan Cameron @ 2024-10-16 17:22 UTC (permalink / raw)
To: Terry Bowman
Cc: ming4.li, linux-cxl, linux-kernel, linux-pci, dave, dave.jiang,
alison.schofield, vishal.l.verma, dan.j.williams, bhelgaas,
mahesh, oohall, Benjamin.Cheatham, rrichter, nathan.fontenot,
smita.koralahallichannabasappa
On Tue, 8 Oct 2024 17:16:56 -0500
Terry Bowman <terry.bowman@amd.com> wrote:
> The CXL driver needs to enable AER correctable and uncorrectable internal
> errors in order to receive notification of protocol
> errors. pci_aer_unmask_internal_errors() is currently defined as
> 'static' in the AER service driver.
>
> Update the AER service driver, exporting pci_aer_unmask_internal_errors()
> to CXL namespace.
>
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
(subject to suggestion we just enable internal errors for all devices
and sit back and watch for error reports :))
^ permalink raw reply [flat|nested] 62+ messages in thread
* [PATCH 15/15] cxl/pci: Enable internal CE/UCE interrupts for CXL PCIe port devices
2024-10-08 22:16 [PATCH 0/15] Enable CXL PCIe port protocol error handling and logging Terry Bowman
` (13 preceding siblings ...)
2024-10-08 22:16 ` [PATCH 14/15] cxl/aer/pci: Export pci_aer_unmask_internal_errors() Terry Bowman
@ 2024-10-08 22:16 ` Terry Bowman
2024-10-16 17:21 ` Jonathan Cameron
2024-10-10 19:07 ` [PATCH 0/15] Enable CXL PCIe port protocol error handling and logging Bjorn Helgaas
` (3 subsequent siblings)
18 siblings, 1 reply; 62+ messages in thread
From: Terry Bowman @ 2024-10-08 22:16 UTC (permalink / raw)
To: ming4.li, linux-cxl, linux-kernel, linux-pci, dave,
jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
dan.j.williams, bhelgaas, mahesh, oohall, Benjamin.Cheatham,
rrichter, nathan.fontenot, smita.koralahallichannabasappa,
terry.bowman
The AER service drivers and CXL drivers are updated to handle PCIe
port protocol errors. But, the PCIe AER correctable and uncorrectable
internal errors are mask disabled for the PCIe port devices.
Enable the AER internal errors for CXL PCIe port devices.
Signed-off-by: Terry Bowman <terry.bowman@amd.com>
---
drivers/cxl/core/pci.c | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
index 4706113d2582..1d84a7022c4d 100644
--- a/drivers/cxl/core/pci.c
+++ b/drivers/cxl/core/pci.c
@@ -908,6 +908,7 @@ EXPORT_SYMBOL_NS_GPL(cxl_port_err_detected, CXL);
void cxl_uport_init_aer(struct cxl_port *port)
{
+ struct pci_dev *pdev = to_pci_dev(port->uport_dev);
/* uport may have more than 1 downstream EP. Check if already mapped. */
if (port->uport_regs.ras) {
dev_warn(&port->dev, "RAS is already mapped\n");
@@ -920,12 +921,14 @@ void cxl_uport_init_aer(struct cxl_port *port)
dev_err(&port->dev, "Failed to map RAS capability.\n");
return;
}
+ pci_aer_unmask_internal_errors(pdev);
}
EXPORT_SYMBOL_NS_GPL(cxl_uport_init_aer, CXL);
void cxl_dport_init_aer(struct cxl_dport *dport)
{
struct device *dport_dev = dport->dport_dev;
+ struct pci_dev *pdev = to_pci_dev(dport_dev);
if (dport->rch) {
struct pci_host_bridge *host_bridge = to_pci_host_bridge(dport_dev);
@@ -949,6 +952,7 @@ void cxl_dport_init_aer(struct cxl_dport *dport)
dev_err(dport_dev, "Failed to map RAS capability.\n");
return;
}
+ pci_aer_unmask_internal_errors(pdev);
}
EXPORT_SYMBOL_NS_GPL(cxl_dport_init_aer, CXL);
--
2.34.1
^ permalink raw reply related [flat|nested] 62+ messages in thread* Re: [PATCH 15/15] cxl/pci: Enable internal CE/UCE interrupts for CXL PCIe port devices
2024-10-08 22:16 ` [PATCH 15/15] cxl/pci: Enable internal CE/UCE interrupts for CXL PCIe port devices Terry Bowman
@ 2024-10-16 17:21 ` Jonathan Cameron
2024-10-16 17:24 ` Terry Bowman
0 siblings, 1 reply; 62+ messages in thread
From: Jonathan Cameron @ 2024-10-16 17:21 UTC (permalink / raw)
To: Terry Bowman
Cc: ming4.li, linux-cxl, linux-kernel, linux-pci, dave, dave.jiang,
alison.schofield, vishal.l.verma, dan.j.williams, bhelgaas,
mahesh, oohall, Benjamin.Cheatham, rrichter, nathan.fontenot,
smita.koralahallichannabasappa
On Tue, 8 Oct 2024 17:16:57 -0500
Terry Bowman <terry.bowman@amd.com> wrote:
> The AER service drivers and CXL drivers are updated to handle PCIe
> port protocol errors. But, the PCIe AER correctable and uncorrectable
> internal errors are mask disabled for the PCIe port devices.
>
> Enable the AER internal errors for CXL PCIe port devices.
>
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
A while back I thought we had a discussion about just enabling these
for all devices and seeing if anyone screamed?
I'd love to do that rather than carefully enabling them for CXL devices
only ;)
If not, this looks fine to me.
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> ---
> drivers/cxl/core/pci.c | 4 ++++
> 1 file changed, 4 insertions(+)
>
> diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
> index 4706113d2582..1d84a7022c4d 100644
> --- a/drivers/cxl/core/pci.c
> +++ b/drivers/cxl/core/pci.c
> @@ -908,6 +908,7 @@ EXPORT_SYMBOL_NS_GPL(cxl_port_err_detected, CXL);
>
> void cxl_uport_init_aer(struct cxl_port *port)
> {
> + struct pci_dev *pdev = to_pci_dev(port->uport_dev);
> /* uport may have more than 1 downstream EP. Check if already mapped. */
> if (port->uport_regs.ras) {
> dev_warn(&port->dev, "RAS is already mapped\n");
> @@ -920,12 +921,14 @@ void cxl_uport_init_aer(struct cxl_port *port)
> dev_err(&port->dev, "Failed to map RAS capability.\n");
> return;
> }
> + pci_aer_unmask_internal_errors(pdev);
> }
> EXPORT_SYMBOL_NS_GPL(cxl_uport_init_aer, CXL);
>
> void cxl_dport_init_aer(struct cxl_dport *dport)
> {
> struct device *dport_dev = dport->dport_dev;
> + struct pci_dev *pdev = to_pci_dev(dport_dev);
>
> if (dport->rch) {
> struct pci_host_bridge *host_bridge = to_pci_host_bridge(dport_dev);
> @@ -949,6 +952,7 @@ void cxl_dport_init_aer(struct cxl_dport *dport)
> dev_err(dport_dev, "Failed to map RAS capability.\n");
> return;
> }
> + pci_aer_unmask_internal_errors(pdev);
> }
> EXPORT_SYMBOL_NS_GPL(cxl_dport_init_aer, CXL);
>
^ permalink raw reply [flat|nested] 62+ messages in thread* Re: [PATCH 15/15] cxl/pci: Enable internal CE/UCE interrupts for CXL PCIe port devices
2024-10-16 17:21 ` Jonathan Cameron
@ 2024-10-16 17:24 ` Terry Bowman
0 siblings, 0 replies; 62+ messages in thread
From: Terry Bowman @ 2024-10-16 17:24 UTC (permalink / raw)
To: Jonathan Cameron
Cc: ming4.li, linux-cxl, linux-kernel, linux-pci, dave, dave.jiang,
alison.schofield, vishal.l.verma, dan.j.williams, bhelgaas,
mahesh, oohall, Benjamin.Cheatham, rrichter, nathan.fontenot,
smita.koralahallichannabasappa
Hi Jonathan,
On 10/16/24 12:21, Jonathan Cameron wrote:
> On Tue, 8 Oct 2024 17:16:57 -0500
> Terry Bowman <terry.bowman@amd.com> wrote:
>
>> The AER service drivers and CXL drivers are updated to handle PCIe
>> port protocol errors. But, the PCIe AER correctable and uncorrectable
>> internal errors are mask disabled for the PCIe port devices.
>>
>> Enable the AER internal errors for CXL PCIe port devices.
>>
>> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
>
> A while back I thought we had a discussion about just enabling these
> for all devices and seeing if anyone screamed?
>
> I'd love to do that rather than carefully enabling them for CXL devices
> only ;)
>
> If not, this looks fine to me.
> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
>
These last 2 patches will be removed for v2. This is not necessary.
Internal AER errors for root ports and RCECs handling are already enabled
by the AER driver.
Regards,
Terry
^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [PATCH 0/15] Enable CXL PCIe port protocol error handling and logging
2024-10-08 22:16 [PATCH 0/15] Enable CXL PCIe port protocol error handling and logging Terry Bowman
` (14 preceding siblings ...)
2024-10-08 22:16 ` [PATCH 15/15] cxl/pci: Enable internal CE/UCE interrupts for CXL PCIe port devices Terry Bowman
@ 2024-10-10 19:07 ` Bjorn Helgaas
2024-10-14 17:22 ` Terry Bowman
2024-10-17 16:34 ` Fan Ni
` (2 subsequent siblings)
18 siblings, 1 reply; 62+ messages in thread
From: Bjorn Helgaas @ 2024-10-10 19:07 UTC (permalink / raw)
To: Terry Bowman
Cc: ming4.li, linux-cxl, linux-kernel, linux-pci, dave,
jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
dan.j.williams, bhelgaas, mahesh, oohall, Benjamin.Cheatham,
rrichter, nathan.fontenot, smita.koralahallichannabasappa
On Tue, Oct 08, 2024 at 05:16:42PM -0500, Terry Bowman wrote:
> This is a continuation of the CXL port error handling RFC from earlier.[1]
> The RFC resulted in the decision to add CXL PCIe port error handling to
> the existing RCH downstream port handling. This patchset adds the CXL PCIe
> port handling and logging.
>
> The first 7 patches update the existing AER service driver to support CXL
> PCIe port protocol error handling and reporting. This includes AER service
> driver changes for adding correctable and uncorrectable error support, CXL
> specific recovery handling, and addition of CXL driver callback handlers.
>
> The following 8 patches address CXL driver support for CXL PCIe port
> protocol errors. This includes the following changes to the CXL drivers:
> mapping CXL port and downstream port RAS registers, interface updates for
> common RCH and VH, adding port specific error handlers, and protocol error
> logging.
>
> [1] - https://lore.kernel.org/linux-cxl/20240617200411.1426554
> -1-terry.bowman@amd.com/
Makes life easier if URLs are all on one line so they still work.
> Testing:
>
> Below are test results for this patchset. This is using Qemu with a root
> port (0c:00.0), upstream switch port (0d:00.0),and downstream switch port
> (0e:00.0).
>
> This was tested using aer-inject updated to support CE and UCE internal
> error injection. CXL RAS was set using a test patch (not upstreamed).
>
> Root port UCE:
> root@tbowman-cxl:~/aer-inject# ./root-uce-inject.sh
> [ 27.318920] pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00400000 into device 0000:0c:00.0
> [ 27.320164] pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0c:00.0
> [ 27.321518] pcieport 0000:0c:00.0: PCIe Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID)
> [ 27.322483] pcieport 0000:0c:00.0: device [8086:7075] error status/mask=00400000/02000000
> [ 27.323243] pcieport 0000:0c:00.0: [22] UncorrIntErr
> [ 27.325584] aer_event: 0000:0c:00.0 PCIe Bus Error: severity=Fatal, Uncorrectable Internal Error, TLP Header=Not available
> [ 27.325584]
> [ 27.327171] cxl_port_aer_uncorrectable_error: device=0000:0c:00.0 host=pci0000:0c status: 'Memory Address Parity Error'
> first_error: 'Memory Address Parity Error'
> [ 27.333277] Kernel panic - not syncing: CXL cachemem error. Invoking panic
> [ 27.333872] CPU: 12 UID: 0 PID: 122 Comm: irq/24-aerdrv Not tainted 6.11.0-rc1-port-error-g1fb9097c3728 #3857
> [ 27.334761] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
> [ 27.335716] Call Trace:
> [ 27.335985] <TASK>
> [ 27.336226] panic+0x2ed/0x320
> [ 27.336547] ? __pfx_cxl_report_normal_detected+0x10/0x10
> [ 27.337037] ? __pfx_aer_root_reset+0x10/0x10
> [ 27.337453] cxl_do_recovery+0x304/0x310
> [ 27.337833] aer_isr+0x3fd/0x700
> [ 27.338154] ? __pfx_irq_thread_fn+0x10/0x10
> [ 27.338572] irq_thread_fn+0x1f/0x60
> [ 27.338923] irq_thread+0x102/0x1b0
> [ 27.339267] ? __pfx_irq_thread_dtor+0x10/0x10
> [ 27.339683] ? __pfx_irq_thread+0x10/0x10
> [ 27.340059] kthread+0xcd/0x100
> [ 27.340387] ? __pfx_kthread+0x10/0x10
> [ 27.340748] ret_from_fork+0x2f/0x50
> [ 27.341100] ? __pfx_kthread+0x10/0x10
> [ 27.341466] ret_from_fork_asm+0x1a/0x30
> [ 27.341842] </TASK>
> [ 27.342281] Kernel Offset: 0x1ba00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
> [ 27.343221] ---[ end Kernel panic - not syncing: CXL cachemem error. Invoking panic ]---
>
> Root port CE:
> root@tbowman-cxl:~/aer-inject# ./root-ce-inject.sh
> [ 19.444339] pcieport 0000:0c:00.0: aer_inject: Injecting errors 00004000/00000000 into device 0000:0c:00.0
> [ 19.445530] pcieport 0000:0c:00.0: AER: Correctable error message received from 0000:0c:00.0
> [ 19.446750] pcieport 0000:0c:00.0: PCIe Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID)
> [ 19.447742] pcieport 0000:0c:00.0: device [8086:7075] error status/mask=00004000/0000a000
> [ 19.448549] pcieport 0000:0c:00.0: [14] CorrIntErr
> [ 19.449223] aer_event: 0000:0c:00.0 PCIe Bus Error: severity=Corrected, Corrected Internal Error, TLP Header=Not available
> [ 19.449223]
> [ 19.451415] cxl_port_aer_correctable_error: device=0000:0c:00.0 host=pci0000:0c status='Received Error From Physical Layer'
>
> Upstream switch port UCE:
> root@tbowman-cxl:~/aer-inject# ./us-uce-inject.sh
> [ 45.236853] pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00400000 into device 0000:0d:00.0
> [ 45.238101] pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0d:00.0
> [ 45.239416] pcieport 0000:0d:00.0: PCIe Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID)
> [ 45.240412] pcieport 0000:0d:00.0: device [19e5:a128] error status/mask=00400000/02000000
> [ 45.241159] pcieport 0000:0d:00.0: [22] UncorrIntErr
> [ 45.242448] aer_event: 0000:0d:00.0 PCIe Bus Error: severity=Fatal, Uncorrectable Internal Error, TLP Header=Not available
> [ 45.242448]
> [ 45.244008] cxl_port_aer_uncorrectable_error: device=0000:0d:00.0 host=0000:0c:00.0 status: 'Memory Address Parity Error'
> first_error: 'Memory Address Parity Error'
> [ 45.249129] Kernel panic - not syncing: CXL cachemem error. Invoking panic
> [ 45.249800] CPU: 12 UID: 0 PID: 122 Comm: irq/24-aerdrv Not tainted 6.11.0-rc1-port-error-g1fb9097c3728 #3855
> [ 45.250795] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
> [ 45.251907] Call Trace:
> [ 45.253284] <TASK>
> [ 45.253564] panic+0x2ed/0x320
> [ 45.253909] ? __pfx_cxl_report_normal_detected+0x10/0x10
> [ 45.255455] ? __pfx_aer_root_reset+0x10/0x10
> [ 45.255915] cxl_do_recovery+0x304/0x310
> [ 45.257219] aer_isr+0x3fd/0x700
> [ 45.257572] ? __pfx_irq_thread_fn+0x10/0x10
> [ 45.258006] irq_thread_fn+0x1f/0x60
> [ 45.258383] irq_thread+0x102/0x1b0
> [ 45.258748] ? __pfx_irq_thread_dtor+0x10/0x10
> [ 45.259196] ? __pfx_irq_thread+0x10/0x10
> [ 45.259605] kthread+0xcd/0x100
> [ 45.259956] ? __pfx_kthread+0x10/0x10
> [ 45.260386] ret_from_fork+0x2f/0x50
> [ 45.260879] ? __pfx_kthread+0x10/0x10
> [ 45.261418] ret_from_fork_asm+0x1a/0x30
> [ 45.261936] </TASK>
> [ 45.262451] Kernel Offset: 0xc600000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
> [ 45.263467] ---[ end Kernel panic - not syncing: CXL cachemem error. Invoking panic ]---
>
> Upstream switch port CE:
> root@tbowman-cxl:~/aer-inject# ./us-ce-inject.sh
> [ 37.504029] pcieport 0000:0c:00.0: aer_inject: Injecting errors 00004000/00000000 into device 0000:0d:00.0
> [ 37.506076] pcieport 0000:0c:00.0: AER: Correctable error message received from 0000:0d:00.0
> [ 37.507599] pcieport 0000:0d:00.0: PCIe Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID)
> [ 37.508759] pcieport 0000:0d:00.0: device [19e5:a128] error status/mask=00004000/0000a000
> [ 37.509574] pcieport 0000:0d:00.0: [14] CorrIntErr
> [ 37.510180] aer_event: 0000:0d:00.0 PCIe Bus Error: severity=Corrected, Corrected Internal Error, TLP Header=Not available
> [ 37.510180]
> [ 37.512057] cxl_port_aer_correctable_error: device=0000:0d:00.0 host=0000:0c:00.0 status='Received Error From Physical Layer'
>
> Downstream switch port UCE:
> root@tbowman-cxl:~/aer-inject# ./ds-uce-inject.sh
> [ 29.421532] pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00400000 into device 0000:0e:00.0
> [ 29.422812] pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0e:00.0
> [ 29.424551] pcieport 0000:0e:00.0: PCIe Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID)
> [ 29.425670] pcieport 0000:0e:00.0: device [19e5:a129] error status/mask=00400000/02000000
> [ 29.426487] pcieport 0000:0e:00.0: [22] UncorrIntErr
> [ 29.427111] aer_event: 0000:0e:00.0 PCIe Bus Error: severity=Fatal, Uncorrectable Internal Error, TLP Header=Not available
> [ 29.427111]
> [ 29.428688] cxl_port_aer_uncorrectable_error: device=0000:0e:00.0 host=0000:0d:00.0 status: 'Memory Address Parity Error'
> first_error: 'Memory Address Parity Error'
> [ 29.430173] Kernel panic - not syncing: CXL cachemem error. Invoking panic
> [ 29.430862] CPU: 12 UID: 0 PID: 122 Comm: irq/24-aerdrv Not tainted 6.11.0-rc1-port-error-g844fd2319372 #3851
> [ 29.431874] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
> [ 29.433031] Call Trace:
> [ 29.433354] <TASK>
> [ 29.433631] panic+0x2ed/0x320
> [ 29.434010] ? __pfx_cxl_report_normal_detected+0x10/0x10
> [ 29.434653] ? __pfx_aer_root_reset+0x10/0x10
> [ 29.435179] cxl_do_recovery+0x304/0x310
> [ 29.435626] aer_isr+0x3fd/0x700
> [ 29.436027] ? __pfx_irq_thread_fn+0x10/0x10
> [ 29.436507] irq_thread_fn+0x1f/0x60
> [ 29.436898] irq_thread+0x102/0x1b0
> [ 29.437293] ? __pfx_irq_thread_dtor+0x10/0x10
> [ 29.437758] ? __pfx_irq_thread+0x10/0x10
> [ 29.438189] kthread+0xcd/0x100
> [ 29.438551] ? __pfx_kthread+0x10/0x10
> [ 29.438959] ret_from_fork+0x2f/0x50
> [ 29.439362] ? __pfx_kthread+0x10/0x10
> [ 29.439771] ret_from_fork_asm+0x1a/0x30
> [ 29.440221] </TASK>
> [ 29.440738] Kernel Offset: 0x10a00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
> [ 29.441812] ---[ end Kernel panic - not syncing: CXL cachemem error. Invoking panic ]---
>
> Downstream switch port CE:
> root@tbowman-cxl:~/aer-inject# ./ds-ce-inject.sh
> [ 177.114442] pcieport 0000:0c:00.0: aer_inject: Injecting errors 00004000/00000000 into device 0000:0e:00.0
> [ 177.115602] pcieport 0000:0c:00.0: AER: Correctable error message received from 0000:0e:00.0
> [ 177.116973] pcieport 0000:0e:00.0: PCIe Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID)
> [ 177.117985] pcieport 0000:0e:00.0: device [19e5:a129] error status/mask=00004000/0000a000
> [ 177.118809] pcieport 0000:0e:00.0: [14] CorrIntErr
> [ 177.119521] aer_event: 0000:0e:00.0 PCIe Bus Error: severity=Corrected, Corrected Internal Error, TLP Header=Not available
> [ 177.119521]
> [ 177.122037] cxl_port_aer_correctable_error: device=0000:0e:00.0 host=0000:0d:00.0 status='Received Error From Physical Layer'
Thanks for the hints about how to test this; it's helpful to have
those in the email archives. Remove the timestamps and non-relevant
call trace entries unless they add useful information. AFAICT they're
just distractions in this case.
> Changes RFC->v1:
> [Dan] Rename cxl_rch_handle_error() becomes cxl_handle_error()
> [Dan] Add cxl_do_recovery()
> [Jonathan] Flatten cxl_setup_parent_uport()
> [Jonathan] Use cxl_component_regs instead of struct cxl_regs regs
> [Jonathan] Rename cxl_dev_is_pci_type()
> [Ming] bus_find_device(&cxl_bus_type, NULL, &pdev->dev, match_uport) can
> replace these find_cxl_port() and device_find_child().
> [Jonathan] Compact call to cxl_port_map_regs() in cxl_setup_parent_uport()
> [Ming] Dont use endpoint as host to cxl_map_component_regs()
> [Bjorn] Use "PCIe UIR/CIE" instesad of "AER UI/CIE"
> [TODO][Bjorn] Dont use Kconfig to enable/disable a CXL external interface
>
> Terry Bowman (15):
> cxl/aer/pci: Add CXL PCIe port error handler callbacks in AER service
> driver
> cxl/aer/pci: Update is_internal_error() to be callable w/o
> CONFIG_PCIEAER_CXL
> cxl/aer/pci: Refactor AER driver's existing interfaces to support CXL
> PCIe ports
> cxl/aer/pci: Add CXL PCIe port correctable error support in AER
> service driver
> cxl/aer/pci: Update AER driver to read UCE fatal status for all CXL
> PCIe port devices
> cxl/aer/pci: Introduce PCI_ERS_RESULT_PANIC to pci_ers_result type
> cxl/aer/pci: Add CXL PCIe port uncorrectable error recovery in AER
> service driver
I had to look at the patches to learn that all the above only touch
drivers/pci, aer.h, and pci.h. Can you use the PCI subject line
conventions (e.g., "PCI/AER: ...") to make this more obvious? Almost
all already include "CXL", so I don't think we'd really lose any
information.
> cxl/pci: Change find_cxl_ports() to be non-static
> cxl/pci: Map CXL PCIe downstream port RAS registers
> cxl/pci: Map CXL PCIe upstream port RAS registers
> cxl/pci: Update RAS handler interfaces to support CXL PCIe ports
> cxl/pci: Add error handler for CXL PCIe port RAS errors
> cxl/pci: Add trace logging for CXL PCIe port RAS errors
> cxl/aer/pci: Export pci_aer_unmask_internal_errors()
Ditto here, and add something about CXL in the subject since this
doesn't export universally.
> cxl/pci: Enable internal CE/UCE interrupts for CXL PCIe port devices
>
> drivers/cxl/core/core.h | 3 +
> drivers/cxl/core/pci.c | 172 +++++++++++++++++++++++++++++++--------
> drivers/cxl/core/port.c | 4 +-
> drivers/cxl/core/trace.h | 47 +++++++++++
> drivers/cxl/cxl.h | 14 +++-
> drivers/cxl/mem.c | 30 ++++++-
> drivers/cxl/pci.c | 8 ++
> drivers/pci/pci.h | 5 ++
> drivers/pci/pcie/aer.c | 123 ++++++++++++++++++++--------
> drivers/pci/pcie/err.c | 150 ++++++++++++++++++++++++++++++++++
> include/linux/aer.h | 16 ++++
> include/linux/pci.h | 3 +
> 12 files changed, 503 insertions(+), 72 deletions(-)
>
>
> base-commit: f7982d85e136ba7e26b31a725c1841373f81f84a
This doesn't apply cleanly on v6.12-rc1, and
f7982d85e136ba7e26b31a725c1841373f81f84a isn't upstream yet. Where
is it? I guess it relies on some other series that hasn't been merged
yet?
Bjorn
^ permalink raw reply [flat|nested] 62+ messages in thread* Re: [PATCH 0/15] Enable CXL PCIe port protocol error handling and logging
2024-10-10 19:07 ` [PATCH 0/15] Enable CXL PCIe port protocol error handling and logging Bjorn Helgaas
@ 2024-10-14 17:22 ` Terry Bowman
2024-10-14 17:29 ` Bjorn Helgaas
0 siblings, 1 reply; 62+ messages in thread
From: Terry Bowman @ 2024-10-14 17:22 UTC (permalink / raw)
To: Bjorn Helgaas
Cc: ming4.li, linux-cxl, linux-kernel, linux-pci, dave,
jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
dan.j.williams, bhelgaas, mahesh, oohall, Benjamin.Cheatham,
rrichter, nathan.fontenot, smita.koralahallichannabasappa
Hi Bjorn,
Thanks for taking the time to review. I added comments below.
On 10/10/24 14:07, Bjorn Helgaas wrote:
> On Tue, Oct 08, 2024 at 05:16:42PM -0500, Terry Bowman wrote:
>> This is a continuation of the CXL port error handling RFC from earlier.[1]
>> The RFC resulted in the decision to add CXL PCIe port error handling to
>> the existing RCH downstream port handling. This patchset adds the CXL PCIe
>> port handling and logging.
>>
>> The first 7 patches update the existing AER service driver to support CXL
>> PCIe port protocol error handling and reporting. This includes AER service
>> driver changes for adding correctable and uncorrectable error support, CXL
>> specific recovery handling, and addition of CXL driver callback handlers.
>>
>> The following 8 patches address CXL driver support for CXL PCIe port
>> protocol errors. This includes the following changes to the CXL drivers:
>> mapping CXL port and downstream port RAS registers, interface updates for
>> common RCH and VH, adding port specific error handlers, and protocol error
>> logging.
>>
>> [1] - https://lore.kernel.org/linux-cxl/20240617200411.1426554
>> -1-terry.bowman@amd.com/
>
> Makes life easier if URLs are all on one line so they still work.
>
Ok.
>> Testing:
>>
>> Below are test results for this patchset. This is using Qemu with a root
>> port (0c:00.0), upstream switch port (0d:00.0),and downstream switch port
>> (0e:00.0).
>>
>> This was tested using aer-inject updated to support CE and UCE internal
>> error injection. CXL RAS was set using a test patch (not upstreamed).
>>
>> Root port UCE:
>> root@tbowman-cxl:~/aer-inject# ./root-uce-inject.sh
>> [ 27.318920] pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00400000 into device 0000:0c:00.0
>> [ 27.320164] pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0c:00.0
>> [ 27.321518] pcieport 0000:0c:00.0: PCIe Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID)
>> [ 27.322483] pcieport 0000:0c:00.0: device [8086:7075] error status/mask=00400000/02000000
>> [ 27.323243] pcieport 0000:0c:00.0: [22] UncorrIntErr
>> [ 27.325584] aer_event: 0000:0c:00.0 PCIe Bus Error: severity=Fatal, Uncorrectable Internal Error, TLP Header=Not available
>> [ 27.325584]
>> [ 27.327171] cxl_port_aer_uncorrectable_error: device=0000:0c:00.0 host=pci0000:0c status: 'Memory Address Parity Error'
>> first_error: 'Memory Address Parity Error'
>> [ 27.333277] Kernel panic - not syncing: CXL cachemem error. Invoking panic
>> [ 27.333872] CPU: 12 UID: 0 PID: 122 Comm: irq/24-aerdrv Not tainted 6.11.0-rc1-port-error-g1fb9097c3728 #3857
>> [ 27.334761] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
>> [ 27.335716] Call Trace:
>> [ 27.335985] <TASK>
>> [ 27.336226] panic+0x2ed/0x320
>> [ 27.336547] ? __pfx_cxl_report_normal_detected+0x10/0x10
>> [ 27.337037] ? __pfx_aer_root_reset+0x10/0x10
>> [ 27.337453] cxl_do_recovery+0x304/0x310
>> [ 27.337833] aer_isr+0x3fd/0x700
>> [ 27.338154] ? __pfx_irq_thread_fn+0x10/0x10
>> [ 27.338572] irq_thread_fn+0x1f/0x60
>> [ 27.338923] irq_thread+0x102/0x1b0
>> [ 27.339267] ? __pfx_irq_thread_dtor+0x10/0x10
>> [ 27.339683] ? __pfx_irq_thread+0x10/0x10
>> [ 27.340059] kthread+0xcd/0x100
>> [ 27.340387] ? __pfx_kthread+0x10/0x10
>> [ 27.340748] ret_from_fork+0x2f/0x50
>> [ 27.341100] ? __pfx_kthread+0x10/0x10
>> [ 27.341466] ret_from_fork_asm+0x1a/0x30
>> [ 27.341842] </TASK>
>> [ 27.342281] Kernel Offset: 0x1ba00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
>> [ 27.343221] ---[ end Kernel panic - not syncing: CXL cachemem error. Invoking panic ]---
>>
>> Root port CE:
>> root@tbowman-cxl:~/aer-inject# ./root-ce-inject.sh
>> [ 19.444339] pcieport 0000:0c:00.0: aer_inject: Injecting errors 00004000/00000000 into device 0000:0c:00.0
>> [ 19.445530] pcieport 0000:0c:00.0: AER: Correctable error message received from 0000:0c:00.0
>> [ 19.446750] pcieport 0000:0c:00.0: PCIe Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID)
>> [ 19.447742] pcieport 0000:0c:00.0: device [8086:7075] error status/mask=00004000/0000a000
>> [ 19.448549] pcieport 0000:0c:00.0: [14] CorrIntErr
>> [ 19.449223] aer_event: 0000:0c:00.0 PCIe Bus Error: severity=Corrected, Corrected Internal Error, TLP Header=Not available
>> [ 19.449223]
>> [ 19.451415] cxl_port_aer_correctable_error: device=0000:0c:00.0 host=pci0000:0c status='Received Error From Physical Layer'
>>
>> Upstream switch port UCE:
>> root@tbowman-cxl:~/aer-inject# ./us-uce-inject.sh
>> [ 45.236853] pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00400000 into device 0000:0d:00.0
>> [ 45.238101] pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0d:00.0
>> [ 45.239416] pcieport 0000:0d:00.0: PCIe Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID)
>> [ 45.240412] pcieport 0000:0d:00.0: device [19e5:a128] error status/mask=00400000/02000000
>> [ 45.241159] pcieport 0000:0d:00.0: [22] UncorrIntErr
>> [ 45.242448] aer_event: 0000:0d:00.0 PCIe Bus Error: severity=Fatal, Uncorrectable Internal Error, TLP Header=Not available
>> [ 45.242448]
>> [ 45.244008] cxl_port_aer_uncorrectable_error: device=0000:0d:00.0 host=0000:0c:00.0 status: 'Memory Address Parity Error'
>> first_error: 'Memory Address Parity Error'
>> [ 45.249129] Kernel panic - not syncing: CXL cachemem error. Invoking panic
>> [ 45.249800] CPU: 12 UID: 0 PID: 122 Comm: irq/24-aerdrv Not tainted 6.11.0-rc1-port-error-g1fb9097c3728 #3855
>> [ 45.250795] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
>> [ 45.251907] Call Trace:
>> [ 45.253284] <TASK>
>> [ 45.253564] panic+0x2ed/0x320
>> [ 45.253909] ? __pfx_cxl_report_normal_detected+0x10/0x10
>> [ 45.255455] ? __pfx_aer_root_reset+0x10/0x10
>> [ 45.255915] cxl_do_recovery+0x304/0x310
>> [ 45.257219] aer_isr+0x3fd/0x700
>> [ 45.257572] ? __pfx_irq_thread_fn+0x10/0x10
>> [ 45.258006] irq_thread_fn+0x1f/0x60
>> [ 45.258383] irq_thread+0x102/0x1b0
>> [ 45.258748] ? __pfx_irq_thread_dtor+0x10/0x10
>> [ 45.259196] ? __pfx_irq_thread+0x10/0x10
>> [ 45.259605] kthread+0xcd/0x100
>> [ 45.259956] ? __pfx_kthread+0x10/0x10
>> [ 45.260386] ret_from_fork+0x2f/0x50
>> [ 45.260879] ? __pfx_kthread+0x10/0x10
>> [ 45.261418] ret_from_fork_asm+0x1a/0x30
>> [ 45.261936] </TASK>
>> [ 45.262451] Kernel Offset: 0xc600000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
>> [ 45.263467] ---[ end Kernel panic - not syncing: CXL cachemem error. Invoking panic ]---
>>
>> Upstream switch port CE:
>> root@tbowman-cxl:~/aer-inject# ./us-ce-inject.sh
>> [ 37.504029] pcieport 0000:0c:00.0: aer_inject: Injecting errors 00004000/00000000 into device 0000:0d:00.0
>> [ 37.506076] pcieport 0000:0c:00.0: AER: Correctable error message received from 0000:0d:00.0
>> [ 37.507599] pcieport 0000:0d:00.0: PCIe Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID)
>> [ 37.508759] pcieport 0000:0d:00.0: device [19e5:a128] error status/mask=00004000/0000a000
>> [ 37.509574] pcieport 0000:0d:00.0: [14] CorrIntErr
>> [ 37.510180] aer_event: 0000:0d:00.0 PCIe Bus Error: severity=Corrected, Corrected Internal Error, TLP Header=Not available
>> [ 37.510180]
>> [ 37.512057] cxl_port_aer_correctable_error: device=0000:0d:00.0 host=0000:0c:00.0 status='Received Error From Physical Layer'
>>
>> Downstream switch port UCE:
>> root@tbowman-cxl:~/aer-inject# ./ds-uce-inject.sh
>> [ 29.421532] pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00400000 into device 0000:0e:00.0
>> [ 29.422812] pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0e:00.0
>> [ 29.424551] pcieport 0000:0e:00.0: PCIe Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID)
>> [ 29.425670] pcieport 0000:0e:00.0: device [19e5:a129] error status/mask=00400000/02000000
>> [ 29.426487] pcieport 0000:0e:00.0: [22] UncorrIntErr
>> [ 29.427111] aer_event: 0000:0e:00.0 PCIe Bus Error: severity=Fatal, Uncorrectable Internal Error, TLP Header=Not available
>> [ 29.427111]
>> [ 29.428688] cxl_port_aer_uncorrectable_error: device=0000:0e:00.0 host=0000:0d:00.0 status: 'Memory Address Parity Error'
>> first_error: 'Memory Address Parity Error'
>> [ 29.430173] Kernel panic - not syncing: CXL cachemem error. Invoking panic
>> [ 29.430862] CPU: 12 UID: 0 PID: 122 Comm: irq/24-aerdrv Not tainted 6.11.0-rc1-port-error-g844fd2319372 #3851
>> [ 29.431874] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
>> [ 29.433031] Call Trace:
>> [ 29.433354] <TASK>
>> [ 29.433631] panic+0x2ed/0x320
>> [ 29.434010] ? __pfx_cxl_report_normal_detected+0x10/0x10
>> [ 29.434653] ? __pfx_aer_root_reset+0x10/0x10
>> [ 29.435179] cxl_do_recovery+0x304/0x310
>> [ 29.435626] aer_isr+0x3fd/0x700
>> [ 29.436027] ? __pfx_irq_thread_fn+0x10/0x10
>> [ 29.436507] irq_thread_fn+0x1f/0x60
>> [ 29.436898] irq_thread+0x102/0x1b0
>> [ 29.437293] ? __pfx_irq_thread_dtor+0x10/0x10
>> [ 29.437758] ? __pfx_irq_thread+0x10/0x10
>> [ 29.438189] kthread+0xcd/0x100
>> [ 29.438551] ? __pfx_kthread+0x10/0x10
>> [ 29.438959] ret_from_fork+0x2f/0x50
>> [ 29.439362] ? __pfx_kthread+0x10/0x10
>> [ 29.439771] ret_from_fork_asm+0x1a/0x30
>> [ 29.440221] </TASK>
>> [ 29.440738] Kernel Offset: 0x10a00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
>> [ 29.441812] ---[ end Kernel panic - not syncing: CXL cachemem error. Invoking panic ]---
>>
>> Downstream switch port CE:
>> root@tbowman-cxl:~/aer-inject# ./ds-ce-inject.sh
>> [ 177.114442] pcieport 0000:0c:00.0: aer_inject: Injecting errors 00004000/00000000 into device 0000:0e:00.0
>> [ 177.115602] pcieport 0000:0c:00.0: AER: Correctable error message received from 0000:0e:00.0
>> [ 177.116973] pcieport 0000:0e:00.0: PCIe Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID)
>> [ 177.117985] pcieport 0000:0e:00.0: device [19e5:a129] error status/mask=00004000/0000a000
>> [ 177.118809] pcieport 0000:0e:00.0: [14] CorrIntErr
>> [ 177.119521] aer_event: 0000:0e:00.0 PCIe Bus Error: severity=Corrected, Corrected Internal Error, TLP Header=Not available
>> [ 177.119521]
>> [ 177.122037] cxl_port_aer_correctable_error: device=0000:0e:00.0 host=0000:0d:00.0 status='Received Error From Physical Layer'
>
> Thanks for the hints about how to test this; it's helpful to have
> those in the email archives. Remove the timestamps and non-relevant
> call trace entries unless they add useful information. AFAICT they're
> just distractions in this case.
>
I'll remove the test logging and details from the cover sheet. I'm unable to find how to
attach using git tools. Instead of an atatachment, I can locate the log files and details
on a public github. Let me know if this is not acceptable.
>> Changes RFC->v1:
>> [Dan] Rename cxl_rch_handle_error() becomes cxl_handle_error()
>> [Dan] Add cxl_do_recovery()
>> [Jonathan] Flatten cxl_setup_parent_uport()
>> [Jonathan] Use cxl_component_regs instead of struct cxl_regs regs
>> [Jonathan] Rename cxl_dev_is_pci_type()
>> [Ming] bus_find_device(&cxl_bus_type, NULL, &pdev->dev, match_uport) can
>> replace these find_cxl_port() and device_find_child().
>> [Jonathan] Compact call to cxl_port_map_regs() in cxl_setup_parent_uport()
>> [Ming] Dont use endpoint as host to cxl_map_component_regs()
>> [Bjorn] Use "PCIe UIR/CIE" instesad of "AER UI/CIE"
>> [TODO][Bjorn] Dont use Kconfig to enable/disable a CXL external interface
>>
>> Terry Bowman (15):
>> cxl/aer/pci: Add CXL PCIe port error handler callbacks in AER service
>> driver
>> cxl/aer/pci: Update is_internal_error() to be callable w/o
>> CONFIG_PCIEAER_CXL
>> cxl/aer/pci: Refactor AER driver's existing interfaces to support CXL
>> PCIe ports
>> cxl/aer/pci: Add CXL PCIe port correctable error support in AER
>> service driver
>> cxl/aer/pci: Update AER driver to read UCE fatal status for all CXL
>> PCIe port devices
>> cxl/aer/pci: Introduce PCI_ERS_RESULT_PANIC to pci_ers_result type
>> cxl/aer/pci: Add CXL PCIe port uncorrectable error recovery in AER
>> service driver
>
> I had to look at the patches to learn that all the above only touch
> drivers/pci, aer.h, and pci.h. Can you use the PCI subject line
> conventions (e.g., "PCI/AER: ...") to make this more obvious? Almost
> all already include "CXL", so I don't think we'd really lose any
> information.
>
Yes, I'll change the patches' headlines to use capitalized "PCI/AER".
>> cxl/pci: Change find_cxl_ports() to be non-static
>> cxl/pci: Map CXL PCIe downstream port RAS registers
>> cxl/pci: Map CXL PCIe upstream port RAS registers
>> cxl/pci: Update RAS handler interfaces to support CXL PCIe ports
>> cxl/pci: Add error handler for CXL PCIe port RAS errors
>> cxl/pci: Add trace logging for CXL PCIe port RAS errors
>> cxl/aer/pci: Export pci_aer_unmask_internal_errors()
>
> Ditto here, and add something about CXL in the subject since this
> doesn't export universally.
>
Ok.
>> cxl/pci: Enable internal CE/UCE interrupts for CXL PCIe port devices
>>
>> drivers/cxl/core/core.h | 3 +
>> drivers/cxl/core/pci.c | 172 +++++++++++++++++++++++++++++++--------
>> drivers/cxl/core/port.c | 4 +-
>> drivers/cxl/core/trace.h | 47 +++++++++++
>> drivers/cxl/cxl.h | 14 +++-
>> drivers/cxl/mem.c | 30 ++++++-
>> drivers/cxl/pci.c | 8 ++
>> drivers/pci/pci.h | 5 ++
>> drivers/pci/pcie/aer.c | 123 ++++++++++++++++++++--------
>> drivers/pci/pcie/err.c | 150 ++++++++++++++++++++++++++++++++++
>> include/linux/aer.h | 16 ++++
>> include/linux/pci.h | 3 +
>> 12 files changed, 503 insertions(+), 72 deletions(-)
>>
>>
>> base-commit: f7982d85e136ba7e26b31a725c1841373f81f84a
>
> This doesn't apply cleanly on v6.12-rc1, and
> f7982d85e136ba7e26b31a725c1841373f81f84a isn't upstream yet. Where
> is it? I guess it relies on some other series that hasn't been merged
> yet?
>
> Bjorn
Hmmm, I thought I was using a 6.11-rc7 commit. I will rebase to either 6.12-rc1 or rc2.
Regards,
Terry
^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [PATCH 0/15] Enable CXL PCIe port protocol error handling and logging
2024-10-14 17:22 ` Terry Bowman
@ 2024-10-14 17:29 ` Bjorn Helgaas
2024-10-14 17:33 ` Terry Bowman
0 siblings, 1 reply; 62+ messages in thread
From: Bjorn Helgaas @ 2024-10-14 17:29 UTC (permalink / raw)
To: Terry Bowman
Cc: ming4.li, linux-cxl, linux-kernel, linux-pci, dave,
jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
dan.j.williams, bhelgaas, mahesh, oohall, Benjamin.Cheatham,
rrichter, nathan.fontenot, smita.koralahallichannabasappa
On Mon, Oct 14, 2024 at 12:22:08PM -0500, Terry Bowman wrote:
> On 10/10/24 14:07, Bjorn Helgaas wrote:
> > On Tue, Oct 08, 2024 at 05:16:42PM -0500, Terry Bowman wrote:
> >> This is a continuation of the CXL port error handling RFC from earlier.[1]
> >> The RFC resulted in the decision to add CXL PCIe port error handling to
> >> the existing RCH downstream port handling. This patchset adds the CXL PCIe
> >> port handling and logging.
> ...
> >> Downstream switch port CE:
> >> root@tbowman-cxl:~/aer-inject# ./ds-ce-inject.sh
> >> [ 177.114442] pcieport 0000:0c:00.0: aer_inject: Injecting errors 00004000/00000000 into device 0000:0e:00.0
> >> [ 177.115602] pcieport 0000:0c:00.0: AER: Correctable error message received from 0000:0e:00.0
> >> [ 177.116973] pcieport 0000:0e:00.0: PCIe Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID)
> >> [ 177.117985] pcieport 0000:0e:00.0: device [19e5:a129] error status/mask=00004000/0000a000
> >> [ 177.118809] pcieport 0000:0e:00.0: [14] CorrIntErr
> >> [ 177.119521] aer_event: 0000:0e:00.0 PCIe Bus Error: severity=Corrected, Corrected Internal Error, TLP Header=Not available
> >> [ 177.119521]
> >> [ 177.122037] cxl_port_aer_correctable_error: device=0000:0e:00.0 host=0000:0d:00.0 status='Received Error From Physical Layer'
> >
> > Thanks for the hints about how to test this; it's helpful to have
> > those in the email archives. Remove the timestamps and non-relevant
> > call trace entries unless they add useful information. AFAICT they're
> > just distractions in this case.
>
> I'll remove the test logging and details from the cover sheet. I'm
> unable to find how to attach using git tools. Instead of an
> atatachment, I can locate the log files and details on a public
> github. Let me know if this is not acceptable.
It's fine to keep this in the cover sheet, and I'd rather have it
there, where lore will archive it reliably forever, than to have a
pointer to some other github that may eventually disappear even though
it's public today.
I just meant to remove irrelevant information like the timestamps.
Bjorn
^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [PATCH 0/15] Enable CXL PCIe port protocol error handling and logging
2024-10-14 17:29 ` Bjorn Helgaas
@ 2024-10-14 17:33 ` Terry Bowman
0 siblings, 0 replies; 62+ messages in thread
From: Terry Bowman @ 2024-10-14 17:33 UTC (permalink / raw)
To: Bjorn Helgaas
Cc: ming4.li, linux-cxl, linux-kernel, linux-pci, dave,
jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
dan.j.williams, bhelgaas, mahesh, oohall, Benjamin.Cheatham,
rrichter, nathan.fontenot, smita.koralahallichannabasappa
On 10/14/24 12:29, Bjorn Helgaas wrote:
> On Mon, Oct 14, 2024 at 12:22:08PM -0500, Terry Bowman wrote:
>> On 10/10/24 14:07, Bjorn Helgaas wrote:
>>> On Tue, Oct 08, 2024 at 05:16:42PM -0500, Terry Bowman wrote:
>>>> This is a continuation of the CXL port error handling RFC from earlier.[1]
>>>> The RFC resulted in the decision to add CXL PCIe port error handling to
>>>> the existing RCH downstream port handling. This patchset adds the CXL PCIe
>>>> port handling and logging.
>> ...
>
>>>> Downstream switch port CE:
>>>> root@tbowman-cxl:~/aer-inject# ./ds-ce-inject.sh
>>>> [ 177.114442] pcieport 0000:0c:00.0: aer_inject: Injecting errors 00004000/00000000 into device 0000:0e:00.0
>>>> [ 177.115602] pcieport 0000:0c:00.0: AER: Correctable error message received from 0000:0e:00.0
>>>> [ 177.116973] pcieport 0000:0e:00.0: PCIe Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID)
>>>> [ 177.117985] pcieport 0000:0e:00.0: device [19e5:a129] error status/mask=00004000/0000a000
>>>> [ 177.118809] pcieport 0000:0e:00.0: [14] CorrIntErr
>>>> [ 177.119521] aer_event: 0000:0e:00.0 PCIe Bus Error: severity=Corrected, Corrected Internal Error, TLP Header=Not available
>>>> [ 177.119521]
>>>> [ 177.122037] cxl_port_aer_correctable_error: device=0000:0e:00.0 host=0000:0d:00.0 status='Received Error From Physical Layer'
>>>
>>> Thanks for the hints about how to test this; it's helpful to have
>>> those in the email archives. Remove the timestamps and non-relevant
>>> call trace entries unless they add useful information. AFAICT they're
>>> just distractions in this case.
>>
>> I'll remove the test logging and details from the cover sheet. I'm
>> unable to find how to attach using git tools. Instead of an
>> atatachment, I can locate the log files and details on a public
>> github. Let me know if this is not acceptable.
>
> It's fine to keep this in the cover sheet, and I'd rather have it
> there, where lore will archive it reliably forever, than to have a
> pointer to some other github that may eventually disappear even though
> it's public today.
>
> I just meant to remove irrelevant information like the timestamps.
>
> Bjorn
Ok, I'll cleanup and leave here. Thanks.
Regards,
Terry
^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [PATCH 0/15] Enable CXL PCIe port protocol error handling and logging
2024-10-08 22:16 [PATCH 0/15] Enable CXL PCIe port protocol error handling and logging Terry Bowman
` (15 preceding siblings ...)
2024-10-10 19:07 ` [PATCH 0/15] Enable CXL PCIe port protocol error handling and logging Bjorn Helgaas
@ 2024-10-17 16:34 ` Fan Ni
2024-10-17 17:27 ` Bowman, Terry
2024-10-18 23:22 ` Bjorn Helgaas
2024-10-22 1:43 ` Dan Williams
18 siblings, 1 reply; 62+ messages in thread
From: Fan Ni @ 2024-10-17 16:34 UTC (permalink / raw)
To: Terry Bowman
Cc: ming4.li, linux-cxl, linux-kernel, linux-pci, dave,
jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
dan.j.williams, bhelgaas, mahesh, oohall, Benjamin.Cheatham,
rrichter, nathan.fontenot, smita.koralahallichannabasappa
On Tue, Oct 08, 2024 at 05:16:42PM -0500, Terry Bowman wrote:
> This is a continuation of the CXL port error handling RFC from earlier.[1]
> The RFC resulted in the decision to add CXL PCIe port error handling to
> the existing RCH downstream port handling. This patchset adds the CXL PCIe
> port handling and logging.
>
> The first 7 patches update the existing AER service driver to support CXL
> PCIe port protocol error handling and reporting. This includes AER service
> driver changes for adding correctable and uncorrectable error support, CXL
> specific recovery handling, and addition of CXL driver callback handlers.
>
> The following 8 patches address CXL driver support for CXL PCIe port
> protocol errors. This includes the following changes to the CXL drivers:
> mapping CXL port and downstream port RAS registers, interface updates for
> common RCH and VH, adding port specific error handlers, and protocol error
> logging.
>
> [1] - https://lore.kernel.org/linux-cxl/20240617200411.1426554
> -1-terry.bowman@amd.com/
>
> Testing:
>
> Below are test results for this patchset. This is using Qemu with a root
> port (0c:00.0), upstream switch port (0d:00.0),and downstream switch port
> (0e:00.0).
>
> This was tested using aer-inject updated to support CE and UCE internal
> error injection. CXL RAS was set using a test patch (not upstreamed).
Hi Terry,
Can you share the aer-inject repo for the testing or the test patch?
Fan
>
> Root port UCE:
> root@tbowman-cxl:~/aer-inject# ./root-uce-inject.sh
> [ 27.318920] pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00400000 into device 0000:0c:00.0
> [ 27.320164] pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0c:00.0
> [ 27.321518] pcieport 0000:0c:00.0: PCIe Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID)
> [ 27.322483] pcieport 0000:0c:00.0: device [8086:7075] error status/mask=00400000/02000000
> [ 27.323243] pcieport 0000:0c:00.0: [22] UncorrIntErr
> [ 27.325584] aer_event: 0000:0c:00.0 PCIe Bus Error: severity=Fatal, Uncorrectable Internal Error, TLP Header=Not available
> [ 27.325584]
> [ 27.327171] cxl_port_aer_uncorrectable_error: device=0000:0c:00.0 host=pci0000:0c status: 'Memory Address Parity Error'
> first_error: 'Memory Address Parity Error'
> [ 27.333277] Kernel panic - not syncing: CXL cachemem error. Invoking panic
> [ 27.333872] CPU: 12 UID: 0 PID: 122 Comm: irq/24-aerdrv Not tainted 6.11.0-rc1-port-error-g1fb9097c3728 #3857
> [ 27.334761] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
> [ 27.335716] Call Trace:
> [ 27.335985] <TASK>
> [ 27.336226] panic+0x2ed/0x320
> [ 27.336547] ? __pfx_cxl_report_normal_detected+0x10/0x10
> [ 27.337037] ? __pfx_aer_root_reset+0x10/0x10
> [ 27.337453] cxl_do_recovery+0x304/0x310
> [ 27.337833] aer_isr+0x3fd/0x700
> [ 27.338154] ? __pfx_irq_thread_fn+0x10/0x10
> [ 27.338572] irq_thread_fn+0x1f/0x60
> [ 27.338923] irq_thread+0x102/0x1b0
> [ 27.339267] ? __pfx_irq_thread_dtor+0x10/0x10
> [ 27.339683] ? __pfx_irq_thread+0x10/0x10
> [ 27.340059] kthread+0xcd/0x100
> [ 27.340387] ? __pfx_kthread+0x10/0x10
> [ 27.340748] ret_from_fork+0x2f/0x50
> [ 27.341100] ? __pfx_kthread+0x10/0x10
> [ 27.341466] ret_from_fork_asm+0x1a/0x30
> [ 27.341842] </TASK>
> [ 27.342281] Kernel Offset: 0x1ba00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
> [ 27.343221] ---[ end Kernel panic - not syncing: CXL cachemem error. Invoking panic ]---
>
> Root port CE:
> root@tbowman-cxl:~/aer-inject# ./root-ce-inject.sh
> [ 19.444339] pcieport 0000:0c:00.0: aer_inject: Injecting errors 00004000/00000000 into device 0000:0c:00.0
> [ 19.445530] pcieport 0000:0c:00.0: AER: Correctable error message received from 0000:0c:00.0
> [ 19.446750] pcieport 0000:0c:00.0: PCIe Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID)
> [ 19.447742] pcieport 0000:0c:00.0: device [8086:7075] error status/mask=00004000/0000a000
> [ 19.448549] pcieport 0000:0c:00.0: [14] CorrIntErr
> [ 19.449223] aer_event: 0000:0c:00.0 PCIe Bus Error: severity=Corrected, Corrected Internal Error, TLP Header=Not available
> [ 19.449223]
> [ 19.451415] cxl_port_aer_correctable_error: device=0000:0c:00.0 host=pci0000:0c status='Received Error From Physical Layer'
>
> Upstream switch port UCE:
> root@tbowman-cxl:~/aer-inject# ./us-uce-inject.sh
> [ 45.236853] pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00400000 into device 0000:0d:00.0
> [ 45.238101] pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0d:00.0
> [ 45.239416] pcieport 0000:0d:00.0: PCIe Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID)
> [ 45.240412] pcieport 0000:0d:00.0: device [19e5:a128] error status/mask=00400000/02000000
> [ 45.241159] pcieport 0000:0d:00.0: [22] UncorrIntErr
> [ 45.242448] aer_event: 0000:0d:00.0 PCIe Bus Error: severity=Fatal, Uncorrectable Internal Error, TLP Header=Not available
> [ 45.242448]
> [ 45.244008] cxl_port_aer_uncorrectable_error: device=0000:0d:00.0 host=0000:0c:00.0 status: 'Memory Address Parity Error'
> first_error: 'Memory Address Parity Error'
> [ 45.249129] Kernel panic - not syncing: CXL cachemem error. Invoking panic
> [ 45.249800] CPU: 12 UID: 0 PID: 122 Comm: irq/24-aerdrv Not tainted 6.11.0-rc1-port-error-g1fb9097c3728 #3855
> [ 45.250795] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
> [ 45.251907] Call Trace:
> [ 45.253284] <TASK>
> [ 45.253564] panic+0x2ed/0x320
> [ 45.253909] ? __pfx_cxl_report_normal_detected+0x10/0x10
> [ 45.255455] ? __pfx_aer_root_reset+0x10/0x10
> [ 45.255915] cxl_do_recovery+0x304/0x310
> [ 45.257219] aer_isr+0x3fd/0x700
> [ 45.257572] ? __pfx_irq_thread_fn+0x10/0x10
> [ 45.258006] irq_thread_fn+0x1f/0x60
> [ 45.258383] irq_thread+0x102/0x1b0
> [ 45.258748] ? __pfx_irq_thread_dtor+0x10/0x10
> [ 45.259196] ? __pfx_irq_thread+0x10/0x10
> [ 45.259605] kthread+0xcd/0x100
> [ 45.259956] ? __pfx_kthread+0x10/0x10
> [ 45.260386] ret_from_fork+0x2f/0x50
> [ 45.260879] ? __pfx_kthread+0x10/0x10
> [ 45.261418] ret_from_fork_asm+0x1a/0x30
> [ 45.261936] </TASK>
> [ 45.262451] Kernel Offset: 0xc600000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
> [ 45.263467] ---[ end Kernel panic - not syncing: CXL cachemem error. Invoking panic ]---
>
> Upstream switch port CE:
> root@tbowman-cxl:~/aer-inject# ./us-ce-inject.sh
> [ 37.504029] pcieport 0000:0c:00.0: aer_inject: Injecting errors 00004000/00000000 into device 0000:0d:00.0
> [ 37.506076] pcieport 0000:0c:00.0: AER: Correctable error message received from 0000:0d:00.0
> [ 37.507599] pcieport 0000:0d:00.0: PCIe Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID)
> [ 37.508759] pcieport 0000:0d:00.0: device [19e5:a128] error status/mask=00004000/0000a000
> [ 37.509574] pcieport 0000:0d:00.0: [14] CorrIntErr
> [ 37.510180] aer_event: 0000:0d:00.0 PCIe Bus Error: severity=Corrected, Corrected Internal Error, TLP Header=Not available
> [ 37.510180]
> [ 37.512057] cxl_port_aer_correctable_error: device=0000:0d:00.0 host=0000:0c:00.0 status='Received Error From Physical Layer'
>
> Downstream switch port UCE:
> root@tbowman-cxl:~/aer-inject# ./ds-uce-inject.sh
> [ 29.421532] pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00400000 into device 0000:0e:00.0
> [ 29.422812] pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0e:00.0
> [ 29.424551] pcieport 0000:0e:00.0: PCIe Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID)
> [ 29.425670] pcieport 0000:0e:00.0: device [19e5:a129] error status/mask=00400000/02000000
> [ 29.426487] pcieport 0000:0e:00.0: [22] UncorrIntErr
> [ 29.427111] aer_event: 0000:0e:00.0 PCIe Bus Error: severity=Fatal, Uncorrectable Internal Error, TLP Header=Not available
> [ 29.427111]
> [ 29.428688] cxl_port_aer_uncorrectable_error: device=0000:0e:00.0 host=0000:0d:00.0 status: 'Memory Address Parity Error'
> first_error: 'Memory Address Parity Error'
> [ 29.430173] Kernel panic - not syncing: CXL cachemem error. Invoking panic
> [ 29.430862] CPU: 12 UID: 0 PID: 122 Comm: irq/24-aerdrv Not tainted 6.11.0-rc1-port-error-g844fd2319372 #3851
> [ 29.431874] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
> [ 29.433031] Call Trace:
> [ 29.433354] <TASK>
> [ 29.433631] panic+0x2ed/0x320
> [ 29.434010] ? __pfx_cxl_report_normal_detected+0x10/0x10
> [ 29.434653] ? __pfx_aer_root_reset+0x10/0x10
> [ 29.435179] cxl_do_recovery+0x304/0x310
> [ 29.435626] aer_isr+0x3fd/0x700
> [ 29.436027] ? __pfx_irq_thread_fn+0x10/0x10
> [ 29.436507] irq_thread_fn+0x1f/0x60
> [ 29.436898] irq_thread+0x102/0x1b0
> [ 29.437293] ? __pfx_irq_thread_dtor+0x10/0x10
> [ 29.437758] ? __pfx_irq_thread+0x10/0x10
> [ 29.438189] kthread+0xcd/0x100
> [ 29.438551] ? __pfx_kthread+0x10/0x10
> [ 29.438959] ret_from_fork+0x2f/0x50
> [ 29.439362] ? __pfx_kthread+0x10/0x10
> [ 29.439771] ret_from_fork_asm+0x1a/0x30
> [ 29.440221] </TASK>
> [ 29.440738] Kernel Offset: 0x10a00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
> [ 29.441812] ---[ end Kernel panic - not syncing: CXL cachemem error. Invoking panic ]---
>
> Downstream switch port CE:
> root@tbowman-cxl:~/aer-inject# ./ds-ce-inject.sh
> [ 177.114442] pcieport 0000:0c:00.0: aer_inject: Injecting errors 00004000/00000000 into device 0000:0e:00.0
> [ 177.115602] pcieport 0000:0c:00.0: AER: Correctable error message received from 0000:0e:00.0
> [ 177.116973] pcieport 0000:0e:00.0: PCIe Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID)
> [ 177.117985] pcieport 0000:0e:00.0: device [19e5:a129] error status/mask=00004000/0000a000
> [ 177.118809] pcieport 0000:0e:00.0: [14] CorrIntErr
> [ 177.119521] aer_event: 0000:0e:00.0 PCIe Bus Error: severity=Corrected, Corrected Internal Error, TLP Header=Not available
> [ 177.119521]
> [ 177.122037] cxl_port_aer_correctable_error: device=0000:0e:00.0 host=0000:0d:00.0 status='Received Error From Physical Layer'
>
> Changes RFC->v1:
> [Dan] Rename cxl_rch_handle_error() becomes cxl_handle_error()
> [Dan] Add cxl_do_recovery()
> [Jonathan] Flatten cxl_setup_parent_uport()
> [Jonathan] Use cxl_component_regs instead of struct cxl_regs regs
> [Jonathan] Rename cxl_dev_is_pci_type()
> [Ming] bus_find_device(&cxl_bus_type, NULL, &pdev->dev, match_uport) can
> replace these find_cxl_port() and device_find_child().
> [Jonathan] Compact call to cxl_port_map_regs() in cxl_setup_parent_uport()
> [Ming] Dont use endpoint as host to cxl_map_component_regs()
> [Bjorn] Use "PCIe UIR/CIE" instesad of "AER UI/CIE"
> [TODO][Bjorn] Dont use Kconfig to enable/disable a CXL external interface
>
> Terry Bowman (15):
> cxl/aer/pci: Add CXL PCIe port error handler callbacks in AER service
> driver
> cxl/aer/pci: Update is_internal_error() to be callable w/o
> CONFIG_PCIEAER_CXL
> cxl/aer/pci: Refactor AER driver's existing interfaces to support CXL
> PCIe ports
> cxl/aer/pci: Add CXL PCIe port correctable error support in AER
> service driver
> cxl/aer/pci: Update AER driver to read UCE fatal status for all CXL
> PCIe port devices
> cxl/aer/pci: Introduce PCI_ERS_RESULT_PANIC to pci_ers_result type
> cxl/aer/pci: Add CXL PCIe port uncorrectable error recovery in AER
> service driver
> cxl/pci: Change find_cxl_ports() to be non-static
> cxl/pci: Map CXL PCIe downstream port RAS registers
> cxl/pci: Map CXL PCIe upstream port RAS registers
> cxl/pci: Update RAS handler interfaces to support CXL PCIe ports
> cxl/pci: Add error handler for CXL PCIe port RAS errors
> cxl/pci: Add trace logging for CXL PCIe port RAS errors
> cxl/aer/pci: Export pci_aer_unmask_internal_errors()
> cxl/pci: Enable internal CE/UCE interrupts for CXL PCIe port devices
>
> drivers/cxl/core/core.h | 3 +
> drivers/cxl/core/pci.c | 172 +++++++++++++++++++++++++++++++--------
> drivers/cxl/core/port.c | 4 +-
> drivers/cxl/core/trace.h | 47 +++++++++++
> drivers/cxl/cxl.h | 14 +++-
> drivers/cxl/mem.c | 30 ++++++-
> drivers/cxl/pci.c | 8 ++
> drivers/pci/pci.h | 5 ++
> drivers/pci/pcie/aer.c | 123 ++++++++++++++++++++--------
> drivers/pci/pcie/err.c | 150 ++++++++++++++++++++++++++++++++++
> include/linux/aer.h | 16 ++++
> include/linux/pci.h | 3 +
> 12 files changed, 503 insertions(+), 72 deletions(-)
>
>
> base-commit: f7982d85e136ba7e26b31a725c1841373f81f84a
> --
> 2.34.1
>
--
Fan Ni
^ permalink raw reply [flat|nested] 62+ messages in thread* Re: [PATCH 0/15] Enable CXL PCIe port protocol error handling and logging
2024-10-17 16:34 ` Fan Ni
@ 2024-10-17 17:27 ` Bowman, Terry
2024-10-21 22:19 ` Fan Ni
0 siblings, 1 reply; 62+ messages in thread
From: Bowman, Terry @ 2024-10-17 17:27 UTC (permalink / raw)
To: Fan Ni, Terry Bowman
Cc: ming4.li, linux-cxl, linux-kernel, linux-pci, dave,
jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
dan.j.williams, bhelgaas, mahesh, oohall, Benjamin.Cheatham,
rrichter, nathan.fontenot, smita.koralahallichannabasappa
[-- Attachment #1: Type: text/plain, Size: 1700 bytes --]
Hi Fan,
On 10/17/2024 11:34 AM, Fan Ni wrote:
> On Tue, Oct 08, 2024 at 05:16:42PM -0500, Terry Bowman wrote:
>> This is a continuation of the CXL port error handling RFC from earlier.[1]
>> The RFC resulted in the decision to add CXL PCIe port error handling to
>> the existing RCH downstream port handling. This patchset adds the CXL PCIe
>> port handling and logging.
>>
>> The first 7 patches update the existing AER service driver to support CXL
>> PCIe port protocol error handling and reporting. This includes AER service
>> driver changes for adding correctable and uncorrectable error support, CXL
>> specific recovery handling, and addition of CXL driver callback handlers.
>>
>> The following 8 patches address CXL driver support for CXL PCIe port
>> protocol errors. This includes the following changes to the CXL drivers:
>> mapping CXL port and downstream port RAS registers, interface updates for
>> common RCH and VH, adding port specific error handlers, and protocol error
>> logging.
>>
>> [1] - https://lore.kernel.org/linux-cxl/20240617200411.1426554
>> -1-terry.bowman@amd.com/
>>
>> Testing:
>>
>> Below are test results for this patchset. This is using Qemu with a root
>> port (0c:00.0), upstream switch port (0d:00.0),and downstream switch port
>> (0e:00.0).
>>
>> This was tested using aer-inject updated to support CE and UCE internal
>> error injection. CXL RAS was set using a test patch (not upstreamed).
>
> Hi Terry,
> Can you share the aer-inject repo for the testing or the test patch?
>
> Fan
Sure, but, its easiest to attach the patch here.
Origin was https://github.com/jderrick/aer-inject.git
Base is 81701cbb30e35a1a76c3876f55692f91bdb9751b
Regards,
Terry
[-- Attachment #2: 0001-aer-inject-Add-internal-error-injection.patch --]
[-- Type: text/plain, Size: 2994 bytes --]
From ca9277866b506723f46f3acd7b264ffa80c37276 Mon Sep 17 00:00:00 2001
From: Terry Bowman <terry.bowman@amd.com>
Date: Thu, 17 Oct 2024 12:12:58 -0500
Subject: [PATCH] aer-inject: Add internal error injection
Add corrected (CE) and uncorrected (UCE) AER internal error injection
support.
Signed-off-by: Terry Bowman <terry.bowman@amd.com>
---
aer.h | 2 ++
aer.lex | 2 ++
aer.y | 8 ++++----
3 files changed, 8 insertions(+), 4 deletions(-)
diff --git a/aer.h b/aer.h
index a0ad152..e55a731 100644
--- a/aer.h
+++ b/aer.h
@@ -30,11 +30,13 @@ struct aer_error_inj
#define PCI_ERR_UNC_MALF_TLP 0x00040000 /* Malformed TLP */
#define PCI_ERR_UNC_ECRC 0x00080000 /* ECRC Error Status */
#define PCI_ERR_UNC_UNSUP 0x00100000 /* Unsupported Request */
+#define PCI_ERR_UNC_INTERNAL 0x00400000 /* Internal error */
#define PCI_ERR_COR_RCVR 0x00000001 /* Receiver Error Status */
#define PCI_ERR_COR_BAD_TLP 0x00000040 /* Bad TLP Status */
#define PCI_ERR_COR_BAD_DLLP 0x00000080 /* Bad DLLP Status */
#define PCI_ERR_COR_REP_ROLL 0x00000100 /* REPLAY_NUM Rollover */
#define PCI_ERR_COR_REP_TIMER 0x00001000 /* Replay Timer Timeout */
+#define PCI_ERR_COR_CINTERNAL 0x00004000 /* Internal error */
extern void init_aer(struct aer_error_inj *err);
extern void submit_aer(struct aer_error_inj *err);
diff --git a/aer.lex b/aer.lex
index 6121e4e..4fadd0e 100644
--- a/aer.lex
+++ b/aer.lex
@@ -82,11 +82,13 @@ static struct key {
KEYVAL(MALF_TLP, PCI_ERR_UNC_MALF_TLP),
KEYVAL(ECRC, PCI_ERR_UNC_ECRC),
KEYVAL(UNSUP, PCI_ERR_UNC_UNSUP),
+ KEYVAL(INTERNAL, PCI_ERR_UNC_INTERNAL),
KEYVAL(RCVR, PCI_ERR_COR_RCVR),
KEYVAL(BAD_TLP, PCI_ERR_COR_BAD_TLP),
KEYVAL(BAD_DLLP, PCI_ERR_COR_BAD_DLLP),
KEYVAL(REP_ROLL, PCI_ERR_COR_REP_ROLL),
KEYVAL(REP_TIMER, PCI_ERR_COR_REP_TIMER),
+ KEYVAL(CINTERNAL, PCI_ERR_COR_CINTERNAL),
};
static int cmp_key(const void *av, const void *bv)
diff --git a/aer.y b/aer.y
index e5ecc7d..500dc97 100644
--- a/aer.y
+++ b/aer.y
@@ -34,8 +34,8 @@ static void init(void);
%token AER DOMAIN BUS DEV FN PCI_ID UNCOR_STATUS COR_STATUS HEADER_LOG
%token <num> TRAIN DLP POISON_TLP FCP COMP_TIME COMP_ABORT UNX_COMP RX_OVER
-%token <num> MALF_TLP ECRC UNSUP
-%token <num> RCVR BAD_TLP BAD_DLLP REP_ROLL REP_TIMER
+%token <num> MALF_TLP ECRC UNSUP INTERNAL
+%token <num> RCVR BAD_TLP BAD_DLLP REP_ROLL REP_TIMER CINTERNAL
%token <num> SYMBOL NUMBER
%token <str> PCI_ID_STR
@@ -77,14 +77,14 @@ uncor_status_list: /* empty */ { $$ = 0; }
;
uncor_status: TRAIN | DLP | POISON_TLP | FCP | COMP_TIME | COMP_ABORT
- | UNX_COMP | RX_OVER | MALF_TLP | ECRC | UNSUP | NUMBER
+ | UNX_COMP | RX_OVER | MALF_TLP | ECRC | UNSUP | INTERNAL | NUMBER
;
cor_status_list: /* empty */ { $$ = 0; }
| cor_status_list cor_status { $$ = $1 | $2; }
;
-cor_status: RCVR | BAD_TLP | BAD_DLLP | REP_ROLL | REP_TIMER | NUMBER
+cor_status: RCVR | BAD_TLP | BAD_DLLP | REP_ROLL | REP_TIMER | CINTERNAL | NUMBER
;
%%
--
2.34.1
^ permalink raw reply related [flat|nested] 62+ messages in thread* Re: [PATCH 0/15] Enable CXL PCIe port protocol error handling and logging
2024-10-17 17:27 ` Bowman, Terry
@ 2024-10-21 22:19 ` Fan Ni
0 siblings, 0 replies; 62+ messages in thread
From: Fan Ni @ 2024-10-21 22:19 UTC (permalink / raw)
To: Bowman, Terry
Cc: Fan Ni, Terry Bowman, ming4.li, linux-cxl, linux-kernel,
linux-pci, dave, jonathan.cameron, dave.jiang, alison.schofield,
vishal.l.verma, dan.j.williams, bhelgaas, mahesh, oohall,
Benjamin.Cheatham, rrichter, nathan.fontenot,
smita.koralahallichannabasappa
On Thu, Oct 17, 2024 at 12:27:04PM -0500, Bowman, Terry wrote:
> Hi Fan,
>
> On 10/17/2024 11:34 AM, Fan Ni wrote:
> > On Tue, Oct 08, 2024 at 05:16:42PM -0500, Terry Bowman wrote:
> > > This is a continuation of the CXL port error handling RFC from earlier.[1]
> > > The RFC resulted in the decision to add CXL PCIe port error handling to
> > > the existing RCH downstream port handling. This patchset adds the CXL PCIe
> > > port handling and logging.
> > >
> > > The first 7 patches update the existing AER service driver to support CXL
> > > PCIe port protocol error handling and reporting. This includes AER service
> > > driver changes for adding correctable and uncorrectable error support, CXL
> > > specific recovery handling, and addition of CXL driver callback handlers.
> > >
> > > The following 8 patches address CXL driver support for CXL PCIe port
> > > protocol errors. This includes the following changes to the CXL drivers:
> > > mapping CXL port and downstream port RAS registers, interface updates for
> > > common RCH and VH, adding port specific error handlers, and protocol error
> > > logging.
> > >
> > > [1] - https://lore.kernel.org/linux-cxl/20240617200411.1426554
> > > -1-terry.bowman@amd.com/
> > >
> > > Testing:
> > >
> > > Below are test results for this patchset. This is using Qemu with a root
> > > port (0c:00.0), upstream switch port (0d:00.0),and downstream switch port
> > > (0e:00.0).
> > >
> > > This was tested using aer-inject updated to support CE and UCE internal
> > > error injection. CXL RAS was set using a test patch (not upstreamed).
> >
> > Hi Terry,
> > Can you share the aer-inject repo for the testing or the test patch?
Hi Terry,
Could you tell me which code base you use for this patch set?
I hit a lot of issues when trying to apply it on top of "fixes" or
"next" branches.
Fan
> >
> > Fan
>
> Sure, but, its easiest to attach the patch here.
>
> Origin was https://github.com/jderrick/aer-inject.git
> Base is 81701cbb30e35a1a76c3876f55692f91bdb9751b
>
> Regards,
> Terry
> From ca9277866b506723f46f3acd7b264ffa80c37276 Mon Sep 17 00:00:00 2001
> From: Terry Bowman <terry.bowman@amd.com>
> Date: Thu, 17 Oct 2024 12:12:58 -0500
> Subject: [PATCH] aer-inject: Add internal error injection
>
> Add corrected (CE) and uncorrected (UCE) AER internal error injection
> support.
>
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> ---
> aer.h | 2 ++
> aer.lex | 2 ++
> aer.y | 8 ++++----
> 3 files changed, 8 insertions(+), 4 deletions(-)
>
> diff --git a/aer.h b/aer.h
> index a0ad152..e55a731 100644
> --- a/aer.h
> +++ b/aer.h
> @@ -30,11 +30,13 @@ struct aer_error_inj
> #define PCI_ERR_UNC_MALF_TLP 0x00040000 /* Malformed TLP */
> #define PCI_ERR_UNC_ECRC 0x00080000 /* ECRC Error Status */
> #define PCI_ERR_UNC_UNSUP 0x00100000 /* Unsupported Request */
> +#define PCI_ERR_UNC_INTERNAL 0x00400000 /* Internal error */
> #define PCI_ERR_COR_RCVR 0x00000001 /* Receiver Error Status */
> #define PCI_ERR_COR_BAD_TLP 0x00000040 /* Bad TLP Status */
> #define PCI_ERR_COR_BAD_DLLP 0x00000080 /* Bad DLLP Status */
> #define PCI_ERR_COR_REP_ROLL 0x00000100 /* REPLAY_NUM Rollover */
> #define PCI_ERR_COR_REP_TIMER 0x00001000 /* Replay Timer Timeout */
> +#define PCI_ERR_COR_CINTERNAL 0x00004000 /* Internal error */
>
> extern void init_aer(struct aer_error_inj *err);
> extern void submit_aer(struct aer_error_inj *err);
> diff --git a/aer.lex b/aer.lex
> index 6121e4e..4fadd0e 100644
> --- a/aer.lex
> +++ b/aer.lex
> @@ -82,11 +82,13 @@ static struct key {
> KEYVAL(MALF_TLP, PCI_ERR_UNC_MALF_TLP),
> KEYVAL(ECRC, PCI_ERR_UNC_ECRC),
> KEYVAL(UNSUP, PCI_ERR_UNC_UNSUP),
> + KEYVAL(INTERNAL, PCI_ERR_UNC_INTERNAL),
> KEYVAL(RCVR, PCI_ERR_COR_RCVR),
> KEYVAL(BAD_TLP, PCI_ERR_COR_BAD_TLP),
> KEYVAL(BAD_DLLP, PCI_ERR_COR_BAD_DLLP),
> KEYVAL(REP_ROLL, PCI_ERR_COR_REP_ROLL),
> KEYVAL(REP_TIMER, PCI_ERR_COR_REP_TIMER),
> + KEYVAL(CINTERNAL, PCI_ERR_COR_CINTERNAL),
> };
>
> static int cmp_key(const void *av, const void *bv)
> diff --git a/aer.y b/aer.y
> index e5ecc7d..500dc97 100644
> --- a/aer.y
> +++ b/aer.y
> @@ -34,8 +34,8 @@ static void init(void);
>
> %token AER DOMAIN BUS DEV FN PCI_ID UNCOR_STATUS COR_STATUS HEADER_LOG
> %token <num> TRAIN DLP POISON_TLP FCP COMP_TIME COMP_ABORT UNX_COMP RX_OVER
> -%token <num> MALF_TLP ECRC UNSUP
> -%token <num> RCVR BAD_TLP BAD_DLLP REP_ROLL REP_TIMER
> +%token <num> MALF_TLP ECRC UNSUP INTERNAL
> +%token <num> RCVR BAD_TLP BAD_DLLP REP_ROLL REP_TIMER CINTERNAL
> %token <num> SYMBOL NUMBER
> %token <str> PCI_ID_STR
>
> @@ -77,14 +77,14 @@ uncor_status_list: /* empty */ { $$ = 0; }
> ;
>
> uncor_status: TRAIN | DLP | POISON_TLP | FCP | COMP_TIME | COMP_ABORT
> - | UNX_COMP | RX_OVER | MALF_TLP | ECRC | UNSUP | NUMBER
> + | UNX_COMP | RX_OVER | MALF_TLP | ECRC | UNSUP | INTERNAL | NUMBER
> ;
>
> cor_status_list: /* empty */ { $$ = 0; }
> | cor_status_list cor_status { $$ = $1 | $2; }
> ;
>
> -cor_status: RCVR | BAD_TLP | BAD_DLLP | REP_ROLL | REP_TIMER | NUMBER
> +cor_status: RCVR | BAD_TLP | BAD_DLLP | REP_ROLL | REP_TIMER | CINTERNAL | NUMBER
> ;
>
> %%
> --
> 2.34.1
>
--
Fan Ni
^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [PATCH 0/15] Enable CXL PCIe port protocol error handling and logging
2024-10-08 22:16 [PATCH 0/15] Enable CXL PCIe port protocol error handling and logging Terry Bowman
` (16 preceding siblings ...)
2024-10-17 16:34 ` Fan Ni
@ 2024-10-18 23:22 ` Bjorn Helgaas
2024-10-21 19:22 ` Terry Bowman
2024-10-22 1:43 ` Dan Williams
18 siblings, 1 reply; 62+ messages in thread
From: Bjorn Helgaas @ 2024-10-18 23:22 UTC (permalink / raw)
To: Terry Bowman
Cc: ming4.li, linux-cxl, linux-kernel, linux-pci, dave,
jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
dan.j.williams, bhelgaas, mahesh, oohall, Benjamin.Cheatham,
rrichter, nathan.fontenot, smita.koralahallichannabasappa
On Tue, Oct 08, 2024 at 05:16:42PM -0500, Terry Bowman wrote:
> This is a continuation of the CXL port error handling RFC from earlier.[1]
> The RFC resulted in the decision to add CXL PCIe port error handling to
> the existing RCH downstream port handling. This patchset adds the CXL PCIe
> port handling and logging.
>
> The first 7 patches update the existing AER service driver to support CXL
> PCIe port protocol error handling and reporting. This includes AER service
> driver changes for adding correctable and uncorrectable error support, CXL
> specific recovery handling, and addition of CXL driver callback handlers.
>
> The following 8 patches address CXL driver support for CXL PCIe port
> protocol errors. This includes the following changes to the CXL drivers:
> mapping CXL port and downstream port RAS registers, interface updates for
> common RCH and VH, adding port specific error handlers, and protocol error
> logging.
Looks like all my comments at
https://lore.kernel.org/r/20241010190726.GA570880@bhelgaas still
apply.
URL broken across lines, distracting timestamps, patch subjects,
no clue about the base commit.
^ permalink raw reply [flat|nested] 62+ messages in thread* Re: [PATCH 0/15] Enable CXL PCIe port protocol error handling and logging
2024-10-18 23:22 ` Bjorn Helgaas
@ 2024-10-21 19:22 ` Terry Bowman
0 siblings, 0 replies; 62+ messages in thread
From: Terry Bowman @ 2024-10-21 19:22 UTC (permalink / raw)
To: Bjorn Helgaas
Cc: ming4.li, linux-cxl, linux-kernel, linux-pci, dave,
jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
dan.j.williams, bhelgaas, mahesh, oohall, Benjamin.Cheatham,
rrichter, nathan.fontenot, smita.koralahallichannabasappa
Hi Bjorn,
I added a response below.
On 10/18/24 18:22, Bjorn Helgaas wrote:
> On Tue, Oct 08, 2024 at 05:16:42PM -0500, Terry Bowman wrote:
>> This is a continuation of the CXL port error handling RFC from earlier.[1]
>> The RFC resulted in the decision to add CXL PCIe port error handling to
>> the existing RCH downstream port handling. This patchset adds the CXL PCIe
>> port handling and logging.
>>
>> The first 7 patches update the existing AER service driver to support CXL
>> PCIe port protocol error handling and reporting. This includes AER service
>> driver changes for adding correctable and uncorrectable error support, CXL
>> specific recovery handling, and addition of CXL driver callback handlers.
>>
>> The following 8 patches address CXL driver support for CXL PCIe port
>> protocol errors. This includes the following changes to the CXL drivers:
>> mapping CXL port and downstream port RAS registers, interface updates for
>> common RCH and VH, adding port specific error handlers, and protocol error
>> logging.
>
> Looks like all my comments at
> https://lore.kernel.org/r/20241010190726.GA570880@bhelgaas still
> apply.
>
> URL broken across lines, distracting timestamps, patch subjects,
> no clue about the base commit.
I added changes for code reuse in pcie_do_recovery() as recommended. I am finishing
testing now and will have v2 upstreamed shortly.
Regards,
Terry
^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [PATCH 0/15] Enable CXL PCIe port protocol error handling and logging
2024-10-08 22:16 [PATCH 0/15] Enable CXL PCIe port protocol error handling and logging Terry Bowman
` (17 preceding siblings ...)
2024-10-18 23:22 ` Bjorn Helgaas
@ 2024-10-22 1:43 ` Dan Williams
2024-10-22 13:29 ` Terry Bowman
18 siblings, 1 reply; 62+ messages in thread
From: Dan Williams @ 2024-10-22 1:43 UTC (permalink / raw)
To: Terry Bowman, ming4.li, linux-cxl, linux-kernel, linux-pci, dave,
jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
dan.j.williams, bhelgaas, mahesh, oohall, Benjamin.Cheatham,
rrichter, nathan.fontenot, smita.koralahallichannabasappa
Terry Bowman wrote:
[..]
> Testing:
>
> Below are test results for this patchset. This is using Qemu with a root
> port (0c:00.0), upstream switch port (0d:00.0),and downstream switch port
> (0e:00.0).
>
> This was tested using aer-inject updated to support CE and UCE internal
> error injection. CXL RAS was set using a test patch (not upstreamed).
Thanks for these test outputs!
>
> Root port UCE:
> root@tbowman-cxl:~/aer-inject# ./root-uce-inject.sh
> [ 27.318920] pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00400000 into device 0000:0c:00.0
> [ 27.320164] pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0c:00.0
> [ 27.321518] pcieport 0000:0c:00.0: PCIe Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID)
> [ 27.322483] pcieport 0000:0c:00.0: device [8086:7075] error status/mask=00400000/02000000
> [ 27.323243] pcieport 0000:0c:00.0: [22] UncorrIntErr
> [ 27.325584] aer_event: 0000:0c:00.0 PCIe Bus Error: severity=Fatal, Uncorrectable Internal Error, TLP Header=Not available
It strikes that by this point the code knows that it is a "CXL Bus"
error and no longer a "PCIe Bus" error. Given the divergent responses
to Fatal errors based on bus I think it would help to clarify that the
kernel is panicking due to "CXL Bus", not "PCIe Bus" errors.
> [ 27.325584]
> [ 27.327171] cxl_port_aer_uncorrectable_error: device=0000:0c:00.0 host=pci0000:0c status: 'Memory Address Parity Error'
...i.e. someone may not notice that this is "cxl" reference in the
backtrace.
^ permalink raw reply [flat|nested] 62+ messages in thread* Re: [PATCH 0/15] Enable CXL PCIe port protocol error handling and logging
2024-10-22 1:43 ` Dan Williams
@ 2024-10-22 13:29 ` Terry Bowman
0 siblings, 0 replies; 62+ messages in thread
From: Terry Bowman @ 2024-10-22 13:29 UTC (permalink / raw)
To: Dan Williams, ming4.li, linux-cxl, linux-kernel, linux-pci, dave,
jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
bhelgaas, mahesh, oohall, Benjamin.Cheatham, rrichter,
nathan.fontenot, smita.koralahallichannabasappa
Hi Dan,
On 10/21/24 20:43, Dan Williams wrote:
> Terry Bowman wrote:
> [..]
>> Testing:
>>
>> Below are test results for this patchset. This is using Qemu with a root
>> port (0c:00.0), upstream switch port (0d:00.0),and downstream switch port
>> (0e:00.0).
>>
>> This was tested using aer-inject updated to support CE and UCE internal
>> error injection. CXL RAS was set using a test patch (not upstreamed).
>
> Thanks for these test outputs!
>
>>
>> Root port UCE:
>> root@tbowman-cxl:~/aer-inject# ./root-uce-inject.sh
>> [ 27.318920] pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00400000 into device 0000:0c:00.0
>> [ 27.320164] pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0c:00.0
>> [ 27.321518] pcieport 0000:0c:00.0: PCIe Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID)
>> [ 27.322483] pcieport 0000:0c:00.0: device [8086:7075] error status/mask=00400000/02000000
>> [ 27.323243] pcieport 0000:0c:00.0: [22] UncorrIntErr
>> [ 27.325584] aer_event: 0000:0c:00.0 PCIe Bus Error: severity=Fatal, Uncorrectable Internal Error, TLP Header=Not available
>
> It strikes that by this point the code knows that it is a "CXL Bus"
> error and no longer a "PCIe Bus" error. Given the divergent responses
> to Fatal errors based on bus I think it would help to clarify that the
> kernel is panicking due to "CXL Bus", not "PCIe Bus" errors.
>
>> [ 27.325584]
>> [ 27.327171] cxl_port_aer_uncorrectable_error: device=0000:0c:00.0 host=pci0000:0c status: 'Memory Address Parity Error'
>
> ...i.e. someone may not notice that this is "cxl" reference in the
> backtrace.
Good idea. I'll add logic to print 'CXL' bus in the case of a CXL erroring device.
Regards,
Terry
^ permalink raw reply [flat|nested] 62+ messages in thread