* Re: [PATCH 4/9] powerpc/bpf: Handle large branch ranges with BPF_EXIT
From: Naveen N. Rao @ 2021-10-04 18:24 UTC (permalink / raw)
To: Alexei Starovoitov, Christophe Leroy, Daniel Borkmann,
Johan Almbladh, Michael Ellerman, Nicholas Piggin
Cc: bpf, linuxppc-dev
In-Reply-To: <e37766fd-8c52-6961-39a8-2de44a769204@csgroup.eu>
Christophe Leroy wrote:
>
>
> Le 01/10/2021 à 23:14, Naveen N. Rao a écrit :
>> In some scenarios, it is possible that the program epilogue is outside
>> the branch range for a BPF_EXIT instruction. Instead of rejecting such
>> programs, emit an indirect branch. We track the size of the bpf program
>> emitted after the initial run and do a second pass since BPF_EXIT can
>> end up emitting different number of instructions depending on the
>> program size.
>>
>> Suggested-by: Jordan Niethe <jniethe5@gmail.com>
>> Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
>> ---
>> arch/powerpc/net/bpf_jit.h | 3 +++
>> arch/powerpc/net/bpf_jit_comp.c | 22 +++++++++++++++++++++-
>> arch/powerpc/net/bpf_jit_comp32.c | 2 +-
>> arch/powerpc/net/bpf_jit_comp64.c | 2 +-
>> 4 files changed, 26 insertions(+), 3 deletions(-)
>>
>> diff --git a/arch/powerpc/net/bpf_jit.h b/arch/powerpc/net/bpf_jit.h
>> index 89bd744c2bffd4..4023de1698b9f5 100644
>> --- a/arch/powerpc/net/bpf_jit.h
>> +++ b/arch/powerpc/net/bpf_jit.h
>> @@ -126,6 +126,7 @@
>>
>> #define SEEN_FUNC 0x20000000 /* might call external helpers */
>> #define SEEN_TAILCALL 0x40000000 /* uses tail calls */
>> +#define SEEN_BIG_PROG 0x80000000 /* large prog, >32MB */
>>
>> #define SEEN_VREG_MASK 0x1ff80000 /* Volatile registers r3-r12 */
>> #define SEEN_NVREG_MASK 0x0003ffff /* Non volatile registers r14-r31 */
>> @@ -179,6 +180,8 @@ int bpf_jit_build_body(struct bpf_prog *fp, u32 *image, struct codegen_context *
>> void bpf_jit_build_prologue(u32 *image, struct codegen_context *ctx);
>> void bpf_jit_build_epilogue(u32 *image, struct codegen_context *ctx);
>> void bpf_jit_realloc_regs(struct codegen_context *ctx);
>> +int bpf_jit_emit_exit_insn(u32 *image, struct codegen_context *ctx,
>> + int tmp_reg, unsigned long exit_addr);
>>
>> #endif
>>
>> diff --git a/arch/powerpc/net/bpf_jit_comp.c b/arch/powerpc/net/bpf_jit_comp.c
>> index fcbf7a917c566e..3204872fbf2738 100644
>> --- a/arch/powerpc/net/bpf_jit_comp.c
>> +++ b/arch/powerpc/net/bpf_jit_comp.c
>> @@ -72,6 +72,21 @@ static int bpf_jit_fixup_subprog_calls(struct bpf_prog *fp, u32 *image,
>> return 0;
>> }
>>
>> +int bpf_jit_emit_exit_insn(u32 *image, struct codegen_context *ctx,
>> + int tmp_reg, unsigned long exit_addr)
>> +{
>> + if (!(ctx->seen & SEEN_BIG_PROG) && is_offset_in_branch_range(exit_addr)) {
>> + PPC_JMP(exit_addr);
>> + } else {
>> + ctx->seen |= SEEN_BIG_PROG;
>> + PPC_FUNC_ADDR(tmp_reg, (unsigned long)image + exit_addr);
>> + EMIT(PPC_RAW_MTCTR(tmp_reg));
>> + EMIT(PPC_RAW_BCTR());
>> + }
>> +
>> + return 0;
>> +}
>> +
>> struct powerpc64_jit_data {
>> struct bpf_binary_header *header;
>> u32 *addrs;
>> @@ -155,12 +170,17 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *fp)
>> goto out_addrs;
>> }
>>
>> + if (!is_offset_in_branch_range((long)cgctx.idx * 4))
>> + cgctx.seen |= SEEN_BIG_PROG;
>> +
>> /*
>> * If we have seen a tail call, we need a second pass.
>> * This is because bpf_jit_emit_common_epilogue() is called
>> * from bpf_jit_emit_tail_call() with a not yet stable ctx->seen.
>> + * We also need a second pass if we ended up with too large
>> + * a program so as to fix branches.
>> */
>> - if (cgctx.seen & SEEN_TAILCALL) {
>> + if (cgctx.seen & (SEEN_TAILCALL | SEEN_BIG_PROG)) {
>> cgctx.idx = 0;
>> if (bpf_jit_build_body(fp, 0, &cgctx, addrs, false)) {
>> fp = org_fp;
>> diff --git a/arch/powerpc/net/bpf_jit_comp32.c b/arch/powerpc/net/bpf_jit_comp32.c
>> index a74d52204f8da2..d2a67574a23066 100644
>> --- a/arch/powerpc/net/bpf_jit_comp32.c
>> +++ b/arch/powerpc/net/bpf_jit_comp32.c
>> @@ -852,7 +852,7 @@ int bpf_jit_build_body(struct bpf_prog *fp, u32 *image, struct codegen_context *
>> * we'll just fall through to the epilogue.
>> */
>> if (i != flen - 1)
>> - PPC_JMP(exit_addr);
>> + bpf_jit_emit_exit_insn(image, ctx, tmp_reg, exit_addr);
>
> On ppc32, if you use tmp_reg you must flag it. But I think you could use
> r0 instead.
Indeed. Can we drop tracking of the temp registers and using them while
remapping registers? Are you seeing significant benefits with re-use of
those temp registers?
- Naveen
^ permalink raw reply
* Re: [PATCH v5 00/14] PCI: Add support for Apple M1
From: Linus Walleij @ 2021-10-04 20:42 UTC (permalink / raw)
To: Rob Herring, linuxppc-dev@lists.ozlabs.org list, opensuse-ppc
Cc: open list:OPEN FIRMWARE AND FLATTENED DEVICE TREE BINDINGS,
Mark Kettenis, Lorenzo Pieralisi, Krzysztof Wilczyński,
Marc Zyngier, Joerg Roedel, Sven Peter,
linux-kernel@vger.kernel.org, Joey Gouly, Hector Martin, PCI,
Bjorn Helgaas, Android Kernel Team, Robin Murphy,
Alyssa Rosenzweig, Stan Skowronek
In-Reply-To: <CAL_Jsq+4FF9QYy87aYhJ-AS78qyHp0NkLrL492+WmdyWj-NKaw@mail.gmail.com>
On Mon, Oct 4, 2021 at 9:52 PM Rob Herring <robh+dt@kernel.org> wrote:
> FYI, I pushed patches 1-3 to kernelCI and didn't see any regressions.
> I am a bit worried about changes to the DT interrupt parsing and
> ancient platforms (such as PowerMacs). Most likely there wouldn't be
> any report until -rc1 or months later on those old systems.
Lets page the PPC lists to see if someone can test on some powermac.
Linus Walleij
^ permalink raw reply
* Re: [PATCH v3 1/8] PCI/AER: Remove ID from aer_agent_string[]
From: Shuah Khan @ 2021-10-04 21:22 UTC (permalink / raw)
To: Naveen Naidu, bhelgaas, ruscur, oohall
Cc: linux-pci, linuxppc-dev, linux-kernel-mentees, linux-kernel,
Shuah Khan
In-Reply-To: <b4c5a5005d4549420cf6e86f31a01d3fb2876731.1633357368.git.naveennaidu479@gmail.com>
On 10/4/21 8:29 AM, Naveen Naidu wrote:
> Before 010caed4ccb6 ("PCI/AER: Decode Error Source RequesterID")
> the AER error logs looked like:
>
> pcieport 0000:00:03.0: AER: Corrected error received: id=0018
> pcieport 0000:00:03.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=0018 (Receiver ID)
> pcieport 0000:00:03.0: device [1b36:000c] error status/mask=00000040/0000e000
> pcieport 0000:00:03.0: [ 6] BadTLP
>
> In 010caed4ccb6 ("PCI/AER: Decode Error Source Requester ID"),
> the "id" field was removed from the AER error logs, so currently AER
> logs look like:
>
> pcieport 0000:00:03.0: AER: Corrected error received: 0000:00:03:0
> pcieport 0000:00:03.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Receiver ID)
> pcieport 0000:00:03.0: device [1b36:000c] error status/mask=00000040/0000e000
> pcieport 0000:00:03.0: [ 6] BadTLP
>
> The second line in the above logs prints "(Receiver ID)", even when
> there is no "id" in the log line. This is confusing.
>
Starting your commit log to say that message are confusing and then talk
about why will make it easier to understand why the change is needed.
> Remove the "ID" from the aer_agent_string[]. The error logs will
> look as follows (Sample from dummy error injected by aer-inject):
>
> pcieport 0000:00:03.0: AER: Corrected error received: 0000:00:03.0
> pcieport 0000:00:03.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Receiver)
> pcieport 0000:00:03.0: device [1b36:000c] error status/mask=00000040/0000e000
> pcieport 0000:00:03.0: [ 6] BadTLP
>
It is good to see before and after messages. However, it will be helpful
to know why this change is necessary. It isn't very clear why in this
commit log.
> Signed-off-by: Naveen Naidu <naveennaidu479@gmail.com
> Signed-off-by: Naveen Naidu <naveennaidu479@gmail.com>
Extra signed-off-by?
> ---
> drivers/pci/pcie/aer.c | 10 +++++-----
> 1 file changed, 5 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index 9784fdcf3006..241ff361b43c 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -516,10 +516,10 @@ static const char *aer_uncorrectable_error_string[] = {
> };
>
> static const char *aer_agent_string[] = {
> - "Receiver ID",
> - "Requester ID",
> - "Completer ID",
> - "Transmitter ID"
> + "Receiver",
> + "Requester",
> + "Completer",
> + "Transmitter"
> };
>
> #define aer_stats_dev_attr(name, stats_array, strings_array, \
> @@ -703,7 +703,7 @@ void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
> const char *level;
>
> if (!info->status) {
> - pci_err(dev, "PCIe Bus Error: severity=%s, type=Inaccessible, (Unregistered Agent ID)\n",
> + pci_err(dev, "PCIe Bus Error: severity=%s, type=Inaccessible, (Unregistered Agent)\n",
> aer_error_severity_string[info->severity]);
> goto out;
> }
>
thanks,
-- Shuah
^ permalink raw reply
* Re: Add Apple M1 support to PASemi i2c driver
From: Christian Zigotzky @ 2021-10-04 9:22 UTC (permalink / raw)
To: Sven Peter
Cc: Darren Stevens, Arnd Bergmann, Hector Martin,
Linux Kernel Mailing List, linux-i2c, Paul Mackerras,
Alyssa Rosenzweig, R.T.Dickinson, Olof Johansson,
mohamed.mediouni, Matthew Leaman, Mark Kettenis, linuxppc-dev,
R.T.Dickinson, linux-arm-kernel, Stan Skowronek
In-Reply-To: <49890226-cf04-46ff-bc37-33d1643faea2@www.fastmail.com>
[-- Attachment #1: Type: text/plain, Size: 2505 bytes --]
> On 3. Oct 2021, at 16:36, Sven Peter <sven@svenpeter.dev> wrote:
>
> Hi,
>
>
>> On Fri, Oct 1, 2021, at 06:47, Christian Zigotzky wrote:
>>> On 27 September 2021 at 07:39 am, Sven Peter wrote:
>>> Hi Christian,
>>>
>>> Thanks already for volunteering to test this!
>>>
>> Hello Sven,
>>
>> Damien (Hypex) has successfully tested the RC3 of kernel 5.15 with your
>> modified i2c driver on his Nemo board yesterday. [1]
>
> Thanks a lot, that's great to hear!
> If he wants to I can credit him with a Tested-by tag in the commit message,
> see e.g. https://www.kernel.org/doc/html/latest/process/submitting-patches.html#using-reported-by-tested-by-reviewed-by-suggested-by-and-fixes.
>
>
> Best,
>
>
> Sven
Hi Sven,
Unfortunately Damien has found an issue. [1]
Output of i2cdetect -l with the default RC3 of kernel 5.15 without your modifications:
2c-0 i2c Radeon i2c bit bus 0x90 I2C adapter
i2c-1 i2c Radeon i2c bit bus 0x91 I2C adapter
i2c-2 i2c Radeon i2c bit bus 0x92 I2C adapter
i2c-3 i2c Radeon i2c bit bus 0x93 I2C adapter
i2c-4 i2c Radeon i2c bit bus 0x94 I2C adapter
i2c-5 i2c Radeon i2c bit bus 0x95 I2C adapter
i2c-6 i2c Radeon i2c bit bus 0x96 I2C adapter
i2c-7 i2c Radeon i2c bit bus 0x97 I2C adapter
i2c-8 i2c PA Semi SMBus adapter at 0x800200 I2C adapter
i2c-9 i2c PA Semi SMBus adapter at 0x800240 I2C adapter
i2c-10 i2c PA Semi SMBus adapter at 0x800280 I2C adapter
Output of i2cdetect -l with your modifications:
i2c-0 i2c Radeon i2c bit bus 0x90 I2C adapter
i2c-1 i2c Radeon i2c bit bus 0x91 I2C adapter
i2c-2 i2c Radeon i2c bit bus 0x92 I2C adapter
i2c-3 i2c Radeon i2c bit bus 0x93 I2C adapter
i2c-4 i2c Radeon i2c bit bus 0x94 I2C adapter
i2c-5 i2c Radeon i2c bit bus 0x95 I2C adapter
i2c-6 i2c Radeon i2c bit bus 0x96 I2C adapter
i2c-7 i2c Radeon i2c bit bus 0x97 I2C adapter
i2c-8 i2c PA Semi SMBus adapter at 0x(____ptrval____) I2C adapter
i2c-9 i2c PA Semi SMBus adapter at 0x(____ptrval____) I2C adapter
i2c-10 i2c PA Semi SMBus adapter at 0x(____ptrval____) I2C adapter
Please check the outputs.
Thanks,
Christian
[1] https://forum.hyperion-entertainment.com/viewtopic.php?p=54165#p54165
[-- Attachment #2: Type: text/html, Size: 8612 bytes --]
^ permalink raw reply
* [PATCH v6 00/11] PCI: Drop duplicated tracking of a pci_dev's bound driver
From: Uwe Kleine-König @ 2021-10-04 12:59 UTC (permalink / raw)
To: Bjorn Helgaas
Cc: Giovanni Cabiddu, Mark Rutland, Sathya Prakash,
Alexander Shishkin, linux-pci, Alexander Duyck, x86, qat-linux,
oss-drivers, Oliver O'Halloran, H. Peter Anvin, Jiri Olsa,
Thomas Gleixner, Marco Chiappero, Stefano Stabellini, Herbert Xu,
linux-scsi, Rafał Miłecki, Jesse Brandeburg,
Peter Zijlstra, Ingo Molnar, linux-wireless, Jakub Kicinski,
Yisen Zhuang, Suganath Prabu Subramani, Fiona Trahe,
Andrew Donnellan, Arnd Bergmann, Konrad Rzeszutek Wilk,
Ido Schimmel, Simon Horman, linuxppc-dev,
Arnaldo Carvalho de Melo, Jack Xu, Borislav Petkov,
Michael Buesch, Jiri Pirko, Bjorn Helgaas, Namhyung Kim,
Boris Ostrovsky, Andy Shevchenko, Juergen Gross, Salil Mehta,
Sreekanth Reddy, xen-devel, Vadym Kochan, MPT-FusionLinux.pdl,
Greg Kroah-Hartman, linux-usb, Wojciech Ziemba, linux-kernel,
Mathias Nyman, Zhou Wang, linux-crypto, kernel, netdev,
Frederic Barrat, Paul Mackerras, Tomaszx Kowalik, Taras Chornyi,
David S. Miller, linux-perf-users
Hello,
this is v6 of the quest to drop the "driver" member from struct pci_dev
which tracks the same data (apart from a constant offset) as dev.driver.
Changes since v5:
- Some Acks added
- Some fixes in "PCI: Replace pci_dev::driver usage by
pci_dev::dev.driver" to properly handle that
to_pci_driver(X) is wrong if X is NULL.
This should fix the problem reported by Ido Schimmel.
Full range diff below.
This patch stack survived an allmodconfig build on arm64, m68k, powerpc,
riscv, s390, sparc64 and x86_64 on top of v5.15-rc3.
Best regards
Uwe
Uwe Kleine-König (11):
PCI: Simplify pci_device_remove()
PCI: Drop useless check from pci_device_probe()
xen/pci: Drop some checks that are always true
bcma: simplify reference to the driver's name
powerpc/eeh: Don't use driver member of struct pci_dev and further
cleanups
ssb: Simplify determination of driver name
PCI: Replace pci_dev::driver usage that gets the driver name
scsi: message: fusion: Remove unused parameter of mpt_pci driver's
probe()
crypto: qat - simplify adf_enable_aer()
PCI: Replace pci_dev::driver usage by pci_dev::dev.driver
PCI: Drop duplicated tracking of a pci_dev's bound driver
arch/powerpc/include/asm/ppc-pci.h | 5 -
arch/powerpc/kernel/eeh.c | 8 ++
arch/powerpc/kernel/eeh_driver.c | 10 +-
arch/x86/events/intel/uncore.c | 2 +-
arch/x86/kernel/probe_roms.c | 10 +-
drivers/bcma/host_pci.c | 6 +-
drivers/crypto/hisilicon/qm.c | 2 +-
drivers/crypto/qat/qat_4xxx/adf_drv.c | 7 +-
drivers/crypto/qat/qat_c3xxx/adf_drv.c | 7 +-
drivers/crypto/qat/qat_c62x/adf_drv.c | 7 +-
drivers/crypto/qat/qat_common/adf_aer.c | 10 +-
.../crypto/qat/qat_common/adf_common_drv.h | 3 +-
drivers/crypto/qat/qat_dh895xcc/adf_drv.c | 7 +-
drivers/message/fusion/mptbase.c | 7 +-
drivers/message/fusion/mptbase.h | 2 +-
drivers/message/fusion/mptctl.c | 4 +-
drivers/message/fusion/mptlan.c | 2 +-
drivers/misc/cxl/guest.c | 24 +++--
drivers/misc/cxl/pci.c | 30 +++---
.../ethernet/hisilicon/hns3/hns3_ethtool.c | 2 +-
.../ethernet/marvell/prestera/prestera_pci.c | 2 +-
drivers/net/ethernet/mellanox/mlxsw/pci.c | 2 +-
.../ethernet/netronome/nfp/nfp_net_ethtool.c | 3 +-
drivers/pci/iov.c | 33 +++++--
drivers/pci/pci-driver.c | 96 ++++++++++---------
drivers/pci/pci.c | 4 +-
drivers/pci/pcie/err.c | 36 +++----
drivers/pci/xen-pcifront.c | 63 ++++++------
drivers/ssb/pcihost_wrapper.c | 6 +-
drivers/usb/host/xhci-pci.c | 2 +-
include/linux/pci.h | 1 -
31 files changed, 208 insertions(+), 195 deletions(-)
Range-diff against v5:
-: ------------ > 1: c2b53ab26a6b PCI: Simplify pci_device_remove()
-: ------------ > 2: 2c733e1d5186 PCI: Drop useless check from pci_device_probe()
-: ------------ > 3: 547ca5a7aa16 xen/pci: Drop some checks that are always true
-: ------------ > 4: 40eb07353844 bcma: simplify reference to the driver's name
-: ------------ > 5: bab59c1dff6d powerpc/eeh: Don't use driver member of struct pci_dev and further cleanups
1: abd70de9782d ! 6: 92f4d61bbac3 ssb: Simplify determination of driver name
@@ Commit message
This has the upside of not requiring the driver member of struct pci_dev
which is about to be removed and being simpler.
+ Acked-by: Michael Büsch <m@bues.ch>
Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
## drivers/ssb/pcihost_wrapper.c ##
2: 735845bd26b9 ! 7: 6303f03ab2aa PCI: Replace pci_dev::driver usage that gets the driver name
@@ Commit message
driver name by dev_driver_string() which implicitly makes use of struct
pci_dev::dev->driver.
+ Acked-by: Simon Horman <simon.horman@corigine.com> (for NFP)
Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
## drivers/crypto/hisilicon/qm.c ##
3: 1e58019165b9 = 8: 658a6c00ec96 scsi: message: fusion: Remove unused parameter of mpt_pci driver's probe()
4: dea72a470141 = 9: aceaf5321603 crypto: qat - simplify adf_enable_aer()
5: b4165dda38ea ! 10: 80648d999985 PCI: Replace pci_dev::driver usage by pci_dev::dev.driver
@@ arch/x86/kernel/probe_roms.c: static struct resource video_rom_resource = {
static bool match_id(struct pci_dev *pdev, unsigned short vendor, unsigned short device)
{
- struct pci_driver *drv = pdev->driver;
-+ struct pci_driver *drv = to_pci_driver(pdev->dev.driver);
const struct pci_device_id *id;
if (pdev->vendor == vendor && pdev->device == device)
+ return true;
+
+- for (id = drv ? drv->id_table : NULL; id && id->vendor; id++)
+- if (id->vendor == vendor && id->device == device)
+- break;
++ if (pdev->dev.driver) {
++ struct pci_driver *drv = to_pci_driver(pdev->dev.driver);
++ for (id = drv->id_table; id && id->vendor; id++)
++ if (id->vendor == vendor && id->device == device)
++ break;
++ }
+
+ return id && id->vendor;
+ }
## drivers/misc/cxl/guest.c ##
@@ drivers/misc/cxl/guest.c: static void pci_error_handlers(struct cxl_afu *afu,
@@ drivers/pci/iov.c: static ssize_t sriov_vf_total_msix_show(struct device *dev,
device_lock(dev);
- if (!pdev->driver || !pdev->driver->sriov_get_vf_total_msix)
-+ pdrv = to_pci_driver(dev->driver);
-+ if (!pdrv || !pdrv->sriov_get_vf_total_msix)
++ if (!dev->driver)
goto unlock;
- vf_total_msix = pdev->driver->sriov_get_vf_total_msix(pdev);
++ pdrv = to_pci_driver(dev->driver);
++ if (!pdrv->sriov_get_vf_total_msix)
++ goto unlock;
++
+ vf_total_msix = pdrv->sriov_get_vf_total_msix(pdev);
unlock:
device_unlock(dev);
@@ drivers/pci/iov.c: static ssize_t sriov_vf_msix_count_store(struct device *dev,
device_lock(&pdev->dev);
- if (!pdev->driver || !pdev->driver->sriov_set_msix_vec_count) {
++ if (!pdev->dev.driver) {
++ ret = -EOPNOTSUPP;
++ goto err_pdev;
++ }
++
+ pdrv = to_pci_driver(pdev->dev.driver);
-+ if (!pdrv || !pdrv->sriov_set_msix_vec_count) {
++ if (!pdrv->sriov_set_msix_vec_count) {
ret = -EOPNOTSUPP;
goto err_pdev;
}
@@ drivers/pci/pci-driver.c: static void pci_device_remove(struct device *dev)
{
struct pci_dev *pci_dev = to_pci_dev(dev);
- struct pci_driver *drv = pci_dev->driver;
-+ struct pci_driver *drv = to_pci_driver(pci_dev->dev.driver);
pm_runtime_resume(dev);
+- if (drv && drv->shutdown)
+- drv->shutdown(pci_dev);
++ if (pci_dev->dev.driver) {
++ struct pci_driver *drv = to_pci_driver(pci_dev->dev.driver);
++
++ if (drv->shutdown)
++ drv->shutdown(pci_dev);
++ }
+
+ /*
+ * If this is a kexec reboot, turn off Bus Master bit on the
@@ drivers/pci/pci-driver.c: static int pci_pm_reenable_device(struct pci_dev *pci_dev)
static int pci_legacy_suspend(struct device *dev, pm_message_t state)
{
struct pci_dev *pci_dev = to_pci_dev(dev);
- struct pci_driver *drv = pci_dev->driver;
-+ struct pci_driver *drv = to_pci_driver(dev->driver);
- if (drv && drv->suspend) {
- pci_power_t prev = pci_dev->current_state;
+- if (drv && drv->suspend) {
+- pci_power_t prev = pci_dev->current_state;
+- int error;
++ if (dev->driver) {
++ struct pci_driver *drv = to_pci_driver(dev->driver);
+
+- error = drv->suspend(pci_dev, state);
+- suspend_report_result(drv->suspend, error);
+- if (error)
+- return error;
++ if (drv->suspend) {
++ pci_power_t prev = pci_dev->current_state;
++ int error;
+
+- if (!pci_dev->state_saved && pci_dev->current_state != PCI_D0
+- && pci_dev->current_state != PCI_UNKNOWN) {
+- pci_WARN_ONCE(pci_dev, pci_dev->current_state != prev,
+- "PCI PM: Device state not saved by %pS\n",
+- drv->suspend);
++ error = drv->suspend(pci_dev, state);
++ suspend_report_result(drv->suspend, error);
++ if (error)
++ return error;
++
++ if (!pci_dev->state_saved && pci_dev->current_state != PCI_D0
++ && pci_dev->current_state != PCI_UNKNOWN) {
++ pci_WARN_ONCE(pci_dev, pci_dev->current_state != prev,
++ "PCI PM: Device state not saved by %pS\n",
++ drv->suspend);
++ }
+ }
+ }
+
@@ drivers/pci/pci-driver.c: static int pci_legacy_suspend_late(struct device *dev, pm_message_t state)
static int pci_legacy_resume(struct device *dev)
{
struct pci_dev *pci_dev = to_pci_dev(dev);
- struct pci_driver *drv = pci_dev->driver;
-+ struct pci_driver *drv = to_pci_driver(pci_dev->dev.driver);
pci_fixup_device(pci_fixup_resume, pci_dev);
+- return drv && drv->resume ?
+- drv->resume(pci_dev) : pci_pm_reenable_device(pci_dev);
++ if (pci_dev->dev.driver) {
++ struct pci_driver *drv = to_pci_driver(pci_dev->dev.driver);
++
++ if (drv->resume)
++ return drv->resume(pci_dev);
++ }
++
++ return pci_pm_reenable_device(pci_dev);
+ }
+
+ /* Auxiliary functions used by the new power management framework */
@@ drivers/pci/pci-driver.c: static void pci_pm_default_suspend(struct pci_dev *pci_dev)
static bool pci_has_legacy_pm_support(struct pci_dev *pci_dev)
{
- struct pci_driver *drv = pci_dev->driver;
-+ struct pci_driver *drv = to_pci_driver(pci_dev->dev.driver);
- bool ret = drv && (drv->suspend || drv->resume);
+- bool ret = drv && (drv->suspend || drv->resume);
++ struct pci_driver *drv;
++ bool ret;
++
++ if (!pci_dev->dev.driver)
++ return false;
++
++ drv = to_pci_driver(pci_dev->dev.driver);
++ ret = drv && (drv->suspend || drv->resume);
/*
+ * Legacy PM support is used by default, so warn if the new framework is
@@ drivers/pci/pci-driver.c: static int pci_pm_runtime_suspend(struct device *dev)
int error;
6: d93a138bd7ab = 11: 2686d69bca17 PCI: Drop duplicated tracking of a pci_dev's bound driver
base-commit: 5816b3e6577eaa676ceb00a848f0fd65fe2adc29
--
2.30.2
^ permalink raw reply
* [PATCH 0/8] Fix long standing AER Error Handling Issues
From: Naveen Naidu @ 2021-10-04 13:52 UTC (permalink / raw)
To: bhelgaas, ruscur, oohall
Cc: Naveen Naidu, linuxppc-dev, linux-kernel-mentees, linux-kernel,
skhan
This patch series aims at fixing some of the AER error handling issues
we have.
Currently we have the following issues:
- Confusing message in aer_print_error()
- aer_err_info not being initialized completely in DPC path before
we print the AER logs
- A bug [1] in clearing of AER registers in the native AER path
[1] https://lore.kernel.org/linux-pci/20151229155822.GA17321@localhost/
The primary aim of this patch series is to converge the APEI path and the
native AER error handling paths. In our current code, we find that we
have two different behaviours (especially when it comes to clearing of
the AER registers) for the same functionality.
This patch series, tries to bring the same semantics and hence more
commonanlity between the APEI part of code and the native OS
handling of AER errors.
PATCH 1:
- Fixes the first issue
PATCH 2 - 4:
- Fixes the second issue
- "Patch 3/8" is dependent on "Patch 2/3" in the series
PATCH 5 - 7
- Deals with converging the various paths and to bring more
commonality between them
- "Patch 6/8" depends on "Patch 1/8"
PATCH 8:
- Adds extra information in AER error logs.
Thanks,
Naveen Naidu
Naveen Naidu (8):
[PATCH 1/8] PCI/AER: Remove ID from aer_agent_string[]
[PATCH 2/8] PCI: Cleanup struct aer_err_info
[PATCH 3/8] PCI/DPC: Initialize info->id in dpc_process_error()
[PATCH 4/8] PCI/DPC: Use pci_aer_clear_status() in dpc_process_error()
[PATCH 5/8] PCI/DPC: Converge EDR and DPC Path of clearing AER registers
[PATCH 6/8] PCI/AER: Clear error device AER registers in aer_irq()
[PATCH 7/8] PCI/ERR: Remove redundant clearing of AER register in pcie_do_recovery()
[PATCH 8/8] PCI/AER: Include DEVCTL in aer_print_error()
drivers/pci/pci.h | 23 +++-
drivers/pci/pcie/aer.c | 265 ++++++++++++++++++++++++++++-------------
drivers/pci/pcie/dpc.c | 9 +-
drivers/pci/pcie/err.c | 9 +-
4 files changed, 207 insertions(+), 99 deletions(-)
--
2.25.1
^ permalink raw reply
* [PATCH 1/8] PCI/AER: Remove ID from aer_agent_string[]
From: Naveen Naidu @ 2021-10-04 13:52 UTC (permalink / raw)
To: bhelgaas, ruscur, oohall
Cc: Naveen Naidu, linuxppc-dev, linux-kernel-mentees, linux-kernel,
skhan
In-Reply-To: <cover.1633353468.git.naveennaidu479@gmail.com>
Before 010caed4ccb6 ("PCI/AER: Decode Error Source RequesterID")
the AER error logs looked like:
pcieport 0000:00:03.0: AER: Corrected error received: id=0018
pcieport 0000:00:03.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=0018 (Receiver ID)
pcieport 0000:00:03.0: device [1b36:000c] error status/mask=00000040/0000e000
pcieport 0000:00:03.0: [ 6] BadTLP
In 010caed4ccb6 ("PCI/AER: Decode Error Source Requester ID"),
the "id" field was removed from the AER error logs, so currently AER
logs look like:
pcieport 0000:00:03.0: AER: Corrected error received: 0000:00:03:0
pcieport 0000:00:03.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Receiver ID)
pcieport 0000:00:03.0: device [1b36:000c] error status/mask=00000040/0000e000
pcieport 0000:00:03.0: [ 6] BadTLP
The second line in the above logs prints "(Receiver ID)", even when
there is no "id" in the log line. This is confusing.
Remove the "ID" from the aer_agent_string[]. The error logs will
look as follows (Sample from dummy error injected by aer-inject):
pcieport 0000:00:03.0: AER: Corrected error received: 0000:00:03.0
pcieport 0000:00:03.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Receiver)
pcieport 0000:00:03.0: device [1b36:000c] error status/mask=00000040/0000e000
pcieport 0000:00:03.0: [ 6] BadTLP
Signed-off-by: Naveen Naidu <naveennaidu479@gmail.com>
---
drivers/pci/pcie/aer.c | 10 +++++-----
1 file changed, 5 insertions(+), 5 deletions(-)
diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index 9784fdcf3006..241ff361b43c 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -516,10 +516,10 @@ static const char *aer_uncorrectable_error_string[] = {
};
static const char *aer_agent_string[] = {
- "Receiver ID",
- "Requester ID",
- "Completer ID",
- "Transmitter ID"
+ "Receiver",
+ "Requester",
+ "Completer",
+ "Transmitter"
};
#define aer_stats_dev_attr(name, stats_array, strings_array, \
@@ -703,7 +703,7 @@ void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
const char *level;
if (!info->status) {
- pci_err(dev, "PCIe Bus Error: severity=%s, type=Inaccessible, (Unregistered Agent ID)\n",
+ pci_err(dev, "PCIe Bus Error: severity=%s, type=Inaccessible, (Unregistered Agent)\n",
aer_error_severity_string[info->severity]);
goto out;
}
--
2.25.1
^ permalink raw reply related
* [PATCH 2/8] PCI: Cleanup struct aer_err_info
From: Naveen Naidu @ 2021-10-04 13:52 UTC (permalink / raw)
To: bhelgaas, ruscur, oohall
Cc: Naveen Naidu, linuxppc-dev, linux-kernel-mentees, linux-kernel,
skhan
In-Reply-To: <cover.1633353468.git.naveennaidu479@gmail.com>
The id, status and the mask fields of the struct aer_err_info comes
directly from the registers, hence their sizes should be explicit.
The length of these registers are:
- id: 16 bits - Represents the Error Source Requester ID
- status: 32 bits - COR/UNCOR Error Status
- mask: 32 bits - COR/UNCOR Error Mask
Since the length of the above registers are even, use u16 and u32
to represent their values.
Also remove the __pad fields.
"pahole" was run on the modified struct aer_err_info and the size
remains unchanged.
Signed-off-by: Naveen Naidu <naveennaidu479@gmail.com>
---
drivers/pci/pci.h | 8 +++-----
1 file changed, 3 insertions(+), 5 deletions(-)
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 1cce56c2aea0..9be7a966fda7 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -427,18 +427,16 @@ struct aer_err_info {
struct pci_dev *dev[AER_MAX_MULTI_ERR_DEVICES];
int error_dev_num;
- unsigned int id:16;
+ u16 id;
unsigned int severity:2; /* 0:NONFATAL | 1:FATAL | 2:COR */
- unsigned int __pad1:5;
unsigned int multi_error_valid:1;
unsigned int first_error:5;
- unsigned int __pad2:2;
unsigned int tlp_header_valid:1;
- unsigned int status; /* COR/UNCOR Error Status */
- unsigned int mask; /* COR/UNCOR Error Mask */
+ u32 status; /* COR/UNCOR Error Status */
+ u32 mask; /* COR/UNCOR Error Mask */
struct aer_header_log_regs tlp; /* TLP Header */
};
--
2.25.1
^ permalink raw reply related
* [PATCH 4/8] PCI/DPC: Use pci_aer_clear_status() in dpc_process_error()
From: Naveen Naidu @ 2021-10-04 13:52 UTC (permalink / raw)
To: bhelgaas, ruscur, oohall
Cc: Naveen Naidu, linuxppc-dev, linux-kernel-mentees, linux-kernel,
skhan
In-Reply-To: <cover.1633353468.git.naveennaidu479@gmail.com>
dpc_process_error() clears both AER fatal and non fatal status
registers. Instead of clearing each status registers via a different
function call use pci_aer_clear_status().
This helps clean up the code a bit.
Signed-off-by: Naveen Naidu <naveennaidu479@gmail.com>
---
drivers/pci/pcie/dpc.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c
index df3f3a10f8bc..faf4a1e77fab 100644
--- a/drivers/pci/pcie/dpc.c
+++ b/drivers/pci/pcie/dpc.c
@@ -288,8 +288,7 @@ void dpc_process_error(struct pci_dev *pdev)
dpc_get_aer_uncorrect_severity(pdev, &info) &&
aer_get_device_error_info(pdev, &info)) {
aer_print_error(pdev, &info);
- pci_aer_clear_nonfatal_status(pdev);
- pci_aer_clear_fatal_status(pdev);
+ pci_aer_clear_status(pdev);
}
}
--
2.25.1
^ permalink raw reply related
* [PATCH 5/8] PCI/DPC: Converge EDR and DPC Path of clearing AER registers
From: Naveen Naidu @ 2021-10-04 13:52 UTC (permalink / raw)
To: bhelgaas, ruscur, oohall
Cc: Naveen Naidu, linuxppc-dev, linux-kernel-mentees, linux-kernel,
skhan
In-Reply-To: <cover.1633353468.git.naveennaidu479@gmail.com>
In the EDR path, AER registers are cleared *after* DPC error event is
processed. The process stack in EDR is:
edr_handle_event()
dpc_process_error()
pci_aer_raw_clear_status()
pcie_do_recovery()
But in DPC path, AER status registers are cleared *while* processing
the error. The process stack in DPC is:
dpc_handler()
dpc_process_error()
pci_aer_clear_status()
pcie_do_recovery()
In EDR path, AER status registers are cleared irrespective of whether
the error was an RP PIO or unmasked uncorrectable error. But in DPC, the
AER status registers are cleared only when it's an unmasked uncorrectable
error.
This leads to two different behaviours for the same task (handling of
DPC errors) in FFS systems and when native OS has control.
Bring the same semantics for clearing the AER status register in EDR
path and DPC path.
Signed-off-by: Naveen Naidu <naveennaidu479@gmail.com>
---
drivers/pci/pcie/dpc.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c
index faf4a1e77fab..68899a3db126 100644
--- a/drivers/pci/pcie/dpc.c
+++ b/drivers/pci/pcie/dpc.c
@@ -288,7 +288,6 @@ void dpc_process_error(struct pci_dev *pdev)
dpc_get_aer_uncorrect_severity(pdev, &info) &&
aer_get_device_error_info(pdev, &info)) {
aer_print_error(pdev, &info);
- pci_aer_clear_status(pdev);
}
}
@@ -297,6 +296,7 @@ static irqreturn_t dpc_handler(int irq, void *context)
struct pci_dev *pdev = context;
dpc_process_error(pdev);
+ pci_aer_clear_status(pdev);
/* We configure DPC so it only triggers on ERR_FATAL */
pcie_do_recovery(pdev, pci_channel_io_frozen, dpc_reset_link);
--
2.25.1
^ permalink raw reply related
* [PATCH 6/8] PCI/AER: Clear error device AER registers in aer_irq()
From: Naveen Naidu @ 2021-10-04 13:52 UTC (permalink / raw)
To: bhelgaas, ruscur, oohall
Cc: Naveen Naidu, linuxppc-dev, linux-kernel-mentees, linux-kernel,
skhan
In-Reply-To: <cover.1633353468.git.naveennaidu479@gmail.com>
Converge the APEI path and native AER path of clearing the AER registers
of the error device.
In APEI path, the system firmware clears the AER registers before
handing off the record to OS. But in "native AER" path, the execution
path of clearing the AER register is as follows:
aer_isr_one_error
aer_print_port_info
if (find_source_device())
aer_process_err_devices
handle_error_source
pci_write_config_dword(dev, PCI_ERR_COR_STATUS, ...)
The above path has a bug, if the find_source_device() fails, AER
registers are not cleared from the error device. This means, the error
device will keep reporting the error again and again and would lead
to message spew.
Related Bug Report:
https://lore.kernel.org/linux-pci/20151229155822.GA17321@localhost/
https://bugzilla.kernel.org/show_bug.cgi?id=109691
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1521173
The above bug could be avoided, if the AER registers are cleared during
the AER IRQ handler aer_irq(), which would provide guarantee that the AER
error registers are always cleared. This is similar to how APEI handles
these errors.
The main aim is that:
When an interrupt handler deals with a interrupt, it must *always*
clear the source of the interrupt.
Signed-off-by: Naveen Naidu <naveennaidu479@gmail.com>
---
drivers/pci/pci.h | 13 ++-
drivers/pci/pcie/aer.c | 245 ++++++++++++++++++++++++++++-------------
2 files changed, 182 insertions(+), 76 deletions(-)
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 9be7a966fda7..eb88d8bfeaf7 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -424,7 +424,6 @@ static inline bool pci_dev_is_added(const struct pci_dev *dev)
#define AER_MAX_MULTI_ERR_DEVICES 5 /* Not likely to have more */
struct aer_err_info {
- struct pci_dev *dev[AER_MAX_MULTI_ERR_DEVICES];
int error_dev_num;
u16 id;
@@ -440,6 +439,18 @@ struct aer_err_info {
struct aer_header_log_regs tlp; /* TLP Header */
};
+/* Preliminary AER error information processed from Root port */
+struct aer_devices_err_info {
+ struct pci_dev *dev[AER_MAX_MULTI_ERR_DEVICES];
+ struct aer_err_info err_info;
+};
+
+/* AER information associated with each error device */
+struct aer_dev_err_info {
+ struct pci_dev *dev;
+ struct aer_err_info err_info;
+};
+
int aer_get_device_error_info(struct pci_dev *dev, struct aer_err_info *info);
void aer_print_error(struct pci_dev *dev, struct aer_err_info *info);
#endif /* CONFIG_PCIEAER */
diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index 241ff361b43c..91f91d6ab052 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -36,6 +36,18 @@
#define AER_ERROR_SOURCES_MAX 128
+/*
+ * There can be 128 maximum error sources (AER_ERROR_SOURCES_MAX) and each
+ * error source can have maximum of 5 error devices (AER_MAX_MULTI_ERR_DEVICES)
+ * so the maximum error devices we can report is:
+ *
+ * AER_ERROR_DEVICES_MAX = AER_ERROR_SOURCES_MAX * AER_MAX_MULTI_ERR_DEVICES == (128 * 5) == 640
+ *
+ * But since, the size in KFIFO should be a power of two, the closest value
+ * to 640 is 1024
+ */
+# define AER_ERROR_DEVICES_MAX 1024
+
#define AER_MAX_TYPEOF_COR_ERRS 16 /* as per PCI_ERR_COR_STATUS */
#define AER_MAX_TYPEOF_UNCOR_ERRS 27 /* as per PCI_ERR_UNCOR_STATUS*/
@@ -46,7 +58,7 @@ struct aer_err_source {
struct aer_rpc {
struct pci_dev *rpd; /* Root Port device */
- DECLARE_KFIFO(aer_fifo, struct aer_err_source, AER_ERROR_SOURCES_MAX);
+ DECLARE_KFIFO(aer_fifo, struct aer_dev_err_info, AER_ERROR_DEVICES_MAX);
};
/* AER stats for the device */
@@ -806,11 +818,11 @@ void cper_print_aer(struct pci_dev *dev, int aer_severity,
* @e_info: pointer to error info
* @dev: pointer to pci_dev to be added
*/
-static int add_error_device(struct aer_err_info *e_info, struct pci_dev *dev)
+static int add_error_device(struct aer_devices_err_info *e_dev, struct pci_dev *dev)
{
- if (e_info->error_dev_num < AER_MAX_MULTI_ERR_DEVICES) {
- e_info->dev[e_info->error_dev_num] = pci_dev_get(dev);
- e_info->error_dev_num++;
+ if (e_dev->err_info.error_dev_num < AER_MAX_MULTI_ERR_DEVICES) {
+ e_dev->dev[e_dev->err_info.error_dev_num] = pci_dev_get(dev);
+ e_dev->err_info.error_dev_num++;
return 0;
}
return -ENOSPC;
@@ -877,18 +889,18 @@ static bool is_error_source(struct pci_dev *dev, struct aer_err_info *e_info)
static int find_device_iter(struct pci_dev *dev, void *data)
{
- struct aer_err_info *e_info = (struct aer_err_info *)data;
+ struct aer_devices_err_info *e_dev = (struct aer_devices_err_info *)data;
- if (is_error_source(dev, e_info)) {
+ if (is_error_source(dev, &e_dev->err_info)) {
/* List this device */
- if (add_error_device(e_info, dev)) {
+ if (add_error_device(e_dev, dev)) {
/* We cannot handle more... Stop iteration */
/* TODO: Should print error message here? */
return 1;
}
/* If there is only a single error, stop iteration */
- if (!e_info->multi_error_valid)
+ if (!e_dev->err_info.multi_error_valid)
return 1;
}
return 0;
@@ -907,26 +919,26 @@ static int find_device_iter(struct pci_dev *dev, void *data)
* e_info->error_dev_num and e_info->dev[], based on the given information.
*/
static bool find_source_device(struct pci_dev *parent,
- struct aer_err_info *e_info)
+ struct aer_devices_err_info *e_dev)
{
struct pci_dev *dev = parent;
int result;
/* Must reset in this function */
- e_info->error_dev_num = 0;
+ e_dev->err_info.error_dev_num = 0;
/* Is Root Port an agent that sends error message? */
- result = find_device_iter(dev, e_info);
+ result = find_device_iter(dev, e_dev);
if (result)
return true;
if (pci_pcie_type(parent) == PCI_EXP_TYPE_RC_EC)
- pcie_walk_rcec(parent, find_device_iter, e_info);
+ pcie_walk_rcec(parent, find_device_iter, e_dev);
else
- pci_walk_bus(parent->subordinate, find_device_iter, e_info);
+ pci_walk_bus(parent->subordinate, find_device_iter, e_dev);
- if (!e_info->error_dev_num) {
- pci_info(parent, "can't find device of ID%04x\n", e_info->id);
+ if (!e_dev->err_info.error_dev_num) {
+ pci_info(parent, "can't find device of ID%04x\n", e_dev->err_info.id);
return false;
}
return true;
@@ -940,24 +952,42 @@ static bool find_source_device(struct pci_dev *parent,
* Invoked when an error being detected by Root Port.
*/
static void handle_error_source(struct pci_dev *dev, struct aer_err_info *info)
+{
+ /*
+ * Correctable error does not need software intervention.
+ * No need to go through error recovery process.
+ */
+ if (info->severity == AER_NONFATAL)
+ pcie_do_recovery(dev, pci_channel_io_normal, aer_root_reset);
+ else if (info->severity == AER_FATAL)
+ pcie_do_recovery(dev, pci_channel_io_frozen, aer_root_reset);
+ pci_dev_put(dev);
+}
+
+/**
+ * clear_error_source_aer_registers - clear AER registers of the error source device
+ * @dev: pointer to pci_dev data structure of error source device
+ * @info: comprehensive error information
+ *
+ * Invoked when an error being detected by Root Port but before we handle the
+ * error.
+ */
+static void clear_error_source_aer_registers(struct pci_dev *dev, struct aer_err_info info)
{
int aer = dev->aer_cap;
- if (info->severity == AER_CORRECTABLE) {
- /*
- * Correctable error does not need software intervention.
- * No need to go through error recovery process.
- */
+ if (info.severity == AER_CORRECTABLE) {
if (aer)
pci_write_config_dword(dev, aer + PCI_ERR_COR_STATUS,
- info->status);
+ info.status);
if (pcie_aer_is_native(dev))
pcie_clear_device_status(dev);
- } else if (info->severity == AER_NONFATAL)
- pcie_do_recovery(dev, pci_channel_io_normal, aer_root_reset);
- else if (info->severity == AER_FATAL)
- pcie_do_recovery(dev, pci_channel_io_frozen, aer_root_reset);
- pci_dev_put(dev);
+ } else if (info.severity == AER_NONFATAL) {
+ pci_aer_clear_nonfatal_status(dev);
+ } else if (info.severity == AER_FATAL) {
+ pci_aer_clear_fatal_status(dev);
+ }
+
}
#ifdef CONFIG_ACPI_APEI_PCIEAER
@@ -1093,70 +1123,112 @@ int aer_get_device_error_info(struct pci_dev *dev, struct aer_err_info *info)
return 1;
}
-static inline void aer_process_err_devices(struct aer_err_info *e_info)
-{
- int i;
-
- /* Report all before handle them, not to lost records by reset etc. */
- for (i = 0; i < e_info->error_dev_num && e_info->dev[i]; i++) {
- if (aer_get_device_error_info(e_info->dev[i], e_info))
- aer_print_error(e_info->dev[i], e_info);
- }
- for (i = 0; i < e_info->error_dev_num && e_info->dev[i]; i++) {
- if (aer_get_device_error_info(e_info->dev[i], e_info))
- handle_error_source(e_info->dev[i], e_info);
- }
-}
-
/**
- * aer_isr_one_error - consume an error detected by root port
- * @rpc: pointer to the root port which holds an error
+ * aer_find_corr_error_source_device - find the error source which detected the corrected error
+ * @rp: pointer to Root Port pci_dev data structure
* @e_src: pointer to an error source
+ * @e_info: including detailed error information such like id
+ *
+ * Return true if found.
+ *
+ * Process the error information received at the Root Port, set these values
+ * in the aer_devices_err_info and find all the devices that are related to
+ * the error.
*/
-static void aer_isr_one_error(struct aer_rpc *rpc,
- struct aer_err_source *e_src)
+static bool aer_find_corr_error_source_device(struct pci_dev *rp,
+ struct aer_err_source *e_src,
+ struct aer_devices_err_info *e_info)
{
- struct pci_dev *pdev = rpc->rpd;
- struct aer_err_info e_info;
-
- pci_rootport_aer_stats_incr(pdev, e_src);
-
- /*
- * There is a possibility that both correctable error and
- * uncorrectable error being logged. Report correctable error first.
- */
if (e_src->status & PCI_ERR_ROOT_COR_RCV) {
- e_info.id = ERR_COR_ID(e_src->id);
- e_info.severity = AER_CORRECTABLE;
+ e_info->err_info.id = ERR_COR_ID(e_src->id);
+ e_info->err_info.severity = AER_CORRECTABLE;
if (e_src->status & PCI_ERR_ROOT_MULTI_COR_RCV)
- e_info.multi_error_valid = 1;
+ e_info->err_info.multi_error_valid = 1;
else
- e_info.multi_error_valid = 0;
- aer_print_port_info(pdev, &e_info);
+ e_info->err_info.multi_error_valid = 0;
- if (find_source_device(pdev, &e_info))
- aer_process_err_devices(&e_info);
+ if (!find_source_device(rp, e_info))
+ return false;
}
+ return true;
+}
+/**
+ * aer_find_corr_error_source_device - find the error source which detected the uncorrected error
+ * @rp: pointer to Root Port pci_dev data structure
+ * @e_src: pointer to an error source
+ * @e_info: including detailed error information such like id
+ *
+ * Return true if found.
+ *
+ * Process the error information received at the Root Port, set these values
+ * in the aer_devices_err_info and find all the devices that are related to
+ * the error.
+ */
+static bool aer_find_uncorr_error_source_device(struct pci_dev *rp,
+ struct aer_err_source *e_src,
+ struct aer_devices_err_info *e_info)
+{
if (e_src->status & PCI_ERR_ROOT_UNCOR_RCV) {
- e_info.id = ERR_UNCOR_ID(e_src->id);
+ e_info->err_info.id = ERR_UNCOR_ID(e_src->id);
if (e_src->status & PCI_ERR_ROOT_FATAL_RCV)
- e_info.severity = AER_FATAL;
+ e_info->err_info.severity = AER_FATAL;
else
- e_info.severity = AER_NONFATAL;
+ e_info->err_info.severity = AER_NONFATAL;
if (e_src->status & PCI_ERR_ROOT_MULTI_UNCOR_RCV)
- e_info.multi_error_valid = 1;
+ e_info->err_info.multi_error_valid = 1;
else
- e_info.multi_error_valid = 0;
+ e_info->err_info.multi_error_valid = 0;
+
+ if (!find_source_device(rp, e_info))
+ return false;
+ }
- aer_print_port_info(pdev, &e_info);
+ return true;
+}
- if (find_source_device(pdev, &e_info))
- aer_process_err_devices(&e_info);
+/**
+ * aer_isr_one_error - consume an error detected by root port
+ * @rp: pointer to Root Port pci_dev data structure
+ * @e_dev: pointer to an error device
+ */
+static void aer_isr_one_error(struct pci_dev *rp, struct aer_dev_err_info *e_dev)
+{
+ aer_print_port_info(rp, &e_dev->err_info);
+ aer_print_error(e_dev->dev, &e_dev->err_info);
+ handle_error_source(e_dev->dev, &e_dev->err_info);
+}
+
+static bool aer_add_err_devices_to_queue(struct aer_rpc *rpc,
+ struct aer_devices_err_info *e_info)
+{
+ int i;
+ struct aer_dev_err_info *e_dev;
+
+ e_dev = kzalloc(sizeof(*e_dev), GFP_ATOMIC);
+ if (!e_dev)
+ return false;
+
+ for (i = 0; i < e_info->err_info.error_dev_num && e_info->dev[i]; i++) {
+ e_dev->err_info = e_info->err_info;
+ e_dev->dev = e_info->dev[i];
+
+ /*
+ * Store the AER register information for each error device on
+ * the queue
+ */
+ if (aer_get_device_error_info(e_dev->dev, &e_dev->err_info)) {
+ if (!kfifo_put(&rpc->aer_fifo, *e_dev))
+ return false;
+
+ clear_error_source_aer_registers(e_dev->dev, e_dev->err_info);
+ }
}
+
+ return true;
}
/**
@@ -1170,13 +1242,13 @@ static irqreturn_t aer_isr(int irq, void *context)
{
struct pcie_device *dev = (struct pcie_device *)context;
struct aer_rpc *rpc = get_service_data(dev);
- struct aer_err_source e_src;
+ struct aer_dev_err_info e_dev;
if (kfifo_is_empty(&rpc->aer_fifo))
return IRQ_NONE;
- while (kfifo_get(&rpc->aer_fifo, &e_src))
- aer_isr_one_error(rpc, &e_src);
+ while (kfifo_get(&rpc->aer_fifo, &e_dev))
+ aer_isr_one_error(rpc->rpd, &e_dev);
return IRQ_HANDLED;
}
@@ -1194,6 +1266,11 @@ static irqreturn_t aer_irq(int irq, void *context)
struct pci_dev *rp = rpc->rpd;
int aer = rp->aer_cap;
struct aer_err_source e_src = {};
+ struct aer_devices_err_info *e_info;
+
+ e_info = kzalloc(sizeof(*e_info), GFP_ATOMIC);
+ if (!e_info)
+ return IRQ_NONE;
pci_read_config_dword(rp, aer + PCI_ERR_ROOT_STATUS, &e_src.status);
if (!(e_src.status & (PCI_ERR_ROOT_UNCOR_RCV|PCI_ERR_ROOT_COR_RCV)))
@@ -1202,8 +1279,26 @@ static irqreturn_t aer_irq(int irq, void *context)
pci_read_config_dword(rp, aer + PCI_ERR_ROOT_ERR_SRC, &e_src.id);
pci_write_config_dword(rp, aer + PCI_ERR_ROOT_STATUS, e_src.status);
- if (!kfifo_put(&rpc->aer_fifo, e_src))
- return IRQ_HANDLED;
+ pci_rootport_aer_stats_incr(rp, &e_src);
+
+ /*
+ * There is a possibility that both correctable error and
+ * uncorrectable error are being logged. Find the devices which caused
+ * correctable errors first so that they can be added to the queue first
+ * and will be reported first.
+ *
+ * Before adding the error device to the queue to be handled, clear the
+ * AER status registers.
+ */
+ if (aer_find_corr_error_source_device(rp, &e_src, e_info)) {
+ if (!aer_add_err_devices_to_queue(rpc, e_info))
+ return IRQ_NONE;
+ }
+
+ if (aer_find_uncorr_error_source_device(rp, &e_src, e_info)) {
+ if (!aer_add_err_devices_to_queue(rpc, e_info))
+ return IRQ_NONE;
+ }
return IRQ_WAKE_THREAD;
}
--
2.25.1
^ permalink raw reply related
* [PATCH 7/8] PCI/ERR: Remove redundant clearing of AER register in pcie_do_recovery()
From: Naveen Naidu @ 2021-10-04 13:52 UTC (permalink / raw)
To: bhelgaas, ruscur, oohall
Cc: Naveen Naidu, linuxppc-dev, linux-kernel-mentees, linux-kernel,
skhan
In-Reply-To: <cover.1633353468.git.naveennaidu479@gmail.com>
pcie_do_recovery() is shared across the following paths:
- ACPI APEI
- Native AER path
- EDR
- DPC
ACPI APEI
==========
ghes_handle_aer()
aer_recover_queue()
kfifo_in_spinlocked(aer_recover_ring)
aer_recover_work_func()
while (kfifo_get(aer_recover_ring))
pcie_do_recovery()
In this path the system firmware clears the AER registers before
handing off the record to the OS in ghes_handle_aer()
Native AER
==========
aer_irq()
aer_add_err_devices_to_queue()
kfifo_put(&rpc->aer_fifo, *e_dev)
clear_error_source_aer_registers() <---- AER registers are cleard
aer_isr()
aer_isr_one_error()
handle_error_source()
pcie_do_recovery()
The AER registers are cleared during the handling of IRQ, i.e before we
the recovery starts.
DPC
=====
dpc_handler()
dpc_process_error()
pci_aer_clear_status() <---- AER registers are cleared
pcie_do_recovery()
EDR
====
edr_handle_event()
dpc_process_error()
pci_aer_raw_clear_status() <---- AER registers are cleared
pcie_do_recovery()
In all the above paths, the AER registers are cleared before
pcie_do_recovery(). The non fatal status AER registers are again cleared
in pcie_do_recovery(). This is redundant.
Remove redundant clearing of AER register in pcie_do_recovery()
Signed-off-by: Naveen Naidu <naveennaidu479@gmail.com>
---
drivers/pci/pcie/err.c | 9 +++------
1 file changed, 3 insertions(+), 6 deletions(-)
diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
index b576aa890c76..fe04b0ae22f4 100644
--- a/drivers/pci/pcie/err.c
+++ b/drivers/pci/pcie/err.c
@@ -231,14 +231,11 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
/*
* If we have native control of AER, clear error status in the device
- * that detected the error. If the platform retained control of AER,
- * it is responsible for clearing this status. In that case, the
- * signaling device may not even be visible to the OS.
+ * that detected the error.
*/
- if (host->native_aer || pcie_ports_native) {
+ if (host->native_aer || pcie_ports_native)
pcie_clear_device_status(dev);
- pci_aer_clear_nonfatal_status(dev);
- }
+
pci_info(bridge, "device recovery successful\n");
return status;
--
2.25.1
^ permalink raw reply related
* [PATCH 8/8] PCI/AER: Include DEVCTL in aer_print_error()
From: Naveen Naidu @ 2021-10-04 13:52 UTC (permalink / raw)
To: bhelgaas, ruscur, oohall
Cc: Naveen Naidu, linuxppc-dev, linux-kernel-mentees, linux-kernel,
skhan
In-Reply-To: <cover.1633353468.git.naveennaidu479@gmail.com>
Print the contents of Device Control Register of the device which
detected the error. This might help in faster error diagnosis.
Sample output from dummy error injected by aer-inject:
pcieport 0000:00:03.0: AER: Corrected error received: 0000:00:03.0
pcieport 0000:00:03.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Receiver)
pcieport 0000:00:03.0: device [1b36:000c] error status/mask=00000040/0000e000, devctl=0x000f
pcieport 0000:00:03.0: [ 6] BadTLP
Signed-off-by: Naveen Naidu <naveennaidu479@gmail.com>
---
drivers/pci/pci.h | 2 ++
drivers/pci/pcie/aer.c | 10 ++++++++--
2 files changed, 10 insertions(+), 2 deletions(-)
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index eb88d8bfeaf7..48ed7f91113b 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -437,6 +437,8 @@ struct aer_err_info {
u32 status; /* COR/UNCOR Error Status */
u32 mask; /* COR/UNCOR Error Mask */
struct aer_header_log_regs tlp; /* TLP Header */
+
+ u16 devctl;
};
/* Preliminary AER error information processed from Root port */
diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index 91f91d6ab052..42cae01b6887 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -729,8 +729,8 @@ void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
aer_error_severity_string[info->severity],
aer_error_layer[layer], aer_agent_string[agent]);
- pci_printk(level, dev, " device [%04x:%04x] error status/mask=%08x/%08x\n",
- dev->vendor, dev->device, info->status, info->mask);
+ pci_printk(level, dev, " device [%04x:%04x] error status/mask=%08x/%08x, devctl=%#06x\n",
+ dev->vendor, dev->device, info->status, info->mask, info->devctl);
__aer_print_error(dev, info);
@@ -1083,6 +1083,12 @@ int aer_get_device_error_info(struct pci_dev *dev, struct aer_err_info *info)
if (!aer)
return 0;
+ /*
+ * Cache the value of Device Control Register now, because later the
+ * device might not be available
+ */
+ pcie_capability_read_word(dev, PCI_EXP_DEVCTL, &info->devctl);
+
if (info->severity == AER_CORRECTABLE) {
pci_read_config_dword(dev, aer + PCI_ERR_COR_STATUS,
&info->status);
--
2.25.1
^ permalink raw reply related
* [PATCH v2 0/8] Fix long standing AER Error Handling Issues
From: Naveen Naidu @ 2021-10-04 14:06 UTC (permalink / raw)
To: bhelgaas, ruscur, oohall
Cc: linux-pci, linux-kernel, Naveen Naidu, skhan, linuxppc-dev,
linux-kernel-mentees
This patch series aims at fixing some of the AER error handling issues
we have.
Currently we have the following issues:
- Confusing message in aer_print_error()
- aer_err_info not being initialized completely in DPC path before
we print the AER logs
- A bug [1] in clearing of AER registers in the native AER path
[1] https://lore.kernel.org/linux-pci/20151229155822.GA17321@localhost/
The primary aim of this patch series is to converge the APEI path and the
native AER error handling paths. In our current code, we find that we
have two different behaviours (especially when it comes to clearing of
the AER registers) for the same functionality.
This patch series, tries to bring the same semantics and hence more
commonanlity between the APEI part of code and the native OS
handling of AER errors.
PATCH 1:
- Fixes the first issue
PATCH 2 - 4:
- Fixes the second issue
- "Patch 3/8" is dependent on "Patch 2/3" in the series
PATCH 5 - 7
- Deals with converging the various paths and to bring more
commonality between them
- "Patch 6/8" depends on "Patch 1/8"
PATCH 8:
- Adds extra information in AER error logs.
Thanks,
Naveen Naidu
Changelog
=========
v2:
Apologies for the mistake, I forgot to cc the linux-pci mailing list.
Resent the email with cc to linux-pci.
Naveen Naidu (8):
[PATCH v2 1/8] PCI/AER: Remove ID from aer_agent_string[]
[PATCH v2 2/8] PCI: Cleanup struct aer_err_info
[PATCH v2 3/8] PCI/DPC: Initialize info->id in dpc_process_error()
[PATCH v2 4/8] PCI/DPC: Use pci_aer_clear_status() in dpc_process_error()
[PATCH v2 5/8] PCI/DPC: Converge EDR and DPC Path of clearing AER registers
[PATCH v2 6/8] PCI/AER: Clear error device AER registers in aer_irq()
[PATCH v2 7/8] PCI/ERR: Remove redundant clearing of AER register in pcie_do_recovery()
[PATCH v2 8/8] PCI/AER: Include DEVCTL in aer_print_error()
drivers/pci/pci.h | 23 +++-
drivers/pci/pcie/aer.c | 265 ++++++++++++++++++++++++++++-------------
drivers/pci/pcie/dpc.c | 9 +-
drivers/pci/pcie/err.c | 9 +-
4 files changed, 207 insertions(+), 99 deletions(-)
--
2.25.1
^ permalink raw reply
* [PATCH v2 1/8] PCI/AER: Remove ID from aer_agent_string[]
From: Naveen Naidu @ 2021-10-04 14:06 UTC (permalink / raw)
To: bhelgaas, ruscur, oohall
Cc: linux-pci, linux-kernel, Naveen Naidu, skhan, linuxppc-dev,
linux-kernel-mentees
In-Reply-To: <cover.1633353468.git.naveennaidu479@gmail.com>
Before 010caed4ccb6 ("PCI/AER: Decode Error Source RequesterID")
the AER error logs looked like:
pcieport 0000:00:03.0: AER: Corrected error received: id=0018
pcieport 0000:00:03.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=0018 (Receiver ID)
pcieport 0000:00:03.0: device [1b36:000c] error status/mask=00000040/0000e000
pcieport 0000:00:03.0: [ 6] BadTLP
In 010caed4ccb6 ("PCI/AER: Decode Error Source Requester ID"),
the "id" field was removed from the AER error logs, so currently AER
logs look like:
pcieport 0000:00:03.0: AER: Corrected error received: 0000:00:03:0
pcieport 0000:00:03.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Receiver ID)
pcieport 0000:00:03.0: device [1b36:000c] error status/mask=00000040/0000e000
pcieport 0000:00:03.0: [ 6] BadTLP
The second line in the above logs prints "(Receiver ID)", even when
there is no "id" in the log line. This is confusing.
Remove the "ID" from the aer_agent_string[]. The error logs will
look as follows (Sample from dummy error injected by aer-inject):
pcieport 0000:00:03.0: AER: Corrected error received: 0000:00:03.0
pcieport 0000:00:03.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Receiver)
pcieport 0000:00:03.0: device [1b36:000c] error status/mask=00000040/0000e000
pcieport 0000:00:03.0: [ 6] BadTLP
Signed-off-by: Naveen Naidu <naveennaidu479@gmail.com>
---
drivers/pci/pcie/aer.c | 10 +++++-----
1 file changed, 5 insertions(+), 5 deletions(-)
diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index 9784fdcf3006..241ff361b43c 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -516,10 +516,10 @@ static const char *aer_uncorrectable_error_string[] = {
};
static const char *aer_agent_string[] = {
- "Receiver ID",
- "Requester ID",
- "Completer ID",
- "Transmitter ID"
+ "Receiver",
+ "Requester",
+ "Completer",
+ "Transmitter"
};
#define aer_stats_dev_attr(name, stats_array, strings_array, \
@@ -703,7 +703,7 @@ void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
const char *level;
if (!info->status) {
- pci_err(dev, "PCIe Bus Error: severity=%s, type=Inaccessible, (Unregistered Agent ID)\n",
+ pci_err(dev, "PCIe Bus Error: severity=%s, type=Inaccessible, (Unregistered Agent)\n",
aer_error_severity_string[info->severity]);
goto out;
}
--
2.25.1
^ permalink raw reply related
* [PATCH v2 2/8] PCI: Cleanup struct aer_err_info
From: Naveen Naidu @ 2021-10-04 14:06 UTC (permalink / raw)
To: bhelgaas, ruscur, oohall
Cc: linux-pci, linux-kernel, Naveen Naidu, skhan, linuxppc-dev,
linux-kernel-mentees
In-Reply-To: <cover.1633353468.git.naveennaidu479@gmail.com>
The id, status and the mask fields of the struct aer_err_info comes
directly from the registers, hence their sizes should be explicit.
The length of these registers are:
- id: 16 bits - Represents the Error Source Requester ID
- status: 32 bits - COR/UNCOR Error Status
- mask: 32 bits - COR/UNCOR Error Mask
Since the length of the above registers are even, use u16 and u32
to represent their values.
Also remove the __pad fields.
"pahole" was run on the modified struct aer_err_info and the size
remains unchanged.
Signed-off-by: Naveen Naidu <naveennaidu479@gmail.com>
---
drivers/pci/pci.h | 8 +++-----
1 file changed, 3 insertions(+), 5 deletions(-)
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 1cce56c2aea0..9be7a966fda7 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -427,18 +427,16 @@ struct aer_err_info {
struct pci_dev *dev[AER_MAX_MULTI_ERR_DEVICES];
int error_dev_num;
- unsigned int id:16;
+ u16 id;
unsigned int severity:2; /* 0:NONFATAL | 1:FATAL | 2:COR */
- unsigned int __pad1:5;
unsigned int multi_error_valid:1;
unsigned int first_error:5;
- unsigned int __pad2:2;
unsigned int tlp_header_valid:1;
- unsigned int status; /* COR/UNCOR Error Status */
- unsigned int mask; /* COR/UNCOR Error Mask */
+ u32 status; /* COR/UNCOR Error Status */
+ u32 mask; /* COR/UNCOR Error Mask */
struct aer_header_log_regs tlp; /* TLP Header */
};
--
2.25.1
^ permalink raw reply related
* [PATCH v2 4/8] PCI/DPC: Use pci_aer_clear_status() in dpc_process_error()
From: Naveen Naidu @ 2021-10-04 14:06 UTC (permalink / raw)
To: bhelgaas, ruscur, oohall
Cc: linux-pci, linux-kernel, Naveen Naidu, skhan, linuxppc-dev,
linux-kernel-mentees
In-Reply-To: <cover.1633353468.git.naveennaidu479@gmail.com>
dpc_process_error() clears both AER fatal and non fatal status
registers. Instead of clearing each status registers via a different
function call use pci_aer_clear_status().
This helps clean up the code a bit.
Signed-off-by: Naveen Naidu <naveennaidu479@gmail.com>
---
drivers/pci/pcie/dpc.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c
index df3f3a10f8bc..faf4a1e77fab 100644
--- a/drivers/pci/pcie/dpc.c
+++ b/drivers/pci/pcie/dpc.c
@@ -288,8 +288,7 @@ void dpc_process_error(struct pci_dev *pdev)
dpc_get_aer_uncorrect_severity(pdev, &info) &&
aer_get_device_error_info(pdev, &info)) {
aer_print_error(pdev, &info);
- pci_aer_clear_nonfatal_status(pdev);
- pci_aer_clear_fatal_status(pdev);
+ pci_aer_clear_status(pdev);
}
}
--
2.25.1
^ permalink raw reply related
* [PATCH v2 5/8] PCI/DPC: Converge EDR and DPC Path of clearing AER registers
From: Naveen Naidu @ 2021-10-04 14:06 UTC (permalink / raw)
To: bhelgaas, ruscur, oohall
Cc: linux-pci, linux-kernel, Naveen Naidu, skhan, linuxppc-dev,
linux-kernel-mentees
In-Reply-To: <cover.1633353468.git.naveennaidu479@gmail.com>
In the EDR path, AER registers are cleared *after* DPC error event is
processed. The process stack in EDR is:
edr_handle_event()
dpc_process_error()
pci_aer_raw_clear_status()
pcie_do_recovery()
But in DPC path, AER status registers are cleared *while* processing
the error. The process stack in DPC is:
dpc_handler()
dpc_process_error()
pci_aer_clear_status()
pcie_do_recovery()
In EDR path, AER status registers are cleared irrespective of whether
the error was an RP PIO or unmasked uncorrectable error. But in DPC, the
AER status registers are cleared only when it's an unmasked uncorrectable
error.
This leads to two different behaviours for the same task (handling of
DPC errors) in FFS systems and when native OS has control.
Bring the same semantics for clearing the AER status register in EDR
path and DPC path.
Signed-off-by: Naveen Naidu <naveennaidu479@gmail.com>
---
drivers/pci/pcie/dpc.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c
index faf4a1e77fab..68899a3db126 100644
--- a/drivers/pci/pcie/dpc.c
+++ b/drivers/pci/pcie/dpc.c
@@ -288,7 +288,6 @@ void dpc_process_error(struct pci_dev *pdev)
dpc_get_aer_uncorrect_severity(pdev, &info) &&
aer_get_device_error_info(pdev, &info)) {
aer_print_error(pdev, &info);
- pci_aer_clear_status(pdev);
}
}
@@ -297,6 +296,7 @@ static irqreturn_t dpc_handler(int irq, void *context)
struct pci_dev *pdev = context;
dpc_process_error(pdev);
+ pci_aer_clear_status(pdev);
/* We configure DPC so it only triggers on ERR_FATAL */
pcie_do_recovery(pdev, pci_channel_io_frozen, dpc_reset_link);
--
2.25.1
^ permalink raw reply related
* [PATCH v2 6/8] PCI/AER: Clear error device AER registers in aer_irq()
From: Naveen Naidu @ 2021-10-04 14:06 UTC (permalink / raw)
To: bhelgaas, ruscur, oohall
Cc: linux-pci, linux-kernel, Naveen Naidu, skhan, linuxppc-dev,
linux-kernel-mentees
In-Reply-To: <cover.1633353468.git.naveennaidu479@gmail.com>
Converge the APEI path and native AER path of clearing the AER registers
of the error device.
In APEI path, the system firmware clears the AER registers before
handing off the record to OS. But in "native AER" path, the execution
path of clearing the AER register is as follows:
aer_isr_one_error
aer_print_port_info
if (find_source_device())
aer_process_err_devices
handle_error_source
pci_write_config_dword(dev, PCI_ERR_COR_STATUS, ...)
The above path has a bug, if the find_source_device() fails, AER
registers are not cleared from the error device. This means, the error
device will keep reporting the error again and again and would lead
to message spew.
Related Bug Report:
https://lore.kernel.org/linux-pci/20151229155822.GA17321@localhost/
https://bugzilla.kernel.org/show_bug.cgi?id=109691
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1521173
The above bug could be avoided, if the AER registers are cleared during
the AER IRQ handler aer_irq(), which would provide guarantee that the AER
error registers are always cleared. This is similar to how APEI handles
these errors.
The main aim is that:
When an interrupt handler deals with a interrupt, it must *always*
clear the source of the interrupt.
Signed-off-by: Naveen Naidu <naveennaidu479@gmail.com>
---
drivers/pci/pci.h | 13 ++-
drivers/pci/pcie/aer.c | 245 ++++++++++++++++++++++++++++-------------
2 files changed, 182 insertions(+), 76 deletions(-)
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 9be7a966fda7..eb88d8bfeaf7 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -424,7 +424,6 @@ static inline bool pci_dev_is_added(const struct pci_dev *dev)
#define AER_MAX_MULTI_ERR_DEVICES 5 /* Not likely to have more */
struct aer_err_info {
- struct pci_dev *dev[AER_MAX_MULTI_ERR_DEVICES];
int error_dev_num;
u16 id;
@@ -440,6 +439,18 @@ struct aer_err_info {
struct aer_header_log_regs tlp; /* TLP Header */
};
+/* Preliminary AER error information processed from Root port */
+struct aer_devices_err_info {
+ struct pci_dev *dev[AER_MAX_MULTI_ERR_DEVICES];
+ struct aer_err_info err_info;
+};
+
+/* AER information associated with each error device */
+struct aer_dev_err_info {
+ struct pci_dev *dev;
+ struct aer_err_info err_info;
+};
+
int aer_get_device_error_info(struct pci_dev *dev, struct aer_err_info *info);
void aer_print_error(struct pci_dev *dev, struct aer_err_info *info);
#endif /* CONFIG_PCIEAER */
diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index 241ff361b43c..91f91d6ab052 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -36,6 +36,18 @@
#define AER_ERROR_SOURCES_MAX 128
+/*
+ * There can be 128 maximum error sources (AER_ERROR_SOURCES_MAX) and each
+ * error source can have maximum of 5 error devices (AER_MAX_MULTI_ERR_DEVICES)
+ * so the maximum error devices we can report is:
+ *
+ * AER_ERROR_DEVICES_MAX = AER_ERROR_SOURCES_MAX * AER_MAX_MULTI_ERR_DEVICES == (128 * 5) == 640
+ *
+ * But since, the size in KFIFO should be a power of two, the closest value
+ * to 640 is 1024
+ */
+# define AER_ERROR_DEVICES_MAX 1024
+
#define AER_MAX_TYPEOF_COR_ERRS 16 /* as per PCI_ERR_COR_STATUS */
#define AER_MAX_TYPEOF_UNCOR_ERRS 27 /* as per PCI_ERR_UNCOR_STATUS*/
@@ -46,7 +58,7 @@ struct aer_err_source {
struct aer_rpc {
struct pci_dev *rpd; /* Root Port device */
- DECLARE_KFIFO(aer_fifo, struct aer_err_source, AER_ERROR_SOURCES_MAX);
+ DECLARE_KFIFO(aer_fifo, struct aer_dev_err_info, AER_ERROR_DEVICES_MAX);
};
/* AER stats for the device */
@@ -806,11 +818,11 @@ void cper_print_aer(struct pci_dev *dev, int aer_severity,
* @e_info: pointer to error info
* @dev: pointer to pci_dev to be added
*/
-static int add_error_device(struct aer_err_info *e_info, struct pci_dev *dev)
+static int add_error_device(struct aer_devices_err_info *e_dev, struct pci_dev *dev)
{
- if (e_info->error_dev_num < AER_MAX_MULTI_ERR_DEVICES) {
- e_info->dev[e_info->error_dev_num] = pci_dev_get(dev);
- e_info->error_dev_num++;
+ if (e_dev->err_info.error_dev_num < AER_MAX_MULTI_ERR_DEVICES) {
+ e_dev->dev[e_dev->err_info.error_dev_num] = pci_dev_get(dev);
+ e_dev->err_info.error_dev_num++;
return 0;
}
return -ENOSPC;
@@ -877,18 +889,18 @@ static bool is_error_source(struct pci_dev *dev, struct aer_err_info *e_info)
static int find_device_iter(struct pci_dev *dev, void *data)
{
- struct aer_err_info *e_info = (struct aer_err_info *)data;
+ struct aer_devices_err_info *e_dev = (struct aer_devices_err_info *)data;
- if (is_error_source(dev, e_info)) {
+ if (is_error_source(dev, &e_dev->err_info)) {
/* List this device */
- if (add_error_device(e_info, dev)) {
+ if (add_error_device(e_dev, dev)) {
/* We cannot handle more... Stop iteration */
/* TODO: Should print error message here? */
return 1;
}
/* If there is only a single error, stop iteration */
- if (!e_info->multi_error_valid)
+ if (!e_dev->err_info.multi_error_valid)
return 1;
}
return 0;
@@ -907,26 +919,26 @@ static int find_device_iter(struct pci_dev *dev, void *data)
* e_info->error_dev_num and e_info->dev[], based on the given information.
*/
static bool find_source_device(struct pci_dev *parent,
- struct aer_err_info *e_info)
+ struct aer_devices_err_info *e_dev)
{
struct pci_dev *dev = parent;
int result;
/* Must reset in this function */
- e_info->error_dev_num = 0;
+ e_dev->err_info.error_dev_num = 0;
/* Is Root Port an agent that sends error message? */
- result = find_device_iter(dev, e_info);
+ result = find_device_iter(dev, e_dev);
if (result)
return true;
if (pci_pcie_type(parent) == PCI_EXP_TYPE_RC_EC)
- pcie_walk_rcec(parent, find_device_iter, e_info);
+ pcie_walk_rcec(parent, find_device_iter, e_dev);
else
- pci_walk_bus(parent->subordinate, find_device_iter, e_info);
+ pci_walk_bus(parent->subordinate, find_device_iter, e_dev);
- if (!e_info->error_dev_num) {
- pci_info(parent, "can't find device of ID%04x\n", e_info->id);
+ if (!e_dev->err_info.error_dev_num) {
+ pci_info(parent, "can't find device of ID%04x\n", e_dev->err_info.id);
return false;
}
return true;
@@ -940,24 +952,42 @@ static bool find_source_device(struct pci_dev *parent,
* Invoked when an error being detected by Root Port.
*/
static void handle_error_source(struct pci_dev *dev, struct aer_err_info *info)
+{
+ /*
+ * Correctable error does not need software intervention.
+ * No need to go through error recovery process.
+ */
+ if (info->severity == AER_NONFATAL)
+ pcie_do_recovery(dev, pci_channel_io_normal, aer_root_reset);
+ else if (info->severity == AER_FATAL)
+ pcie_do_recovery(dev, pci_channel_io_frozen, aer_root_reset);
+ pci_dev_put(dev);
+}
+
+/**
+ * clear_error_source_aer_registers - clear AER registers of the error source device
+ * @dev: pointer to pci_dev data structure of error source device
+ * @info: comprehensive error information
+ *
+ * Invoked when an error being detected by Root Port but before we handle the
+ * error.
+ */
+static void clear_error_source_aer_registers(struct pci_dev *dev, struct aer_err_info info)
{
int aer = dev->aer_cap;
- if (info->severity == AER_CORRECTABLE) {
- /*
- * Correctable error does not need software intervention.
- * No need to go through error recovery process.
- */
+ if (info.severity == AER_CORRECTABLE) {
if (aer)
pci_write_config_dword(dev, aer + PCI_ERR_COR_STATUS,
- info->status);
+ info.status);
if (pcie_aer_is_native(dev))
pcie_clear_device_status(dev);
- } else if (info->severity == AER_NONFATAL)
- pcie_do_recovery(dev, pci_channel_io_normal, aer_root_reset);
- else if (info->severity == AER_FATAL)
- pcie_do_recovery(dev, pci_channel_io_frozen, aer_root_reset);
- pci_dev_put(dev);
+ } else if (info.severity == AER_NONFATAL) {
+ pci_aer_clear_nonfatal_status(dev);
+ } else if (info.severity == AER_FATAL) {
+ pci_aer_clear_fatal_status(dev);
+ }
+
}
#ifdef CONFIG_ACPI_APEI_PCIEAER
@@ -1093,70 +1123,112 @@ int aer_get_device_error_info(struct pci_dev *dev, struct aer_err_info *info)
return 1;
}
-static inline void aer_process_err_devices(struct aer_err_info *e_info)
-{
- int i;
-
- /* Report all before handle them, not to lost records by reset etc. */
- for (i = 0; i < e_info->error_dev_num && e_info->dev[i]; i++) {
- if (aer_get_device_error_info(e_info->dev[i], e_info))
- aer_print_error(e_info->dev[i], e_info);
- }
- for (i = 0; i < e_info->error_dev_num && e_info->dev[i]; i++) {
- if (aer_get_device_error_info(e_info->dev[i], e_info))
- handle_error_source(e_info->dev[i], e_info);
- }
-}
-
/**
- * aer_isr_one_error - consume an error detected by root port
- * @rpc: pointer to the root port which holds an error
+ * aer_find_corr_error_source_device - find the error source which detected the corrected error
+ * @rp: pointer to Root Port pci_dev data structure
* @e_src: pointer to an error source
+ * @e_info: including detailed error information such like id
+ *
+ * Return true if found.
+ *
+ * Process the error information received at the Root Port, set these values
+ * in the aer_devices_err_info and find all the devices that are related to
+ * the error.
*/
-static void aer_isr_one_error(struct aer_rpc *rpc,
- struct aer_err_source *e_src)
+static bool aer_find_corr_error_source_device(struct pci_dev *rp,
+ struct aer_err_source *e_src,
+ struct aer_devices_err_info *e_info)
{
- struct pci_dev *pdev = rpc->rpd;
- struct aer_err_info e_info;
-
- pci_rootport_aer_stats_incr(pdev, e_src);
-
- /*
- * There is a possibility that both correctable error and
- * uncorrectable error being logged. Report correctable error first.
- */
if (e_src->status & PCI_ERR_ROOT_COR_RCV) {
- e_info.id = ERR_COR_ID(e_src->id);
- e_info.severity = AER_CORRECTABLE;
+ e_info->err_info.id = ERR_COR_ID(e_src->id);
+ e_info->err_info.severity = AER_CORRECTABLE;
if (e_src->status & PCI_ERR_ROOT_MULTI_COR_RCV)
- e_info.multi_error_valid = 1;
+ e_info->err_info.multi_error_valid = 1;
else
- e_info.multi_error_valid = 0;
- aer_print_port_info(pdev, &e_info);
+ e_info->err_info.multi_error_valid = 0;
- if (find_source_device(pdev, &e_info))
- aer_process_err_devices(&e_info);
+ if (!find_source_device(rp, e_info))
+ return false;
}
+ return true;
+}
+/**
+ * aer_find_corr_error_source_device - find the error source which detected the uncorrected error
+ * @rp: pointer to Root Port pci_dev data structure
+ * @e_src: pointer to an error source
+ * @e_info: including detailed error information such like id
+ *
+ * Return true if found.
+ *
+ * Process the error information received at the Root Port, set these values
+ * in the aer_devices_err_info and find all the devices that are related to
+ * the error.
+ */
+static bool aer_find_uncorr_error_source_device(struct pci_dev *rp,
+ struct aer_err_source *e_src,
+ struct aer_devices_err_info *e_info)
+{
if (e_src->status & PCI_ERR_ROOT_UNCOR_RCV) {
- e_info.id = ERR_UNCOR_ID(e_src->id);
+ e_info->err_info.id = ERR_UNCOR_ID(e_src->id);
if (e_src->status & PCI_ERR_ROOT_FATAL_RCV)
- e_info.severity = AER_FATAL;
+ e_info->err_info.severity = AER_FATAL;
else
- e_info.severity = AER_NONFATAL;
+ e_info->err_info.severity = AER_NONFATAL;
if (e_src->status & PCI_ERR_ROOT_MULTI_UNCOR_RCV)
- e_info.multi_error_valid = 1;
+ e_info->err_info.multi_error_valid = 1;
else
- e_info.multi_error_valid = 0;
+ e_info->err_info.multi_error_valid = 0;
+
+ if (!find_source_device(rp, e_info))
+ return false;
+ }
- aer_print_port_info(pdev, &e_info);
+ return true;
+}
- if (find_source_device(pdev, &e_info))
- aer_process_err_devices(&e_info);
+/**
+ * aer_isr_one_error - consume an error detected by root port
+ * @rp: pointer to Root Port pci_dev data structure
+ * @e_dev: pointer to an error device
+ */
+static void aer_isr_one_error(struct pci_dev *rp, struct aer_dev_err_info *e_dev)
+{
+ aer_print_port_info(rp, &e_dev->err_info);
+ aer_print_error(e_dev->dev, &e_dev->err_info);
+ handle_error_source(e_dev->dev, &e_dev->err_info);
+}
+
+static bool aer_add_err_devices_to_queue(struct aer_rpc *rpc,
+ struct aer_devices_err_info *e_info)
+{
+ int i;
+ struct aer_dev_err_info *e_dev;
+
+ e_dev = kzalloc(sizeof(*e_dev), GFP_ATOMIC);
+ if (!e_dev)
+ return false;
+
+ for (i = 0; i < e_info->err_info.error_dev_num && e_info->dev[i]; i++) {
+ e_dev->err_info = e_info->err_info;
+ e_dev->dev = e_info->dev[i];
+
+ /*
+ * Store the AER register information for each error device on
+ * the queue
+ */
+ if (aer_get_device_error_info(e_dev->dev, &e_dev->err_info)) {
+ if (!kfifo_put(&rpc->aer_fifo, *e_dev))
+ return false;
+
+ clear_error_source_aer_registers(e_dev->dev, e_dev->err_info);
+ }
}
+
+ return true;
}
/**
@@ -1170,13 +1242,13 @@ static irqreturn_t aer_isr(int irq, void *context)
{
struct pcie_device *dev = (struct pcie_device *)context;
struct aer_rpc *rpc = get_service_data(dev);
- struct aer_err_source e_src;
+ struct aer_dev_err_info e_dev;
if (kfifo_is_empty(&rpc->aer_fifo))
return IRQ_NONE;
- while (kfifo_get(&rpc->aer_fifo, &e_src))
- aer_isr_one_error(rpc, &e_src);
+ while (kfifo_get(&rpc->aer_fifo, &e_dev))
+ aer_isr_one_error(rpc->rpd, &e_dev);
return IRQ_HANDLED;
}
@@ -1194,6 +1266,11 @@ static irqreturn_t aer_irq(int irq, void *context)
struct pci_dev *rp = rpc->rpd;
int aer = rp->aer_cap;
struct aer_err_source e_src = {};
+ struct aer_devices_err_info *e_info;
+
+ e_info = kzalloc(sizeof(*e_info), GFP_ATOMIC);
+ if (!e_info)
+ return IRQ_NONE;
pci_read_config_dword(rp, aer + PCI_ERR_ROOT_STATUS, &e_src.status);
if (!(e_src.status & (PCI_ERR_ROOT_UNCOR_RCV|PCI_ERR_ROOT_COR_RCV)))
@@ -1202,8 +1279,26 @@ static irqreturn_t aer_irq(int irq, void *context)
pci_read_config_dword(rp, aer + PCI_ERR_ROOT_ERR_SRC, &e_src.id);
pci_write_config_dword(rp, aer + PCI_ERR_ROOT_STATUS, e_src.status);
- if (!kfifo_put(&rpc->aer_fifo, e_src))
- return IRQ_HANDLED;
+ pci_rootport_aer_stats_incr(rp, &e_src);
+
+ /*
+ * There is a possibility that both correctable error and
+ * uncorrectable error are being logged. Find the devices which caused
+ * correctable errors first so that they can be added to the queue first
+ * and will be reported first.
+ *
+ * Before adding the error device to the queue to be handled, clear the
+ * AER status registers.
+ */
+ if (aer_find_corr_error_source_device(rp, &e_src, e_info)) {
+ if (!aer_add_err_devices_to_queue(rpc, e_info))
+ return IRQ_NONE;
+ }
+
+ if (aer_find_uncorr_error_source_device(rp, &e_src, e_info)) {
+ if (!aer_add_err_devices_to_queue(rpc, e_info))
+ return IRQ_NONE;
+ }
return IRQ_WAKE_THREAD;
}
--
2.25.1
^ permalink raw reply related
* [PATCH v2 7/8] PCI/ERR: Remove redundant clearing of AER register in pcie_do_recovery()
From: Naveen Naidu @ 2021-10-04 14:06 UTC (permalink / raw)
To: bhelgaas, ruscur, oohall
Cc: linux-pci, linux-kernel, Naveen Naidu, skhan, linuxppc-dev,
linux-kernel-mentees
In-Reply-To: <cover.1633353468.git.naveennaidu479@gmail.com>
pcie_do_recovery() is shared across the following paths:
- ACPI APEI
- Native AER path
- EDR
- DPC
ACPI APEI
==========
ghes_handle_aer()
aer_recover_queue()
kfifo_in_spinlocked(aer_recover_ring)
aer_recover_work_func()
while (kfifo_get(aer_recover_ring))
pcie_do_recovery()
In this path the system firmware clears the AER registers before
handing off the record to the OS in ghes_handle_aer()
Native AER
==========
aer_irq()
aer_add_err_devices_to_queue()
kfifo_put(&rpc->aer_fifo, *e_dev)
clear_error_source_aer_registers() <---- AER registers are cleard
aer_isr()
aer_isr_one_error()
handle_error_source()
pcie_do_recovery()
The AER registers are cleared during the handling of IRQ, i.e before we
the recovery starts.
DPC
=====
dpc_handler()
dpc_process_error()
pci_aer_clear_status() <---- AER registers are cleared
pcie_do_recovery()
EDR
====
edr_handle_event()
dpc_process_error()
pci_aer_raw_clear_status() <---- AER registers are cleared
pcie_do_recovery()
In all the above paths, the AER registers are cleared before
pcie_do_recovery(). The non fatal status AER registers are again cleared
in pcie_do_recovery(). This is redundant.
Remove redundant clearing of AER register in pcie_do_recovery()
Signed-off-by: Naveen Naidu <naveennaidu479@gmail.com>
---
drivers/pci/pcie/err.c | 9 +++------
1 file changed, 3 insertions(+), 6 deletions(-)
diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
index b576aa890c76..fe04b0ae22f4 100644
--- a/drivers/pci/pcie/err.c
+++ b/drivers/pci/pcie/err.c
@@ -231,14 +231,11 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
/*
* If we have native control of AER, clear error status in the device
- * that detected the error. If the platform retained control of AER,
- * it is responsible for clearing this status. In that case, the
- * signaling device may not even be visible to the OS.
+ * that detected the error.
*/
- if (host->native_aer || pcie_ports_native) {
+ if (host->native_aer || pcie_ports_native)
pcie_clear_device_status(dev);
- pci_aer_clear_nonfatal_status(dev);
- }
+
pci_info(bridge, "device recovery successful\n");
return status;
--
2.25.1
^ permalink raw reply related
* [PATCH 8/8] PCI/AER: Include DEVCTL in aer_print_error()
From: Naveen Naidu @ 2021-10-04 14:06 UTC (permalink / raw)
To: bhelgaas, ruscur, oohall
Cc: linux-pci, linux-kernel, Naveen Naidu, skhan, linuxppc-dev,
linux-kernel-mentees
In-Reply-To: <cover.1633353468.git.naveennaidu479@gmail.com>
Print the contents of Device Control Register of the device which
detected the error. This might help in faster error diagnosis.
Sample output from dummy error injected by aer-inject:
pcieport 0000:00:03.0: AER: Corrected error received: 0000:00:03.0
pcieport 0000:00:03.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Receiver)
pcieport 0000:00:03.0: device [1b36:000c] error status/mask=00000040/0000e000, devctl=0x000f
pcieport 0000:00:03.0: [ 6] BadTLP
Signed-off-by: Naveen Naidu <naveennaidu479@gmail.com>
---
drivers/pci/pci.h | 2 ++
drivers/pci/pcie/aer.c | 10 ++++++++--
2 files changed, 10 insertions(+), 2 deletions(-)
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index eb88d8bfeaf7..48ed7f91113b 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -437,6 +437,8 @@ struct aer_err_info {
u32 status; /* COR/UNCOR Error Status */
u32 mask; /* COR/UNCOR Error Mask */
struct aer_header_log_regs tlp; /* TLP Header */
+
+ u16 devctl;
};
/* Preliminary AER error information processed from Root port */
diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index 91f91d6ab052..42cae01b6887 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -729,8 +729,8 @@ void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
aer_error_severity_string[info->severity],
aer_error_layer[layer], aer_agent_string[agent]);
- pci_printk(level, dev, " device [%04x:%04x] error status/mask=%08x/%08x\n",
- dev->vendor, dev->device, info->status, info->mask);
+ pci_printk(level, dev, " device [%04x:%04x] error status/mask=%08x/%08x, devctl=%#06x\n",
+ dev->vendor, dev->device, info->status, info->mask, info->devctl);
__aer_print_error(dev, info);
@@ -1083,6 +1083,12 @@ int aer_get_device_error_info(struct pci_dev *dev, struct aer_err_info *info)
if (!aer)
return 0;
+ /*
+ * Cache the value of Device Control Register now, because later the
+ * device might not be available
+ */
+ pcie_capability_read_word(dev, PCI_EXP_DEVCTL, &info->devctl);
+
if (info->severity == AER_CORRECTABLE) {
pci_read_config_dword(dev, aer + PCI_ERR_COR_STATUS,
&info->status);
--
2.25.1
^ permalink raw reply related
* [PATCH v2 8/8] PCI/AER: Include DEVCTL in aer_print_error()
From: Naveen Naidu @ 2021-10-04 14:12 UTC (permalink / raw)
To: bhelgaas, ruscur, oohall
Cc: linux-pci, linux-kernel, Naveen Naidu, skhan, linuxppc-dev,
linux-kernel-mentees
In-Reply-To: <cover.1633353468.git.naveennaidu479@gmail.com>
Print the contents of Device Control Register of the device which
detected the error. This might help in faster error diagnosis.
Sample output from dummy error injected by aer-inject:
pcieport 0000:00:03.0: AER: Corrected error received: 0000:00:03.0
pcieport 0000:00:03.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Receiver)
pcieport 0000:00:03.0: device [1b36:000c] error status/mask=00000040/0000e000, devctl=0x000f
pcieport 0000:00:03.0: [ 6] BadTLP
Signed-off-by: Naveen Naidu <naveennaidu479@gmail.com>
---
drivers/pci/pci.h | 2 ++
drivers/pci/pcie/aer.c | 10 ++++++++--
2 files changed, 10 insertions(+), 2 deletions(-)
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index eb88d8bfeaf7..48ed7f91113b 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -437,6 +437,8 @@ struct aer_err_info {
u32 status; /* COR/UNCOR Error Status */
u32 mask; /* COR/UNCOR Error Mask */
struct aer_header_log_regs tlp; /* TLP Header */
+
+ u16 devctl;
};
/* Preliminary AER error information processed from Root port */
diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index 91f91d6ab052..42cae01b6887 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -729,8 +729,8 @@ void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
aer_error_severity_string[info->severity],
aer_error_layer[layer], aer_agent_string[agent]);
- pci_printk(level, dev, " device [%04x:%04x] error status/mask=%08x/%08x\n",
- dev->vendor, dev->device, info->status, info->mask);
+ pci_printk(level, dev, " device [%04x:%04x] error status/mask=%08x/%08x, devctl=%#06x\n",
+ dev->vendor, dev->device, info->status, info->mask, info->devctl);
__aer_print_error(dev, info);
@@ -1083,6 +1083,12 @@ int aer_get_device_error_info(struct pci_dev *dev, struct aer_err_info *info)
if (!aer)
return 0;
+ /*
+ * Cache the value of Device Control Register now, because later the
+ * device might not be available
+ */
+ pcie_capability_read_word(dev, PCI_EXP_DEVCTL, &info->devctl);
+
if (info->severity == AER_CORRECTABLE) {
pci_read_config_dword(dev, aer + PCI_ERR_COR_STATUS,
&info->status);
--
2.25.1
^ permalink raw reply related
* [PATCH v3 0/8] Fix long standing AER Error Handling Issues
From: Naveen Naidu @ 2021-10-04 14:29 UTC (permalink / raw)
To: bhelgaas, ruscur, oohall
Cc: linux-pci, linux-kernel, Naveen Naidu, skhan, linuxppc-dev,
linux-kernel-mentees
This patch series aims at fixing some of the AER error handling issues
we have.
Currently we have the following issues:
- Confusing message in aer_print_error()
- aer_err_info not being initialized completely in DPC path before
we print the AER logs
- A bug [1] in clearing of AER registers in the native AER path
[1] https://lore.kernel.org/linux-pci/20151229155822.GA17321@localhost/
The primary aim of this patch series is to converge the APEI path and the
native AER error handling paths. In our current code, we find that we
have two different behaviours (especially when it comes to clearing of
the AER registers) for the same functionality.
This patch series, tries to bring the same semantics and hence more
commonanlity between the APEI part of code and the native OS
handling of AER errors.
PATCH 1:
- Fixes the first issue
PATCH 2 - 4:
- Fixes the second issue
- "Patch 3/8" is dependent on "Patch 2/3" in the series
PATCH 5 - 7
- Deals with converging the various paths and to bring more
commonality between them
- "Patch 6/8" depends on "Patch 1/8"
PATCH 8:
- Adds extra information in AER error logs.
Thanks,
Naveen Naidu
Changelog
=========
v3:
- Fix up mail formatting and resend the patches again.
Really sorry for all the spam. I messed up in the first try and
instead of fixing it well in v2, I messed up again. I have fixed
everything now. Apologies for the inconvenience caused. I'll make
sure to not repeat it again.
v2:
- Apologies for the mistake, I forgot to cc the linux-pci mailing
list.Resent the email with cc to linux-pci
Naveen Naidu (8):
[PATCH v3 1/8] PCI/AER: Remove ID from aer_agent_string[]
[PATCH v3 2/8] PCI: Cleanup struct aer_err_info
[PATCH v3 3/8] PCI/DPC: Initialize info->id in dpc_process_error()
[PATCH v3 4/8] PCI/DPC: Use pci_aer_clear_status() in dpc_process_error()
[PATCH v3 5/8] PCI/DPC: Converge EDR and DPC Path of clearing AER registers
[PATCH v3 6/8] PCI/AER: Clear error device AER registers in aer_irq()
[PATCH v3 7/8] PCI/ERR: Remove redundant clearing of AER register in pcie_do_recovery()
[PATCH v3 8/8] PCI/AER: Include DEVCTL in aer_print_error()
drivers/pci/pci.h | 23 +++-
drivers/pci/pcie/aer.c | 265 ++++++++++++++++++++++++++++-------------
drivers/pci/pcie/dpc.c | 9 +-
drivers/pci/pcie/err.c | 9 +-
4 files changed, 207 insertions(+), 99 deletions(-)
--
2.25.1
^ permalink raw reply
* [PATCH v3 1/8] PCI/AER: Remove ID from aer_agent_string[]
From: Naveen Naidu @ 2021-10-04 14:29 UTC (permalink / raw)
To: bhelgaas, ruscur, oohall
Cc: linux-pci, linux-kernel, Naveen Naidu, skhan, linuxppc-dev,
linux-kernel-mentees
In-Reply-To: <cover.1633357368.git.naveennaidu479@gmail.com>
Before 010caed4ccb6 ("PCI/AER: Decode Error Source RequesterID")
the AER error logs looked like:
pcieport 0000:00:03.0: AER: Corrected error received: id=0018
pcieport 0000:00:03.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=0018 (Receiver ID)
pcieport 0000:00:03.0: device [1b36:000c] error status/mask=00000040/0000e000
pcieport 0000:00:03.0: [ 6] BadTLP
In 010caed4ccb6 ("PCI/AER: Decode Error Source Requester ID"),
the "id" field was removed from the AER error logs, so currently AER
logs look like:
pcieport 0000:00:03.0: AER: Corrected error received: 0000:00:03:0
pcieport 0000:00:03.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Receiver ID)
pcieport 0000:00:03.0: device [1b36:000c] error status/mask=00000040/0000e000
pcieport 0000:00:03.0: [ 6] BadTLP
The second line in the above logs prints "(Receiver ID)", even when
there is no "id" in the log line. This is confusing.
Remove the "ID" from the aer_agent_string[]. The error logs will
look as follows (Sample from dummy error injected by aer-inject):
pcieport 0000:00:03.0: AER: Corrected error received: 0000:00:03.0
pcieport 0000:00:03.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Receiver)
pcieport 0000:00:03.0: device [1b36:000c] error status/mask=00000040/0000e000
pcieport 0000:00:03.0: [ 6] BadTLP
Signed-off-by: Naveen Naidu <naveennaidu479@gmail.com
Signed-off-by: Naveen Naidu <naveennaidu479@gmail.com>
---
drivers/pci/pcie/aer.c | 10 +++++-----
1 file changed, 5 insertions(+), 5 deletions(-)
diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index 9784fdcf3006..241ff361b43c 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -516,10 +516,10 @@ static const char *aer_uncorrectable_error_string[] = {
};
static const char *aer_agent_string[] = {
- "Receiver ID",
- "Requester ID",
- "Completer ID",
- "Transmitter ID"
+ "Receiver",
+ "Requester",
+ "Completer",
+ "Transmitter"
};
#define aer_stats_dev_attr(name, stats_array, strings_array, \
@@ -703,7 +703,7 @@ void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
const char *level;
if (!info->status) {
- pci_err(dev, "PCIe Bus Error: severity=%s, type=Inaccessible, (Unregistered Agent ID)\n",
+ pci_err(dev, "PCIe Bus Error: severity=%s, type=Inaccessible, (Unregistered Agent)\n",
aer_error_severity_string[info->severity]);
goto out;
}
--
2.25.1
^ permalink raw reply related
* [PATCH v3 2/8] PCI: Cleanup struct aer_err_info
From: Naveen Naidu @ 2021-10-04 14:29 UTC (permalink / raw)
To: bhelgaas, ruscur, oohall
Cc: linux-pci, linux-kernel, Naveen Naidu, skhan, linuxppc-dev,
linux-kernel-mentees
In-Reply-To: <cover.1633357368.git.naveennaidu479@gmail.com>
The id, status and the mask fields of the struct aer_err_info comes
directly from the registers, hence their sizes should be explicit.
The length of these registers are:
- id: 16 bits - Represents the Error Source Requester ID
- status: 32 bits - COR/UNCOR Error Status
- mask: 32 bits - COR/UNCOR Error Mask
Since the length of the above registers are even, use u16 and u32
to represent their values.
Also remove the __pad fields.
"pahole" was run on the modified struct aer_err_info and the size
remains unchanged.
Signed-off-by: Naveen Naidu <naveennaidu479@gmail.com>
---
drivers/pci/pci.h | 8 +++-----
1 file changed, 3 insertions(+), 5 deletions(-)
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 1cce56c2aea0..9be7a966fda7 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -427,18 +427,16 @@ struct aer_err_info {
struct pci_dev *dev[AER_MAX_MULTI_ERR_DEVICES];
int error_dev_num;
- unsigned int id:16;
+ u16 id;
unsigned int severity:2; /* 0:NONFATAL | 1:FATAL | 2:COR */
- unsigned int __pad1:5;
unsigned int multi_error_valid:1;
unsigned int first_error:5;
- unsigned int __pad2:2;
unsigned int tlp_header_valid:1;
- unsigned int status; /* COR/UNCOR Error Status */
- unsigned int mask; /* COR/UNCOR Error Mask */
+ u32 status; /* COR/UNCOR Error Status */
+ u32 mask; /* COR/UNCOR Error Mask */
struct aer_header_log_regs tlp; /* TLP Header */
};
--
2.25.1
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox