* X-Gene: Unhandled fault: synchronous external abort in pci_generic_config_read32 @ 2015-07-24 22:42 Bjorn Helgaas 2015-07-25 0:05 ` Duc Dang 2015-07-28 14:37 ` Dall, Elizabeth J 0 siblings, 2 replies; 24+ messages in thread From: Bjorn Helgaas @ 2015-07-24 22:42 UTC (permalink / raw) To: Tanmay Inamdar; +Cc: Duc Dang, linux-pci, linux-arm-kernel, linux-kernel I regularly see faults like this on an APM X-Gene: U-Boot 2013.04-mustang_sw_1.14.14 (Dec 16 2014 - 15:59:33) CPU0: APM ARM 64-bit Potenza Rev B0 2400MHz PCP 2400MHz 32 KB ICACHE, 32 KB DCACHE SOC 2000MHz IOBAXI 400MHz AXI 250MHz AHB 200MHz GFC 125MHz ... Unhandled fault: synchronous external abort (0x96000010) at 0xffffff8000110034 Internal error: : 96000010 [#1] SMP Modules linked in: CPU: 0 PID: 3723 Comm: ... 4.1.0-smp-DEV #3 Hardware name: APM X-Gene Mustang board (DT) task: ffffffc7dc1a4140 ti: ffffffc7dc118000 task.ti: ffffffc7dc118000 PC is at pci_generic_config_read32+0x4c/0xb8 LR is at pci_generic_config_read32+0x40/0xb8 pc : [<ffffffc00033b90c>] lr : [<ffffffc00033b900>] pstate: 600001c5 ... Call trace: [<ffffffc00033b90c>] pci_generic_config_read32+0x4c/0xb8 [<ffffffc00033bf58>] pci_user_read_config_byte+0x60/0xc4 [<ffffffc0003496a8>] pci_read_config+0x15c/0x238 [<ffffffc0002393b4>] sysfs_kf_bin_read+0x68/0xa0 [<ffffffc00023896c>] kernfs_fop_read+0x9c/0x1ac [<ffffffc0001c361c>] __vfs_read+0x44/0x128 [<ffffffc0001c3e28>] vfs_read+0x84/0x144 [<ffffffc0001c4764>] SyS_read+0x50/0xb0 # lspci 00:00.0 PCI bridge: Applied Micro Circuits Corp. Device e004 (rev 04) 01:00.0 Ethernet controller: Mellanox Technologies MT27520 Family I first saw this on an ancient kernel and thought it was likely specific to my environment, but I'm now using an almost unmodified v4.1 kernel and still seeing it. Does anybody else see this? The box does have a PCI card installed, but I haven't yet worked out what device's config space we're trying to read. Is there anything I can do to debug this? I'm not an arm64 guy, but my impression is that this is a page fault, and the address seems to be in the "cfg" area ioremapped by xgene_pcie_map_reg(), so I'm not sure this is really a PCI issue -- maybe that page mapping got trashed by somebody else? Bjorn ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: X-Gene: Unhandled fault: synchronous external abort in pci_generic_config_read32 2015-07-24 22:42 X-Gene: Unhandled fault: synchronous external abort in pci_generic_config_read32 Bjorn Helgaas @ 2015-07-25 0:05 ` Duc Dang 2015-07-27 11:36 ` Catalin Marinas 2015-07-28 16:43 ` Bjorn Helgaas 2015-07-28 14:37 ` Dall, Elizabeth J 1 sibling, 2 replies; 24+ messages in thread From: Duc Dang @ 2015-07-25 0:05 UTC (permalink / raw) To: Bjorn Helgaas; +Cc: Tanmay Inamdar, linux-pci, linux-arm, linux-kernel Hi Bjorn, On Fri, Jul 24, 2015 at 3:42 PM, Bjorn Helgaas <bhelgaas@google.com> wrote: > > I regularly see faults like this on an APM X-Gene: > > U-Boot 2013.04-mustang_sw_1.14.14 (Dec 16 2014 - 15:59:33) > CPU0: APM ARM 64-bit Potenza Rev B0 2400MHz PCP 2400MHz > 32 KB ICACHE, 32 KB DCACHE > SOC 2000MHz IOBAXI 400MHz AXI 250MHz AHB 200MHz GFC 125MHz > ... > Unhandled fault: synchronous external abort (0x96000010) at 0xffffff8000110034 > Internal error: : 96000010 [#1] SMP > Modules linked in: > CPU: 0 PID: 3723 Comm: ... 4.1.0-smp-DEV #3 > Hardware name: APM X-Gene Mustang board (DT) > task: ffffffc7dc1a4140 ti: ffffffc7dc118000 task.ti: ffffffc7dc118000 > PC is at pci_generic_config_read32+0x4c/0xb8 > LR is at pci_generic_config_read32+0x40/0xb8 > pc : [<ffffffc00033b90c>] lr : [<ffffffc00033b900>] pstate: 600001c5 > ... > Call trace: > [<ffffffc00033b90c>] pci_generic_config_read32+0x4c/0xb8 > [<ffffffc00033bf58>] pci_user_read_config_byte+0x60/0xc4 > [<ffffffc0003496a8>] pci_read_config+0x15c/0x238 > [<ffffffc0002393b4>] sysfs_kf_bin_read+0x68/0xa0 > [<ffffffc00023896c>] kernfs_fop_read+0x9c/0x1ac > [<ffffffc0001c361c>] __vfs_read+0x44/0x128 > [<ffffffc0001c3e28>] vfs_read+0x84/0x144 > [<ffffffc0001c4764>] SyS_read+0x50/0xb0 The log shows kernel gets an exception when trying to access Mellanox card configuration space. This is usually due to suboptimal PCIe SerDes parameters are using in your board, which will cause bad link quality. The PCIe SerDes programming is done in U-Boot, so I suggest you do a U-Boot upgrade to our latest X-Gene U-Boot release. In order to access latest X-Gene U-Boot release, please use APM official support channel: https://myapm.apm.com Please register an account at myapm.apm.com if you don't have one using following link: https://myapm.apm.com/user/register > > # lspci > 00:00.0 PCI bridge: Applied Micro Circuits Corp. Device e004 (rev 04) > 01:00.0 Ethernet controller: Mellanox Technologies MT27520 Family > > I first saw this on an ancient kernel and thought it was likely specific to > my environment, but I'm now using an almost unmodified v4.1 kernel and > still seeing it. Does anybody else see this? The box does have a PCI card > installed, but I haven't yet worked out what device's config space we're > trying to read. > > Is there anything I can do to debug this? I'm not an arm64 guy, but my > impression is that this is a page fault, and the address seems to be in the > "cfg" area ioremapped by xgene_pcie_map_reg(), so I'm not sure this is > really a PCI issue -- maybe that page mapping got trashed by somebody else? > > Bjorn -- Regards, Duc Dang. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: X-Gene: Unhandled fault: synchronous external abort in pci_generic_config_read32 2015-07-25 0:05 ` Duc Dang @ 2015-07-27 11:36 ` Catalin Marinas 2015-07-28 17:39 ` Duc Dang 2015-07-28 16:43 ` Bjorn Helgaas 1 sibling, 1 reply; 24+ messages in thread From: Catalin Marinas @ 2015-07-27 11:36 UTC (permalink / raw) To: Duc Dang Cc: Bjorn Helgaas, linux-pci, Tanmay Inamdar, linux-arm, linux-kernel On Fri, Jul 24, 2015 at 05:05:19PM -0700, Duc Dang wrote: > On Fri, Jul 24, 2015 at 3:42 PM, Bjorn Helgaas <bhelgaas@google.com> wrote: > > I regularly see faults like this on an APM X-Gene: > > > > U-Boot 2013.04-mustang_sw_1.14.14 (Dec 16 2014 - 15:59:33) > > CPU0: APM ARM 64-bit Potenza Rev B0 2400MHz PCP 2400MHz > > 32 KB ICACHE, 32 KB DCACHE > > SOC 2000MHz IOBAXI 400MHz AXI 250MHz AHB 200MHz GFC 125MHz > > ... > > Unhandled fault: synchronous external abort (0x96000010) at 0xffffff8000110034 That's generated by an external device (PCIe root complex, card etc.) and some mis-configured CPU setting. > > Internal error: : 96000010 [#1] SMP > > Modules linked in: > > CPU: 0 PID: 3723 Comm: ... 4.1.0-smp-DEV #3 > > Hardware name: APM X-Gene Mustang board (DT) > > task: ffffffc7dc1a4140 ti: ffffffc7dc118000 task.ti: ffffffc7dc118000 > > PC is at pci_generic_config_read32+0x4c/0xb8 > > LR is at pci_generic_config_read32+0x40/0xb8 > > pc : [<ffffffc00033b90c>] lr : [<ffffffc00033b900>] pstate: 600001c5 > > ... > > Call trace: > > [<ffffffc00033b90c>] pci_generic_config_read32+0x4c/0xb8 > > [<ffffffc00033bf58>] pci_user_read_config_byte+0x60/0xc4 > > [<ffffffc0003496a8>] pci_read_config+0x15c/0x238 > > [<ffffffc0002393b4>] sysfs_kf_bin_read+0x68/0xa0 > > [<ffffffc00023896c>] kernfs_fop_read+0x9c/0x1ac > > [<ffffffc0001c361c>] __vfs_read+0x44/0x128 > > [<ffffffc0001c3e28>] vfs_read+0x84/0x144 > > [<ffffffc0001c4764>] SyS_read+0x50/0xb0 > > The log shows kernel gets an exception when trying to access Mellanox > card configuration space. This is usually due to suboptimal PCIe > SerDes parameters are using in your board, which will cause bad link > quality. I would have hoped that "suboptimal" means that it still works, albeit not fully optimal ;). > The PCIe SerDes programming is done in U-Boot, so I suggest you do a > U-Boot upgrade to our latest X-Gene U-Boot release. > > In order to access latest X-Gene U-Boot release, please use APM > official support channel: > https://myapm.apm.com > > Please register an account at myapm.apm.com if you don't have one > using following link: > https://myapm.apm.com/user/register Isn't the latest U-Boot source for X-Gene publicly available anywhere? It's GPL code anyway, so it shouldn't have proprietary code to require registration, click-through agreements. -- Catalin ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: X-Gene: Unhandled fault: synchronous external abort in pci_generic_config_read32 2015-07-27 11:36 ` Catalin Marinas @ 2015-07-28 17:39 ` Duc Dang 2015-07-28 18:36 ` Bjorn Helgaas 0 siblings, 1 reply; 24+ messages in thread From: Duc Dang @ 2015-07-28 17:39 UTC (permalink / raw) To: Catalin Marinas Cc: Bjorn Helgaas, linux-pci, Tanmay Inamdar, linux-arm, Linux Kernel Mailing List On Mon, Jul 27, 2015 at 4:36 AM, Catalin Marinas <catalin.marinas@arm.com> wrote: > On Fri, Jul 24, 2015 at 05:05:19PM -0700, Duc Dang wrote: >> On Fri, Jul 24, 2015 at 3:42 PM, Bjorn Helgaas <bhelgaas@google.com> wrote: >> > I regularly see faults like this on an APM X-Gene: >> > >> > U-Boot 2013.04-mustang_sw_1.14.14 (Dec 16 2014 - 15:59:33) >> > CPU0: APM ARM 64-bit Potenza Rev B0 2400MHz PCP 2400MHz >> > 32 KB ICACHE, 32 KB DCACHE >> > SOC 2000MHz IOBAXI 400MHz AXI 250MHz AHB 200MHz GFC 125MHz >> > ... >> > Unhandled fault: synchronous external abort (0x96000010) at 0xffffff8000110034 > > That's generated by an external device (PCIe root complex, card etc.) > and some mis-configured CPU setting. > >> > Internal error: : 96000010 [#1] SMP >> > Modules linked in: >> > CPU: 0 PID: 3723 Comm: ... 4.1.0-smp-DEV #3 >> > Hardware name: APM X-Gene Mustang board (DT) >> > task: ffffffc7dc1a4140 ti: ffffffc7dc118000 task.ti: ffffffc7dc118000 >> > PC is at pci_generic_config_read32+0x4c/0xb8 >> > LR is at pci_generic_config_read32+0x40/0xb8 >> > pc : [<ffffffc00033b90c>] lr : [<ffffffc00033b900>] pstate: 600001c5 >> > ... >> > Call trace: >> > [<ffffffc00033b90c>] pci_generic_config_read32+0x4c/0xb8 >> > [<ffffffc00033bf58>] pci_user_read_config_byte+0x60/0xc4 >> > [<ffffffc0003496a8>] pci_read_config+0x15c/0x238 >> > [<ffffffc0002393b4>] sysfs_kf_bin_read+0x68/0xa0 >> > [<ffffffc00023896c>] kernfs_fop_read+0x9c/0x1ac >> > [<ffffffc0001c361c>] __vfs_read+0x44/0x128 >> > [<ffffffc0001c3e28>] vfs_read+0x84/0x144 >> > [<ffffffc0001c4764>] SyS_read+0x50/0xb0 >> >> The log shows kernel gets an exception when trying to access Mellanox >> card configuration space. This is usually due to suboptimal PCIe >> SerDes parameters are using in your board, which will cause bad link >> quality. > > I would have hoped that "suboptimal" means that it still works, albeit > not fully optimal ;). Yes, it should still work, but you may see crashes occasionally due to link quality. > >> The PCIe SerDes programming is done in U-Boot, so I suggest you do a >> U-Boot upgrade to our latest X-Gene U-Boot release. >> >> In order to access latest X-Gene U-Boot release, please use APM >> official support channel: >> https://myapm.apm.com >> >> Please register an account at myapm.apm.com if you don't have one >> using following link: >> https://myapm.apm.com/user/register > > Isn't the latest U-Boot source for X-Gene publicly available anywhere? > It's GPL code anyway, so it shouldn't have proprietary code to require > registration, click-through agreements. APM X-Gene U-Boot isn't available publicly yet. Though, if this is required, we can make a public GIT which will be hosted with APM server. As of now, customer who has a board from APM will have to use MyAPM to get U-Boot source and binary. > > -- > Catalin -- Regards, Duc Dang. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: X-Gene: Unhandled fault: synchronous external abort in pci_generic_config_read32 2015-07-28 17:39 ` Duc Dang @ 2015-07-28 18:36 ` Bjorn Helgaas 0 siblings, 0 replies; 24+ messages in thread From: Bjorn Helgaas @ 2015-07-28 18:36 UTC (permalink / raw) To: Duc Dang Cc: Catalin Marinas, linux-pci@vger.kernel.org, Tanmay Inamdar, linux-arm, Linux Kernel Mailing List On Tue, Jul 28, 2015 at 12:39 PM, Duc Dang <dhdang@apm.com> wrote: > On Mon, Jul 27, 2015 at 4:36 AM, Catalin Marinas > <catalin.marinas@arm.com> wrote: >> On Fri, Jul 24, 2015 at 05:05:19PM -0700, Duc Dang wrote: >>> On Fri, Jul 24, 2015 at 3:42 PM, Bjorn Helgaas <bhelgaas@google.com> wrote: >>> > I regularly see faults like this on an APM X-Gene: >>> > >>> > U-Boot 2013.04-mustang_sw_1.14.14 (Dec 16 2014 - 15:59:33) >>> > CPU0: APM ARM 64-bit Potenza Rev B0 2400MHz PCP 2400MHz >>> > 32 KB ICACHE, 32 KB DCACHE >>> > SOC 2000MHz IOBAXI 400MHz AXI 250MHz AHB 200MHz GFC 125MHz >>> > ... >>> > Unhandled fault: synchronous external abort (0x96000010) at 0xffffff8000110034 >> >> That's generated by an external device (PCIe root complex, card etc.) >> and some mis-configured CPU setting. >> >>> > Internal error: : 96000010 [#1] SMP >>> > Modules linked in: >>> > CPU: 0 PID: 3723 Comm: ... 4.1.0-smp-DEV #3 >>> > Hardware name: APM X-Gene Mustang board (DT) >>> > task: ffffffc7dc1a4140 ti: ffffffc7dc118000 task.ti: ffffffc7dc118000 >>> > PC is at pci_generic_config_read32+0x4c/0xb8 >>> > LR is at pci_generic_config_read32+0x40/0xb8 >>> > pc : [<ffffffc00033b90c>] lr : [<ffffffc00033b900>] pstate: 600001c5 >>> > ... >>> > Call trace: >>> > [<ffffffc00033b90c>] pci_generic_config_read32+0x4c/0xb8 >>> > [<ffffffc00033bf58>] pci_user_read_config_byte+0x60/0xc4 >>> > [<ffffffc0003496a8>] pci_read_config+0x15c/0x238 >>> > [<ffffffc0002393b4>] sysfs_kf_bin_read+0x68/0xa0 >>> > [<ffffffc00023896c>] kernfs_fop_read+0x9c/0x1ac >>> > [<ffffffc0001c361c>] __vfs_read+0x44/0x128 >>> > [<ffffffc0001c3e28>] vfs_read+0x84/0x144 >>> > [<ffffffc0001c4764>] SyS_read+0x50/0xb0 >>> >>> The log shows kernel gets an exception when trying to access Mellanox >>> card configuration space. This is usually due to suboptimal PCIe >>> SerDes parameters are using in your board, which will cause bad link >>> quality. >> >> I would have hoped that "suboptimal" means that it still works, albeit >> not fully optimal ;). > > Yes, it should still work, but you may see crashes occasionally due to > link quality. A crash seems like a too-severe response to a link quality issue. Isn't there some way to retry the access or return an error, so we don't have to crash the whole system? ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: X-Gene: Unhandled fault: synchronous external abort in pci_generic_config_read32 2015-07-25 0:05 ` Duc Dang 2015-07-27 11:36 ` Catalin Marinas @ 2015-07-28 16:43 ` Bjorn Helgaas 2015-07-28 17:45 ` Duc Dang 1 sibling, 1 reply; 24+ messages in thread From: Bjorn Helgaas @ 2015-07-28 16:43 UTC (permalink / raw) To: Duc Dang Cc: Tanmay Inamdar, linux-pci@vger.kernel.org, linux-arm, linux-kernel@vger.kernel.org On Fri, Jul 24, 2015 at 7:05 PM, Duc Dang <dhdang@apm.com> wrote: > Hi Bjorn, > > On Fri, Jul 24, 2015 at 3:42 PM, Bjorn Helgaas <bhelgaas@google.com> wrote: >> >> I regularly see faults like this on an APM X-Gene: >> >> U-Boot 2013.04-mustang_sw_1.14.14 (Dec 16 2014 - 15:59:33) >> CPU0: APM ARM 64-bit Potenza Rev B0 2400MHz PCP 2400MHz >> 32 KB ICACHE, 32 KB DCACHE >> SOC 2000MHz IOBAXI 400MHz AXI 250MHz AHB 200MHz GFC 125MHz >> ... >> Unhandled fault: synchronous external abort (0x96000010) at 0xffffff8000110034 >> Internal error: : 96000010 [#1] SMP >> Modules linked in: >> CPU: 0 PID: 3723 Comm: ... 4.1.0-smp-DEV #3 >> Hardware name: APM X-Gene Mustang board (DT) >> task: ffffffc7dc1a4140 ti: ffffffc7dc118000 task.ti: ffffffc7dc118000 >> PC is at pci_generic_config_read32+0x4c/0xb8 >> LR is at pci_generic_config_read32+0x40/0xb8 >> pc : [<ffffffc00033b90c>] lr : [<ffffffc00033b900>] pstate: 600001c5 >> ... >> Call trace: >> [<ffffffc00033b90c>] pci_generic_config_read32+0x4c/0xb8 >> [<ffffffc00033bf58>] pci_user_read_config_byte+0x60/0xc4 >> [<ffffffc0003496a8>] pci_read_config+0x15c/0x238 >> [<ffffffc0002393b4>] sysfs_kf_bin_read+0x68/0xa0 >> [<ffffffc00023896c>] kernfs_fop_read+0x9c/0x1ac >> [<ffffffc0001c361c>] __vfs_read+0x44/0x128 >> [<ffffffc0001c3e28>] vfs_read+0x84/0x144 >> [<ffffffc0001c4764>] SyS_read+0x50/0xb0 > > The log shows kernel gets an exception when trying to access Mellanox > card configuration space. This is usually due to suboptimal PCIe > SerDes parameters are using in your board, which will cause bad link > quality. > The PCIe SerDes programming is done in U-Boot, so I suggest you do a > U-Boot upgrade to our latest X-Gene U-Boot release. I installed U-Boot 1.15.12, which I thought was the latest. I'm still seeing this issue regularly, approx once/hour. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: X-Gene: Unhandled fault: synchronous external abort in pci_generic_config_read32 2015-07-28 16:43 ` Bjorn Helgaas @ 2015-07-28 17:45 ` Duc Dang 2015-07-28 21:29 ` Bjorn Helgaas 0 siblings, 1 reply; 24+ messages in thread From: Duc Dang @ 2015-07-28 17:45 UTC (permalink / raw) To: Bjorn Helgaas Cc: Tanmay Inamdar, linux-pci@vger.kernel.org, linux-arm, linux-kernel@vger.kernel.org On Tue, Jul 28, 2015 at 9:43 AM, Bjorn Helgaas <bhelgaas@google.com> wrote: > On Fri, Jul 24, 2015 at 7:05 PM, Duc Dang <dhdang@apm.com> wrote: >> Hi Bjorn, >> >> On Fri, Jul 24, 2015 at 3:42 PM, Bjorn Helgaas <bhelgaas@google.com> wrote: >>> >>> I regularly see faults like this on an APM X-Gene: >>> >>> U-Boot 2013.04-mustang_sw_1.14.14 (Dec 16 2014 - 15:59:33) >>> CPU0: APM ARM 64-bit Potenza Rev B0 2400MHz PCP 2400MHz >>> 32 KB ICACHE, 32 KB DCACHE >>> SOC 2000MHz IOBAXI 400MHz AXI 250MHz AHB 200MHz GFC 125MHz >>> ... >>> Unhandled fault: synchronous external abort (0x96000010) at 0xffffff8000110034 >>> Internal error: : 96000010 [#1] SMP >>> Modules linked in: >>> CPU: 0 PID: 3723 Comm: ... 4.1.0-smp-DEV #3 >>> Hardware name: APM X-Gene Mustang board (DT) >>> task: ffffffc7dc1a4140 ti: ffffffc7dc118000 task.ti: ffffffc7dc118000 >>> PC is at pci_generic_config_read32+0x4c/0xb8 >>> LR is at pci_generic_config_read32+0x40/0xb8 >>> pc : [<ffffffc00033b90c>] lr : [<ffffffc00033b900>] pstate: 600001c5 >>> ... >>> Call trace: >>> [<ffffffc00033b90c>] pci_generic_config_read32+0x4c/0xb8 >>> [<ffffffc00033bf58>] pci_user_read_config_byte+0x60/0xc4 >>> [<ffffffc0003496a8>] pci_read_config+0x15c/0x238 >>> [<ffffffc0002393b4>] sysfs_kf_bin_read+0x68/0xa0 >>> [<ffffffc00023896c>] kernfs_fop_read+0x9c/0x1ac >>> [<ffffffc0001c361c>] __vfs_read+0x44/0x128 >>> [<ffffffc0001c3e28>] vfs_read+0x84/0x144 >>> [<ffffffc0001c4764>] SyS_read+0x50/0xb0 >> >> The log shows kernel gets an exception when trying to access Mellanox >> card configuration space. This is usually due to suboptimal PCIe >> SerDes parameters are using in your board, which will cause bad link >> quality. >> The PCIe SerDes programming is done in U-Boot, so I suggest you do a >> U-Boot upgrade to our latest X-Gene U-Boot release. > > I installed U-Boot 1.15.12, which I thought was the latest. I'm still > seeing this issue regularly, approx once/hour. Our latest U-Boot is 1.15.15, but U-Boot 1.15.12 is already a good version to use. Are you running any PCIe traffic test when the error happens? I will try to reproduce the issue with my Mustang board as well. And it will be useful if you can share your "lspci -vvv" output when the board is running, we can check to see if there is any error status reported. -- Regards, Duc Dang. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: X-Gene: Unhandled fault: synchronous external abort in pci_generic_config_read32 2015-07-28 17:45 ` Duc Dang @ 2015-07-28 21:29 ` Bjorn Helgaas 2015-07-28 21:50 ` Duc Dang 2016-04-13 9:58 ` Sudeep Holla 0 siblings, 2 replies; 24+ messages in thread From: Bjorn Helgaas @ 2015-07-28 21:29 UTC (permalink / raw) To: Duc Dang Cc: Tanmay Inamdar, linux-pci@vger.kernel.org, linux-arm, linux-kernel@vger.kernel.org On Tue, Jul 28, 2015 at 10:45:26AM -0700, Duc Dang wrote: > On Tue, Jul 28, 2015 at 9:43 AM, Bjorn Helgaas <bhelgaas@google.com> wrote: > > On Fri, Jul 24, 2015 at 7:05 PM, Duc Dang <dhdang@apm.com> wrote: > >> Hi Bjorn, > >> > >> On Fri, Jul 24, 2015 at 3:42 PM, Bjorn Helgaas <bhelgaas@google.com> wrote: > >>> > >>> I regularly see faults like this on an APM X-Gene: > >>> > >>> U-Boot 2013.04-mustang_sw_1.14.14 (Dec 16 2014 - 15:59:33) > >>> CPU0: APM ARM 64-bit Potenza Rev B0 2400MHz PCP 2400MHz > >>> 32 KB ICACHE, 32 KB DCACHE > >>> SOC 2000MHz IOBAXI 400MHz AXI 250MHz AHB 200MHz GFC 125MHz > >>> ... > >>> Unhandled fault: synchronous external abort (0x96000010) at 0xffffff8000110034 > >>> Internal error: : 96000010 [#1] SMP > >>> Modules linked in: > >>> CPU: 0 PID: 3723 Comm: ... 4.1.0-smp-DEV #3 > >>> Hardware name: APM X-Gene Mustang board (DT) > >>> task: ffffffc7dc1a4140 ti: ffffffc7dc118000 task.ti: ffffffc7dc118000 > >>> PC is at pci_generic_config_read32+0x4c/0xb8 > >>> LR is at pci_generic_config_read32+0x40/0xb8 > >>> pc : [<ffffffc00033b90c>] lr : [<ffffffc00033b900>] pstate: 600001c5 > >>> ... > >>> Call trace: > >>> [<ffffffc00033b90c>] pci_generic_config_read32+0x4c/0xb8 > >>> [<ffffffc00033bf58>] pci_user_read_config_byte+0x60/0xc4 > >>> [<ffffffc0003496a8>] pci_read_config+0x15c/0x238 > >>> [<ffffffc0002393b4>] sysfs_kf_bin_read+0x68/0xa0 > >>> [<ffffffc00023896c>] kernfs_fop_read+0x9c/0x1ac > >>> [<ffffffc0001c361c>] __vfs_read+0x44/0x128 > >>> [<ffffffc0001c3e28>] vfs_read+0x84/0x144 > >>> [<ffffffc0001c4764>] SyS_read+0x50/0xb0 > >> > >> The log shows kernel gets an exception when trying to access Mellanox > >> card configuration space. This is usually due to suboptimal PCIe > >> SerDes parameters are using in your board, which will cause bad link > >> quality. > >> The PCIe SerDes programming is done in U-Boot, so I suggest you do a > >> U-Boot upgrade to our latest X-Gene U-Boot release. > > > > I installed U-Boot 1.15.12, which I thought was the latest. I'm still > > seeing this issue regularly, approx once/hour. > > Our latest U-Boot is 1.15.15, but U-Boot 1.15.12 is already a good > version to use. Are you running any PCIe traffic test when the error > happens? Nope, the machine was either idle or running a reboot test; no PCIe stress test or anything. > And it will be useful if you can share your "lspci -vvv" output when > the board is running, we can check to see if there is any error status > reported. Here's some lspci output and info about the firmware I'm running. Obviously this lspci output was collected before a crash. I have also seen lspci output where "CESta: RxErr+" was set on the 00:00.0 Root Port. U-Boot 2013.04-mustang_sw_1.15.12 (May 20 2015 - 10:03:33) CPU0: APM ARM 64-bit Potenza Rev B0 2400MHz PCP 2400MHz 32 KB ICACHE, 32 KB DCACHE SOC 2000MHz IOBAXI 400MHz AXI 250MHz AHB 200MHz GFC 125MHz Boot from SPI-NOR Slimpro FW: Ver: 2.4 (build 01.15.12.00 2015/05/20) PMD: 970 mV SOC: 950 mV Board: Mustang - AppliedMicro APM883208-xNA24SPT Reference Board I2C: ready DRAM: ECC 32 GiB @ 1600MHz SF: Detected N25Q256 with page size 256 Bytes, total 32 MiB MMC: X-Gene SD/SDIO/eMMC: 0 PCIE0: (RC) X8 GEN-3 link up 00:00.0 - 10e8:e004 - Bridge device 01:00.0 - 15b3:1007 - Network controller # lspci -vvv 00:00.0 PCI bridge: Applied Micro Circuits Corp. Device e004 (rev 04) (prog-if 00 [Normal decode]) Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx- Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 0 Bus: primary=00, secondary=01, subordinate=01, sec-latency=0 I/O behind bridge: 0000f000-00000fff Memory behind bridge: 80000000-82ffffff Prefetchable memory behind bridge: 0000000083000000-00000000830fffff Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- <SERR- <PERR- BridgeCtl: Parity- SERR- NoISA- VGA- MAbort- >Reset- FastB2B- PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn- Capabilities: [40] Express (v2) Root Port (Slot+), MSI 00 DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s <1us, L1 unlimited ExtTag- RBE+ FLReset- DevCtl: Report errors: Correctable- Non-Fatal+ Fatal+ Unsupported- RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+ MaxPayload 256 bytes, MaxReadReq 512 bytes DevSta: CorrErr+ UncorrErr- FatalErr+ UnsuppReq- AuxPwr- TransPend+ LnkCap: Port #0, Speed unknown, Width x8, ASPM L0s L1, Latency L0 unlimited, L1 unlimited ClockPM- Surprise+ LLActRep+ BwNot+ LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk+ ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- LnkSta: Speed unknown, Width x8, TrErr- Train- SlotClk+ DLActive+ BWMgmt+ ABWMgmt- SltCap: AttnBtn- PwrCtrl- MRL- AttnInd- PwrInd- HotPlug- Surprise- Slot #1, PowerLimit 10.000W; Interlock- NoCompl- SltCtl: Enable: AttnBtn- PwrFlt- MRL- PresDet- CmdCplt- HPIrq- LinkChg- Control: AttnInd Off, PwrInd Off, Power- Interlock- SltSta: Status: AttnBtn- PowerFlt- MRL- CmdCplt- PresDet- Interlock- Changed: MRL- PresDet- LinkState+ RootCtl: ErrCorrectable- ErrNon-Fatal- ErrFatal- PMEIntEna- CRSVisible- RootCap: CRSVisible- RootSta: PME ReqID 0000, PMEStatus- PMEPending- DevCap2: Completion Timeout: Not Supported, TimeoutDis+ ARIFwd- DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- ARIFwd- LnkCtl2: Target Link Speed: Unknown, EnterCompliance- SpeedDis-, Selectable De-emphasis: -3.5dB Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS- Compliance De-emphasis: -6dB LnkSta2: Current De-emphasis Level: -6dB Capabilities: [80] Power Management version 3 Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-) Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME- Capabilities: [100 v1] Advanced Error Reporting UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol- CESta: RxErr+ BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr- CEMsk: RxErr+ BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+ AERCap: First Error Pointer: 00, GenCap+ CGenEn- ChkCap+ ChkEn- Capabilities: [180 v1] #19 Capabilities: [150 v1] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?> Kernel driver in use: pcieport 01:00.0 Ethernet controller: Mellanox Technologies MT27520 Family Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx- Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Interrupt: pin A routed to IRQ 226 Region 0: [virtual] Memory at e182000000 (32-bit, non-prefetchable) [size=1M] Region 2: [virtual] Memory at e180000000 (32-bit, non-prefetchable) [size=32M] [virtual] Expansion ROM at e183000000 [disabled] [size=1M] Capabilities: [40] Power Management version 3 Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-) Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME- Capabilities: [9c] MSI-X: Enable- Count=64 Masked- Vector table: BAR=0 offset=0007c000 PBA: BAR=0 offset=0007d000 Capabilities: [60] Express (v2) Endpoint, MSI 00 DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <64ns, L1 unlimited ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset- DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported- RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop- MaxPayload 128 bytes, MaxReadReq 512 bytes DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend- LnkCap: Port #8, Speed unknown, Width x8, ASPM L0s, Latency L0 unlimited, L1 unlimited ClockPM- Surprise- LLActRep- BwNot- LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk- ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- LnkSta: Speed unknown, Width x8, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt- DevCap2: Completion Timeout: Range ABCD, TimeoutDis+ DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LnkCtl2: Target Link Speed: Unknown, EnterCompliance- SpeedDis-, Selectable De-emphasis: -6dB Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS- Compliance De-emphasis: -6dB LnkSta2: Current De-emphasis Level: -6dB Capabilities: [100 v1] Alternative Routing-ID Interpretation (ARI) ARICap: MFVC- ACS-, Next Function: 0 ARICtl: MFVC- ACS-, Function Group: 0 Capabilities: [148 v1] Device Serial Number xx-xx-xx-xx-xx-xx-xx-xx Capabilities: [154 v2] Advanced Error Reporting UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UESvrt: DLP+ SDES- TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol- CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr- CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+ AERCap: First Error Pointer: 00, GenCap+ CGenEn- ChkCap+ ChkEn- Capabilities: [18c v1] #19 Kernel modules: mlx4_core ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: X-Gene: Unhandled fault: synchronous external abort in pci_generic_config_read32 2015-07-28 21:29 ` Bjorn Helgaas @ 2015-07-28 21:50 ` Duc Dang 2015-07-29 1:22 ` Bjorn Helgaas 2016-04-13 9:58 ` Sudeep Holla 1 sibling, 1 reply; 24+ messages in thread From: Duc Dang @ 2015-07-28 21:50 UTC (permalink / raw) To: Bjorn Helgaas Cc: Tanmay Inamdar, linux-pci@vger.kernel.org, linux-arm, linux-kernel@vger.kernel.org On Tue, Jul 28, 2015 at 2:29 PM, Bjorn Helgaas <bhelgaas@google.com> wrote: > On Tue, Jul 28, 2015 at 10:45:26AM -0700, Duc Dang wrote: >> On Tue, Jul 28, 2015 at 9:43 AM, Bjorn Helgaas <bhelgaas@google.com> wrote: >> > On Fri, Jul 24, 2015 at 7:05 PM, Duc Dang <dhdang@apm.com> wrote: >> >> Hi Bjorn, >> >> >> >> On Fri, Jul 24, 2015 at 3:42 PM, Bjorn Helgaas <bhelgaas@google.com> wrote: >> >>> >> >>> I regularly see faults like this on an APM X-Gene: >> >>> >> >>> U-Boot 2013.04-mustang_sw_1.14.14 (Dec 16 2014 - 15:59:33) >> >>> CPU0: APM ARM 64-bit Potenza Rev B0 2400MHz PCP 2400MHz >> >>> 32 KB ICACHE, 32 KB DCACHE >> >>> SOC 2000MHz IOBAXI 400MHz AXI 250MHz AHB 200MHz GFC 125MHz >> >>> ... >> >>> Unhandled fault: synchronous external abort (0x96000010) at 0xffffff8000110034 >> >>> Internal error: : 96000010 [#1] SMP >> >>> Modules linked in: >> >>> CPU: 0 PID: 3723 Comm: ... 4.1.0-smp-DEV #3 >> >>> Hardware name: APM X-Gene Mustang board (DT) >> >>> task: ffffffc7dc1a4140 ti: ffffffc7dc118000 task.ti: ffffffc7dc118000 >> >>> PC is at pci_generic_config_read32+0x4c/0xb8 >> >>> LR is at pci_generic_config_read32+0x40/0xb8 >> >>> pc : [<ffffffc00033b90c>] lr : [<ffffffc00033b900>] pstate: 600001c5 >> >>> ... >> >>> Call trace: >> >>> [<ffffffc00033b90c>] pci_generic_config_read32+0x4c/0xb8 >> >>> [<ffffffc00033bf58>] pci_user_read_config_byte+0x60/0xc4 >> >>> [<ffffffc0003496a8>] pci_read_config+0x15c/0x238 >> >>> [<ffffffc0002393b4>] sysfs_kf_bin_read+0x68/0xa0 >> >>> [<ffffffc00023896c>] kernfs_fop_read+0x9c/0x1ac >> >>> [<ffffffc0001c361c>] __vfs_read+0x44/0x128 >> >>> [<ffffffc0001c3e28>] vfs_read+0x84/0x144 >> >>> [<ffffffc0001c4764>] SyS_read+0x50/0xb0 >> >> >> >> The log shows kernel gets an exception when trying to access Mellanox >> >> card configuration space. This is usually due to suboptimal PCIe >> >> SerDes parameters are using in your board, which will cause bad link >> >> quality. >> >> The PCIe SerDes programming is done in U-Boot, so I suggest you do a >> >> U-Boot upgrade to our latest X-Gene U-Boot release. >> > >> > I installed U-Boot 1.15.12, which I thought was the latest. I'm still >> > seeing this issue regularly, approx once/hour. >> >> Our latest U-Boot is 1.15.15, but U-Boot 1.15.12 is already a good >> version to use. Are you running any PCIe traffic test when the error >> happens? > > Nope, the machine was either idle or running a reboot test; no PCIe stress > test or anything. > >> And it will be useful if you can share your "lspci -vvv" output when >> the board is running, we can check to see if there is any error status >> reported. > > Here's some lspci output and info about the firmware I'm running. > Obviously this lspci output was collected before a crash. I have also > seen lspci output where "CESta: RxErr+" was set on the 00:00.0 Root Port. > > U-Boot 2013.04-mustang_sw_1.15.12 (May 20 2015 - 10:03:33) > > CPU0: APM ARM 64-bit Potenza Rev B0 2400MHz PCP 2400MHz > 32 KB ICACHE, 32 KB DCACHE > SOC 2000MHz IOBAXI 400MHz AXI 250MHz AHB 200MHz GFC 125MHz > Boot from SPI-NOR > Slimpro FW: > Ver: 2.4 (build 01.15.12.00 2015/05/20) > PMD: 970 mV > SOC: 950 mV > Board: Mustang - AppliedMicro APM883208-xNA24SPT Reference Board > I2C: ready > DRAM: ECC 32 GiB @ 1600MHz > SF: Detected N25Q256 with page size 256 Bytes, total 32 MiB > MMC: X-Gene SD/SDIO/eMMC: 0 > PCIE0: (RC) X8 GEN-3 link up > 00:00.0 - 10e8:e004 - Bridge device > 01:00.0 - 15b3:1007 - Network controller > > # lspci -vvv > 00:00.0 PCI bridge: Applied Micro Circuits Corp. Device e004 (rev 04) (prog-if 00 [Normal decode]) > Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx- > Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- > Latency: 0 > Bus: primary=00, secondary=01, subordinate=01, sec-latency=0 > I/O behind bridge: 0000f000-00000fff > Memory behind bridge: 80000000-82ffffff > Prefetchable memory behind bridge: 0000000083000000-00000000830fffff > Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- <SERR- <PERR- > BridgeCtl: Parity- SERR- NoISA- VGA- MAbort- >Reset- FastB2B- > PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn- > Capabilities: [40] Express (v2) Root Port (Slot+), MSI 00 > DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s <1us, L1 unlimited > ExtTag- RBE+ FLReset- > DevCtl: Report errors: Correctable- Non-Fatal+ Fatal+ Unsupported- > RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+ > MaxPayload 256 bytes, MaxReadReq 512 bytes > DevSta: CorrErr+ UncorrErr- FatalErr+ UnsuppReq- AuxPwr- TransPend+ > LnkCap: Port #0, Speed unknown, Width x8, ASPM L0s L1, Latency L0 unlimited, L1 unlimited > ClockPM- Surprise+ LLActRep+ BwNot+ > LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk+ > ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- > LnkSta: Speed unknown, Width x8, TrErr- Train- SlotClk+ DLActive+ BWMgmt+ ABWMgmt- > SltCap: AttnBtn- PwrCtrl- MRL- AttnInd- PwrInd- HotPlug- Surprise- > Slot #1, PowerLimit 10.000W; Interlock- NoCompl- > SltCtl: Enable: AttnBtn- PwrFlt- MRL- PresDet- CmdCplt- HPIrq- LinkChg- > Control: AttnInd Off, PwrInd Off, Power- Interlock- > SltSta: Status: AttnBtn- PowerFlt- MRL- CmdCplt- PresDet- Interlock- > Changed: MRL- PresDet- LinkState+ > RootCtl: ErrCorrectable- ErrNon-Fatal- ErrFatal- PMEIntEna- CRSVisible- > RootCap: CRSVisible- > RootSta: PME ReqID 0000, PMEStatus- PMEPending- > DevCap2: Completion Timeout: Not Supported, TimeoutDis+ ARIFwd- > DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- ARIFwd- > LnkCtl2: Target Link Speed: Unknown, EnterCompliance- SpeedDis-, Selectable De-emphasis: -3.5dB Target Link Speed unknown is really strange. I also saw the same "Link speed unknown" for Mellanox card below. > Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS- > Compliance De-emphasis: -6dB > LnkSta2: Current De-emphasis Level: -6dB > Capabilities: [80] Power Management version 3 > Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-) > Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME- > Capabilities: [100 v1] Advanced Error Reporting > UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- > UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- > UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol- > CESta: RxErr+ BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr- > CEMsk: RxErr+ BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+ > AERCap: First Error Pointer: 00, GenCap+ CGenEn- ChkCap+ ChkEn- > Capabilities: [180 v1] #19 > Capabilities: [150 v1] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?> > Kernel driver in use: pcieport > > 01:00.0 Ethernet controller: Mellanox Technologies MT27520 Family > Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx- Mem and BusMaster are disabled. So this card is not functional? > Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- > Interrupt: pin A routed to IRQ 226 > Region 0: [virtual] Memory at e182000000 (32-bit, non-prefetchable) [size=1M] > Region 2: [virtual] Memory at e180000000 (32-bit, non-prefetchable) [size=32M] > [virtual] Expansion ROM at e183000000 [disabled] [size=1M] > Capabilities: [40] Power Management version 3 > Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-) > Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME- > Capabilities: [9c] MSI-X: Enable- Count=64 Masked- This may be unrelated, but MSI allocation fails for this card somehow. > Vector table: BAR=0 offset=0007c000 > PBA: BAR=0 offset=0007d000 > Capabilities: [60] Express (v2) Endpoint, MSI 00 > DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <64ns, L1 unlimited > ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset- > DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported- > RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop- > MaxPayload 128 bytes, MaxReadReq 512 bytes > DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend- > LnkCap: Port #8, Speed unknown, Width x8, ASPM L0s, Latency L0 unlimited, L1 unlimited > ClockPM- Surprise- LLActRep- BwNot- > LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk- > ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- > LnkSta: Speed unknown, Width x8, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt- > DevCap2: Completion Timeout: Range ABCD, TimeoutDis+ > DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- > LnkCtl2: Target Link Speed: Unknown, EnterCompliance- SpeedDis-, Selectable De-emphasis: -6dB > Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS- > Compliance De-emphasis: -6dB > LnkSta2: Current De-emphasis Level: -6dB > Capabilities: [100 v1] Alternative Routing-ID Interpretation (ARI) > ARICap: MFVC- ACS-, Next Function: 0 > ARICtl: MFVC- ACS-, Function Group: 0 > Capabilities: [148 v1] Device Serial Number xx-xx-xx-xx-xx-xx-xx-xx The serial number here seems invalid. I have a Mellanox card but different model (ConnectX-3 15b3:1003) that shows meaningful serial number: Capabilities: [148 v1] Device Serial Number f4-52-14-03-00-0b-c2-30. Do you have another PCIe card to try on the same reboot test on this board? > Capabilities: [154 v2] Advanced Error Reporting > UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- > UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- > UESvrt: DLP+ SDES- TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol- > CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr- > CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+ > AERCap: First Error Pointer: 00, GenCap+ CGenEn- ChkCap+ ChkEn- > Capabilities: [18c v1] #19 > Kernel modules: mlx4_core -- Regards, Duc Dang. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: X-Gene: Unhandled fault: synchronous external abort in pci_generic_config_read32 2015-07-28 21:50 ` Duc Dang @ 2015-07-29 1:22 ` Bjorn Helgaas 2015-07-29 15:55 ` Bjorn Helgaas 0 siblings, 1 reply; 24+ messages in thread From: Bjorn Helgaas @ 2015-07-29 1:22 UTC (permalink / raw) To: Duc Dang Cc: Tanmay Inamdar, linux-pci@vger.kernel.org, linux-arm, linux-kernel@vger.kernel.org On Tue, Jul 28, 2015 at 02:50:39PM -0700, Duc Dang wrote: > On Tue, Jul 28, 2015 at 2:29 PM, Bjorn Helgaas <bhelgaas@google.com> wrote: > > On Tue, Jul 28, 2015 at 10:45:26AM -0700, Duc Dang wrote: > >> On Tue, Jul 28, 2015 at 9:43 AM, Bjorn Helgaas <bhelgaas@google.com> wrote: > >> > On Fri, Jul 24, 2015 at 7:05 PM, Duc Dang <dhdang@apm.com> wrote: > >> >> Hi Bjorn, > >> >> > >> >> On Fri, Jul 24, 2015 at 3:42 PM, Bjorn Helgaas <bhelgaas@google.com> wrote: > >> >>> > >> >>> I regularly see faults like this on an APM X-Gene: > >> >>> > >> >>> U-Boot 2013.04-mustang_sw_1.14.14 (Dec 16 2014 - 15:59:33) > >> >>> CPU0: APM ARM 64-bit Potenza Rev B0 2400MHz PCP 2400MHz > >> >>> 32 KB ICACHE, 32 KB DCACHE > >> >>> SOC 2000MHz IOBAXI 400MHz AXI 250MHz AHB 200MHz GFC 125MHz > >> >>> ... > >> >>> Unhandled fault: synchronous external abort (0x96000010) at 0xffffff8000110034 > >> >>> Internal error: : 96000010 [#1] SMP > >> >>> Modules linked in: > >> >>> CPU: 0 PID: 3723 Comm: ... 4.1.0-smp-DEV #3 > >> >>> Hardware name: APM X-Gene Mustang board (DT) > >> >>> task: ffffffc7dc1a4140 ti: ffffffc7dc118000 task.ti: ffffffc7dc118000 > >> >>> PC is at pci_generic_config_read32+0x4c/0xb8 > >> >>> LR is at pci_generic_config_read32+0x40/0xb8 > >> >>> pc : [<ffffffc00033b90c>] lr : [<ffffffc00033b900>] pstate: 600001c5 > >> >>> ... > >> >>> Call trace: > >> >>> [<ffffffc00033b90c>] pci_generic_config_read32+0x4c/0xb8 > >> >>> [<ffffffc00033bf58>] pci_user_read_config_byte+0x60/0xc4 > >> >>> [<ffffffc0003496a8>] pci_read_config+0x15c/0x238 > >> >>> [<ffffffc0002393b4>] sysfs_kf_bin_read+0x68/0xa0 > >> >>> [<ffffffc00023896c>] kernfs_fop_read+0x9c/0x1ac > >> >>> [<ffffffc0001c361c>] __vfs_read+0x44/0x128 > >> >>> [<ffffffc0001c3e28>] vfs_read+0x84/0x144 > >> >>> [<ffffffc0001c4764>] SyS_read+0x50/0xb0 > >> >> > >> >> The log shows kernel gets an exception when trying to access Mellanox > >> >> card configuration space. This is usually due to suboptimal PCIe > >> >> SerDes parameters are using in your board, which will cause bad link > >> >> quality. > >> >> The PCIe SerDes programming is done in U-Boot, so I suggest you do a > >> >> U-Boot upgrade to our latest X-Gene U-Boot release. > >> > > >> > I installed U-Boot 1.15.12, which I thought was the latest. I'm still > >> > seeing this issue regularly, approx once/hour. > >> > >> Our latest U-Boot is 1.15.15, but U-Boot 1.15.12 is already a good > >> version to use. Are you running any PCIe traffic test when the error > >> happens? > > > > Nope, the machine was either idle or running a reboot test; no PCIe stress > > test or anything. > > > >> And it will be useful if you can share your "lspci -vvv" output when > >> the board is running, we can check to see if there is any error status > >> reported. > > > > Here's some lspci output and info about the firmware I'm running. > > Obviously this lspci output was collected before a crash. I have also > > seen lspci output where "CESta: RxErr+" was set on the 00:00.0 Root Port. > > > > U-Boot 2013.04-mustang_sw_1.15.12 (May 20 2015 - 10:03:33) > > > > CPU0: APM ARM 64-bit Potenza Rev B0 2400MHz PCP 2400MHz > > 32 KB ICACHE, 32 KB DCACHE > > SOC 2000MHz IOBAXI 400MHz AXI 250MHz AHB 200MHz GFC 125MHz > > Boot from SPI-NOR > > Slimpro FW: > > Ver: 2.4 (build 01.15.12.00 2015/05/20) > > PMD: 970 mV > > SOC: 950 mV > > Board: Mustang - AppliedMicro APM883208-xNA24SPT Reference Board > > I2C: ready > > DRAM: ECC 32 GiB @ 1600MHz > > SF: Detected N25Q256 with page size 256 Bytes, total 32 MiB > > MMC: X-Gene SD/SDIO/eMMC: 0 > > PCIE0: (RC) X8 GEN-3 link up > > 00:00.0 - 10e8:e004 - Bridge device > > 01:00.0 - 15b3:1007 - Network controller > > > > # lspci -vvv > > 00:00.0 PCI bridge: Applied Micro Circuits Corp. Device e004 (rev 04) (prog-if 00 [Normal decode]) > > LnkCtl2: Target Link Speed: Unknown, EnterCompliance- SpeedDis-, Selectable De-emphasis: -3.5dB > > Target Link Speed unknown is really strange. I also saw the same "Link > speed unknown" for Mellanox card below. I think this is because I have a really old lspci. Here's the -xxx output: 00: e8 10 04 e0 07 00 10 00 04 00 04 06 00 00 01 00 10: 00 00 00 00 00 00 00 00 00 01 01 00 f1 01 00 00 20: 00 80 f0 82 01 83 01 83 00 00 00 00 00 00 00 00 30: 00 00 00 00 40 00 00 00 00 00 00 00 00 01 00 00 40: 10 80 42 01 02 8f 00 00 36 28 21 00 83 fc 7b 00 50: 40 00 83 70 00 05 08 00 c0 03 00 01 00 00 01 00 60: 00 00 00 00 10 00 00 00 00 00 00 00 0e 01 00 00 70: 43 00 1e 00 00 00 00 00 00 00 00 00 00 00 00 00 80: 01 00 03 06 08 00 00 00 00 00 00 00 00 00 00 00 90: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 LnkCtl2 is at offset 0x30 in the PCIe capability, which starts at 0x40, so LnkCtl2 = 0x0043. I think that means Target Link Speed is 0x3, or "Supported Link Speeds Vector field bit 2". The Supported Link Speeds Vector in LnkCap2 (which isn't decoded even by current upstream lspci) is 0x7, so 2.5GT/s, 5.0GT/s, and 8.0GT/s are all supported, with bit 2 being 8.0GT/s. So I think a modern lspci would show "8.0GT/s". > > 01:00.0 Ethernet controller: Mellanox Technologies MT27520 Family > > Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx- > > Mem and BusMaster are disabled. So this card is not functional? I don't know whether it's functional; I haven't tried to use it yet. I typically don't even load the mlx4 driver, so most of the failures I'm seeing are when the driver isn't loaded. User-space code is doing config reads via /sys. > > Capabilities: [148 v1] Device Serial Number xx-xx-xx-xx-xx-xx-xx-xx > > The serial number here seems invalid. I have a Mellanox card but > different model (ConnectX-3 15b3:1003) that shows meaningful serial > number: > Capabilities: [148 v1] Device Serial Number f4-52-14-03-00-0b-c2-30. My fault, lspci actually showed a meaningful serial number; I removed it in a misguided attempt to avoid exposing anything proprietary. > Do you have another PCIe card to try on the same reboot test on this board? I've seen this on at least two Mellanox cards. I'm running similar tests on a different type of card now. Bjorn ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: X-Gene: Unhandled fault: synchronous external abort in pci_generic_config_read32 2015-07-29 1:22 ` Bjorn Helgaas @ 2015-07-29 15:55 ` Bjorn Helgaas 2015-07-31 17:00 ` Duc Dang 0 siblings, 1 reply; 24+ messages in thread From: Bjorn Helgaas @ 2015-07-29 15:55 UTC (permalink / raw) To: Duc Dang Cc: Tanmay Inamdar, linux-pci@vger.kernel.org, linux-arm, linux-kernel@vger.kernel.org On Tue, Jul 28, 2015 at 08:22:55PM -0500, Bjorn Helgaas wrote: > On Tue, Jul 28, 2015 at 02:50:39PM -0700, Duc Dang wrote: > > Do you have another PCIe card to try on the same reboot test on this board? > > I've seen this on at least two Mellanox cards. I'm running similar tests > on a different type of card now. FWIW, reboot tests on two machines with Mellanox cards failed, while the same test on a machine with a different proprietary card succeeded. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: X-Gene: Unhandled fault: synchronous external abort in pci_generic_config_read32 2015-07-29 15:55 ` Bjorn Helgaas @ 2015-07-31 17:00 ` Duc Dang 2015-08-10 16:18 ` Bjorn Helgaas 0 siblings, 1 reply; 24+ messages in thread From: Duc Dang @ 2015-07-31 17:00 UTC (permalink / raw) To: Bjorn Helgaas Cc: Tanmay Inamdar, linux-pci@vger.kernel.org, linux-arm, linux-kernel@vger.kernel.org On Wed, Jul 29, 2015 at 8:55 AM, Bjorn Helgaas <bhelgaas@google.com> wrote: > On Tue, Jul 28, 2015 at 08:22:55PM -0500, Bjorn Helgaas wrote: >> On Tue, Jul 28, 2015 at 02:50:39PM -0700, Duc Dang wrote: > >> > Do you have another PCIe card to try on the same reboot test on this board? >> >> I've seen this on at least two Mellanox cards. I'm running similar tests >> on a different type of card now. > > FWIW, reboot tests on two machines with Mellanox cards failed, while the > same test on a machine with a different proprietary card succeeded. Thanks, Bjorn. I don't have the same Mellanox card as yours, but I will also run similar reboot test to see if I hit the same issue with my card. -- Regards, Duc Dang. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: X-Gene: Unhandled fault: synchronous external abort in pci_generic_config_read32 2015-07-31 17:00 ` Duc Dang @ 2015-08-10 16:18 ` Bjorn Helgaas 2015-08-10 17:38 ` Catalin Marinas [not found] ` <CADaLNDkUQHzGACfFmYDeJWnaNrKmJUDx4Rby60OWr4FzOjx3rA@mail.gmail.com> 0 siblings, 2 replies; 24+ messages in thread From: Bjorn Helgaas @ 2015-08-10 16:18 UTC (permalink / raw) To: Duc Dang Cc: Tanmay Inamdar, linux-pci@vger.kernel.org, linux-arm, linux-kernel@vger.kernel.org On Fri, Jul 31, 2015 at 12:00 PM, Duc Dang <dhdang@apm.com> wrote: > On Wed, Jul 29, 2015 at 8:55 AM, Bjorn Helgaas <bhelgaas@google.com> wrote: >> On Tue, Jul 28, 2015 at 08:22:55PM -0500, Bjorn Helgaas wrote: >>> On Tue, Jul 28, 2015 at 02:50:39PM -0700, Duc Dang wrote: >> >>> > Do you have another PCIe card to try on the same reboot test on this board? >>> >>> I've seen this on at least two Mellanox cards. I'm running similar tests >>> on a different type of card now. >> >> FWIW, reboot tests on two machines with Mellanox cards failed, while the >> same test on a machine with a different proprietary card succeeded. > > Thanks, Bjorn. > > I don't have the same Mellanox card as yours, but I will also run > similar reboot test to see if I hit the same issue with my card. Any more hints on this? Nothing has changed on my end, so of course I'm still seeing this, always on machines with Mellanox, and never on other machines. Could this be a hardware issue like a signal integrity or margin issue? I don't know where to go from here because I'm not a hardware person, and I don't know anything to do in software. Bjorn ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: X-Gene: Unhandled fault: synchronous external abort in pci_generic_config_read32 2015-08-10 16:18 ` Bjorn Helgaas @ 2015-08-10 17:38 ` Catalin Marinas [not found] ` <CADaLNDkUQHzGACfFmYDeJWnaNrKmJUDx4Rby60OWr4FzOjx3rA@mail.gmail.com> 1 sibling, 0 replies; 24+ messages in thread From: Catalin Marinas @ 2015-08-10 17:38 UTC (permalink / raw) To: Bjorn Helgaas Cc: Duc Dang, linux-pci@vger.kernel.org, Tanmay Inamdar, linux-arm, linux-kernel@vger.kernel.org On Mon, Aug 10, 2015 at 11:18:23AM -0500, Bjorn Helgaas wrote: > On Fri, Jul 31, 2015 at 12:00 PM, Duc Dang <dhdang@apm.com> wrote: > > On Wed, Jul 29, 2015 at 8:55 AM, Bjorn Helgaas <bhelgaas@google.com> wrote: > >> On Tue, Jul 28, 2015 at 08:22:55PM -0500, Bjorn Helgaas wrote: > >>> On Tue, Jul 28, 2015 at 02:50:39PM -0700, Duc Dang wrote: > >> > >>> > Do you have another PCIe card to try on the same reboot test on this board? > >>> > >>> I've seen this on at least two Mellanox cards. I'm running similar tests > >>> on a different type of card now. > >> > >> FWIW, reboot tests on two machines with Mellanox cards failed, while the > >> same test on a machine with a different proprietary card succeeded. > > > > Thanks, Bjorn. > > > > I don't have the same Mellanox card as yours, but I will also run > > similar reboot test to see if I hit the same issue with my card. > > Any more hints on this? Nothing has changed on my end, so of course > I'm still seeing this, always on machines with Mellanox, and never on > other machines. Could this be a hardware issue like a signal > integrity or margin issue? I don't know where to go from here because > I'm not a hardware person, and I don't know anything to do in > software. Silly hack below, not actually a solution (and it may not even work): diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c index 94d98cd1aad8..e895e96b3d13 100644 --- a/arch/arm64/mm/fault.c +++ b/arch/arm64/mm/fault.c @@ -369,6 +369,14 @@ static int do_bad(unsigned long addr, unsigned int esr, struct pt_regs *regs) return 1; } +/* + * Retry the faulty access. + */ +static int do_good(unsigned long addr, unsigned int esr, struct pt_regs *regs) +{ + return 0; +} + static struct fault_info { int (*fn)(unsigned long addr, unsigned int esr, struct pt_regs *regs); int sig; @@ -391,7 +399,7 @@ static struct fault_info { { do_page_fault, SIGSEGV, SEGV_ACCERR, "level 1 permission fault" }, { do_page_fault, SIGSEGV, SEGV_ACCERR, "level 2 permission fault" }, { do_page_fault, SIGSEGV, SEGV_ACCERR, "level 3 permission fault" }, - { do_bad, SIGBUS, 0, "synchronous external abort" }, + { do_good, SIGBUS, 0, "synchronous external abort" }, { do_bad, SIGBUS, 0, "asynchronous external abort" }, { do_bad, SIGBUS, 0, "unknown 18" }, { do_bad, SIGBUS, 0, "unknown 19" }, -- Catalin ^ permalink raw reply related [flat|nested] 24+ messages in thread
[parent not found: <CADaLNDkUQHzGACfFmYDeJWnaNrKmJUDx4Rby60OWr4FzOjx3rA@mail.gmail.com>]
* Re: X-Gene: Unhandled fault: synchronous external abort in pci_generic_config_read32 [not found] ` <CADaLNDkUQHzGACfFmYDeJWnaNrKmJUDx4Rby60OWr4FzOjx3rA@mail.gmail.com> @ 2015-08-10 17:42 ` Bjorn Helgaas 2015-08-10 19:07 ` Duc Dang 0 siblings, 1 reply; 24+ messages in thread From: Bjorn Helgaas @ 2015-08-10 17:42 UTC (permalink / raw) To: Duc Dang Cc: Tanmay Inamdar, linux-pci@vger.kernel.org, linux-arm, linux-kernel@vger.kernel.org On Mon, Aug 10, 2015 at 12:16 PM, Duc Dang <dhdang@apm.com> wrote: > On Monday, August 10, 2015, Bjorn Helgaas <bhelgaas@google.com> wrote: >> >> On Fri, Jul 31, 2015 at 12:00 PM, Duc Dang <dhdang@apm.com> wrote: >> > On Wed, Jul 29, 2015 at 8:55 AM, Bjorn Helgaas <bhelgaas@google.com> >> > wrote: >> >> On Tue, Jul 28, 2015 at 08:22:55PM -0500, Bjorn Helgaas wrote: >> >>> On Tue, Jul 28, 2015 at 02:50:39PM -0700, Duc Dang wrote: >> >> >> >>> > Do you have another PCIe card to try on the same reboot test on this >> >>> > board? >> >>> >> >>> I've seen this on at least two Mellanox cards. I'm running similar >> >>> tests >> >>> on a different type of card now. >> >> >> >> FWIW, reboot tests on two machines with Mellanox cards failed, while >> >> the >> >> same test on a machine with a different proprietary card succeeded. >> > >> > Thanks, Bjorn. >> > >> > I don't have the same Mellanox card as yours, but I will also run >> > similar reboot test to see if I hit the same issue with my card. >> >> Any more hints on this? Nothing has changed on my end, so of course >> I'm still seeing this, always on machines with Mellanox, and never on >> other machines. Could this be a hardware issue like a signal >> integrity or margin issue? I don't know where to go from here because >> I'm not a hardware person, and I don't know anything to do in >> software. > > > Hi Bjorn, > > I tried to run similar reboot tests on 2 different Mellanox cards (Connect-X > family, one card has 2 10G interfaces, the other one has 1 port that > supports InfiniBand) with U-Boot 1.15.12 and linux 4.2-rc5 and I did not see > the crash that you encounterred. > > Did you check if your Mellanox cards have latest firmware? I did see some > link issues on my Mellanox cards with its old firmware before. Good idea; I'll check that, too. Also, I just learned that these cards on installed with an extender card because of some space issues, so we're going to test again without the extender. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: X-Gene: Unhandled fault: synchronous external abort in pci_generic_config_read32 2015-08-10 17:42 ` Bjorn Helgaas @ 2015-08-10 19:07 ` Duc Dang 2015-08-11 19:28 ` Bjorn Helgaas 0 siblings, 1 reply; 24+ messages in thread From: Duc Dang @ 2015-08-10 19:07 UTC (permalink / raw) To: Bjorn Helgaas Cc: Tanmay Inamdar, linux-pci@vger.kernel.org, linux-arm, linux-kernel@vger.kernel.org On Mon, Aug 10, 2015 at 10:42 AM, Bjorn Helgaas <bhelgaas@google.com> wrote: > On Mon, Aug 10, 2015 at 12:16 PM, Duc Dang <dhdang@apm.com> wrote: >> On Monday, August 10, 2015, Bjorn Helgaas <bhelgaas@google.com> wrote: >>> >>> On Fri, Jul 31, 2015 at 12:00 PM, Duc Dang <dhdang@apm.com> wrote: >>> > On Wed, Jul 29, 2015 at 8:55 AM, Bjorn Helgaas <bhelgaas@google.com> >>> > wrote: >>> >> On Tue, Jul 28, 2015 at 08:22:55PM -0500, Bjorn Helgaas wrote: >>> >>> On Tue, Jul 28, 2015 at 02:50:39PM -0700, Duc Dang wrote: >>> >> >>> >>> > Do you have another PCIe card to try on the same reboot test on this >>> >>> > board? >>> >>> >>> >>> I've seen this on at least two Mellanox cards. I'm running similar >>> >>> tests >>> >>> on a different type of card now. >>> >> >>> >> FWIW, reboot tests on two machines with Mellanox cards failed, while >>> >> the >>> >> same test on a machine with a different proprietary card succeeded. >>> > >>> > Thanks, Bjorn. >>> > >>> > I don't have the same Mellanox card as yours, but I will also run >>> > similar reboot test to see if I hit the same issue with my card. >>> >>> Any more hints on this? Nothing has changed on my end, so of course >>> I'm still seeing this, always on machines with Mellanox, and never on >>> other machines. Could this be a hardware issue like a signal >>> integrity or margin issue? I don't know where to go from here because >>> I'm not a hardware person, and I don't know anything to do in >>> software. >> >> >> Hi Bjorn, >> >> I tried to run similar reboot tests on 2 different Mellanox cards (Connect-X >> family, one card has 2 10G interfaces, the other one has 1 port that >> supports InfiniBand) with U-Boot 1.15.12 and linux 4.2-rc5 and I did not see >> the crash that you encounterred. >> >> Did you check if your Mellanox cards have latest firmware? I did see some >> link issues on my Mellanox cards with its old firmware before. > > Good idea; I'll check that, too. Also, I just learned that these > cards on installed with an extender card because of some space issues, > so we're going to test again without the extender. Hi Bjorn, Are other cards that passed your test installed directly to the on-board PCIe slot? If yes, then this is a good data point and it will be useful to test the case where your Mellanox cards are directly installed into the on-board PCIe slot. -- Regards, Duc Dang. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: X-Gene: Unhandled fault: synchronous external abort in pci_generic_config_read32 2015-08-10 19:07 ` Duc Dang @ 2015-08-11 19:28 ` Bjorn Helgaas 2015-09-05 20:13 ` Jon Masters 0 siblings, 1 reply; 24+ messages in thread From: Bjorn Helgaas @ 2015-08-11 19:28 UTC (permalink / raw) To: Duc Dang Cc: Tanmay Inamdar, linux-pci@vger.kernel.org, linux-arm, linux-kernel@vger.kernel.org On Mon, Aug 10, 2015 at 2:07 PM, Duc Dang <dhdang@apm.com> wrote: > On Mon, Aug 10, 2015 at 10:42 AM, Bjorn Helgaas <bhelgaas@google.com> wrote: >> On Mon, Aug 10, 2015 at 12:16 PM, Duc Dang <dhdang@apm.com> wrote: >>> On Monday, August 10, 2015, Bjorn Helgaas <bhelgaas@google.com> wrote: >>>> >>>> On Fri, Jul 31, 2015 at 12:00 PM, Duc Dang <dhdang@apm.com> wrote: >>>> > On Wed, Jul 29, 2015 at 8:55 AM, Bjorn Helgaas <bhelgaas@google.com> >>>> > wrote: >>>> >> On Tue, Jul 28, 2015 at 08:22:55PM -0500, Bjorn Helgaas wrote: >>>> >>> On Tue, Jul 28, 2015 at 02:50:39PM -0700, Duc Dang wrote: >>>> >> >>>> >>> > Do you have another PCIe card to try on the same reboot test on this >>>> >>> > board? >>>> >>> >>>> >>> I've seen this on at least two Mellanox cards. I'm running similar >>>> >>> tests >>>> >>> on a different type of card now. >>>> >> >>>> >> FWIW, reboot tests on two machines with Mellanox cards failed, while >>>> >> the >>>> >> same test on a machine with a different proprietary card succeeded. >>>> > >>>> > Thanks, Bjorn. >>>> > >>>> > I don't have the same Mellanox card as yours, but I will also run >>>> > similar reboot test to see if I hit the same issue with my card. >>>> >>>> Any more hints on this? Nothing has changed on my end, so of course >>>> I'm still seeing this, always on machines with Mellanox, and never on >>>> other machines. Could this be a hardware issue like a signal >>>> integrity or margin issue? I don't know where to go from here because >>>> I'm not a hardware person, and I don't know anything to do in >>>> software. >>> >>> >>> Hi Bjorn, >>> >>> I tried to run similar reboot tests on 2 different Mellanox cards (Connect-X >>> family, one card has 2 10G interfaces, the other one has 1 port that >>> supports InfiniBand) with U-Boot 1.15.12 and linux 4.2-rc5 and I did not see >>> the crash that you encounterred. >>> >>> Did you check if your Mellanox cards have latest firmware? I did see some >>> link issues on my Mellanox cards with its old firmware before. >> >> Good idea; I'll check that, too. Also, I just learned that these >> cards on installed with an extender card because of some space issues, >> so we're going to test again without the extender. > > Hi Bjorn, > > Are other cards that passed your test installed directly to the > on-board PCIe slot? > If yes, then this is a good data point and it will be useful to test > the case where > your Mellanox cards are directly installed into the on-board PCIe slot. The cards that passed the test were installed directly, with no extender. We removed the extender from one of the machines with the Mellanox card and have not seen this issue since then. I think it's very likely that the problem is related to using the extender. Bjorn ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: X-Gene: Unhandled fault: synchronous external abort in pci_generic_config_read32 2015-08-11 19:28 ` Bjorn Helgaas @ 2015-09-05 20:13 ` Jon Masters 2015-09-05 20:22 ` Jon Masters 0 siblings, 1 reply; 24+ messages in thread From: Jon Masters @ 2015-09-05 20:13 UTC (permalink / raw) To: Bjorn Helgaas, Duc Dang Cc: Tanmay Inamdar, linux-pci@vger.kernel.org, linux-arm, linux-kernel@vger.kernel.org On 08/11/2015 03:28 PM, Bjorn Helgaas wrote: > On Mon, Aug 10, 2015 at 2:07 PM, Duc Dang <dhdang@apm.com> wrote: >> On Mon, Aug 10, 2015 at 10:42 AM, Bjorn Helgaas <bhelgaas@google.com> wrote: >>> On Mon, Aug 10, 2015 at 12:16 PM, Duc Dang <dhdang@apm.com> wrote: >>>> On Monday, August 10, 2015, Bjorn Helgaas <bhelgaas@google.com> wrote: >>>>> >>>>> On Fri, Jul 31, 2015 at 12:00 PM, Duc Dang <dhdang@apm.com> wrote: >>>>>> On Wed, Jul 29, 2015 at 8:55 AM, Bjorn Helgaas <bhelgaas@google.com> >>>>>> wrote: >>>>>>> On Tue, Jul 28, 2015 at 08:22:55PM -0500, Bjorn Helgaas wrote: >>>>>>>> On Tue, Jul 28, 2015 at 02:50:39PM -0700, Duc Dang wrote: >>>>>>> >>>>>>>>> Do you have another PCIe card to try on the same reboot test on this >>>>>>>>> board? >>>>>>>> >>>>>>>> I've seen this on at least two Mellanox cards. I'm running similar >>>>>>>> tests >>>>>>>> on a different type of card now. >>>>>>> >>>>>>> FWIW, reboot tests on two machines with Mellanox cards failed, while >>>>>>> the >>>>>>> same test on a machine with a different proprietary card succeeded. >>>>>> >>>>>> Thanks, Bjorn. >>>>>> >>>>>> I don't have the same Mellanox card as yours, but I will also run >>>>>> similar reboot test to see if I hit the same issue with my card. >>>>> >>>>> Any more hints on this? Nothing has changed on my end, so of course >>>>> I'm still seeing this, always on machines with Mellanox, and never on >>>>> other machines. Could this be a hardware issue like a signal >>>>> integrity or margin issue? I don't know where to go from here because >>>>> I'm not a hardware person, and I don't know anything to do in >>>>> software. >>>> >>>> >>>> Hi Bjorn, >>>> >>>> I tried to run similar reboot tests on 2 different Mellanox cards (Connect-X >>>> family, one card has 2 10G interfaces, the other one has 1 port that >>>> supports InfiniBand) with U-Boot 1.15.12 and linux 4.2-rc5 and I did not see >>>> the crash that you encounterred. >>>> >>>> Did you check if your Mellanox cards have latest firmware? I did see some >>>> link issues on my Mellanox cards with its old firmware before. >>> >>> Good idea; I'll check that, too. Also, I just learned that these >>> cards on installed with an extender card because of some space issues, >>> so we're going to test again without the extender. >> >> Hi Bjorn, >> >> Are other cards that passed your test installed directly to the >> on-board PCIe slot? >> If yes, then this is a good data point and it will be useful to test >> the case where >> your Mellanox cards are directly installed into the on-board PCIe slot. > > The cards that passed the test were installed directly, with no > extender. We removed the extender from one of the machines with the > Mellanox card and have not seen this issue since then. I think it's > very likely that the problem is related to using the extender. If you're trying to use Mellanox cards in (for example) an APM Mustang like system with a PCIe extender card (for example a 90 degree angle adjustment for a low profile server case), you might want to ping me offline. I have procured a number of these over the past couple of years for my home lab and have found one that works (almost) reliably on that particular hardware platform and does 10G in my home lab. Jon. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: X-Gene: Unhandled fault: synchronous external abort in pci_generic_config_read32 2015-09-05 20:13 ` Jon Masters @ 2015-09-05 20:22 ` Jon Masters 0 siblings, 0 replies; 24+ messages in thread From: Jon Masters @ 2015-09-05 20:22 UTC (permalink / raw) To: Bjorn Helgaas, Duc Dang Cc: Tanmay Inamdar, linux-pci@vger.kernel.org, linux-arm, linux-kernel@vger.kernel.org On 09/05/2015 04:13 PM, Jon Masters wrote: > On 08/11/2015 03:28 PM, Bjorn Helgaas wrote: >> On Mon, Aug 10, 2015 at 2:07 PM, Duc Dang <dhdang@apm.com> wrote: >>> On Mon, Aug 10, 2015 at 10:42 AM, Bjorn Helgaas <bhelgaas@google.com> wrote: >>>> On Mon, Aug 10, 2015 at 12:16 PM, Duc Dang <dhdang@apm.com> wrote: >>>>> On Monday, August 10, 2015, Bjorn Helgaas <bhelgaas@google.com> wrote: >>>>>> >>>>>> On Fri, Jul 31, 2015 at 12:00 PM, Duc Dang <dhdang@apm.com> wrote: >>>>>>> On Wed, Jul 29, 2015 at 8:55 AM, Bjorn Helgaas <bhelgaas@google.com> >>>>>>> wrote: >>>>>>>> On Tue, Jul 28, 2015 at 08:22:55PM -0500, Bjorn Helgaas wrote: >>>>>>>>> On Tue, Jul 28, 2015 at 02:50:39PM -0700, Duc Dang wrote: >>>>>>>> >>>>>>>>>> Do you have another PCIe card to try on the same reboot test on this >>>>>>>>>> board? >>>>>>>>> >>>>>>>>> I've seen this on at least two Mellanox cards. I'm running similar >>>>>>>>> tests >>>>>>>>> on a different type of card now. >>>>>>>> >>>>>>>> FWIW, reboot tests on two machines with Mellanox cards failed, while >>>>>>>> the >>>>>>>> same test on a machine with a different proprietary card succeeded. >>>>>>> >>>>>>> Thanks, Bjorn. >>>>>>> >>>>>>> I don't have the same Mellanox card as yours, but I will also run >>>>>>> similar reboot test to see if I hit the same issue with my card. >>>>>> >>>>>> Any more hints on this? Nothing has changed on my end, so of course >>>>>> I'm still seeing this, always on machines with Mellanox, and never on >>>>>> other machines. Could this be a hardware issue like a signal >>>>>> integrity or margin issue? I don't know where to go from here because >>>>>> I'm not a hardware person, and I don't know anything to do in >>>>>> software. >>>>> >>>>> >>>>> Hi Bjorn, >>>>> >>>>> I tried to run similar reboot tests on 2 different Mellanox cards (Connect-X >>>>> family, one card has 2 10G interfaces, the other one has 1 port that >>>>> supports InfiniBand) with U-Boot 1.15.12 and linux 4.2-rc5 and I did not see >>>>> the crash that you encounterred. >>>>> >>>>> Did you check if your Mellanox cards have latest firmware? I did see some >>>>> link issues on my Mellanox cards with its old firmware before. >>>> >>>> Good idea; I'll check that, too. Also, I just learned that these >>>> cards on installed with an extender card because of some space issues, >>>> so we're going to test again without the extender. >>> >>> Hi Bjorn, >>> >>> Are other cards that passed your test installed directly to the >>> on-board PCIe slot? >>> If yes, then this is a good data point and it will be useful to test >>> the case where >>> your Mellanox cards are directly installed into the on-board PCIe slot. >> >> The cards that passed the test were installed directly, with no >> extender. We removed the extender from one of the machines with the >> Mellanox card and have not seen this issue since then. I think it's >> very likely that the problem is related to using the extender. > > If you're trying to use Mellanox cards in (for example) an APM Mustang > like system with a PCIe extender card (for example a 90 degree angle > adjustment for a low profile server case), you might want to ping me > offline. I have procured a number of these over the past couple of years > for my home lab and have found one that works (almost) reliably on that > particular hardware platform and does 10G in my home lab. Traveling for the holiday, but I guess it doesn't need to be a secret. I think I have found some success with this one (but I have ordered many different ones over the past year so will confirm next week): http://www.amazon.com/gp/product/B00H8VVD00?psc=1&redirect=true&ref_=oh_aui_search_detailpage Specifically, the fixed angle adapter brackets generally DO NOT work. Jon. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: X-Gene: Unhandled fault: synchronous external abort in pci_generic_config_read32 2015-07-28 21:29 ` Bjorn Helgaas 2015-07-28 21:50 ` Duc Dang @ 2016-04-13 9:58 ` Sudeep Holla 2016-04-13 13:21 ` Bjorn Helgaas 1 sibling, 1 reply; 24+ messages in thread From: Sudeep Holla @ 2016-04-13 9:58 UTC (permalink / raw) To: Bjorn Helgaas Cc: Duc Dang, Tanmay Inamdar, linux-pci@vger.kernel.org, linux-arm, linux-kernel@vger.kernel.org, Sudeep Holla Hi, (sorry for replying on the old thread, but I found it could be related to the issue I have now) On Tue, Jul 28, 2015 at 10:29 PM, Bjorn Helgaas <bhelgaas@google.com> wrote: > On Tue, Jul 28, 2015 at 10:45:26AM -0700, Duc Dang wrote: >> On Tue, Jul 28, 2015 at 9:43 AM, Bjorn Helgaas <bhelgaas@google.com> wrote: >> > On Fri, Jul 24, 2015 at 7:05 PM, Duc Dang <dhdang@apm.com> wrote: >> >> Hi Bjorn, >> >> >> >> On Fri, Jul 24, 2015 at 3:42 PM, Bjorn Helgaas <bhelgaas@google.com> wrote: >> >>> >> >>> I regularly see faults like this on an APM X-Gene: >> >>> >> >>> U-Boot 2013.04-mustang_sw_1.14.14 (Dec 16 2014 - 15:59:33) >> >>> CPU0: APM ARM 64-bit Potenza Rev B0 2400MHz PCP 2400MHz >> >>> 32 KB ICACHE, 32 KB DCACHE >> >>> SOC 2000MHz IOBAXI 400MHz AXI 250MHz AHB 200MHz GFC 125MHz >> >>> ... >> >>> Unhandled fault: synchronous external abort (0x96000010) at 0xffffff8000110034 >> >>> Internal error: : 96000010 [#1] SMP >> >>> Modules linked in: >> >>> CPU: 0 PID: 3723 Comm: ... 4.1.0-smp-DEV #3 >> >>> Hardware name: APM X-Gene Mustang board (DT) >> >>> task: ffffffc7dc1a4140 ti: ffffffc7dc118000 task.ti: ffffffc7dc118000 >> >>> PC is at pci_generic_config_read32+0x4c/0xb8 >> >>> LR is at pci_generic_config_read32+0x40/0xb8 >> >>> pc : [<ffffffc00033b90c>] lr : [<ffffffc00033b900>] pstate: 600001c5 >> >>> ... >> >>> Call trace: >> >>> [<ffffffc00033b90c>] pci_generic_config_read32+0x4c/0xb8 >> >>> [<ffffffc00033bf58>] pci_user_read_config_byte+0x60/0xc4 >> >>> [<ffffffc0003496a8>] pci_read_config+0x15c/0x238 >> >>> [<ffffffc0002393b4>] sysfs_kf_bin_read+0x68/0xa0 >> >>> [<ffffffc00023896c>] kernfs_fop_read+0x9c/0x1ac >> >>> [<ffffffc0001c361c>] __vfs_read+0x44/0x128 >> >>> [<ffffffc0001c3e28>] vfs_read+0x84/0x144 >> >>> [<ffffffc0001c4764>] SyS_read+0x50/0xb0 >> >> >> >> The log shows kernel gets an exception when trying to access Mellanox >> >> card configuration space. This is usually due to suboptimal PCIe >> >> SerDes parameters are using in your board, which will cause bad link >> >> quality. >> >> The PCIe SerDes programming is done in U-Boot, so I suggest you do a >> >> U-Boot upgrade to our latest X-Gene U-Boot release. >> > >> > I installed U-Boot 1.15.12, which I thought was the latest. I'm still >> > seeing this issue regularly, approx once/hour. >> >> Our latest U-Boot is 1.15.15, but U-Boot 1.15.12 is already a good >> version to use. Are you running any PCIe traffic test when the error >> happens? > > Nope, the machine was either idle or running a reboot test; no PCIe stress > test or anything. > Was there any conclusion on this ? I am having similar issue[1] on my Juno with sky2 PCIe driver during reboot. Regards, Sudeep [1] http://marc.info/?l=linux-netdev&m=146046999701956&w=2 ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: X-Gene: Unhandled fault: synchronous external abort in pci_generic_config_read32 2016-04-13 9:58 ` Sudeep Holla @ 2016-04-13 13:21 ` Bjorn Helgaas 2016-04-13 13:29 ` Sudeep Holla 0 siblings, 1 reply; 24+ messages in thread From: Bjorn Helgaas @ 2016-04-13 13:21 UTC (permalink / raw) To: Sudeep Holla Cc: Bjorn Helgaas, Duc Dang, Tanmay Inamdar, linux-pci@vger.kernel.org, linux-arm, linux-kernel@vger.kernel.org On Wed, Apr 13, 2016 at 10:58:18AM +0100, Sudeep Holla wrote: > Hi, > > (sorry for replying on the old thread, but I found it could be related > to the issue > I have now) > > On Tue, Jul 28, 2015 at 10:29 PM, Bjorn Helgaas <bhelgaas@google.com> wrote: > > On Tue, Jul 28, 2015 at 10:45:26AM -0700, Duc Dang wrote: > >> On Tue, Jul 28, 2015 at 9:43 AM, Bjorn Helgaas <bhelgaas@google.com> wrote: > >> > On Fri, Jul 24, 2015 at 7:05 PM, Duc Dang <dhdang@apm.com> wrote: > >> >> Hi Bjorn, > >> >> > >> >> On Fri, Jul 24, 2015 at 3:42 PM, Bjorn Helgaas <bhelgaas@google.com> wrote: > >> >>> > >> >>> I regularly see faults like this on an APM X-Gene: > >> >>> > >> >>> U-Boot 2013.04-mustang_sw_1.14.14 (Dec 16 2014 - 15:59:33) > >> >>> CPU0: APM ARM 64-bit Potenza Rev B0 2400MHz PCP 2400MHz > >> >>> 32 KB ICACHE, 32 KB DCACHE > >> >>> SOC 2000MHz IOBAXI 400MHz AXI 250MHz AHB 200MHz GFC 125MHz > >> >>> ... > >> >>> Unhandled fault: synchronous external abort (0x96000010) at 0xffffff8000110034 > >> >>> Internal error: : 96000010 [#1] SMP > >> >>> Modules linked in: > >> >>> CPU: 0 PID: 3723 Comm: ... 4.1.0-smp-DEV #3 > >> >>> Hardware name: APM X-Gene Mustang board (DT) > >> >>> task: ffffffc7dc1a4140 ti: ffffffc7dc118000 task.ti: ffffffc7dc118000 > >> >>> PC is at pci_generic_config_read32+0x4c/0xb8 > >> >>> LR is at pci_generic_config_read32+0x40/0xb8 > >> >>> pc : [<ffffffc00033b90c>] lr : [<ffffffc00033b900>] pstate: 600001c5 > >> >>> ... > >> >>> Call trace: > >> >>> [<ffffffc00033b90c>] pci_generic_config_read32+0x4c/0xb8 > >> >>> [<ffffffc00033bf58>] pci_user_read_config_byte+0x60/0xc4 > >> >>> [<ffffffc0003496a8>] pci_read_config+0x15c/0x238 > >> >>> [<ffffffc0002393b4>] sysfs_kf_bin_read+0x68/0xa0 > >> >>> [<ffffffc00023896c>] kernfs_fop_read+0x9c/0x1ac > >> >>> [<ffffffc0001c361c>] __vfs_read+0x44/0x128 > >> >>> [<ffffffc0001c3e28>] vfs_read+0x84/0x144 > >> >>> [<ffffffc0001c4764>] SyS_read+0x50/0xb0 > >> >> > >> >> The log shows kernel gets an exception when trying to access Mellanox > >> >> card configuration space. This is usually due to suboptimal PCIe > >> >> SerDes parameters are using in your board, which will cause bad link > >> >> quality. > >> >> The PCIe SerDes programming is done in U-Boot, so I suggest you do a > >> >> U-Boot upgrade to our latest X-Gene U-Boot release. > >> > > >> > I installed U-Boot 1.15.12, which I thought was the latest. I'm still > >> > seeing this issue regularly, approx once/hour. > >> > >> Our latest U-Boot is 1.15.15, but U-Boot 1.15.12 is already a good > >> version to use. Are you running any PCIe traffic test when the error > >> happens? > > > > Nope, the machine was either idle or running a reboot test; no PCIe stress > > test or anything. > > > > Was there any conclusion on this ? > I am having similar issue[1] on my Juno with sky2 PCIe driver during reboot. We found that the unhandled faults occurred when using an extender card. After removing the extender card, we didn't see the faults any more. > [1] http://marc.info/?l=linux-netdev&m=146046999701956&w=2 ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: X-Gene: Unhandled fault: synchronous external abort in pci_generic_config_read32 2016-04-13 13:21 ` Bjorn Helgaas @ 2016-04-13 13:29 ` Sudeep Holla 2016-04-13 22:17 ` Jon Masters 0 siblings, 1 reply; 24+ messages in thread From: Sudeep Holla @ 2016-04-13 13:29 UTC (permalink / raw) To: Bjorn Helgaas Cc: Sudeep Holla, Bjorn Helgaas, Duc Dang, Tanmay Inamdar, linux-pci@vger.kernel.org, linux-arm, linux-kernel@vger.kernel.org On 13/04/16 14:21, Bjorn Helgaas wrote: > On Wed, Apr 13, 2016 at 10:58:18AM +0100, Sudeep Holla wrote: >> Hi, >> >> (sorry for replying on the old thread, but I found it could be related >> to the issue >> I have now) >> >> On Tue, Jul 28, 2015 at 10:29 PM, Bjorn Helgaas <bhelgaas@google.com> wrote: [...] >>>> >>>> Our latest U-Boot is 1.15.15, but U-Boot 1.15.12 is already a good >>>> version to use. Are you running any PCIe traffic test when the error >>>> happens? >>> >>> Nope, the machine was either idle or running a reboot test; no PCIe stress >>> test or anything. >>> >> >> Was there any conclusion on this ? >> I am having similar issue[1] on my Juno with sky2 PCIe driver during reboot. > > We found that the unhandled faults occurred when using an extender > card. After removing the extender card, we didn't see the faults any > more. > Thanks for the response. It's not related then, I saw report referencing reboot tests and hence linked them together. Sorry for the noise. -- Regards, Sudeep >> [1] http://marc.info/?l=linux-netdev&m=146046999701956&w=2 ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: X-Gene: Unhandled fault: synchronous external abort in pci_generic_config_read32 2016-04-13 13:29 ` Sudeep Holla @ 2016-04-13 22:17 ` Jon Masters 0 siblings, 0 replies; 24+ messages in thread From: Jon Masters @ 2016-04-13 22:17 UTC (permalink / raw) To: Sudeep Holla, Bjorn Helgaas Cc: Bjorn Helgaas, Duc Dang, Tanmay Inamdar, linux-pci@vger.kernel.org, linux-arm, linux-kernel@vger.kernel.org On 04/13/2016 09:29 AM, Sudeep Holla wrote: > > > On 13/04/16 14:21, Bjorn Helgaas wrote: >> On Wed, Apr 13, 2016 at 10:58:18AM +0100, Sudeep Holla wrote: >>> Hi, >>> >>> (sorry for replying on the old thread, but I found it could be related >>> to the issue >>> I have now) >>> >>> On Tue, Jul 28, 2015 at 10:29 PM, Bjorn Helgaas <bhelgaas@google.com> >>> wrote: > > [...] > >>>>> >>>>> Our latest U-Boot is 1.15.15, but U-Boot 1.15.12 is already a good >>>>> version to use. Are you running any PCIe traffic test when the error >>>>> happens? >>>> >>>> Nope, the machine was either idle or running a reboot test; no PCIe >>>> stress >>>> test or anything. >>>> >>> >>> Was there any conclusion on this ? >>> I am having similar issue[1] on my Juno with sky2 PCIe driver during >>> reboot. >> >> We found that the unhandled faults occurred when using an extender >> card. After removing the extender card, we didn't see the faults any >> more. >> > > Thanks for the response. It's not related then, I saw report referencing > reboot tests and hence linked them together. Sorry for the noise. For the record, I've had success with this cable on X-Gene: http://www.amazon.com/PCI-E-Riser-Flexible-Ribbon-Extension/dp/B00H8VVD00?ie=UTF8&psc=1&redirect=true&ref_=oh_aui_search_detailpage But it's hit or miss. The only public platform where I've been reliably able to use an extender cable so far is AMD Seattle. On that platform, the PCIe IP is so rock solid that I can talk to very funky PCIe IP I've implemented myself in a FPGA (and I can see link quality is fine too). There's one other non-public platform so far where PCIe extenders work without a single hitch as well, and a number where more work is needed. Jon. -- Computer Architect ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: X-Gene: Unhandled fault: synchronous external abort in pci_generic_config_read32 2015-07-24 22:42 X-Gene: Unhandled fault: synchronous external abort in pci_generic_config_read32 Bjorn Helgaas 2015-07-25 0:05 ` Duc Dang @ 2015-07-28 14:37 ` Dall, Elizabeth J 1 sibling, 0 replies; 24+ messages in thread From: Dall, Elizabeth J @ 2015-07-28 14:37 UTC (permalink / raw) To: Bjorn Helgaas, Tanmay Inamdar Cc: Duc Dang, linux-pci@vger.kernel.org, linux-arm-kernel@lists.infradead.org, linux-kernel@vger.kernel.org On 07/24/2015 04:43 PM, Bjorn Helgaas wrote: > I regularly see faults like this on an APM X-Gene: > > U-Boot 2013.04-mustang_sw_1.14.14 (Dec 16 2014 - 15:59:33) > CPU0: APM ARM 64-bit Potenza Rev B0 2400MHz PCP 2400MHz > 32 KB ICACHE, 32 KB DCACHE > SOC 2000MHz IOBAXI 400MHz AXI 250MHz AHB 200MHz GFC 125MHz > ... > Unhandled fault: synchronous external abort (0x96000010) at 0xffffff8000110034 The 0x96000010 is the value of the ESR register and decodes to "Stack Pointer Alignment exception". The ISS field for this exception code is reserved, so no additional info. -Betty Dall ^ permalink raw reply [flat|nested] 24+ messages in thread
end of thread, other threads:[~2016-04-13 22:17 UTC | newest] Thread overview: 24+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2015-07-24 22:42 X-Gene: Unhandled fault: synchronous external abort in pci_generic_config_read32 Bjorn Helgaas 2015-07-25 0:05 ` Duc Dang 2015-07-27 11:36 ` Catalin Marinas 2015-07-28 17:39 ` Duc Dang 2015-07-28 18:36 ` Bjorn Helgaas 2015-07-28 16:43 ` Bjorn Helgaas 2015-07-28 17:45 ` Duc Dang 2015-07-28 21:29 ` Bjorn Helgaas 2015-07-28 21:50 ` Duc Dang 2015-07-29 1:22 ` Bjorn Helgaas 2015-07-29 15:55 ` Bjorn Helgaas 2015-07-31 17:00 ` Duc Dang 2015-08-10 16:18 ` Bjorn Helgaas 2015-08-10 17:38 ` Catalin Marinas [not found] ` <CADaLNDkUQHzGACfFmYDeJWnaNrKmJUDx4Rby60OWr4FzOjx3rA@mail.gmail.com> 2015-08-10 17:42 ` Bjorn Helgaas 2015-08-10 19:07 ` Duc Dang 2015-08-11 19:28 ` Bjorn Helgaas 2015-09-05 20:13 ` Jon Masters 2015-09-05 20:22 ` Jon Masters 2016-04-13 9:58 ` Sudeep Holla 2016-04-13 13:21 ` Bjorn Helgaas 2016-04-13 13:29 ` Sudeep Holla 2016-04-13 22:17 ` Jon Masters 2015-07-28 14:37 ` Dall, Elizabeth J
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).