Linux PCI subsystem development
 help / color / mirror / Atom feed
From: "Pali Rohár" <pali@kernel.org>
To: Lukas Wunner <lukas@wunner.de>
Cc: "Maciej W. Rozycki" <macro@orcam.me.uk>,
	Bjorn Helgaas <bhelgaas@google.com>,
	Nicholas Piggin <npiggin@gmail.com>,
	Christophe Leroy <christophe.leroy@csgroup.eu>,
	Saeed Mahameed <saeedm@nvidia.com>,
	Leon Romanovsky <leon@kernel.org>,
	Alex Williamson <alex.williamson@redhat.com>,
	Stefan Roese <sr@denx.de>, Jim Wilson <wilson@tuliptree.org>,
	David Abdurachmanov <david.abdurachmanov@gmail.com>,
	linux-pci@vger.kernel.org,
	Philipp Rosenberger <p.rosenberger@kunbus.com>
Subject: Re: [PING][PATCH v6 0/7] pci: Work around ASMedia ASM2824 PCIe link training failures
Date: Wed, 22 Feb 2023 10:17:02 +0100	[thread overview]
Message-ID: <20230222091702.yd7tpkb2kj7b75da@pali> (raw)
In-Reply-To: <20230222084033.GA31047@wunner.de>

On Wednesday 22 February 2023 09:40:33 Lukas Wunner wrote:
> On Tue, Feb 21, 2023 at 10:46:11PM +0100, Pali Rohár wrote:
> > On Sunday 19 February 2023 20:46:19 Lukas Wunner wrote:
> > > > On Sun, 5 Feb 2023, Maciej W. Rozycki wrote:
> > > > >  This is v6 of the change to work around a PCIe link training phenomenon
> > > > > where a pair of devices both capable of operating at a link speed above
> > > > > 2.5GT/s seems unable to negotiate the link speed and continues training
> > > > > indefinitely with the Link Training bit switching on and off repeatedly
> > > > > and the data link layer never reaching the active state.
> > > 
> > > Philipp is witnessing similar issues with a Pericom PI7C9X2G404EL
> > > switch below a Broadcom STB host controller:  On some rare occasions,
> > > when booting the system the link trains correctly at 5 GT/s and the
> > > switch is accessible without any issues.  But most of the time,
> > > the switch is inaccessible on boot.  The Broadcom STB host controller
> > > claims not to support Link Active Reporting, but in reality has a
> > > link status indicator in a custom register.  It indicates that the
> > > link is up, the link trains to 2.5 GT/s but the switch is inaccessible.
> > 
> > This is interesting. Do you know which layer it indicates that is up?
> > I can image that PCIe physical layer or data link layer is up but
> > PCIe transaction layer not yet up and so sending config requests fail.
> > Existence of custom register may explain that it indicates different
> > "link up" meaning.
> 
> drivers/pci/controller/pcie-brcmstb.c defines the following bits:
> 
> #define  PCIE_MISC_PCIE_STATUS_PCIE_PORT_MASK           0x80
> #define  PCIE_MISC_PCIE_STATUS_PCIE_DL_ACTIVE_MASK      0x20
> #define  PCIE_MISC_PCIE_STATUS_PCIE_PHYLINKUP_MASK      0x10
> #define  PCIE_MISC_PCIE_STATUS_PCIE_LINK_IN_L23_MASK    0x40
> 
> And brcm_pcie_link_up() checks that both DL_ACTIVE and PHYLINKUP are set.
> 
> A public spec for the Broadcom STB PCIe controller does not seem to exist,
> so I do not know what the register bits mean exactly.

Ok, so then this is question for Broadcom. Without spec we probably
cannot do anything.

> 
> > > Due to a quirk of the Broadcom STB host controller, ECAM access to
> > > the inaccessible switch raises an unhandled CPU exception and thus
> > > causes a kernel panic, making the issue difficult to debug.
> > 
> > Is this ARM Cortex A53 core and unhandled exception is asynchronous one
> > with syndrome 0xbf000002?
> 
> It's a Cortex A72 and yes the exception looks like this:
> 
> SError Interrupt on CPU1, code 0x00000000bf000002 -- SError

This is core specific exception and for A53 it means AXI Slave Error.
I guess that on A72 it could mean too. But generally for Aarch64 they
are not defined.

> I was wondering why we're not checking in the exception handler whether
> the accessed address is in ECAM space, and just return from the handler
> since such exceptions could be handled by returning "all ones" in
> software from the PCI core.

We are not checking them because it is asynchronous exception. You do
not know who, when and why caused this exception. Exception may come
also after executing lot of other instructions. It is non-recoverable
fatal exception and it means something like internal core error. The
only reasonable thing to do is to reset CPU.

CPU should not receive AXI Slave Error under non-fatal condition and it
is basically bug in that PCIe controller that it sends such thing to
the CPU core. PCIe controller should for ECAM load/store operations
always returns AXI Slave OK response and on error it should set all-ones
in data part.

The proper way is to find out if PCIe controller does not have some
hidden or debug register which allows to disable sending these errors.
IIRC DesignWare has it, so there is big chance that Broadcom has it too.
But for example Cadence does not have it (yet).

> Then again, perhaps there's a method to stop the controller from
> raising an exception on ECAM access to an inaccessible device.
> If such a method exists (e.g. some register bit), that would
> obviously be preferred.

Other way is map ECAM address memory space in strong ordering mode.
It would cause CPU core to wait during executing of load / store
operation after they completely finish and then AXI Slave Error is
reported as synchronous exception as Data Abort, which is possible.
On ARMv7 it was possible by marking mapping as Device. On ARMv8 it
should be possible too, but on A53 it is unimplemented. Maybe possible
on A72? But it is kind like a hack to workaround total mess.

I will send separate email to Broadcom people who already helped with
one of their PCIe controller in the past. Maybe there is hidden register
debug bit which can turn it off.

> 
> > > The switch works fine 100% when plugged into a TI Sitara AM64 board
> > > (contains a DesignWare-derived PCIe host controller).
> > 
> > It is really DesignWare? I had an impression that TI uses PCIe IPs from
> > Cadence, not from DesignWare. And Cadence controllers behave in some
> > cases different from Designware controllers.
> 
> You're right, I was mistaken, it's indeed a Cadence.
> 
> 
> > > Next step is to hook up
> > > a Teledyne T28 analyzer to see what's going on.
> > 
> > Can you use Teledyne T28 for debugging this issue? Because this is
> > something which can finally show what is happing there.
> 
> Yes it should be possible to debug this, the analyzer is capable of
> logging the link training sequence and present it in a Wireshark-esque
> interface.
> 
> Thanks,
> 
> Lukas

The main issue is that nobody had analyzer for doing it, it is not a
cheap device.

  reply	other threads:[~2023-02-22  9:17 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-02-05 15:48 [PATCH v6 0/7] pci: Work around ASMedia ASM2824 PCIe link training failures Maciej W. Rozycki
2023-02-05 15:48 ` [PATCH v6 1/7] PCI: Export PCI link retrain timeout Maciej W. Rozycki
2023-02-05 15:49 ` [PATCH v6 2/7] PCI: Execute `quirk_enable_clear_retrain_link' earlier Maciej W. Rozycki
2023-02-05 15:49 ` [PATCH v6 3/7] PCI: Initialize `link_active_reporting' earlier Maciej W. Rozycki
2023-02-05 15:49 ` [PATCH v6 4/7] powerpc/eeh: Rely on `link_active_reporting' Maciej W. Rozycki
2023-02-05 15:49 ` [PATCH v6 5/7] net/mlx5: " Maciej W. Rozycki
2023-02-05 15:49 ` [PATCH v6 6/7] PCI: pciehp: " Maciej W. Rozycki
2023-02-13 13:53   ` Lukas Wunner
2023-02-05 15:49 ` [PATCH v6 7/7] PCI: Work around PCIe link training failures Maciej W. Rozycki
2023-02-19 18:52 ` [PING][PATCH v6 0/7] pci: Work around ASMedia ASM2824 " Maciej W. Rozycki
2023-02-19 19:46   ` Lukas Wunner
2023-02-21 21:46     ` Pali Rohár
2023-02-22  8:40       ` Lukas Wunner
2023-02-22  9:17         ` Pali Rohár [this message]
2023-02-22 11:54     ` Maciej W. Rozycki

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20230222091702.yd7tpkb2kj7b75da@pali \
    --to=pali@kernel.org \
    --cc=alex.williamson@redhat.com \
    --cc=bhelgaas@google.com \
    --cc=christophe.leroy@csgroup.eu \
    --cc=david.abdurachmanov@gmail.com \
    --cc=leon@kernel.org \
    --cc=linux-pci@vger.kernel.org \
    --cc=lukas@wunner.de \
    --cc=macro@orcam.me.uk \
    --cc=npiggin@gmail.com \
    --cc=p.rosenberger@kunbus.com \
    --cc=saeedm@nvidia.com \
    --cc=sr@denx.de \
    --cc=wilson@tuliptree.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox