All of lore.kernel.org
 help / color / mirror / Atom feed
From: Bjorn Helgaas <helgaas@kernel.org>
To: Kelvin.Cao@microchip.com
Cc: kurt.schwemmer@microsemi.com, bhelgaas@google.com,
	kelvincao@outlook.com, linux-pci@vger.kernel.org,
	logang@deltatee.com, linux-kernel@vger.kernel.org
Subject: Re: [PATCH 1/5] PCI/switchtec: Error out MRPC execution when no GAS access
Date: Wed, 6 Oct 2021 15:20:16 -0500	[thread overview]
Message-ID: <20211006202016.GA1178025@bhelgaas> (raw)
In-Reply-To: <70abccd6144d8eac461866355e3a6963b3fb3fe7.camel@microchip.com>

On Wed, Oct 06, 2021 at 07:00:55PM +0000, Kelvin.Cao@microchip.com wrote:
> On Wed, 2021-10-06 at 09:19 -0500, Bjorn Helgaas wrote:
> > On Wed, Oct 06, 2021 at 05:49:29AM +0000, Kelvin.Cao@microchip.com
> > wrote:
> > > On Tue, 2021-10-05 at 21:33 -0500, Bjorn Helgaas wrote:
> > > > On Wed, Oct 06, 2021 at 12:37:02AM +0000, 
> > > > Kelvin.Cao@microchip.com
> > > > wrote:
> > > > > On Tue, 2021-10-05 at 15:11 -0500, Bjorn Helgaas wrote:
> > > > > > On Mon, Oct 04, 2021 at 08:51:06PM +0000,
> > > > > > Kelvin.Cao@microchip.com
> > > > > > wrote:
> > > > > > > On Sat, 2021-10-02 at 10:11 -0500, Bjorn Helgaas wrote:
> > > > > > > > I *thought* the problem was that the PCIe Memory Read
> > > > > > > > failed and the Root Complex fabricated ~0 data to
> > > > > > > > complete
> > > > > > > > the CPU read.  But now I'm not sure, because it sounds
> > > > > > > > like it might be that the PCIe transaction succeeds, but
> > > > > > > > it reads data that hasn't been updated by the firmware,
> > > > > > > > i.e., it reads 'in progress' because firmware hasn't
> > > > > > > > updated it to 'done'.
> > > > > > > 
> > > > > > > The original message was sort of misleading. After a
> > > > > > > firmware reset, CPU getting ~0 for the PCIe Memory Read
> > > > > > > doesn't explain the hang.  In a MRPC execution (DMA MRPC
> > > > > > > mode), the MRPC status which is located in the host memory,
> > > > > > > gets initialized by the CPU and updated/finalized by the
> > > > > > > firmware. In the situation of a firmware reset, any MRPC
> > > > > > > initiated afterwards will not get the status updated by the
> > > > > > > firmware per the reason you pointed out above (or similar,
> > > > > > > to my understanding, firmware can no longer DMA data to
> > > > > > > host
> > > > > > > memory in such cases), therefore the MRPC execution will
> > > > > > > never end.
> > > > > > 
> > > > > > I'm glad this makes sense to you, because it still doesn't to
> > > > > > me.
> > > > > > 
> > > > > > check_access() does an MMIO read to something in BAR0.  If
> > > > > > that read returns ~0, it means either the PCIe Memory Read
> > > > > > was
> > > > > > successful and the Switchtec device supplied ~0 data (maybe
> > > > > > because firmware has not initialized that part of the BAR) or
> > > > > > the PCIe Memory Read failed and the root complex fabricated
> > > > > > the ~0 data.
> > > > > > 
> > > > > > I'd like to know which one is happening so we can clarify the
> > > > > > commit log text about "MRPC command executions hang
> > > > > > indefinitely" and "host wil fail all GAS reads."  It's not
> > > > > > clear whether these are PCIe protocol issues or
> > > > > > driver/firmware interaction issues.
> > > > > 
> > > > > I think it's the latter case, the ~0 data was fabricated by the
> > > > > root complex, as the MMIO read in check_access() always returns
> > > > > ~0 until a reboot or a rescan happens.
> > > > 
> > > > If the root complex fabricates ~0, that means a PCIe transaction
> > > > failed, i.e., the device didn't respond.  Rescan only does config
> > > > reads and writes.  Why should that cause the PCIe transactions to
> > > > magically start working?
> > > 
> > > I took a closer look. What I observed was like this. A firmware
> > > reset cleared some CSR settings including the MSE and MBE bits and
> > > the Base Address Registers. With a rescan (removing the switch to
> > > which the management EP was binded from root port and rescan), the
> > > management EP was re-enumerated and driver was re-probed, so that
> > > the settings cleared by the firmware reset was properly setup
> > > again,
> > > therefore PCIe transactions start working.
> > 
> > I think what you just said is that
> > 
> >   - the driver asked the firmware to reset the device
> > 
> >   - the firmware did reset the device, which cleared Memory Space
> >     Enable
> > 
> >   - nothing restored the device config after the reset, so Memory
> >     Space Enable remains cleared
> > 
> >   - the driver does MMIO reads to figure out when the reset has
> >     completed
> > 
> >   - the device doesn't respond to the PCIe Memory Reads because
> > Memory
> >     Space Enable is cleared
> > 
> >   - the root complex sees a timeout or error completion and
> > fabricates
> >     ~0 data for the CPU read
> > 
> >   - the driver sees ~0 data from the MMIO read and thinks the device
> >     or firmware is hung
> > 
> > If that's all true, I think the patch is sort of a band-aid that
> > doesn't fix the problem at all but only makes the driver's response
> > to
> > it marginally better.  But the device is still unusable until a
> > rescan
> > or reboot.
> > 
> > So I think we should drop this patch and do something to restore the
> > device state after the reset.
> 
> Do you mean we should do something at the driver level to automatically
> try to restore the device state after the reset? I was thinking it's up
> to the user to make the call to restore the device state or take other
> actions, so that returning an error code from MRPC to indicate what
> happened would be good enough for the driver. 

It sounds like this is a completely predictable situation.  Why would
you want manual user intervention?

I'm pretty sure there are drivers that save state *before* the reset
and restore it afterwards.

> Can you possibly shed light on what might be a reasonable way to
> restore the device state in the driver if applicable? I was just doing
> it by leveraging the remove and rescan interfaces in the sysfs.
> 
> That's all true. I lean towards keeping the patch as I think making the
> response better under the following situations might not be bad.
> 
>   - The firmware reset case, as we discussed. I'd think it's still
> useful for users to get a fast error return which indicates something
> bad happened and some actions need to be taken either to abort or try
> to recover. In this case, we are assuming that a firmware reset will
> boot the firmware successfully.

So wait, you mean you just intentionally ask the firmware to reset,
knowing that the device will be unusable until the user reboots or
does a manual rescan?  And the way to improve this is for the driver
to report an error to the user instead of hanging?  I *guess* that
might be some sort of improvement, but seems like a pretty small one.

>   - The firwmare crashes and doesn't respond, which normally is the
> reason for users to issue a firmware reset command to try to recover it
> via either the driver or a sideband interface. The firmware may not be
> able to recover by a reset in some extream situations like hardware
> errors, so that an error return is probably all the users can get
> before another level of recovery happens.
> 
> So I'd think this patch is still making the driver better in some way.
> 
> Kelvin
> 

  reply	other threads:[~2021-10-06 20:20 UTC|newest]

Thread overview: 34+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-09-24 11:08 [PATCH 0/5] Switchtec Fixes and Improvements kelvin.cao
2021-09-24 11:08 ` [PATCH 1/5] PCI/switchtec: Error out MRPC execution when no GAS access kelvin.cao
2021-10-01 20:18   ` Bjorn Helgaas
2021-10-01 20:29     ` Logan Gunthorpe
2021-10-01 23:49       ` Kelvin.Cao
2021-10-02 15:11         ` Bjorn Helgaas
2021-10-04 20:51           ` Kelvin.Cao
2021-10-05 20:11             ` Bjorn Helgaas
2021-10-06  0:37               ` Kelvin.Cao
2021-10-06  2:33                 ` Bjorn Helgaas
2021-10-06  5:49                   ` Kelvin.Cao
2021-10-06 14:19                     ` Bjorn Helgaas
2021-10-06 19:00                       ` Kelvin.Cao
2021-10-06 20:20                         ` Bjorn Helgaas [this message]
2021-10-06 21:27                           ` Kelvin.Cao
2021-10-07 21:23                             ` Bjorn Helgaas
2021-10-08  0:06                               ` Kelvin.Cao
2021-10-08 11:03                                 ` Bjorn Helgaas
2021-10-01 22:58     ` Kelvin.Cao
2021-10-01 23:52       ` Logan Gunthorpe
2021-10-02  0:05         ` Kelvin.Cao
2021-09-24 11:08 ` [PATCH 2/5] PCI/switchtec: Fix a MRPC error status handling issue kelvin.cao
2021-09-24 11:08 ` [PATCH 3/5] PCI/switchtec: Update the way of getting management VEP instance ID kelvin.cao
2021-09-24 11:08 ` [PATCH 4/5] PCI/switchtec: Replace ENOTSUPP with EOPNOTSUPP kelvin.cao
2021-09-24 11:08 ` [PATCH 5/5] PCI/switchtec: Add check of event support kelvin.cao
2021-09-24 15:53 ` [PATCH 0/5] Switchtec Fixes and Improvements Logan Gunthorpe
2021-09-25  5:27   ` Kelvin.Cao
2021-09-27 16:39 ` Bjorn Helgaas
2021-09-27 18:25   ` Kelvin.Cao
2021-10-08 17:05 ` Bjorn Helgaas
2021-10-08 17:23   ` Logan Gunthorpe
2021-10-08 18:25     ` Bjorn Helgaas
  -- strict thread matches above, loose matches on Subject: below --
2021-09-24 10:24 [PATCH 0/5] Switchtec Miscellaneous " kelvin.cao
2021-09-24 10:24 ` [PATCH 1/5] PCI/switchtec: Error out MRPC execution when no GAS access kelvin.cao
2021-09-24 10:22 [PATCH 0/5] Switchtec Miscellaneous Fixes and Improvements kelvin.cao
2021-09-24 10:22 ` [PATCH 1/5] PCI/switchtec: Error out MRPC execution when no GAS access kelvin.cao

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20211006202016.GA1178025@bhelgaas \
    --to=helgaas@kernel.org \
    --cc=Kelvin.Cao@microchip.com \
    --cc=bhelgaas@google.com \
    --cc=kelvincao@outlook.com \
    --cc=kurt.schwemmer@microsemi.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-pci@vger.kernel.org \
    --cc=logang@deltatee.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.