Re: nvme-pci: Disabling device after reset failure: -5 occurs while AER recovery

linux-pci.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Bjorn Helgaas <helgaas@kernel.org>
To: Tushar Dave <tdave@nvidia.com>
Cc: Lukas Wunner <lukas@wunner.de>, Sagi Grimberg <sagi@grimberg.me>,
	linux-nvme@lists.infradead.org, kbusch@kernel.org,
	linux-pci@vger.kernel.org
Subject: Re: nvme-pci: Disabling device after reset failure: -5 occurs while AER recovery
Date: Tue, 14 Mar 2023 11:11:27 -0500	[thread overview]
Message-ID: <20230314161127.GA1648664@bhelgaas> (raw)
In-Reply-To: <c17f7476-8ed0-212e-9480-78732635ee3f@nvidia.com>

On Mon, Mar 13, 2023 at 05:57:43PM -0700, Tushar Dave wrote:
> On 3/11/23 00:22, Lukas Wunner wrote:
> > On Fri, Mar 10, 2023 at 05:45:48PM -0800, Tushar Dave wrote:
> > > On 3/10/2023 3:53 PM, Bjorn Helgaas wrote:
> > > > In the log below, pciehp obviously is enabled; should I infer that in
> > > > the log above, it is not?
> > > 
> > > pciehp is enabled all the time. In the log above and below.
> > > I do not have answer yet why pciehp shows-up only in some tests (due to DPC
> > > link down/up) and not in others like you noticed in both the logs.
> > 
> > Maybe some of the switch Downstream Ports are hotplug-capable and
> > some are not?  (Check the Slot Implemented bit in the PCI Express
> > Capabilities Register as well as the Hot-Plug Capable bit in the
> > Slot Capabilities Register.)
> > ...

> > > > Generally we've avoided handling a device reset as a
> > > > remove/add event because upper layers can't deal well with
> > > > that.  But in the log below it looks like pciehp *did* treat
> > > > the DPC containment as a remove/add, which of course involves
> > > > configuring the "new" device and its MPS settings.
> > > 
> > > yes and that puzzled me why? especially when"Link Down/Up
> > > ignored (recovered by DPC)". Do we still have race somewhere, I
> > > am not sure.
> > 
> > You're seeing the expected behavior.  pciehp ignores DLLSC events
> > caused by DPC, but then double-checks that DPC recovery succeeded.
> > If it didn't, it would be a bug not to bring down the slot.  So
> > pciehp does exactly that.  See this code snippet in
> > pciehp_ignore_dpc_link_change():
> > 
> > 	/*
> > 	 * If the link is unexpectedly down after successful recovery,
> > 	 * the corresponding link change may have been ignored above.
> > 	 * Synthesize it to ensure that it is acted on.
> > 	 */
> > 	down_read_nested(&ctrl->reset_lock, ctrl->depth);
> > 	if (!pciehp_check_link_active(ctrl))
> > 		pciehp_request(ctrl, PCI_EXP_SLTSTA_DLLSC);
> > 	up_read(&ctrl->reset_lock);
> > 
> > So on hotplug-capable ports, pciehp is able to mop up the mess
> > created by fiddling with the MPS settings behind the kernel's
> > back.
> 
> That's the thing, even on hotplug-capable slot I do not see pciehp
> _all_ the time. Sometime pciehp get involve and takes care of things
> (like I mentioned in the previous thread) and other times no pciehp
> engagement at all!

Possibly a timing issue, so I'll be interested to see if 53b54ad074de
("PCI/DPC: Await readiness of secondary bus after reset") makes any
difference.  Lukas didn't mention that, so maybe it's a red herring,
but I'm still curious since it explicitly mentions the DPC reset case
that you're exercising here.

Bjorn

next prev parent reply	other threads:[~2023-03-14 16:11 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <bcbd48b5-1d6e-8fe3-d6a0-cb341e5c34e3@nvidia.com>
2023-03-09 17:53 ` nvme-pci: Disabling device after reset failure: -5 occurs while AER recovery Bjorn Helgaas
2023-03-10 22:39   ` Tushar Dave
2023-03-10 23:53     ` Bjorn Helgaas
2023-03-11  1:45       ` Tushar Dave
2023-03-11  8:22         ` Lukas Wunner
2023-03-11 16:46           ` Keith Busch
2023-03-14  0:57           ` Tushar Dave
2023-03-14 16:11             ` Bjorn Helgaas [this message]
2023-03-14 17:26               ` Keith Busch
2023-03-15 20:01                 ` Tushar Dave
2023-03-15 20:43                   ` Sathyanarayanan Kuppuswamy
2023-03-15 22:16                     ` Tushar Dave
2023-03-15 22:23                       ` Sathyanarayanan Kuppuswamy
2023-03-15 22:25                         ` Sathyanarayanan Kuppuswamy
2023-03-18  0:15                   ` Tushar Dave

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20230314161127.GA1648664@bhelgaas \
    --to=helgaas@kernel.org \
    --cc=kbusch@kernel.org \
    --cc=linux-nvme@lists.infradead.org \
    --cc=linux-pci@vger.kernel.org \
    --cc=lukas@wunner.de \
    --cc=sagi@grimberg.me \
    --cc=tdave@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).