linux-pci.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Keith Busch <keith.busch@intel.com>
To: Bjorn Helgaas <helgaas@kernel.org>
Cc: linux-pci@vger.kernel.org, Bjorn Helgaas <bhelgaas@google.com>,
	Lukas Wunner <lukas@wunner.de>, Wei Zhang <wzhang@fb.com>
Subject: Re: [PATCHv4 next 0/3] Limiting pci access
Date: Tue, 13 Dec 2016 18:18:40 -0500	[thread overview]
Message-ID: <20161213231840.GD12113@localhost.localdomain> (raw)
In-Reply-To: <20161213205012.GA29950@bhelgaas-glaptop.roam.corp.google.com>

On Tue, Dec 13, 2016 at 02:50:12PM -0600, Bjorn Helgaas wrote:
> On Mon, Dec 12, 2016 at 07:55:47PM -0500, Keith Busch wrote:
> > On Mon, Dec 12, 2016 at 05:42:27PM -0600, Bjorn Helgaas wrote:
> > > On Thu, Dec 08, 2016 at 02:32:53PM -0500, Keith Busch wrote:
> > > > Depending on the device and the driver, there are hundreds to thousands
> > > > of non-posted transactions submitted to the device to complete driver
> > > > unbinding and removal. Since the device is gone, hardware has to handle
> > > > that as an error condition, which is slower than the a successful
> > > > non-posted transaction. Since we're doing 1000 of them for no particular
> > > > reason, it takes a long time. If you hot remove a switch with multiple
> > > > downstream devices, the serialized removal adds up to many seconds.
> > > 
> > > Another thread mentioned 1-2us as a reasonable config access cost, and
> > > I'm still a little puzzled about how we get to something on the order
> > > of a million times that cost.
> > > 
> > > I know this is all pretty hand-wavey, but 1000 config accesses to shut
> > > down a device seems unreasonably high.  The entire config space is
> > > only 4096 bytes, and most devices use a small fraction of that.  If
> > > we're really doing 1000 accesses, it sounds like we're doing something
> > > wrong, like polling without a delay or something.
> > 
> > Every time pci_find_ext_capability is called on a removed device, the
> > kernel will do 481 failed config space accesses trying to find that
> > capability. The kernel used to do that multiple times to find the AER
> > capability under conditions common to surprise removal.
> 
> Right, that's a perfect example.  I'd rather fix issues like this by
> caching the position as you did with AER.  The "removed" bit makes
> these issues "go away" without addressing the underlying problem.
> 
> We might still need a "removed" bit for other reasons, but I want to
> be clear about those reasons, not just throw it in under the general
> "make it go fast" umbrella.
> 
> > But now that we cache the AER position (commit: 66b80809), we've
> > eliminated by far the worst offender. The counts I'm telling you are
> > still referencing the original captured traces showing long tear down
> > times, so it's not up-to-date with the most recent version of the kernel.
> >  
> > > I measured the cost of config reads during enumeration using the TSC
> > > on a 2.8GHz CPU and found the following:
> > > 
> > >   1580 cycles, 0.565 usec (device present)
> > >   1230 cycles, 0.440 usec (empty slot)
> > >   2130 cycles, 0.761 usec (unimplemented function of multi-function device)
> > > 
> > > So 1-2usec does seem the right order of magnitude, and my "empty slot"
> > > error responses are actually *faster* than the "device present" ones,
> > > which is plausible to me because the Downstream Port can generate the
> > > error response immediately without sending a packet down the link.
> > > The "unimplemented function" responses take longer than the "empty
> > > slot", which makes sense because the Downstream Port does have to send
> > > a packet to the device, which then complains because it doesn't
> > > implement that function.
> > > 
> > > Of course, these aren't the same case as yours, where the link used to
> > > be up but is no longer.  Is there some hardware timeout to see if the
> > > link will come back?
> > 
> > Yes, the hardware does not respond immediately under this test, which
> > is considered an error condition. This is a reason why PCIe Device
> > Capabilities 2 Completion Timeout Ranges are recommended to be in the
> > 10ms range.
> 
> And we're apparently still doing a lot of these accesses?  I'm still
> curious about exactly what these are, because it may be that we're
> doing more than necessary.

It's the MSI-x masking that's our next highest contributor. Masking
vectors still requires non-posted commands, and since they're not going
through managed API accessors like config space uses, the removed flag
is needed for checking before doing significant MMIO.

  reply	other threads:[~2016-12-13 23:18 UTC|newest]

Thread overview: 42+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-10-28 22:58 [PATCHv4 next 0/3] Limiting pci access Keith Busch
2016-10-28 22:58 ` [PATCHv4 next 1/3] pci: Add is_removed state Keith Busch
2016-10-31 10:41   ` Lukas Wunner
2016-12-13 20:56   ` Bjorn Helgaas
2016-12-13 23:07     ` Keith Busch
2016-12-14  2:50       ` Bjorn Helgaas
2016-12-14  2:54         ` Bjorn Helgaas
2016-12-13 23:54     ` Lukas Wunner
2016-10-28 22:58 ` [PATCHv4 next 2/3] pci: No config access for removed devices Keith Busch
2016-10-31 12:18   ` Lukas Wunner
2016-10-28 22:58 ` [PATCHv4 next 3/3] pci/msix: Skip disabling " Keith Busch
2016-10-31 11:00   ` Lukas Wunner
2016-10-31 13:54     ` Keith Busch
2016-12-13 21:18   ` Bjorn Helgaas
2016-12-13 23:01     ` Keith Busch
2016-11-18 23:25 ` [PATCHv4 next 0/3] Limiting pci access Keith Busch
2016-11-23 16:09   ` Bjorn Helgaas
2016-11-28  9:14     ` Wei Zhang
2016-11-28 10:22       ` Lukas Wunner
2016-11-28 18:02     ` Keith Busch
2016-12-08 17:54       ` Bjorn Helgaas
2016-12-08 19:32         ` Keith Busch
2016-12-12 23:42           ` Bjorn Helgaas
2016-12-13  0:55             ` Keith Busch
2016-12-13 20:50               ` Bjorn Helgaas
2016-12-13 23:18                 ` Keith Busch [this message]
     [not found]                   ` <B58D82457FDA0744A320A2FC5AC253B93D82F37D@fmsmsx104.amr.corp.intel.com>
     [not found]                     ` <20170120213550.GA16618@localhost.localdomain>
2017-01-21  7:31                       ` Lukas Wunner
2017-01-21  8:42                         ` Greg Kroah-Hartman
2017-01-21 14:22                           ` Lukas Wunner
2017-01-25 11:47                             ` Greg Kroah-Hartman
2017-01-23 16:04                           ` Keith Busch
2017-01-25  0:44                             ` Austin.Bolen
2017-01-25 21:17                               ` Bjorn Helgaas
2017-01-26  1:12                                 ` Austin.Bolen
2017-02-01 16:04                                   ` Bjorn Helgaas
2017-02-03 20:30                                     ` Austin.Bolen
2017-02-03 20:39                                       ` Greg KH
2017-02-03 21:43                                     ` Austin.Bolen
2017-01-25 11:48                             ` Greg Kroah-Hartman
2017-01-28  7:36                             ` Christoph Hellwig
2018-11-13  6:05                   ` Bjorn Helgaas
2018-11-13 14:59                     ` Keith Busch

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20161213231840.GD12113@localhost.localdomain \
    --to=keith.busch@intel.com \
    --cc=bhelgaas@google.com \
    --cc=helgaas@kernel.org \
    --cc=linux-pci@vger.kernel.org \
    --cc=lukas@wunner.de \
    --cc=wzhang@fb.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).