All of lore.kernel.org
 help / color / mirror / Atom feed
From: Bjorn Helgaas <helgaas@kernel.org>
To: Keith Busch <keith.busch@intel.com>
Cc: linux-pci@vger.kernel.org, Lukas Wunner <lukas@wunner.de>,
	Wei Zhang <wzhang@fb.com>
Subject: Re: [PATCHv4 next 0/3] Limiting pci access
Date: Tue, 13 Nov 2018 00:05:11 -0600	[thread overview]
Message-ID: <20181113060510.GB182139@google.com> (raw)
In-Reply-To: <20161213231840.GD12113@localhost.localdomain>

On Tue, Dec 13, 2016 at 06:18:40PM -0500, Keith Busch wrote:
> On Tue, Dec 13, 2016 at 02:50:12PM -0600, Bjorn Helgaas wrote:
> > On Mon, Dec 12, 2016 at 07:55:47PM -0500, Keith Busch wrote:
> > > On Mon, Dec 12, 2016 at 05:42:27PM -0600, Bjorn Helgaas wrote:
> > > > On Thu, Dec 08, 2016 at 02:32:53PM -0500, Keith Busch wrote:
> > > > > Depending on the device and the driver, there are hundreds
> > > > > to thousands of non-posted transactions submitted to the
> > > > > device to complete driver unbinding and removal. Since the
> > > > > device is gone, hardware has to handle that as an error
> > > > > condition, which is slower than the a successful non-posted
> > > > > transaction. Since we're doing 1000 of them for no
> > > > > particular reason, it takes a long time. If you hot remove a
> > > > > switch with multiple downstream devices, the serialized
> > > > > removal adds up to many seconds.
> > > > 
> > > > Another thread mentioned 1-2us as a reasonable config access
> > > > cost, and I'm still a little puzzled about how we get to
> > > > something on the order of a million times that cost.
> > > > 
> > > > I know this is all pretty hand-wavey, but 1000 config accesses
> > > > to shut down a device seems unreasonably high.  The entire
> > > > config space is only 4096 bytes, and most devices use a small
> > > > fraction of that.  If we're really doing 1000 accesses, it
> > > > sounds like we're doing something wrong, like polling without
> > > > a delay or something.
> > > 
> > > Every time pci_find_ext_capability is called on a removed
> > > device, the kernel will do 481 failed config space accesses
> > > trying to find that capability. The kernel used to do that
> > > multiple times to find the AER capability under conditions
> > > common to surprise removal.
> > 
> > Right, that's a perfect example.  I'd rather fix issues like this
> > by caching the position as you did with AER.  The "removed" bit
> > makes these issues "go away" without addressing the underlying
> > problem.
> > 
> > We might still need a "removed" bit for other reasons, but I want
> > to be clear about those reasons, not just throw it in under the
> > general "make it go fast" umbrella.
> > 
> > > But now that we cache the AER position (commit: 66b80809), we've
> > > eliminated by far the worst offender. The counts I'm telling you
> > > are still referencing the original captured traces showing long
> > > tear down times, so it's not up-to-date with the most recent
> > > version of the kernel.
> > >  
> > > > I measured the cost of config reads during enumeration using
> > > > the TSC on a 2.8GHz CPU and found the following:
> > > > 
> > > >   1580 cycles, 0.565 usec (device present)
> > > >   1230 cycles, 0.440 usec (empty slot)
> > > >   2130 cycles, 0.761 usec (unimplemented function of multi-function device)
> > > > 
> > > > So 1-2usec does seem the right order of magnitude, and my
> > > > "empty slot" error responses are actually *faster* than the
> > > > "device present" ones, which is plausible to me because the
> > > > Downstream Port can generate the error response immediately
> > > > without sending a packet down the link.  The "unimplemented
> > > > function" responses take longer than the "empty slot", which
> > > > makes sense because the Downstream Port does have to send a
> > > > packet to the device, which then complains because it doesn't
> > > > implement that function.
> > > > 
> > > > Of course, these aren't the same case as yours, where the link
> > > > used to be up but is no longer.  Is there some hardware
> > > > timeout to see if the link will come back?
> > > 
> > > Yes, the hardware does not respond immediately under this test,
> > > which is considered an error condition. This is a reason why
> > > PCIe Device Capabilities 2 Completion Timeout Ranges are
> > > recommended to be in the 10ms range.
> > 
> > And we're apparently still doing a lot of these accesses?  I'm
> > still curious about exactly what these are, because it may be that
> > we're doing more than necessary.
> 
> It's the MSI-x masking that's our next highest contributor. Masking
> vectors still requires non-posted commands, and since they're not
> going through managed API accessors like config space uses, the
> removed flag is needed for checking before doing significant MMIO.

Sorry for digging up this ancient history, but do you remember where
this MSI-X masking with non-posted commands happens?  The masking is
an MMIO write (posted) and there should only be a non-posted MMIO read
if we use the msi_set_mask_bit() path.

I'm looking at this path:

  nvme_pci_disable
    pci_free_irq_vectors
      pci_disable_msix
        pci_msix_shutdown
          if (pci_dev_is_disconnected())    # added by 0170591bb067
            return                          # (skip hw access)
          for_each_pci_msi_entry(...)       # <-- loop
            __pci_msix_desc_mask_irq
              writel                        # <-- MMIO write
          pci_read_config_word(PCI_MSIX_FLAGS)
          pci_write_config_word(PCI_MSIX_FLAGS)
        free_msi_irqs

whih only does MMIO *writes*, which I think are posted and do not
require completion and should not cause timeouts.  That's wasted
effort, I agree, but it doesn't seem like it should be a performance
issue.

So there must be another path, probably preceding this one (since
pci_disable_msix() cleans everything up), that masks the vectors and
does the non-posted reads?

The only places we do MMIO reads are msix_program_entries(), which is
done at MSI-X enable time, and msi_set_mask_bit(), which is used in
irq_chip.irq_mask() and irq_chip.irq_unmask() methods.  But I haven't
figured out where that irq_chip path is used in a loop.

Bjorn

  parent reply	other threads:[~2018-11-13  6:05 UTC|newest]

Thread overview: 42+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-10-28 22:58 [PATCHv4 next 0/3] Limiting pci access Keith Busch
2016-10-28 22:58 ` [PATCHv4 next 1/3] pci: Add is_removed state Keith Busch
2016-10-31 10:41   ` Lukas Wunner
2016-12-13 20:56   ` Bjorn Helgaas
2016-12-13 23:07     ` Keith Busch
2016-12-14  2:50       ` Bjorn Helgaas
2016-12-14  2:54         ` Bjorn Helgaas
2016-12-13 23:54     ` Lukas Wunner
2016-10-28 22:58 ` [PATCHv4 next 2/3] pci: No config access for removed devices Keith Busch
2016-10-31 12:18   ` Lukas Wunner
2016-10-28 22:58 ` [PATCHv4 next 3/3] pci/msix: Skip disabling " Keith Busch
2016-10-31 11:00   ` Lukas Wunner
2016-10-31 13:54     ` Keith Busch
2016-12-13 21:18   ` Bjorn Helgaas
2016-12-13 23:01     ` Keith Busch
2016-11-18 23:25 ` [PATCHv4 next 0/3] Limiting pci access Keith Busch
2016-11-23 16:09   ` Bjorn Helgaas
2016-11-28  9:14     ` Wei Zhang
2016-11-28 10:22       ` Lukas Wunner
2016-11-28 18:02     ` Keith Busch
2016-12-08 17:54       ` Bjorn Helgaas
2016-12-08 19:32         ` Keith Busch
2016-12-12 23:42           ` Bjorn Helgaas
2016-12-13  0:55             ` Keith Busch
2016-12-13 20:50               ` Bjorn Helgaas
2016-12-13 23:18                 ` Keith Busch
     [not found]                   ` <B58D82457FDA0744A320A2FC5AC253B93D82F37D@fmsmsx104.amr.corp.intel.com>
     [not found]                     ` <20170120213550.GA16618@localhost.localdomain>
2017-01-21  7:31                       ` Lukas Wunner
2017-01-21  8:42                         ` Greg Kroah-Hartman
2017-01-21 14:22                           ` Lukas Wunner
2017-01-25 11:47                             ` Greg Kroah-Hartman
2017-01-23 16:04                           ` Keith Busch
2017-01-25  0:44                             ` Austin.Bolen
2017-01-25 21:17                               ` Bjorn Helgaas
2017-01-26  1:12                                 ` Austin.Bolen
2017-02-01 16:04                                   ` Bjorn Helgaas
2017-02-03 20:30                                     ` Austin.Bolen
2017-02-03 20:39                                       ` Greg KH
2017-02-03 21:43                                     ` Austin.Bolen
2017-01-25 11:48                             ` Greg Kroah-Hartman
2017-01-28  7:36                             ` Christoph Hellwig
2018-11-13  6:05                   ` Bjorn Helgaas [this message]
2018-11-13 14:59                     ` Keith Busch

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20181113060510.GB182139@google.com \
    --to=helgaas@kernel.org \
    --cc=keith.busch@intel.com \
    --cc=linux-pci@vger.kernel.org \
    --cc=lukas@wunner.de \
    --cc=wzhang@fb.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.