From: Bjorn Helgaas <helgaas@kernel.org>
To: Ben Shelton <benjamin.h.shelton@intel.com>
Cc: Alexander Duyck <alexander.duyck@gmail.com>,
bhelgaas@google.com, linux-pci@vger.kernel.org,
linux-kernel@vger.kernel.org
Subject: Re: [PATCH v2] PCI: IOV: read SRIOV_NUM_VF after enabling ARI
Date: Fri, 16 Oct 2015 13:07:02 -0500 [thread overview]
Message-ID: <20151016180701.GB21346@localhost> (raw)
In-Reply-To: <20151016165627.GA52728@bhshelto-vm>
Hi Ben,
Thanks for the detailed response!
On Fri, Oct 16, 2015 at 11:56:28AM -0500, Ben Shelton wrote:
> Hi Bjorn,
>
> > What problem does this patch solve, Ben? I assume you have devices
> > that do change TotalVFs when ARI is enabled, and you do want the new
> > value?
> >
> > Or is the problem something like the following:
> >
> > - ...
> > - Linux PCI core sees TotalVFs = X (saved as iov->total_VFs)
> > - Linux sets ARI Capable Hierarchy
> > - Device changes TotalVFs to X + Y (but PCI core doesn't notice)
> > - Driver reads TotalVFs and sees X + Y
> > - Driver attempts pci_enable_sriov(dev, X + Y), which fails because
> > sriov_enable() sees "X + Y > iov->total_VFs"
>
> Here's a short snippet from the databook for the PCI Express controller we're
> using:
>
> "Supports two sets of VF Stride, First VF Offset, InitialVFs, and TotalVFs
> registers per PF—one each for ARI and non-ARI hierarchies. Selection is
> performed by host software through the ARI Capable Hierarchy bit of the Control
> register in the PF0 SR-IOV capability structure."
>
> The values in InitialVFs and TotalVFs are HWinit for each set of registers.
The HwInit description says "bits are read-only after initialization
and can only be reset (for write-once by firmware) with 'Power Good
Reset'." I don't see any provision for different values based on a
control register bit, so I think this device is actually out of spec.
We should be able to deal with it, so it's not that big a deal, but we
will have to keep it in mind and probably mention it in a comment in
the code.
> So the issue this is intended to fix is the following:
>
> - Linux PCI core sees TotalVFs = X (saved as iov->total_VFs).
> - Linux sets ARI Capable Hierarchy.
> - Device switches to exposing the second set of registers, where
> InitialVFs = TotalVFs = Y (where Y > X).
> - User enables one or more VFs on the device, e.g. by writing a value to
> sriov_numvfs in the sysfs.
> - Driver calls pci_enable_sriov() for the device, which then calls
> sriov_enable(). sriov_enable() reads InitialVFs (= Y) and then checks if it's
> greater than iov->total_VFs (= X). Since Y > X, the comparison is true, so
> sriov_enable() fails out and returns -EIO.
I think there are two problems here:
1) We should be reading some registers together to make sure we get
consistent values. For example, we always read VFOffset and
VFStride immediately after writing NumVFs. I think we should
read InitialVFs and TotalVFs together. I don't see the point of
reading TotalVFs in sriov_init() and reading InitialVFs in
sriov_enable(). If we read them both in sriov_init(), I don't
think we'd have this problem of sriov_enable() returning -EIO.
2) To work around this non-compliant device, we should read InitialVFs
and TotalVFs after setting the ARI bit.
Ideally, I think this would be two patches: one to move the InitialVFs
read from sriov_enable() to sriov_init(), and a second to move the
pair from before setting ARI to after.
> > I'm a little dubious about drivers reading the SRIOV capability
> > directly, so maybe this is a symptom of deeper problems.
>
> I agree that the driver should not be reading the capability directly, but from
> what I understand, it's intended for the device itself to do this. From the PCI
> SR-IOV spec revision 1.1:
>
> "ARI Capable Hierarchy is a hint to the Device that ARI has been enabled in the
> Root Port or Switch Downstream Port immediately above the Device."
Sure, of course, the device should behave differently based on how the
registers are programmed; that's the whole point of having writable
registers. I think the particular case of the device changing HwInit
registers is out of spec, but changing things like VFOffset and
VFStride is completely expected.
Bjorn
next prev parent reply other threads:[~2015-10-16 18:07 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-10-08 15:20 [PATCH v2] PCI: IOV: read SRIOV_NUM_VF after enabling ARI Ben Shelton
2015-10-15 17:58 ` Bjorn Helgaas
2015-10-15 20:00 ` Alexander Duyck
2015-10-15 21:36 ` Bjorn Helgaas
2015-10-15 22:14 ` Alexander Duyck
2015-10-16 16:56 ` Ben Shelton
2015-10-16 18:07 ` Bjorn Helgaas [this message]
2015-10-15 19:31 ` Bjorn Helgaas
2015-10-21 20:52 ` Bjorn Helgaas
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20151016180701.GB21346@localhost \
--to=helgaas@kernel.org \
--cc=alexander.duyck@gmail.com \
--cc=benjamin.h.shelton@intel.com \
--cc=bhelgaas@google.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-pci@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).