Re: [PATCH] PCI/IOV: update num_VFs earlier

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Bjorn Helgaas <helgaas@kernel.org>
To: CREGUT Pierre IMT/OLN <pierre.cregut@orange.com>
Cc: linux-pci@vger.kernel.org, linux-kernel@vger.kernel.org,
	Donald Dutile <ddutile@redhat.com>,
	Alexander Duyck <alexander.h.duyck@intel.com>,
	Jakub Kicinski <jakub.kicinski@netronome.com>
Subject: Re: [PATCH] PCI/IOV: update num_VFs earlier
Date: Thu, 3 Oct 2019 17:10:07 -0500	[thread overview]
Message-ID: <20191003221007.GA209602@google.com> (raw)
In-Reply-To: <49b0ad6d-7b6f-adbd-c4a3-5f9328a7ad9d@orange.com>

[+cc Don, Alex, Jakub]

On Thu, Oct 03, 2019 at 11:04:45AM +0200, CREGUT Pierre IMT/OLN wrote:
> Le 02/10/2019 à 01:45, Bjorn Helgaas a écrit :
> > On Fri, Apr 26, 2019 at 10:11:54AM +0200, CREGUT Pierre IMT/OLN wrote:
> > > I also initially thought that kobject_uevent generated the netlink event
> > > but this is not the case. This is generated by the specific driver in use.
> > > For the Intel i40e driver, this is the call to i40e_do_reset_safe in
> > > i40e_pci_sriov_configure that sends the event.
> > > It is followed by i40e_pci_sriov_enable that calls i40e_alloc_vfs that
> > > finally calls the generic pci_enable_sriov function.
> > I don't know anything about netlink.  The script from the bugzilla
> > (https://bugzilla.kernel.org/show_bug.cgi?id=202991) looks like it
> > runs
> > 
> >    ip monitor dev enp9s0f2
> > 
> > What are the actual netlink events you see?  Are they related to a
> > device being removed?
> 
> We have netlink events both when num_vfs goes from 0 to N and from N to 0.
> Indeed you have to go to 0 before going to M with M != N.

Right.

> On an Intel card, when one goes from 0 to N, the netlink event is
> sent "early". The value of num_vfs is still 0 and you get the
> impression that the number of VFS has not changed. As the meaning of
> those events is overloaded, you have to wait an arbitrary amount of
> time until it settles (there will be no other event).  There is no
> such problem when it goes from N to 0 because of implementation
> details but it may be different for another brand.

I hadn't looked far enough.  I think the "remove" netlink events are
probably from the i40e_do_reset_safe() path, which eventually calls
free_netdev() and put_device().

The pci_enable_sriov() path calls the driver's ->probe method, and I
suspect the "add" netlink events are emitted there.

> > When we change num_VFs, I think we have to disable any existing VFs
> > before enabling the new num_VFs, so if you trigger on a netlink
> > "remove" event, I wouldn't be surprised that reading sriov_numvfs
> > would give a zero until the new VFs are enabled.
> Yes but we are speaking of the event sent when num_vfs is changed from 0 to
> N
> > [...]
> > I thought this was a good idea, but
> > 
> >    - It does break the device_lock() encapsulation a little bit:
> >      sriov_numvfs_store() uses device_lock(), which happens to be
> >      implemented as "mutex_lock(&dev->mutex)", but we really shouldn't
> >      rely on that implementation, and

> The use of device_lock was the cheapest solution. It is true that
> lock and trylock are exposed by device.h but not is_locked. To
> respect the abstraction, we would have to lock the device (at least
> use trylock but it means locking when we can access the value, in
> that case we may just make reading num_vfs blocking ?).
> 
> The other solution is to record the state of freshness of num_vfs
> but it means a new Boolean in the pci_sriov data-structure.

> 
> >    - The netlink events are being generated via the NIC driver, and I'm
> >      a little hesitant about changing the PCI core to deal with timing
> >      issues "over there".
> 
> NIC drivers send netlink events when their state change, but it is
> the core that changes the value of num_vfs. So I would think it is
> the core responsibility to make sure the exposed value makes sense
> and it would be better to ignore the details of the driver
> implementation.

Yes, I think you're right.  And I like your previous suggestion of
just locking the device in the reader.  I'm not enough of a sysfs
expert to know if there's a good reason to avoid a lock there.  Does
the following look reasonable to you?


commit 0940fc95da45
Author: Pierre Crégut <pierre.cregut@orange.com>
Date:   Wed Sep 11 09:27:36 2019 +0200

    PCI/IOV: Serialize sysfs sriov_numvfs reads vs writes
    
    When sriov_numvfs is being updated, drivers may notify about new devices
    before they are reflected in sriov->num_VFs, so concurrent sysfs reads
    previously returned stale values.
    
    Serialize the sysfs read vs the write so the read returns the correct
    num_VFs value.
    
    Link: https://bugzilla.kernel.org/show_bug.cgi?id=202991
    Link: https://lore.kernel.org/r/20190911072736.32091-1-pierre.cregut@orange.com
    Signed-off-by: Pierre Crégut <pierre.cregut@orange.com>
    Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>

diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
index b3f972e8cfed..e77562aabbae 100644
--- a/drivers/pci/iov.c
+++ b/drivers/pci/iov.c
@@ -254,8 +254,14 @@ static ssize_t sriov_numvfs_show(struct device *dev,
 				 char *buf)
 {
 	struct pci_dev *pdev = to_pci_dev(dev);
+	u16 num_vfs;
+
+	/* Serialize vs sriov_numvfs_store() so readers see valid num_VFs */
+	device_lock(&pdev->dev);
+	num_vfs = pdev->sriov->num_VFs;
+	device_lock(&pdev->dev);
 
-	return sprintf(buf, "%u\n", pdev->sriov->num_VFs);
+	return sprintf(buf, "%u\n", num_vfs);
 }
 
 /*

next prev parent reply	other threads:[~2019-10-03 22:10 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-03-29  8:00 [PATCH] PCI/IOV: update num_VFs earlier Pierre Crégut
2019-04-05 22:33 ` Bjorn Helgaas
2019-04-26  8:11   ` CREGUT Pierre IMT/OLN
2019-06-13 23:51     ` Bjorn Helgaas
2019-10-01 23:45     ` Bjorn Helgaas
2019-10-03  9:04       ` CREGUT Pierre IMT/OLN
2019-10-03 22:10         ` Bjorn Helgaas [this message]
2019-10-03 22:36           ` Jakub Kicinski
2019-10-03 22:37           ` Duyck, Alexander H
2019-10-08 21:38           ` Bjorn Helgaas
2019-10-08 22:06             ` Don Dutile
2019-10-09 12:31               ` Bjorn Helgaas
2019-10-09 14:20                 ` Don Dutile
  -- strict thread matches above, loose matches on Subject: below --
2019-03-25  8:18 Pierre Crégut

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:b3f972e8cfe dfblob:e77562aabba )
 OR (
bs:"Re: [PATCH] PCI/IOV: update num_VFs earlier" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20191003221007.GA209602@google.com \
    --to=helgaas@kernel.org \
    --cc=alexander.h.duyck@intel.com \
    --cc=ddutile@redhat.com \
    --cc=jakub.kicinski@netronome.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-pci@vger.kernel.org \
    --cc=pierre.cregut@orange.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.