All of lore.kernel.org
 help / color / mirror / Atom feed
From: keith.busch@intel.com (Keith Busch)
Subject: [RFC PATCH] nvme: avoid race-conditions when enabling devices
Date: Wed, 21 Mar 2018 10:02:39 -0600	[thread overview]
Message-ID: <20180321160238.GF12909@localhost.localdomain> (raw)
In-Reply-To: <20180321154807.GD22254@ming.t460p>

On Wed, Mar 21, 2018@11:48:09PM +0800, Ming Lei wrote:
> On Wed, Mar 21, 2018@01:10:31PM +0100, Marta Rybczynska wrote:
> > > On Wed, Mar 21, 2018@12:00:49PM +0100, Marta Rybczynska wrote:
> > >> NVMe driver uses threads for the work at device reset, including enabling
> > >> the PCIe device. When multiple NVMe devices are initialized, their reset
> > >> works may be scheduled in parallel. Then pci_enable_device_mem can be
> > >> called in parallel on multiple cores.
> > >> 
> > >> This causes a loop of enabling of all upstream bridges in
> > >> pci_enable_bridge(). pci_enable_bridge() causes multiple operations
> > >> including __pci_set_master and architecture-specific functions that
> > >> call ones like and pci_enable_resources(). Both __pci_set_master()
> > >> and pci_enable_resources() read PCI_COMMAND field in the PCIe space
> > >> and change it. This is done as read/modify/write.
> > >> 
> > >> Imagine that the PCIe tree looks like:
> > >> A - B - switch -  C - D
> > >>                \- E - F
> > >> 
> > >> D and F are two NVMe disks and all devices from B are not enabled and bus
> > >> mastering is not set. If their reset work are scheduled in parallel the two
> > >> modifications of PCI_COMMAND may happen in parallel without locking and the
> > >> system may end up with the part of PCIe tree not enabled.
> > > 
> > > Then looks serialized reset should be used, and I did see the commit
> > > 79c48ccf2fe ("nvme-pci: serialize pci resets") fixes issue of 'failed
> > > to mark controller state' in reset stress test.
> > > 
> > > But that commit only covers case of PCI reset from sysfs attribute, and
> > > maybe other cases need to be dealt with in similar way too.
> > > 
> > 
> > It seems to me that the serialized reset works for multiple resets of the
> > same device, doesn't it? Our problem is linked to resets of different devices
> > that share the same PCIe tree.
> 
> Given reset shouldn't be a frequent action, it might be fine to serialize all
> reset from different devices.

The driver was much simpler when we had serialized resets in line with
probe, but that had a bigger problems with certain init systems when
you put enough nvme devices in your server, making them unbootable.

Would it be okay to serialize just the pci_enable_device across all
other tasks messing with the PCI topology?

---
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index cef5ce851a92..e0a2f6c0f1cf 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2094,8 +2094,11 @@ static int nvme_pci_enable(struct nvme_dev *dev)
	int result = -ENOMEM;
	struct pci_dev *pdev = to_pci_dev(dev->dev);

-	if (pci_enable_device_mem(pdev))
-		return result;
+	pci_lock_rescan_remove();
+	result = pci_enable_device_mem(pdev);
+	pci_unlock_rescan_remove();
+	if (result)
+		return -ENODEV;

	pci_set_master(pdev);

--

WARNING: multiple messages have this Message-ID (diff)
From: Keith Busch <keith.busch@intel.com>
To: Ming Lei <ming.lei@redhat.com>
Cc: Marta Rybczynska <mrybczyn@kalray.eu>,
	axboe@fb.com, hch@lst.de, sagi@grimberg.me,
	linux-nvme@lists.infradead.org, linux-kernel@vger.kernel.org,
	bhelgaas@google.com, linux-pci@vger.kernel.org,
	Pierre-Yves Kerbrat <pkerbrat@kalray.eu>
Subject: Re: [RFC PATCH] nvme: avoid race-conditions when enabling devices
Date: Wed, 21 Mar 2018 10:02:39 -0600	[thread overview]
Message-ID: <20180321160238.GF12909@localhost.localdomain> (raw)
In-Reply-To: <20180321154807.GD22254@ming.t460p>

On Wed, Mar 21, 2018 at 11:48:09PM +0800, Ming Lei wrote:
> On Wed, Mar 21, 2018 at 01:10:31PM +0100, Marta Rybczynska wrote:
> > > On Wed, Mar 21, 2018 at 12:00:49PM +0100, Marta Rybczynska wrote:
> > >> NVMe driver uses threads for the work at device reset, including enabling
> > >> the PCIe device. When multiple NVMe devices are initialized, their reset
> > >> works may be scheduled in parallel. Then pci_enable_device_mem can be
> > >> called in parallel on multiple cores.
> > >> 
> > >> This causes a loop of enabling of all upstream bridges in
> > >> pci_enable_bridge(). pci_enable_bridge() causes multiple operations
> > >> including __pci_set_master and architecture-specific functions that
> > >> call ones like and pci_enable_resources(). Both __pci_set_master()
> > >> and pci_enable_resources() read PCI_COMMAND field in the PCIe space
> > >> and change it. This is done as read/modify/write.
> > >> 
> > >> Imagine that the PCIe tree looks like:
> > >> A - B - switch -  C - D
> > >>                \- E - F
> > >> 
> > >> D and F are two NVMe disks and all devices from B are not enabled and bus
> > >> mastering is not set. If their reset work are scheduled in parallel the two
> > >> modifications of PCI_COMMAND may happen in parallel without locking and the
> > >> system may end up with the part of PCIe tree not enabled.
> > > 
> > > Then looks serialized reset should be used, and I did see the commit
> > > 79c48ccf2fe ("nvme-pci: serialize pci resets") fixes issue of 'failed
> > > to mark controller state' in reset stress test.
> > > 
> > > But that commit only covers case of PCI reset from sysfs attribute, and
> > > maybe other cases need to be dealt with in similar way too.
> > > 
> > 
> > It seems to me that the serialized reset works for multiple resets of the
> > same device, doesn't it? Our problem is linked to resets of different devices
> > that share the same PCIe tree.
> 
> Given reset shouldn't be a frequent action, it might be fine to serialize all
> reset from different devices.

The driver was much simpler when we had serialized resets in line with
probe, but that had a bigger problems with certain init systems when
you put enough nvme devices in your server, making them unbootable.

Would it be okay to serialize just the pci_enable_device across all
other tasks messing with the PCI topology?

---
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index cef5ce851a92..e0a2f6c0f1cf 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2094,8 +2094,11 @@ static int nvme_pci_enable(struct nvme_dev *dev)
	int result = -ENOMEM;
	struct pci_dev *pdev = to_pci_dev(dev->dev);

-	if (pci_enable_device_mem(pdev))
-		return result;
+	pci_lock_rescan_remove();
+	result = pci_enable_device_mem(pdev);
+	pci_unlock_rescan_remove();
+	if (result)
+		return -ENODEV;

	pci_set_master(pdev);

--

  reply	other threads:[~2018-03-21 16:02 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-03-21 11:00 [RFC PATCH] nvme: avoid race-conditions when enabling devices Marta Rybczynska
2018-03-21 11:00 ` Marta Rybczynska
2018-03-21 11:50 ` Ming Lei
2018-03-21 11:50   ` Ming Lei
2018-03-21 12:10   ` Marta Rybczynska
2018-03-21 12:10     ` Marta Rybczynska
2018-03-21 15:48     ` Ming Lei
2018-03-21 15:48       ` Ming Lei
2018-03-21 16:02       ` Keith Busch [this message]
2018-03-21 16:02         ` Keith Busch
2018-03-21 16:10         ` Marta Rybczynska
2018-03-21 16:10           ` Marta Rybczynska
2018-03-21 21:53           ` Bjorn Helgaas
2018-03-21 21:53             ` Bjorn Helgaas
2018-03-23  7:28             ` Marta Rybczynska
2018-03-23  7:28               ` Marta Rybczynska
2018-03-23  8:44               ` Srinath Mannam
2018-03-23  8:44                 ` Srinath Mannam
2018-03-23  7:44           ` Marta Rybczynska
2018-03-23  7:44             ` Marta Rybczynska
2018-03-23  7:44             ` Marta Rybczynska

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20180321160238.GF12909@localhost.localdomain \
    --to=keith.busch@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.