linux-pci.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCHv2 0/2] pcie hotplug and error fixes
@ 2024-06-12 18:16 Keith Busch
  2024-06-12 18:16 ` [PATCHv2 1/2] PCI: pciehp: fix concurrent sub-tree removal deadlock Keith Busch
  2024-06-12 18:16 ` [PATCHv2 2/2] PCI: err: ensure stable topology during handling Keith Busch
  0 siblings, 2 replies; 10+ messages in thread
From: Keith Busch @ 2024-06-12 18:16 UTC (permalink / raw)
  To: linux-pci, lukas, bhelgaas; +Cc: Keith Busch

From: Keith Busch <kbusch@kernel.org>

We' failures when dealing with certain overlapping pcie events. The
topology is essentially this:

  [Root Port] <-> [UpStream Port] <-> [DownStream Port] <-> [End Device]

An error between the DSP and ED triggers DPC. There's only inband
presence detection so it also triggers hotplug. Before the error
handling is completed, though, another error seen by the RP triggers its
own DPC handling.

The concurrent event handling reveals some interesting races, and this
patchset tries to address these without fundamentally changing the
locking scheme.

v1->v2:

  Fixed compiler error for !CONFIG_PCI (lkp)

Keith Busch (2):
  PCI: pciehp: fix concurrent sub-tree removal deadlock
  PCI: err: ensure stable topology during handling

 drivers/pci/hotplug/pciehp_pci.c | 12 +++++++++---
 drivers/pci/pci.h                |  1 +
 drivers/pci/pcie/err.c           |  8 +++++++-
 drivers/pci/probe.c              | 24 ++++++++++++++++++++++++
 include/linux/pci.h              |  2 ++
 5 files changed, 43 insertions(+), 4 deletions(-)

-- 
2.43.0


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCHv2 1/2] PCI: pciehp: fix concurrent sub-tree removal deadlock
  2024-06-12 18:16 [PATCHv2 0/2] pcie hotplug and error fixes Keith Busch
@ 2024-06-12 18:16 ` Keith Busch
  2024-06-27  7:47   ` Lukas Wunner
  2024-06-12 18:16 ` [PATCHv2 2/2] PCI: err: ensure stable topology during handling Keith Busch
  1 sibling, 1 reply; 10+ messages in thread
From: Keith Busch @ 2024-06-12 18:16 UTC (permalink / raw)
  To: linux-pci, lukas, bhelgaas; +Cc: Keith Busch

From: Keith Busch <kbusch@kernel.org>

PCIe hotplug events modify the topology in their IRQ thread once it can
acquire the global pci_rescan_remove_lock.

If a different removal event happens to acquire that lock first, and
that removal event is for the parent device of the bridge processing the
other hotplug event, then we are deadlocked: the parent removal will
wait indefinitely on the child's IRQ thread because the parent is
holding the global lock the child thread needs to make forward progress.

Introduce a new locking function that aborts if the device is being
removed. The following are stack traces of the deadlock:

Task A:

  pciehp_unconfigure_device+0x41/0x120
  pciehp_disable_slot+0x3c/0xc0
  pciehp_handle_presence_or_link_change+0x28f/0x3e0
  pciehp_ist+0xc3/0x210
  irq_thread_fn+0x19/0x40

Task B:

  __synchronize_irq+0x5b/0x90
  free_irq+0x192/0x2e0
  pcie_shutdown_notification+0x3b/0x40
  pciehp_remove+0x23/0x50
  pcie_port_remove_service+0x2c/0x40
  device_release_driver_internal+0x11f/0x180
  bus_remove_device+0xc5/0x110
  device_del+0x126/0x340
  device_unregister+0x13/0x50
  remove_iter+0x17/0x20
  device_for_each_child+0x4a/0x70
  pcie_portdrv_remove+0x23/0x40
  pci_device_remove+0x24/0x60
  device_release_driver_internal+0x11f/0x180
  pci_stop_bus_device+0x57/0x80
  pci_stop_bus_device+0x2c/0x80
  pci_stop_and_remove_bus_device+0xe/0x20
  pciehp_unconfigure_device+0x76/0x120
  pciehp_disable_slot+0x3c/0xc0
  pciehp_handle_presence_or_link_change+0x28f/0x3e0
  pciehp_ist+0xc3/0x210
  irq_thread_fn+0x19/0x40

Signed-off-by: Keith Busch <kbusch@kernel.org>
---
 drivers/pci/hotplug/pciehp_pci.c | 12 +++++++++---
 drivers/pci/pci.h                |  1 +
 drivers/pci/probe.c              | 24 ++++++++++++++++++++++++
 include/linux/pci.h              |  4 ++++
 4 files changed, 38 insertions(+), 3 deletions(-)

diff --git a/drivers/pci/hotplug/pciehp_pci.c b/drivers/pci/hotplug/pciehp_pci.c
index ad12515a4a121..ca6237b0732c8 100644
--- a/drivers/pci/hotplug/pciehp_pci.c
+++ b/drivers/pci/hotplug/pciehp_pci.c
@@ -34,9 +34,12 @@ int pciehp_configure_device(struct controller *ctrl)
 	struct pci_dev *dev;
 	struct pci_dev *bridge = ctrl->pcie->port;
 	struct pci_bus *parent = bridge->subordinate;
-	int num, ret = 0;
+	int num, ret;
 
-	pci_lock_rescan_remove();
+	ret = pci_trylock_rescan_remove(bridge);
+	if (!ret)
+		return -ENODEV;
+	ret = 0;
 
 	dev = pci_get_slot(parent, PCI_DEVFN(0, 0));
 	if (dev) {
@@ -93,6 +96,7 @@ void pciehp_unconfigure_device(struct controller *ctrl, bool presence)
 	struct pci_dev *dev, *temp;
 	struct pci_bus *parent = ctrl->pcie->port->subordinate;
 	u16 command;
+	int ret;
 
 	ctrl_dbg(ctrl, "%s: domain:bus:dev = %04x:%02x:00\n",
 		 __func__, pci_domain_nr(parent), parent->number);
@@ -100,7 +104,9 @@ void pciehp_unconfigure_device(struct controller *ctrl, bool presence)
 	if (!presence)
 		pci_walk_bus(parent, pci_dev_set_disconnected, NULL);
 
-	pci_lock_rescan_remove();
+	ret = pci_trylock_rescan_remove(parent->self);
+	if (!ret)
+		return;
 
 	/*
 	 * Stopping an SR-IOV PF device removes all the associated VFs,
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index fd44565c47562..f525490a02122 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -370,6 +370,7 @@ static inline int pci_dev_set_disconnected(struct pci_dev *dev, void *unused)
 {
 	pci_dev_set_io_state(dev, pci_channel_io_perm_failure);
 	pci_doe_disconnected(dev);
+	pci_notify_disconnected();
 
 	return 0;
 }
diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
index 5fbabb4e3425f..d2e19a1d1a45b 100644
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -3302,6 +3302,7 @@ EXPORT_SYMBOL_GPL(pci_rescan_bus);
  * routines should always be executed under this mutex.
  */
 static DEFINE_MUTEX(pci_rescan_remove_lock);
+static DECLARE_WAIT_QUEUE_HEAD(pci_lock_wq);
 
 void pci_lock_rescan_remove(void)
 {
@@ -3309,12 +3310,35 @@ void pci_lock_rescan_remove(void)
 }
 EXPORT_SYMBOL_GPL(pci_lock_rescan_remove);
 
+/*
+ * pci_trylock_rescan_remove() - keep trying to take the lock until successful
+ *				 or notified the device is disconnected
+ *
+ * Returns 1 if the lock was successfully taken, 0 otherwise.
+ */
+bool pci_trylock_rescan_remove(struct pci_dev *dev)
+{
+	int ret;
+
+	wait_event(pci_lock_wq,
+		   (ret = mutex_trylock(&pci_rescan_remove_lock)) == 1 ||
+		   pci_dev_is_disconnected(dev));
+
+	return ret;
+}
+
 void pci_unlock_rescan_remove(void)
 {
 	mutex_unlock(&pci_rescan_remove_lock);
+	wake_up_all(&pci_lock_wq);
 }
 EXPORT_SYMBOL_GPL(pci_unlock_rescan_remove);
 
+void pci_notify_disconnected(void)
+{
+	wake_up_all(&pci_lock_wq);
+}
+
 static int __init pci_sort_bf_cmp(const struct device *d_a,
 				  const struct device *d_b)
 {
diff --git a/include/linux/pci.h b/include/linux/pci.h
index cafc5ab1cbcb4..05f293f7d8b1c 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -1442,7 +1442,9 @@ void set_pcie_hotplug_bridge(struct pci_dev *pdev);
 unsigned int pci_rescan_bus_bridge_resize(struct pci_dev *bridge);
 unsigned int pci_rescan_bus(struct pci_bus *bus);
 void pci_lock_rescan_remove(void);
+bool pci_trylock_rescan_remove(struct pci_dev *dev);
 void pci_unlock_rescan_remove(void);
+void pci_notify_disconnected(void);
 
 /* Vital Product Data routines */
 ssize_t pci_read_vpd(struct pci_dev *dev, loff_t pos, size_t count, void *buf);
@@ -2072,6 +2074,8 @@ pci_alloc_irq_vectors(struct pci_dev *dev, unsigned int min_vecs,
 {
 	return -ENOSPC;
 }
+
+static inline void pci_notify_disconnected(void) { }
 #endif /* CONFIG_PCI */
 
 /* Include architecture-dependent settings and functions */
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCHv2 2/2] PCI: err: ensure stable topology during handling
  2024-06-12 18:16 [PATCHv2 0/2] pcie hotplug and error fixes Keith Busch
  2024-06-12 18:16 ` [PATCHv2 1/2] PCI: pciehp: fix concurrent sub-tree removal deadlock Keith Busch
@ 2024-06-12 18:16 ` Keith Busch
  2024-06-15 14:09   ` Lukas Wunner
  1 sibling, 1 reply; 10+ messages in thread
From: Keith Busch @ 2024-06-12 18:16 UTC (permalink / raw)
  To: linux-pci, lukas, bhelgaas; +Cc: Keith Busch

From: Keith Busch <kbusch@kernel.org>

DPC and AER handling access their subordinate bus devices. If pciehp should
happen to also trigger during this handling, it will remove all the subordinate
buses, then dereferecing any children may be a use-after-free. That may lead to
kernel panics like the below.

 BUG: unable to handle page fault for address: 00000000091400c0
 #PF: supervisor read access in kernel mode
 #PF: error_code(0x0000) - not-present page
 PGD 0 P4D 0
 Oops: 0000 [#1] SMP
 CPU: 15 PID: 2464 Comm: irq/53-pcie-dpc Kdump: loaded Tainted: G            E      6.9.0-0_fbk0_rc10_871_g4e98bf884071 #1
 RIP: 0010:pci_bus_read_config_dword+0x17/0x50
 Code: e9 0e 00 00 00 c7 01 ff ff ff ff b8 86 00 00 00 c3 cc cc 0f 1f 44 00 00 53 50 c7 44 24 04 00 00 00 00 f6 c2 03 75 27 48 89 cb <48> 8b 87 c0 00 00 00 4c 8d 44 24 04 b9 04 00 00 00 ff 50 18 85 c0
 RSP: 0018:ffffc90039113d60 EFLAGS: 00010246
 RAX: 0000000009140000 RBX: ffffc90039113d7c RCX: ffffc90039113d7c
 RDX: 0000000000000004 RSI: 0000000000000000 RDI: 0000000009140000
 RBP: 0000000000000100 R08: 0000000000000000 R09: 0000000000000001
 R10: 0000000000000000 R11: 0000001f975c6971 R12: 000000000000e9fc
 R13: ffff88811b5b4000 R14: ffffc90039113d7c R15: 0000000000000000
 FS:  0000000000000000(0000) GS:ffff899f7d3c0000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 00000000091400c0 CR3: 00000243fb00f002 CR4: 0000000000770ef0
 PKRU: 55555554
 Call Trace:
  <TASK>
  ? __die+0x78/0xc0
  ? page_fault_oops+0x2a8/0x3a0
  ? sched_clock+0x5/0x10
  ? psi_task_switch+0x39/0xc90
  ? __switch_to+0x131/0x530
  ? exc_page_fault+0x63/0x130
  ? asm_exc_page_fault+0x22/0x30
  ? pci_bus_read_config_dword+0x17/0x50
  pci_dev_wait+0x107/0x190
  ? dpc_completed+0x50/0x50
  dpc_reset_link+0x4e/0xd0
  pcie_do_recovery+0xb2/0x2d0
  ? irq_forced_thread_fn+0x60/0x60
  dpc_handler+0x107/0x130
  irq_thread_fn+0x19/0x40
  irq_thread+0x120/0x1e0
  ? irq_thread_fn+0x40/0x40
  ? irq_forced_secondary_handler+0x20/0x20
  kthread+0xae/0xe0
  ? file_tty_write+0x360/0x360
  ret_from_fork+0x2f/0x40
  ? file_tty_write+0x360/0x360
  ret_from_fork_asm+0x11/0x20
  </TASK>

Signed-off-by: Keith Busch <kbusch@kernel.org>
---
 drivers/pci/pcie/err.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
index 31090770fffcc..5355fc0fbf910 100644
--- a/drivers/pci/pcie/err.c
+++ b/drivers/pci/pcie/err.c
@@ -192,7 +192,7 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
 		pci_channel_state_t state,
 		pci_ers_result_t (*reset_subordinates)(struct pci_dev *pdev))
 {
-	int type = pci_pcie_type(dev);
+	int type = pci_pcie_type(dev), ret;
 	struct pci_dev *bridge;
 	pci_ers_result_t status = PCI_ERS_RESULT_CAN_RECOVER;
 	struct pci_host_bridge *host = pci_find_host_bridge(dev->bus);
@@ -214,6 +214,10 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
 	else
 		bridge = pci_upstream_bridge(dev);
 
+
+	ret = pci_trylock_rescan_remove(bridge);
+	if (!ret)
+		return PCI_ERS_RESULT_DISCONNECT;
 	pci_walk_bridge(bridge, pci_pm_runtime_get_sync, NULL);
 
 	pci_dbg(bridge, "broadcast error_detected message\n");
@@ -262,12 +266,14 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
 	}
 
 	pci_walk_bridge(bridge, pci_pm_runtime_put, NULL);
+	pci_unlock_rescan_remove();
 
 	pci_info(bridge, "device recovery successful\n");
 	return status;
 
 failed:
 	pci_walk_bridge(bridge, pci_pm_runtime_put, NULL);
+	pci_unlock_rescan_remove();
 
 	pci_uevent_ers(bridge, PCI_ERS_RESULT_DISCONNECT);
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCHv2 2/2] PCI: err: ensure stable topology during handling
  2024-06-12 18:16 ` [PATCHv2 2/2] PCI: err: ensure stable topology during handling Keith Busch
@ 2024-06-15 14:09   ` Lukas Wunner
  2024-06-17 17:04     ` Keith Busch
  0 siblings, 1 reply; 10+ messages in thread
From: Lukas Wunner @ 2024-06-15 14:09 UTC (permalink / raw)
  To: Keith Busch; +Cc: linux-pci, bhelgaas, Keith Busch

On Wed, Jun 12, 2024 at 11:16:25AM -0700, Keith Busch wrote:
> DPC and AER handling access their subordinate bus devices. If pciehp
> should happen to also trigger during this handling, it will remove
> all the subordinate buses, then dereferecing any children may be a
> use-after-free. That may lead to kernel panics like the below.

I assume the crash occurs because the struct pci_bus accessed by
pci_bus_read_config_dword() has already been freed?

Generally the solution to issues like this is to hold references
on the structs being accessed, not to acquire locks that happen
to prevent the structs from being freed.

Question is, which struct ref needs to be held and where?

Holding a ref on a struct pci_dev also holds the pci_bus it resides on
in place.  So I suspect we need to call pci_dev_get() somewhere.

The stack trace looks incomplete for some reason:

>   ? pci_bus_read_config_dword+0x17/0x50
>   pci_dev_wait+0x107/0x190
>   ? dpc_completed+0x50/0x50
>   dpc_reset_link+0x4e/0xd0
>   pcie_do_recovery+0xb2/0x2d0

I'd expect a call to pci_bridge_wait_for_secondary_bus() from
dpc_reset_link(), which in turn calls pci_dev_wait().

Indeed pci_bridge_wait_for_secondary_bus() does something fishy:
It takes the first entry from the devices list without acquiring
a ref:

	child = list_first_entry(&dev->subordinate->devices, struct pci_dev,
				 bus_list);

Below is a small patch which acquires a ref on child.  Maybe this
already does the trick?

-- >8 --

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 2a8063e..82db9a8 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -4753,7 +4753,7 @@ static int pci_bus_max_d3cold_delay(const struct pci_bus *bus)
  */
 int pci_bridge_wait_for_secondary_bus(struct pci_dev *dev, char *reset_type)
 {
-	struct pci_dev *child;
+	struct pci_dev *child __free(pci_dev_put) = NULL;
 	int delay;
 
 	if (pci_dev_is_disconnected(dev))
@@ -4782,8 +4782,9 @@ int pci_bridge_wait_for_secondary_bus(struct pci_dev *dev, char *reset_type)
 		return 0;
 	}
 
-	child = list_first_entry(&dev->subordinate->devices, struct pci_dev,
-				 bus_list);
+
+	child = pci_dev_get(list_first_entry(&dev->subordinate->devices,
+					     struct pci_dev, bus_list));
 	up_read(&pci_bus_sem);
 
 	/*

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCHv2 2/2] PCI: err: ensure stable topology during handling
  2024-06-15 14:09   ` Lukas Wunner
@ 2024-06-17 17:04     ` Keith Busch
  0 siblings, 0 replies; 10+ messages in thread
From: Keith Busch @ 2024-06-17 17:04 UTC (permalink / raw)
  To: Lukas Wunner; +Cc: Keith Busch, linux-pci, bhelgaas

On Sat, Jun 15, 2024 at 04:09:41PM +0200, Lukas Wunner wrote:
> Below is a small patch which acquires a ref on child.  Maybe this
> already does the trick?

Thanks, it appears your idea works! I've run it through 100 test
iterations successfully; previously it would reliably panic within 5 or
fewer iterations.

I also need the first patch from this series, though, otherwise it
inevitably deadlocks. I'd be interested to hear if you have any thoughts
on that one.

Reviewed-by: Keith Busch <kbusch@kernel.org>
Tested-by: Keith Busch <kbusch@kernel.org>

> -- >8 --
> 
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index 2a8063e..82db9a8 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -4753,7 +4753,7 @@ static int pci_bus_max_d3cold_delay(const struct pci_bus *bus)
>   */
>  int pci_bridge_wait_for_secondary_bus(struct pci_dev *dev, char *reset_type)
>  {
> -	struct pci_dev *child;
> +	struct pci_dev *child __free(pci_dev_put) = NULL;
>  	int delay;
>  
>  	if (pci_dev_is_disconnected(dev))
> @@ -4782,8 +4782,9 @@ int pci_bridge_wait_for_secondary_bus(struct pci_dev *dev, char *reset_type)
>  		return 0;
>  	}
>  
> -	child = list_first_entry(&dev->subordinate->devices, struct pci_dev,
> -				 bus_list);
> +
> +	child = pci_dev_get(list_first_entry(&dev->subordinate->devices,
> +					     struct pci_dev, bus_list));
>  	up_read(&pci_bus_sem);
>  
>  	/*

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCHv2 1/2] PCI: pciehp: fix concurrent sub-tree removal deadlock
  2024-06-12 18:16 ` [PATCHv2 1/2] PCI: pciehp: fix concurrent sub-tree removal deadlock Keith Busch
@ 2024-06-27  7:47   ` Lukas Wunner
  2024-06-27 14:55     ` Keith Busch
  2024-11-11  7:38     ` Lukas Wunner
  0 siblings, 2 replies; 10+ messages in thread
From: Lukas Wunner @ 2024-06-27  7:47 UTC (permalink / raw)
  To: Keith Busch; +Cc: linux-pci, bhelgaas, Keith Busch

On Wed, Jun 12, 2024 at 11:16:24AM -0700, Keith Busch wrote:
> PCIe hotplug events modify the topology in their IRQ thread once it can
> acquire the global pci_rescan_remove_lock.
> 
> If a different removal event happens to acquire that lock first, and
> that removal event is for the parent device of the bridge processing the
> other hotplug event, then we are deadlocked: the parent removal will
> wait indefinitely on the child's IRQ thread because the parent is
> holding the global lock the child thread needs to make forward progress.

Yes, that's a known problem.  I submitted a fix for it in 2018:

https://lore.kernel.org/all/4c882e25194ba8282b78fe963fec8faae7cf23eb.1529173804.git.lukas@wunner.de/

The patch I proposed was similar to yours, but was smaller and
confined to pciehp_pci.c.  It was part of a larger series and
when respinning that series I dropped the patch, which is the
reason it never got applied.  I explained the rationale for
dropping it in this message:

https://lore.kernel.org/all/20180906162634.ylyp3ydwujf5wuxx@wunner.de/

Basically all these proposals (both mine and yours) are not great
because they add another layer of duct tape without tackling the
underlying problem -- that pci_lock_rescan_remove() is way too
coarse-grained and needs to be replaced by finer-grained locking.
That however is a complex task that we haven't made significant
forward progress on in the last couple of years.  Something else
always seemed more important.

So I don't really know what to say.  I recognize it's a problem
but I'm hesitant to endorse a duct tape fix. :(

In the second link above I mentioned that my approach doesn't solve
another race discovered by Xiongfeng Wang.  pciehp has been refactored
considerably since then, so I'm not sure if that particular issue
still exists...

Thanks,

Lukas

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCHv2 1/2] PCI: pciehp: fix concurrent sub-tree removal deadlock
  2024-06-27  7:47   ` Lukas Wunner
@ 2024-06-27 14:55     ` Keith Busch
  2024-11-11  7:38     ` Lukas Wunner
  1 sibling, 0 replies; 10+ messages in thread
From: Keith Busch @ 2024-06-27 14:55 UTC (permalink / raw)
  To: Lukas Wunner; +Cc: Keith Busch, linux-pci, bhelgaas

On Thu, Jun 27, 2024 at 09:47:05AM +0200, Lukas Wunner wrote:
> On Wed, Jun 12, 2024 at 11:16:24AM -0700, Keith Busch wrote:
> > PCIe hotplug events modify the topology in their IRQ thread once it can
> > acquire the global pci_rescan_remove_lock.
> > 
> > If a different removal event happens to acquire that lock first, and
> > that removal event is for the parent device of the bridge processing the
> > other hotplug event, then we are deadlocked: the parent removal will
> > wait indefinitely on the child's IRQ thread because the parent is
> > holding the global lock the child thread needs to make forward progress.
> 
> Yes, that's a known problem.  I submitted a fix for it in 2018:
> 
> https://lore.kernel.org/all/4c882e25194ba8282b78fe963fec8faae7cf23eb.1529173804.git.lukas@wunner.de/
> 
> The patch I proposed was similar to yours, but was smaller and
> confined to pciehp_pci.c.  It was part of a larger series and
> when respinning that series I dropped the patch, which is the
> reason it never got applied.  I explained the rationale for
> dropping it in this message:
> 
> https://lore.kernel.org/all/20180906162634.ylyp3ydwujf5wuxx@wunner.de/
> 
> Basically all these proposals (both mine and yours) are not great
> because they add another layer of duct tape without tackling the
> underlying problem -- that pci_lock_rescan_remove() is way too
> coarse-grained and needs to be replaced by finer-grained locking.
> That however is a complex task that we haven't made significant
> forward progress on in the last couple of years.  Something else
> always seemed more important.
> 
> So I don't really know what to say.  I recognize it's a problem
> but I'm hesitant to endorse a duct tape fix. :(
> 
> In the second link above I mentioned that my approach doesn't solve
> another race discovered by Xiongfeng Wang.  pciehp has been refactored
> considerably since then, so I'm not sure if that particular issue
> still exists...

Thanks for comments. I agree the current locking scheme is the real
problem here. But I would still selfishly take a duct tape solution at
this point! :) There's so many direct pci_lock_rescan_remove() users
with hardware I don't have, it may be difficult to properly test a
significant change to the locking.

The sysfs race appears to still exist today. My patch wouldn't help with
that either.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCHv2 1/2] PCI: pciehp: fix concurrent sub-tree removal deadlock
  2024-06-27  7:47   ` Lukas Wunner
  2024-06-27 14:55     ` Keith Busch
@ 2024-11-11  7:38     ` Lukas Wunner
  2024-11-11  8:00       ` Lukas Wunner
  1 sibling, 1 reply; 10+ messages in thread
From: Lukas Wunner @ 2024-11-11  7:38 UTC (permalink / raw)
  To: Keith Busch; +Cc: linux-pci, bhelgaas, Keith Busch

On Thu, Jun 27, 2024 at 09:47:05AM +0200, Lukas Wunner wrote:
> On Wed, Jun 12, 2024 at 11:16:24AM -0700, Keith Busch wrote:
> > PCIe hotplug events modify the topology in their IRQ thread once it can
> > acquire the global pci_rescan_remove_lock.
> > 
> > If a different removal event happens to acquire that lock first, and
> > that removal event is for the parent device of the bridge processing the
> > other hotplug event, then we are deadlocked: the parent removal will
> > wait indefinitely on the child's IRQ thread because the parent is
> > holding the global lock the child thread needs to make forward progress.
> 
> Yes, that's a known problem.  I submitted a fix for it in 2018:
> 
> https://lore.kernel.org/all/4c882e25194ba8282b78fe963fec8faae7cf23eb.1529173804.git.lukas@wunner.de/
> 
> The patch I proposed was similar to yours, but was smaller and
> confined to pciehp_pci.c.  It was part of a larger series and
> when respinning that series I dropped the patch, which is the
> reason it never got applied.  I explained the rationale for
> dropping it in this message:
> 
> https://lore.kernel.org/all/20180906162634.ylyp3ydwujf5wuxx@wunner.de/
> 
> Basically all these proposals (both mine and yours) are not great
> because they add another layer of duct tape without tackling the
> underlying problem -- that pci_lock_rescan_remove() is way too
> coarse-grained and needs to be replaced by finer-grained locking.
> That however is a complex task that we haven't made significant
> forward progress on in the last couple of years.  Something else
> always seemed more important.

Thinking about this some more:

The problem is pci_lock_rescan_remove() is a single global lock.

What if we introduce a lock at each bridge or for each pci_bus.
Before a portion of the hierarchy is removed, all locks in that
sub-hierarchy are acquired bottom-up.

I think that should avoid the deadlock.  Thoughts?

Thanks,

Lukas

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCHv2 1/2] PCI: pciehp: fix concurrent sub-tree removal deadlock
  2024-11-11  7:38     ` Lukas Wunner
@ 2024-11-11  8:00       ` Lukas Wunner
  2024-11-11 17:59         ` Keith Busch
  0 siblings, 1 reply; 10+ messages in thread
From: Lukas Wunner @ 2024-11-11  8:00 UTC (permalink / raw)
  To: Keith Busch; +Cc: linux-pci, bhelgaas, Keith Busch

On Mon, Nov 11, 2024 at 08:38:03AM +0100, Lukas Wunner wrote:
> Thinking about this some more:
> 
> The problem is pci_lock_rescan_remove() is a single global lock.
> 
> What if we introduce a lock at each bridge or for each pci_bus.
> Before a portion of the hierarchy is removed, all locks in that
> sub-hierarchy are acquired bottom-up.
> 
> I think that should avoid the deadlock.  Thoughts?

I note that you attempted something similar back in July:

https://lore.kernel.org/all/20240722151936.1452299-9-kbusch@meta.com/

However I'd suggest to solve this differently:

Keep the pci_lock_rescan_remove() everywhere, don't add pci_lock_bus()
adjacent to it.

Instead, amend pci_lock_rescan_remove() to walk the sub-hierarchy
bottom-up and acquire all the bus locks.  Obviously you'll have to amend
pci_lock_rescan_remove() to accept a pci_dev which is the bridge atop
the sub-hierarchy.  (Or alternatively, the top-most pci_bus in the
sub-hierarchy.)

Thanks,

Lukas

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCHv2 1/2] PCI: pciehp: fix concurrent sub-tree removal deadlock
  2024-11-11  8:00       ` Lukas Wunner
@ 2024-11-11 17:59         ` Keith Busch
  0 siblings, 0 replies; 10+ messages in thread
From: Keith Busch @ 2024-11-11 17:59 UTC (permalink / raw)
  To: Lukas Wunner; +Cc: Keith Busch, linux-pci, bhelgaas

On Mon, Nov 11, 2024 at 09:00:18AM +0100, Lukas Wunner wrote:
> On Mon, Nov 11, 2024 at 08:38:03AM +0100, Lukas Wunner wrote:
> > Thinking about this some more:
> > 
> > The problem is pci_lock_rescan_remove() is a single global lock.
> > 
> > What if we introduce a lock at each bridge or for each pci_bus.
> > Before a portion of the hierarchy is removed, all locks in that
> > sub-hierarchy are acquired bottom-up.
> > 
> > I think that should avoid the deadlock.  Thoughts?
> 
> I note that you attempted something similar back in July:
> 
> https://lore.kernel.org/all/20240722151936.1452299-9-kbusch@meta.com/
> 
> However I'd suggest to solve this differently:
> 
> Keep the pci_lock_rescan_remove() everywhere, don't add pci_lock_bus()
> adjacent to it.
> 
> Instead, amend pci_lock_rescan_remove() to walk the sub-hierarchy
> bottom-up and acquire all the bus locks.  Obviously you'll have to amend
> pci_lock_rescan_remove() to accept a pci_dev which is the bridge atop
> the sub-hierarchy.  (Or alternatively, the top-most pci_bus in the
> sub-hierarchy.)

I don't think we can walk the bus bottom-up without hitting the same
deadlock I'm trying to fix.

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2024-11-11 17:59 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-06-12 18:16 [PATCHv2 0/2] pcie hotplug and error fixes Keith Busch
2024-06-12 18:16 ` [PATCHv2 1/2] PCI: pciehp: fix concurrent sub-tree removal deadlock Keith Busch
2024-06-27  7:47   ` Lukas Wunner
2024-06-27 14:55     ` Keith Busch
2024-11-11  7:38     ` Lukas Wunner
2024-11-11  8:00       ` Lukas Wunner
2024-11-11 17:59         ` Keith Busch
2024-06-12 18:16 ` [PATCHv2 2/2] PCI: err: ensure stable topology during handling Keith Busch
2024-06-15 14:09   ` Lukas Wunner
2024-06-17 17:04     ` Keith Busch

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).