All of lore.kernel.org
 help / color / mirror / Atom feed
From: Dexuan Cui <decui@microsoft.com>
To: bhelgaas@google.com, davem@davemloft.net, decui@microsoft.com,
	edumazet@google.com, haiyangz@microsoft.com, jakeo@microsoft.com,
	kuba@kernel.org, kw@linux.com, kys@microsoft.com,
	leon@kernel.org, linux-pci@vger.kernel.org,
	lpieralisi@kernel.org, mikelley@microsoft.com, pabeni@redhat.com,
	robh@kernel.org, saeedm@nvidia.com, wei.liu@kernel.org,
	longli@microsoft.com, boqun.feng@gmail.com
Cc: linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-rdma@vger.kernel.org, netdev@vger.kernel.org
Subject: [PATCH 5/6] PCI: hv: Add a per-bus mutex state_lock
Date: Mon, 27 Mar 2023 21:51:21 -0700	[thread overview]
Message-ID: <20230328045122.25850-6-decui@microsoft.com> (raw)
In-Reply-To: <20230328045122.25850-1-decui@microsoft.com>

In the case of fast device addition/removal, it's possible that
hv_eject_device_work() can start to run before create_root_hv_pci_bus()
starts to run; as a result, the pci_get_domain_bus_and_slot() in
hv_eject_device_work() can return a 'pdev' of NULL, and
hv_eject_device_work() can remove the 'hpdev', and immediately send a
message PCI_EJECTION_COMPLETE to the host, and the host immediately
unassigns the PCI device from the guest; meanwhile,
create_root_hv_pci_bus() and the PCI device driver can be probing the
dead PCI device and reporting timeout errors.

Fix the issue by adding a per-bus mutex 'state_lock' and grabbing the
mutex before powering on the PCI bus in hv_pci_enter_d0(): when
hv_eject_device_work() starts to run, it's able to find the 'pdev' and call
pci_stop_and_remove_bus_device(pdev): if the PCI device driver has
loaded, the PCI device driver's probe() function is already called in
create_root_hv_pci_bus() -> pci_bus_add_devices(), and now
hv_eject_device_work() -> pci_stop_and_remove_bus_device() is able
to call the PCI device driver's remove() function and remove the device
reliably; if the PCI device driver hasn't loaded yet, the function call
hv_eject_device_work() -> pci_stop_and_remove_bus_device() is able to
remove the PCI device reliably and the PCI device driver's probe()
function won't be called; if the PCI device driver's probe() is already
running (e.g., systemd-udev is loading the PCI device driver), it must
be holding the per-device lock, and after the probe() finishes and releases
the lock, hv_eject_device_work() -> pci_stop_and_remove_bus_device() is
able to proceed to remove the device reliably.

Fixes: 4daace0d8ce8 ("PCI: hv: Add paravirtual PCI front-end for Microsoft Hyper-V VMs")
Signed-off-by: Dexuan Cui <decui@microsoft.com>

---
 drivers/pci/controller/pci-hyperv.c | 27 +++++++++++++++++++++++++--
 1 file changed, 25 insertions(+), 2 deletions(-)

With the below debug code:

Changes to /drivers/net/ethernet/mellanox/mlx5/core/main.c
@@ -1727,6 +1727,8 @@ static int probe_one(struct pci_dev *pdev, const struct pci_device_id *id)
 	struct devlink *devlink;
 	int err;

+	printk("%s: line %d: sleep 20s: \n", __func__, __LINE__); ssleep(20);
+	printk("%s: line %d: sleep 20s: done\n", __func__, __LINE__);
 	devlink = mlx5_devlink_alloc(&pdev->dev);
 	if (!devlink) {
 		dev_err(&pdev->dev, "devlink alloc failed\n");
Changes to drivers/pci/controller/pci-hyperv.c
@@ -2749,6 +2749,7 @@ static void hv_eject_device_work(struct work_struct *work)
 	 */
 	wslot = wslot_to_devfn(hpdev->desc.win_slot.slot);
 	pdev = pci_get_domain_bus_and_slot(hbus->bridge->domain_nr, 0, wslot);
+	printk("%s: 1: line %d: pdev=%px\n", __func__, __LINE__, pdev);
 	if (pdev) {
 		pci_lock_rescan_remove();
 		pci_stop_and_remove_bus_device(pdev);
@@ -2756,9 +2757,12 @@ static void hv_eject_device_work(struct work_struct *work)
 		pci_unlock_rescan_remove();
 	}

+	printk("%s: 2: line %d: pdev=%px: sleeping 10s\n", __func__, __LINE__, pdev); ssleep(10);
+	printk("%s: 3: line %d: pdev=%px: sleeping 10s: done: removing hpdev\n", __func__, __LINE__, pdev);
 	spin_lock_irqsave(&hbus->device_list_lock, flags);
 	list_del(&hpdev->list_entry);
 	spin_unlock_irqrestore(&hbus->device_list_lock, flags);
+	printk("%s: 4: line %d: pdev=%px: hpdev is removed!!!\n", __func__, __LINE__, pdev);

 	if (hpdev->pci_slot)
 		pci_destroy_slot(hpdev->pci_slot);
@@ -2770,6 +2774,7 @@ static void hv_eject_device_work(struct work_struct *work)
 	vmbus_sendpacket(hbus->hdev->channel, ejct_pkt,
 			 sizeof(*ejct_pkt), 0,
 			 VM_PKT_DATA_INBAND, 0);
+	printk("%s: 5: line %d: pdev=%px: sent PCI_EJECTION_COMPLETE\n", __func__, __LINE__, pdev);

 	/* For the get_pcichild() in hv_pci_eject_device() */
 	put_pcichild(hpdev);
@@ -3686,7 +3691,10 @@ static int hv_pci_probe(struct hv_device *hdev,

 	hbus->state = hv_pcibus_probed;

+	printk("%s: line %d: create_root_hv_pci_bus=%d: sleeping 10s\n", __func__, __LINE__, ret); ssleep(10);
+	printk("%s: line %d: create_root_hv_pci_bus=%d: sleeping 10s: done\n", __func__, __LINE__, ret);
 	ret = create_root_hv_pci_bus(hbus);
+	printk("%s: line %d: create_root_hv_pci_bus=%d\n", __func__, __LINE__, ret);
 	if (ret)
 		goto free_windows;

I'm able to repro the below timeout errors (remove the PCI VF NIC within 10 seconds after it's assigned)

[   31.445306] hv_pci b92a0085-468b-407a-a88a-d33fac8edc75: PCI VMBus probing: Using version 0x10004
[   31.452133] hv_pci_probe: line 3694: create_root_hv_pci_bus=0: sleeping 10s
[   34.345256] hv_eject_device_work: 1: line 2752: pdev=0000000000000000
[   34.350175] hv_eject_device_work: 2: line 2760: pdev=0000000000000000: sleeping 10s
[   41.650330] hv_pci_probe: line 3695: create_root_hv_pci_bus=0: sleeping 10s: done
[   41.668443] hv_pci b92a0085-468b-407a-a88a-d33fac8edc75: PCI host bridge to bus 468b:00
[   41.674201] pci_bus 468b:00: root bus resource [mem 0xfe0000000-0xfe00fffff window]
[   41.680284] pci_bus 468b:00: No busn resource found for root bus, will use [bus 00-ff]
[   41.688901] pci 468b:00:02.0: [15b3:1016] type 00 class 0x020000
[   41.695554] pci 468b:00:02.0: reg 0x10: [mem 0xfe0000000-0xfe00fffff 64bit pref]
[   41.704304] pci 468b:00:02.0: enabling Extended Tags
...
[   41.745847] hv_pci_probe: line 3697: create_root_hv_pci_bus=0
[   41.938950] mlx5_core 468b:00:02.0: no default pinctrl state
[   41.941462] probe_one: line 1730: sleep 20s:
[   44.466334] hv_eject_device_work: 3: line 2761: pdev=0000000000000000: sleeping 10s: done: removing hpdev
[   44.472691] hv_eject_device_work: 4: line 2765: pdev=0000000000000000: hpdev is removed!!!
[   44.478007] hv_eject_device_work: 5: line 2777: pdev=0000000000000000: sent PCI_EJECTION_COMPLETE
[   44.480213] hv_netvsc 818fe754-b912-4445-af51-1f584812e3c9 eth0: VF slot 1 removed
[   63.154376] probe_one: line 1731: sleep 20s: done
[   63.161610] mlx5_core 468b:00:02.0: firmware version: 65535.65535.65535
[   83.174538] mlx5_core 468b:00:02.0: wait_fw_init:199:(pid 17): Waiting for FW initialization, timeout abort in 100s (0xffffffff)
[  103.190543] mlx5_core 468b:00:02.0: wait_fw_init:199:(pid 17): Waiting for FW initialization, timeout abort in 79s (0xffffffff)
[  123.202507] mlx5_core 468b:00:02.0: wait_fw_init:199:(pid 17): Waiting for FW initialization, timeout abort in 59s (0xffffffff)
[  143.214547] mlx5_core 468b:00:02.0: wait_fw_init:199:(pid 17): Waiting for FW initialization, timeout abort in 39s (0xffffffff)
[  163.222594] mlx5_core 468b:00:02.0: wait_fw_init:199:(pid 17): Waiting for FW initialization, timeout abort in 19s (0xffffffff)
[  183.178629] mlx5_core 468b:00:02.0: mlx5_function_setup:1130:(pid 17): Firmware over 120000 MS in pre-initializing state, aborting
[  183.186289] mlx5_core 468b:00:02.0: probe_one:1764:(pid 17): mlx5_init_one failed with error code -16
[  183.192623] mlx5_core: probe of 468b:00:02.0 failed with error -16
[  183.202701] pci_bus 468b:00: busn_res: [bus 00] is released

With the fix, I'm no longer seeing the timeout errors:

[   31.144066] hv_pci b92a0085-468b-407a-a88a-d33fac8edc75: PCI VMBus probing: Using version 0x10004
[   31.149207] hv_pci_probe: line 3708: create_root_hv_pci_bus=0: sleeping 10s
[   41.378274] hv_pci_probe: line 3709: create_root_hv_pci_bus=0: sleeping 10s: done
[   41.397577] hv_pci b92a0085-468b-407a-a88a-d33fac8edc75: PCI host bridge to bus 468b:00
[   41.402799] pci_bus 468b:00: root bus resource [mem 0xfe0000000-0xfe00fffff window]
...
[   41.415586] pci 468b:00:02.0: [15b3:1016] type 00 class 0x020000
[   41.423132] pci 468b:00:02.0: reg 0x10: [mem 0xfe0000000-0xfe00fffff 64bit pref]
...
[   41.471366] hv_pci_probe: line 3711: create_root_hv_pci_bus=0
[   41.484371] hv_eject_device_work: 1: line 2761: pdev=ffff90f4c4858000
[   41.491644] hv_eject_device_work: 2: line 2769: pdev=ffff90f4c4858000: sleeping 10s
[   51.618268] hv_eject_device_work: 3: line 2770: pdev=ffff90f4c4858000: sleeping 10s: done: removing hpdev
[   51.625094] hv_eject_device_work: 4: line 2774: pdev=ffff90f4c4858000: hpdev is removed!!!
[   51.630234] hv_eject_device_work: 5: line 2786: pdev=ffff90f4c4858000: sent PCI_EJECTION_COMPLETE
[   51.632148] hv_netvsc 818fe754-b912-4445-af51-1f584812e3c9 eth0: VF slot 1 removed
[   51.647014] pci_bus 468b:00: busn_res: [bus 00] is released

Now the mlx5_core driver is loaded; I try fast device addition/removal again
and I'm still not seeing the timeout errors:

[  495.622678] hv_pci b92a0085-468b-407a-a88a-d33fac8edc75: PCI VMBus probing: Using version 0x10004
[  495.627791] hv_pci_probe: line 3708: create_root_hv_pci_bus=0: sleeping 10s
[  505.762550] hv_pci_probe: line 3709: create_root_hv_pci_bus=0: sleeping 10s: done
[  505.779496] hv_pci b92a0085-468b-407a-a88a-d33fac8edc75: PCI host bridge to bus 468b:00
[  505.784872] pci_bus 468b:00: root bus resource [mem 0xfe0000000-0xfe00fffff window]
...
[  505.798323] pci 468b:00:02.0: [15b3:1016] type 00 class 0x020000
...
[  505.868908] probe_one: line 1730: sleep 20s:
[  525.986578] probe_one: line 1731: sleep 20s: done
[  525.990717] mlx5_core 468b:00:02.0: enabling device (0000 -> 0002)
[  525.998339] mlx5_core 468b:00:02.0: firmware version: 14.25.8102
...
[  526.381211] hv_netvsc 818fe754-b912-4445-af51-1f584812e3c9 eth0: VF registering: eth1
[  526.385432] mlx5_core 468b:00:02.0 eth1: joined to eth0
[  526.389076] mlx5_core 468b:00:02.0 eth1: Disabling LRO, not supported in legacy RQ
[  526.397276] mlx5_core 468b:00:02.0 eth1: Disabling LRO, not supported in legacy RQ
[  526.400330] mlx5_core 468b:00:02.0: MLX5E: StrdRq(0) RqSz(1024) StrdSz(256) RxCqeCmprss(0 basic)
...
[  526.430803] hv_pci_probe: line 3711: create_root_hv_pci_bus=0
[  526.443633] hv_eject_device_work: 1: line 2761: pdev=ffff90f4e369b000
[  526.454819] hv_netvsc 818fe754-b912-4445-af51-1f584812e3c9 eth0: VF unregistering: enP18059s1
[  526.457828] mlx5_core 468b:00:02.0 enP18059s1 (unregistering): Disabling LRO, not supported in legacy RQ
[  527.680118] mlx5_core 468b:00:02.0: devm_attr_group_remove: removing group 00000000074d6d6b
[  527.692794] hv_eject_device_work: 2: line 2769: pdev=ffff90f4e369b000: sleeping 10s
[  537.762575] hv_eject_device_work: 3: line 2770: pdev=ffff90f4e369b000: sleeping 10s: done: removing hpdev
[  537.768831] hv_eject_device_work: 4: line 2774: pdev=ffff90f4e369b000: hpdev is removed!!!
[  537.774313] hv_eject_device_work: 5: line 2786: pdev=ffff90f4e369b000: sent PCI_EJECTION_COMPLETE
[  537.777737] hv_netvsc 818fe754-b912-4445-af51-1f584812e3c9 eth0: VF slot 1 removed
[  537.794038] pci_bus 468b:00: busn_res: [bus 00] is released

diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c
index 48feab095a14..2c0b86b20408 100644
--- a/drivers/pci/controller/pci-hyperv.c
+++ b/drivers/pci/controller/pci-hyperv.c
@@ -489,7 +489,10 @@ struct hv_pcibus_device {
 	struct fwnode_handle *fwnode;
 	/* Protocol version negotiated with the host */
 	enum pci_protocol_version_t protocol_version;
+
+	struct mutex state_lock;
 	enum hv_pcibus_state state;
+
 	struct hv_device *hdev;
 	resource_size_t low_mmio_space;
 	resource_size_t high_mmio_space;
@@ -2512,6 +2515,8 @@ static void pci_devices_present_work(struct work_struct *work)
 	if (!dr)
 		return;
 
+	mutex_lock(&hbus->state_lock);
+
 	/* First, mark all existing children as reported missing. */
 	spin_lock_irqsave(&hbus->device_list_lock, flags);
 	list_for_each_entry(hpdev, &hbus->children, list_entry) {
@@ -2593,6 +2598,8 @@ static void pci_devices_present_work(struct work_struct *work)
 		break;
 	}
 
+	mutex_unlock(&hbus->state_lock);
+
 	kfree(dr);
 }
 
@@ -2741,6 +2748,8 @@ static void hv_eject_device_work(struct work_struct *work)
 	hpdev = container_of(work, struct hv_pci_dev, wrk);
 	hbus = hpdev->hbus;
 
+	mutex_lock(&hbus->state_lock);
+
 	/*
 	 * Ejection can come before or after the PCI bus has been set up, so
 	 * attempt to find it and tear down the bus state, if it exists.  This
@@ -2777,6 +2786,8 @@ static void hv_eject_device_work(struct work_struct *work)
 	put_pcichild(hpdev);
 	put_pcichild(hpdev);
 	/* hpdev has been freed. Do not use it any more. */
+
+	mutex_unlock(&hbus->state_lock);
 }
 
 /**
@@ -3562,6 +3573,7 @@ static int hv_pci_probe(struct hv_device *hdev,
 		return -ENOMEM;
 
 	hbus->bridge = bridge;
+	mutex_init(&hbus->state_lock);
 	hbus->state = hv_pcibus_init;
 	hbus->wslot_res_allocated = -1;
 
@@ -3670,9 +3682,11 @@ static int hv_pci_probe(struct hv_device *hdev,
 	if (ret)
 		goto free_irq_domain;
 
+	mutex_lock(&hbus->state_lock);
+
 	ret = hv_pci_enter_d0(hdev);
 	if (ret)
-		goto free_irq_domain;
+		goto release_state_lock;
 
 	ret = hv_pci_allocate_bridge_windows(hbus);
 	if (ret)
@@ -3690,12 +3704,15 @@ static int hv_pci_probe(struct hv_device *hdev,
 	if (ret)
 		goto free_windows;
 
+	mutex_unlock(&hbus->state_lock);
 	return 0;
 
 free_windows:
 	hv_pci_free_bridge_windows(hbus);
 exit_d0:
 	(void) hv_pci_bus_exit(hdev, true);
+release_state_lock:
+	mutex_unlock(&hbus->state_lock);
 free_irq_domain:
 	irq_domain_remove(hbus->irq_domain);
 free_fwnode:
@@ -3945,20 +3962,26 @@ static int hv_pci_resume(struct hv_device *hdev)
 	if (ret)
 		goto out;
 
+	mutex_lock(&hbus->state_lock);
+
 	ret = hv_pci_enter_d0(hdev);
 	if (ret)
 		goto out;
 
 	ret = hv_send_resources_allocated(hdev);
 	if (ret)
-		goto out;
+		goto release_state_lock;
 
 	prepopulate_bars(hbus);
 
 	hv_pci_restore_msi_state(hbus);
 
 	hbus->state = hv_pcibus_installed;
+	mutex_unlock(&hbus->state_lock);
 	return 0;
+
+release_state_lock:
+	mutex_unlock(&hbus->state_lock);
 out:
 	vmbus_close(hdev->channel);
 	return ret;
-- 
2.25.1


  parent reply	other threads:[~2023-03-28  4:53 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-03-28  4:51 [PATCH 0/6] pci-hyper: fix race condition bugs for fast device hotplug Dexuan Cui
2023-03-28  4:51 ` [PATCH 1/6] PCI: hv: fix a race condition bug in hv_pci_query_relations() Dexuan Cui
2023-03-28  5:29   ` [EXTERNAL] " Saurabh Singh Sengar
2023-03-28  5:38     ` Dexuan Cui
2023-03-28 17:24       ` Bjorn Helgaas
2023-03-29  8:37         ` Dexuan Cui
2023-03-29 16:06           ` Bjorn Helgaas
2023-03-28 16:48   ` Long Li
2023-03-29  8:21     ` Dexuan Cui
2023-03-28  4:51 ` [PATCH 2/6] PCI: hv: Fix a race condition in hv_irq_unmask() that can cause panic Dexuan Cui
2023-03-28  4:51 ` [PATCH 3/6] PCI: hv: Remove the useless hv_pcichild_state from struct hv_pci_dev Dexuan Cui
2023-03-28  4:51 ` [PATCH 4/6] Revert "PCI: hv: Fix a timing issue which causes kdump to fail occasionally" Dexuan Cui
2023-03-28  6:33   ` Dexuan Cui
2023-03-28 12:46     ` Wei Hu
2023-03-30  5:17     ` Wei Hu
2023-03-28  4:51 ` Dexuan Cui [this message]
2023-03-29 16:41   ` [PATCH 5/6] PCI: hv: Add a per-bus mutex state_lock Michael Kelley (LINUX)
2023-03-29 16:54     ` Dexuan Cui
2023-03-28  4:51 ` [PATCH 6/6] PCI: hv: Use async probing to reduce boot time Dexuan Cui
2023-03-29 16:55   ` Michael Kelley (LINUX)
2023-03-29 22:42     ` Dexuan Cui

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20230328045122.25850-6-decui@microsoft.com \
    --to=decui@microsoft.com \
    --cc=bhelgaas@google.com \
    --cc=boqun.feng@gmail.com \
    --cc=davem@davemloft.net \
    --cc=edumazet@google.com \
    --cc=haiyangz@microsoft.com \
    --cc=jakeo@microsoft.com \
    --cc=kuba@kernel.org \
    --cc=kw@linux.com \
    --cc=kys@microsoft.com \
    --cc=leon@kernel.org \
    --cc=linux-hyperv@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-pci@vger.kernel.org \
    --cc=linux-rdma@vger.kernel.org \
    --cc=longli@microsoft.com \
    --cc=lpieralisi@kernel.org \
    --cc=mikelley@microsoft.com \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    --cc=robh@kernel.org \
    --cc=saeedm@nvidia.com \
    --cc=wei.liu@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.