public inbox for linux-acpi@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH] ACPI/PCI: deduplicate concurrent eject notify
@ 2025-02-24  7:00 Weiwen Hu
  2025-02-28 17:16 ` Bjorn Helgaas
  0 siblings, 1 reply; 4+ messages in thread
From: Weiwen Hu @ 2025-02-24  7:00 UTC (permalink / raw)
  To: Rafael J. Wysocki, Bjorn Helgaas; +Cc: Weiwen Hu, linux-acpi, linux-pci

Ignore the eject notification when the previous ejection is still in
progress to prevent multiple _EJ0 invocations when the ejection completes.

The first _EJ0 informs the platform to actually eject the device and frees
the slot for other devices. So the subsequent _EJ0 may accidentally eject
the new device that is just plugged into this slot. We need to avoid this.

For each acpi_device, this patch introduces a new field `ejecting`, which
is set before enqueuing the `kacpi_hotplug_wq` and reset before invoking
_EJ0.  Every notifications we received before invoking _EJ0 is targeted to
the old device. This ensures we don't miss any notifications for newly
plugged device.  And a new flag `should_dedup_eject` allows each driver to
implement this feature gradually. A driver should set this flag on
initialization if it will reset `ejecting`.

This fix is not perfect. If we receive an eject notification just after
resetting `ejecting` but before _EJ0, we will still invoke _EJ0 twice.
However this seems to be the best solution available, and it strictly
improves the current situation.

Another potential fix is to add an `ejected` flag to each device and not
invoke _EJ0 for already ejected devices. However, this approach risks
losing synchronization with the platform if something else goes wrong,
potentially preventing the device from being ejected permanently.  And we
need to check with bus driver to make sure the device is really ejected
successfully. But this check is still racy. So we cannot ensure no extra
_EJ0 invocations either.

Signed-off-by: Weiwen Hu <huweiwen@linux.alibaba.com>
---
 drivers/acpi/osl.c                 |  6 ++++++
 drivers/pci/hotplug/acpiphp_glue.c | 15 +++++++++++----
 include/acpi/acpi_bus.h            |  4 +++-
 3 files changed, 20 insertions(+), 5 deletions(-)

We observed that umount can take extremely long time if there is a lot of
dirty page cache.  umount will take the s_umount semaphore, which will
block the ejecting process:

	__schedule+0x1e0/0x630
	? kernfs_put.part.0+0xd4/0x1a0
	schedule+0x46/0xb0
	rwsem_down_read_slowpath+0x16b/0x490
	__get_super.part.0+0xc1/0xe0
	fsync_bdev+0x11/0x60
	invalidate_partition+0x5c/0xa0
	del_gendisk+0x103/0x2e0
	virtblk_remove+0x27/0xa0
	virtio_dev_remove+0x36/0x90
	__device_release_driver+0x172/0x260
	device_release_driver+0x24/0x30
	bus_remove_device+0xf6/0x170
	device_del+0x19b/0x450
	device_unregister+0x16/0x60
	unregister_virtio_device+0x11/0x20
	virtio_pci_remove+0x31/0x60
	pci_device_remove+0x38/0xa0
	__device_release_driver+0x172/0x260
	device_release_driver+0x24/0x30
	pci_stop_bus_device+0x6c/0x90
	pci_stop_and_remove_bus_device+0xe/0x20
	disable_slot+0x49/0x90
	acpiphp_disable_and_eject_slot+0x15/0x90

While OS is not invoking _EJ0 timely, the user (or hypervisor) may retry by
issuing another notification, which will be queued in kacpi_hotplug_wq.
After the umount is finally done, the _EJ0 will be invoked.  Then, if there
are devices pending attach, the hypervisor may choose to attach it
immediately to the same slot.  The new device can be ejected by the queued
ejecting process unintentionally.

On Alibaba Cloud, we re-issue the notification around every 10s if the OS
does not respond.  (BTW, do you think platform is allowed to re-issue
the notification on timeout?)
We can easily reproduce this issue on Alibaba Cloud ECS:

	WRITE_SIZE=2300M  # tune this value so that the umount takes 20s
	# replace these disk serial numbers
	DISK_DETACH=bp142xxxxxxxxxxxxxxx  # pre-formatted
	DISK_ATTACH=bp109xxxxxxxxxxxxxxx  # any
	# start from these two disks detached

	INSTANCE_ID=$(curl -sS http://100.100.100.200/latest/meta-data/instance-id)
	echo "instance id: $INSTANCE_ID"
	DISK_PATH=/dev/disk/by-id/nvme-Alibaba_Cloud_Elastic_Block_Storage_

	echo "attaching disk d-$DISK_DETACH"
	aliyun ecs AttachDisk --DiskId "d-$DISK_DETACH" --InstanceId "$INSTANCE_ID"

	sleep 2
	mkdir -p /mnt/slow
	mount "$DISK_PATH$DISK_DETACH" /mnt/slow
	echo "mounted d-$DISK_DETACH to /mnt/slow"

	rm -f /mnt/slow/zero
	echo "populating dirty cache"
	head -c $WRITE_SIZE /dev/zero > /mnt/slow/zero;

	echo umounting
	(
		umount /mnt/slow
		echo umounted
	)&

	sleep 2
	echo "detaching disk d-$DISK_DETACH"
	aliyun ecs DetachDisk --DiskId "d-$DISK_DETACH" --InstanceId "$INSTANCE_ID"

	sleep 10
	echo "attaching disk d-$DISK_ATTACH"
	aliyun ecs AttachDisk --DiskId "d-$DISK_ATTACH" --InstanceId "$INSTANCE_ID"

	sleep 7
	wait
	for _ in {1..10}; do
		sleep 1
		if [ -e "$DISK_PATH$DISK_ATTACH" ]; then
			echo "disk d-$DISK_ATTACH attached, issue not reproduced"
			exit 0
		fi
		echo "disk d-$DISK_ATTACH not found yet"
	done

	echo "issue reproduced"
	exit 1

And here is the trace we got from `perf trace` while running the above script on an unpatched kernel:

	[starting detach]
	 48202.244 kworker/0:0-ev/5 probe:acpi_ev_queue_notify_request(__probe_ip: -1452149680, notify_value: 3)
	 48202.314 kworker/0:0-ev/5 probe:acpi_hotplug_schedule(__probe_ip: -1452297040, src: 3)
	 48203.690 kworker/u8:0-e/1946 probe:acpi_device_hotplug(__probe_ip: -1452251424, src: 3)
	[blocked, retrying detach]
	 58023.813 kworker/0:0-ev/5 probe:acpi_ev_queue_notify_request(__probe_ip: -1452149680, notify_value: 3)
	 58023.881 kworker/0:0-ev/5 probe:acpi_hotplug_schedule(__probe_ip: -1452297040, src: 3)
	[detach done]
	 62834.048 kworker/u8:0-e/1946 probe:acpi_evaluate_ej0(__probe_ip: -1452291632)
	[another device attaching]
	 62954.686 kworker/0:0-ev/5 probe:acpi_ev_queue_notify_request(__probe_ip: -1452149680, notify_value: 1)
	 62954.828 kworker/0:0-ev/5 probe:acpi_hotplug_schedule(__probe_ip: -1452297040, src: 1)
	 63042.506 kworker/u8:0-e/1946 probe:acpi_device_hotplug(__probe_ip: -1452251424, src: 3)
	[the new device is ejected unintentionally]
	 63042.520 kworker/u8:0-e/1946 probe:acpi_evaluate_ej0(__probe_ip: -1452291632)
	[the actual attach task, scanned the bus but got nothing]
	 63266.555 kworker/u8:0-e/1946 probe:acpi_device_hotplug(__probe_ip: -1452251424, src: 1)

With this patch, the acpi_hotplug_schedule at 58023.881 will be skipped to
suppress the acpi_evaluate_ej0 at 63042.520.

diff --git a/drivers/acpi/osl.c b/drivers/acpi/osl.c
index 5ff343096ece..f041c4db10f7 100644
--- a/drivers/acpi/osl.c
+++ b/drivers/acpi/osl.c
@@ -1193,6 +1193,12 @@ acpi_status acpi_hotplug_schedule(struct acpi_device *adev, u32 src)
 {
 	struct acpi_hp_work *hpw;
 
+	if (src == ACPI_NOTIFY_EJECT_REQUEST && adev->flags.should_dedup_eject
+			&& atomic_xchg(&adev->hp->ejecting, 1)) {
+		put_device(&adev->dev);
+		return AE_OK;
+	}
+
 	acpi_handle_debug(adev->handle,
 			  "Scheduling hotplug event %u for deferred handling\n",
 			   src);
diff --git a/drivers/pci/hotplug/acpiphp_glue.c b/drivers/pci/hotplug/acpiphp_glue.c
index 5b1f271c6034..3c50f2af1584 100644
--- a/drivers/pci/hotplug/acpiphp_glue.c
+++ b/drivers/pci/hotplug/acpiphp_glue.c
@@ -68,6 +68,7 @@ static struct acpiphp_context *acpiphp_init_context(struct acpi_device *adev)
 	context->hp.notify = acpiphp_hotplug_notify;
 	context->hp.fixup = acpiphp_post_dock_fixup;
 	acpi_set_hp_context(adev, &context->hp);
+	adev->flags.should_dedup_eject = true;
 	return context;
 }
 
@@ -778,7 +779,8 @@ void acpiphp_check_host_bridge(struct acpi_device *adev)
 	}
 }
 
-static int acpiphp_disable_and_eject_slot(struct acpiphp_slot *slot);
+static int
+acpiphp_disable_and_eject_slot(struct acpi_hotplug_context *hp, struct acpiphp_slot *slot);
 
 static void hotplug_event(u32 type, struct acpiphp_context *context)
 {
@@ -825,7 +827,7 @@ static void hotplug_event(u32 type, struct acpiphp_context *context)
 	case ACPI_NOTIFY_EJECT_REQUEST:
 		/* request device eject */
 		acpi_handle_debug(handle, "Eject request in %s()\n", __func__);
-		acpiphp_disable_and_eject_slot(slot);
+		acpiphp_disable_and_eject_slot(&context->hp, slot);
 		break;
 	}
 
@@ -999,9 +1001,11 @@ int acpiphp_enable_slot(struct acpiphp_slot *slot)
 
 /**
  * acpiphp_disable_and_eject_slot - power off and eject slot
+ * @hp: the context that received eject notification, can be NULL
  * @slot: ACPI PHP slot
  */
-static int acpiphp_disable_and_eject_slot(struct acpiphp_slot *slot)
+static int
+acpiphp_disable_and_eject_slot(struct acpi_hotplug_context *hp, struct acpiphp_slot *slot)
 {
 	struct acpiphp_func *func;
 
@@ -1011,6 +1015,9 @@ static int acpiphp_disable_and_eject_slot(struct acpiphp_slot *slot)
 	/* unconfigure all functions */
 	disable_slot(slot);
 
+	if (hp)
+		atomic_set(&hp->ejecting, 0);
+
 	list_for_each_entry(func, &slot->funcs, sibling)
 		if (func->flags & FUNC_HAS_EJ0) {
 			acpi_handle handle = func_to_handle(func);
@@ -1034,7 +1041,7 @@ int acpiphp_disable_slot(struct acpiphp_slot *slot)
 	 */
 	acpi_scan_lock_acquire();
 	pci_lock_rescan_remove();
-	ret = acpiphp_disable_and_eject_slot(slot);
+	ret = acpiphp_disable_and_eject_slot(NULL, slot);
 	pci_unlock_rescan_remove();
 	acpi_scan_lock_release();
 	return ret;
diff --git a/include/acpi/acpi_bus.h b/include/acpi/acpi_bus.h
index aad1a95e6863..870c1ffe47c9 100644
--- a/include/acpi/acpi_bus.h
+++ b/include/acpi/acpi_bus.h
@@ -151,6 +151,7 @@ typedef void (*acpi_hp_fixup) (struct acpi_device *);
 
 struct acpi_hotplug_context {
 	struct acpi_device *self;
+	atomic_t ejecting;
 	acpi_hp_notify notify;
 	acpi_hp_uevent uevent;
 	acpi_hp_fixup fixup;
@@ -215,7 +216,8 @@ struct acpi_device_flags {
 	u32 cca_seen:1;
 	u32 enumeration_by_parent:1;
 	u32 honor_deps:1;
-	u32 reserved:18;
+	u32 should_dedup_eject:1;
+	u32 reserved:17;
 };
 
 /* File System */
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH] ACPI/PCI: deduplicate concurrent eject notify
  2025-02-24  7:00 [PATCH] ACPI/PCI: deduplicate concurrent eject notify Weiwen Hu
@ 2025-02-28 17:16 ` Bjorn Helgaas
  2025-04-11  9:12   ` Weiwen Hu
  0 siblings, 1 reply; 4+ messages in thread
From: Bjorn Helgaas @ 2025-02-28 17:16 UTC (permalink / raw)
  To: Weiwen Hu; +Cc: Rafael J. Wysocki, Bjorn Helgaas, linux-acpi, linux-pci

On Mon, Feb 24, 2025 at 03:00:34PM +0800, Weiwen Hu wrote:
> Ignore the eject notification when the previous ejection is still in
> progress to prevent multiple _EJ0 invocations when the ejection completes.
> 
> The first _EJ0 informs the platform to actually eject the device and frees
> the slot for other devices. So the subsequent _EJ0 may accidentally eject
> the new device that is just plugged into this slot. We need to avoid this.
> 
> For each acpi_device, this patch introduces a new field `ejecting`, which
> is set before enqueuing the `kacpi_hotplug_wq` and reset before invoking
> _EJ0.  Every notifications we received before invoking _EJ0 is targeted to
> the old device. This ensures we don't miss any notifications for newly
> plugged device.  And a new flag `should_dedup_eject` allows each driver to
> implement this feature gradually. A driver should set this flag on
> initialization if it will reset `ejecting`.

Which drivers do you have in mind when you say "each driver can
implement this feature gradually"?  You set should_dedup_eject in
acpiphp, so I guess you mean other ACPI hotplug drivers?  I see
acpi_hotplug_context mentioned in libata-acpi.c, surface3-wmi.c; maybe
those are the only other current ones?

> This fix is not perfect. If we receive an eject notification just after
> resetting `ejecting` but before _EJ0, we will still invoke _EJ0 twice.
> However this seems to be the best solution available, and it strictly
> improves the current situation.
> 
> Another potential fix is to add an `ejected` flag to each device and not
> invoke _EJ0 for already ejected devices. However, this approach risks
> losing synchronization with the platform if something else goes wrong,
> potentially preventing the device from being ejected permanently.  And we
> need to check with bus driver to make sure the device is really ejected
> successfully. But this check is still racy. So we cannot ensure no extra
> _EJ0 invocations either.

I see the problem.  Thanks for the detailed explanation and details
about reproducing it and the trace.

I'm not sure whether the platform should reissue the Bus Check
notification based on the fact that the OS hasn't invoked _EJ0 in some
arbitrary time.  That seems a little bit presumptuous because, for
example, the platform can't know how long it will take to write out
the dirty page cache.  The racyness of the workaround seems
troublesome to me.

But this is all really an ACPI issue, not a PCI issue, so I'd like to
defer to the ACPI experts here.

> Signed-off-by: Weiwen Hu <huweiwen@linux.alibaba.com>
> ---
>  drivers/acpi/osl.c                 |  6 ++++++
>  drivers/pci/hotplug/acpiphp_glue.c | 15 +++++++++++----
>  include/acpi/acpi_bus.h            |  4 +++-
>  3 files changed, 20 insertions(+), 5 deletions(-)
> 
> We observed that umount can take extremely long time if there is a lot of
> dirty page cache.  umount will take the s_umount semaphore, which will
> block the ejecting process:
> 
> 	__schedule+0x1e0/0x630
> 	? kernfs_put.part.0+0xd4/0x1a0
> 	schedule+0x46/0xb0
> 	rwsem_down_read_slowpath+0x16b/0x490
> 	__get_super.part.0+0xc1/0xe0
> 	fsync_bdev+0x11/0x60
> 	invalidate_partition+0x5c/0xa0
> 	del_gendisk+0x103/0x2e0
> 	virtblk_remove+0x27/0xa0
> 	virtio_dev_remove+0x36/0x90
> 	__device_release_driver+0x172/0x260
> 	device_release_driver+0x24/0x30
> 	bus_remove_device+0xf6/0x170
> 	device_del+0x19b/0x450
> 	device_unregister+0x16/0x60
> 	unregister_virtio_device+0x11/0x20
> 	virtio_pci_remove+0x31/0x60
> 	pci_device_remove+0x38/0xa0
> 	__device_release_driver+0x172/0x260
> 	device_release_driver+0x24/0x30
> 	pci_stop_bus_device+0x6c/0x90
> 	pci_stop_and_remove_bus_device+0xe/0x20
> 	disable_slot+0x49/0x90
> 	acpiphp_disable_and_eject_slot+0x15/0x90
> 
> While OS is not invoking _EJ0 timely, the user (or hypervisor) may retry by
> issuing another notification, which will be queued in kacpi_hotplug_wq.
> After the umount is finally done, the _EJ0 will be invoked.  Then, if there
> are devices pending attach, the hypervisor may choose to attach it
> immediately to the same slot.  The new device can be ejected by the queued
> ejecting process unintentionally.
> 
> On Alibaba Cloud, we re-issue the notification around every 10s if the OS
> does not respond.  (BTW, do you think platform is allowed to re-issue
> the notification on timeout?)
> We can easily reproduce this issue on Alibaba Cloud ECS:
> 
> 	WRITE_SIZE=2300M  # tune this value so that the umount takes 20s
> 	# replace these disk serial numbers
> 	DISK_DETACH=bp142xxxxxxxxxxxxxxx  # pre-formatted
> 	DISK_ATTACH=bp109xxxxxxxxxxxxxxx  # any
> 	# start from these two disks detached
> 
> 	INSTANCE_ID=$(curl -sS http://100.100.100.200/latest/meta-data/instance-id)
> 	echo "instance id: $INSTANCE_ID"
> 	DISK_PATH=/dev/disk/by-id/nvme-Alibaba_Cloud_Elastic_Block_Storage_
> 
> 	echo "attaching disk d-$DISK_DETACH"
> 	aliyun ecs AttachDisk --DiskId "d-$DISK_DETACH" --InstanceId "$INSTANCE_ID"
> 
> 	sleep 2
> 	mkdir -p /mnt/slow
> 	mount "$DISK_PATH$DISK_DETACH" /mnt/slow
> 	echo "mounted d-$DISK_DETACH to /mnt/slow"
> 
> 	rm -f /mnt/slow/zero
> 	echo "populating dirty cache"
> 	head -c $WRITE_SIZE /dev/zero > /mnt/slow/zero;
> 
> 	echo umounting
> 	(
> 		umount /mnt/slow
> 		echo umounted
> 	)&
> 
> 	sleep 2
> 	echo "detaching disk d-$DISK_DETACH"
> 	aliyun ecs DetachDisk --DiskId "d-$DISK_DETACH" --InstanceId "$INSTANCE_ID"
> 
> 	sleep 10
> 	echo "attaching disk d-$DISK_ATTACH"
> 	aliyun ecs AttachDisk --DiskId "d-$DISK_ATTACH" --InstanceId "$INSTANCE_ID"
> 
> 	sleep 7
> 	wait
> 	for _ in {1..10}; do
> 		sleep 1
> 		if [ -e "$DISK_PATH$DISK_ATTACH" ]; then
> 			echo "disk d-$DISK_ATTACH attached, issue not reproduced"
> 			exit 0
> 		fi
> 		echo "disk d-$DISK_ATTACH not found yet"
> 	done
> 
> 	echo "issue reproduced"
> 	exit 1
> 
> And here is the trace we got from `perf trace` while running the above script on an unpatched kernel:
> 
> 	[starting detach]
> 	 48202.244 kworker/0:0-ev/5 probe:acpi_ev_queue_notify_request(__probe_ip: -1452149680, notify_value: 3)
> 	 48202.314 kworker/0:0-ev/5 probe:acpi_hotplug_schedule(__probe_ip: -1452297040, src: 3)
> 	 48203.690 kworker/u8:0-e/1946 probe:acpi_device_hotplug(__probe_ip: -1452251424, src: 3)
> 	[blocked, retrying detach]
> 	 58023.813 kworker/0:0-ev/5 probe:acpi_ev_queue_notify_request(__probe_ip: -1452149680, notify_value: 3)
> 	 58023.881 kworker/0:0-ev/5 probe:acpi_hotplug_schedule(__probe_ip: -1452297040, src: 3)
> 	[detach done]
> 	 62834.048 kworker/u8:0-e/1946 probe:acpi_evaluate_ej0(__probe_ip: -1452291632)
> 	[another device attaching]
> 	 62954.686 kworker/0:0-ev/5 probe:acpi_ev_queue_notify_request(__probe_ip: -1452149680, notify_value: 1)
> 	 62954.828 kworker/0:0-ev/5 probe:acpi_hotplug_schedule(__probe_ip: -1452297040, src: 1)
> 	 63042.506 kworker/u8:0-e/1946 probe:acpi_device_hotplug(__probe_ip: -1452251424, src: 3)
> 	[the new device is ejected unintentionally]
> 	 63042.520 kworker/u8:0-e/1946 probe:acpi_evaluate_ej0(__probe_ip: -1452291632)
> 	[the actual attach task, scanned the bus but got nothing]
> 	 63266.555 kworker/u8:0-e/1946 probe:acpi_device_hotplug(__probe_ip: -1452251424, src: 1)
> 
> With this patch, the acpi_hotplug_schedule at 58023.881 will be skipped to
> suppress the acpi_evaluate_ej0 at 63042.520.
> 
> diff --git a/drivers/acpi/osl.c b/drivers/acpi/osl.c
> index 5ff343096ece..f041c4db10f7 100644
> --- a/drivers/acpi/osl.c
> +++ b/drivers/acpi/osl.c
> @@ -1193,6 +1193,12 @@ acpi_status acpi_hotplug_schedule(struct acpi_device *adev, u32 src)
>  {
>  	struct acpi_hp_work *hpw;
>  
> +	if (src == ACPI_NOTIFY_EJECT_REQUEST && adev->flags.should_dedup_eject
> +			&& atomic_xchg(&adev->hp->ejecting, 1)) {
> +		put_device(&adev->dev);
> +		return AE_OK;
> +	}
> +
>  	acpi_handle_debug(adev->handle,
>  			  "Scheduling hotplug event %u for deferred handling\n",
>  			   src);
> diff --git a/drivers/pci/hotplug/acpiphp_glue.c b/drivers/pci/hotplug/acpiphp_glue.c
> index 5b1f271c6034..3c50f2af1584 100644
> --- a/drivers/pci/hotplug/acpiphp_glue.c
> +++ b/drivers/pci/hotplug/acpiphp_glue.c
> @@ -68,6 +68,7 @@ static struct acpiphp_context *acpiphp_init_context(struct acpi_device *adev)
>  	context->hp.notify = acpiphp_hotplug_notify;
>  	context->hp.fixup = acpiphp_post_dock_fixup;
>  	acpi_set_hp_context(adev, &context->hp);
> +	adev->flags.should_dedup_eject = true;
>  	return context;
>  }
>  
> @@ -778,7 +779,8 @@ void acpiphp_check_host_bridge(struct acpi_device *adev)
>  	}
>  }
>  
> -static int acpiphp_disable_and_eject_slot(struct acpiphp_slot *slot);
> +static int
> +acpiphp_disable_and_eject_slot(struct acpi_hotplug_context *hp, struct acpiphp_slot *slot);
>  
>  static void hotplug_event(u32 type, struct acpiphp_context *context)
>  {
> @@ -825,7 +827,7 @@ static void hotplug_event(u32 type, struct acpiphp_context *context)
>  	case ACPI_NOTIFY_EJECT_REQUEST:
>  		/* request device eject */
>  		acpi_handle_debug(handle, "Eject request in %s()\n", __func__);
> -		acpiphp_disable_and_eject_slot(slot);
> +		acpiphp_disable_and_eject_slot(&context->hp, slot);
>  		break;
>  	}
>  
> @@ -999,9 +1001,11 @@ int acpiphp_enable_slot(struct acpiphp_slot *slot)
>  
>  /**
>   * acpiphp_disable_and_eject_slot - power off and eject slot
> + * @hp: the context that received eject notification, can be NULL
>   * @slot: ACPI PHP slot
>   */
> -static int acpiphp_disable_and_eject_slot(struct acpiphp_slot *slot)
> +static int
> +acpiphp_disable_and_eject_slot(struct acpi_hotplug_context *hp, struct acpiphp_slot *slot)
>  {
>  	struct acpiphp_func *func;
>  
> @@ -1011,6 +1015,9 @@ static int acpiphp_disable_and_eject_slot(struct acpiphp_slot *slot)
>  	/* unconfigure all functions */
>  	disable_slot(slot);
>  
> +	if (hp)
> +		atomic_set(&hp->ejecting, 0);
> +
>  	list_for_each_entry(func, &slot->funcs, sibling)
>  		if (func->flags & FUNC_HAS_EJ0) {
>  			acpi_handle handle = func_to_handle(func);
> @@ -1034,7 +1041,7 @@ int acpiphp_disable_slot(struct acpiphp_slot *slot)
>  	 */
>  	acpi_scan_lock_acquire();
>  	pci_lock_rescan_remove();
> -	ret = acpiphp_disable_and_eject_slot(slot);
> +	ret = acpiphp_disable_and_eject_slot(NULL, slot);
>  	pci_unlock_rescan_remove();
>  	acpi_scan_lock_release();
>  	return ret;
> diff --git a/include/acpi/acpi_bus.h b/include/acpi/acpi_bus.h
> index aad1a95e6863..870c1ffe47c9 100644
> --- a/include/acpi/acpi_bus.h
> +++ b/include/acpi/acpi_bus.h
> @@ -151,6 +151,7 @@ typedef void (*acpi_hp_fixup) (struct acpi_device *);
>  
>  struct acpi_hotplug_context {
>  	struct acpi_device *self;
> +	atomic_t ejecting;
>  	acpi_hp_notify notify;
>  	acpi_hp_uevent uevent;
>  	acpi_hp_fixup fixup;
> @@ -215,7 +216,8 @@ struct acpi_device_flags {
>  	u32 cca_seen:1;
>  	u32 enumeration_by_parent:1;
>  	u32 honor_deps:1;
> -	u32 reserved:18;
> +	u32 should_dedup_eject:1;
> +	u32 reserved:17;
>  };
>  
>  /* File System */
> -- 
> 2.48.1
> 

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] ACPI/PCI: deduplicate concurrent eject notify
  2025-02-28 17:16 ` Bjorn Helgaas
@ 2025-04-11  9:12   ` Weiwen Hu
  2025-04-11 10:39     ` Rafael J. Wysocki
  0 siblings, 1 reply; 4+ messages in thread
From: Weiwen Hu @ 2025-04-11  9:12 UTC (permalink / raw)
  To: Bjorn Helgaas, Rafael J. Wysocki
  Cc: Rafael J. Wysocki, Bjorn Helgaas, linux-acpi, linux-pci

Hi Rafael, what's your opinion on this race condition in ACPI hotplug?

On Fri, Feb 28, 2025 at 11:16:01AM -0600, Bjorn Helgaas wrote:
> On Mon, Feb 24, 2025 at 03:00:34PM +0800, Weiwen Hu wrote:
> > Ignore the eject notification when the previous ejection is still in
> > progress to prevent multiple _EJ0 invocations when the ejection completes.
> > 
> > The first _EJ0 informs the platform to actually eject the device and frees
> > the slot for other devices. So the subsequent _EJ0 may accidentally eject
> > the new device that is just plugged into this slot. We need to avoid this.
> > 
> > For each acpi_device, this patch introduces a new field `ejecting`, which
> > is set before enqueuing the `kacpi_hotplug_wq` and reset before invoking
> > _EJ0.  Every notifications we received before invoking _EJ0 is targeted to
> > the old device. This ensures we don't miss any notifications for newly
> > plugged device.  And a new flag `should_dedup_eject` allows each driver to
> > implement this feature gradually. A driver should set this flag on
> > initialization if it will reset `ejecting`.
> 
> Which drivers do you have in mind when you say "each driver can
> implement this feature gradually"?  You set should_dedup_eject in
> acpiphp, so I guess you mean other ACPI hotplug drivers?  I see
> acpi_hotplug_context mentioned in libata-acpi.c, surface3-wmi.c; maybe
> those are the only other current ones?

Yes, I mean other ACPI hotplug drivers. I've searched for acpi_evaluate_ej0(),
because `ejecting` should be reset just before this. I got dock.c and scan.c.
All of them are called from acpi_device_hotplug(). So may be I can change all
of them at once.  Should we also consider out-of-tree drivers?

As for surface3-wmi.c, its s3_wmi_hp_notify function seems only probing lid
state, and does not support hot unplug actually.

But ata_acpi_detach_device() seems not calling acpi_evaluate_ej0(), From
f730ae183863 ("libata: remove functions now handed by ACPI dock driver"), It
seems that ata relies on dock to call ej0 for it.

> > This fix is not perfect. If we receive an eject notification just after
> > resetting `ejecting` but before _EJ0, we will still invoke _EJ0 twice.
> > However this seems to be the best solution available, and it strictly
> > improves the current situation.
> > 
> > Another potential fix is to add an `ejected` flag to each device and not
> > invoke _EJ0 for already ejected devices. However, this approach risks
> > losing synchronization with the platform if something else goes wrong,
> > potentially preventing the device from being ejected permanently.  And we
> > need to check with bus driver to make sure the device is really ejected
> > successfully. But this check is still racy. So we cannot ensure no extra
> > _EJ0 invocations either.
> 
> I see the problem.  Thanks for the detailed explanation and details
> about reproducing it and the trace.
> 
> I'm not sure whether the platform should reissue the Bus Check
> notification based on the fact that the OS hasn't invoked _EJ0 in some
> arbitrary time.  That seems a little bit presumptuous because, for
> example, the platform can't know how long it will take to write out
> the dirty page cache.  The racyness of the workaround seems
> troublesome to me.
> 
> But this is all really an ACPI issue, not a PCI issue, so I'd like to
> defer to the ACPI experts here.
> 
> > Signed-off-by: Weiwen Hu <huweiwen@linux.alibaba.com>
> > ---
> >  drivers/acpi/osl.c                 |  6 ++++++
> >  drivers/pci/hotplug/acpiphp_glue.c | 15 +++++++++++----
> >  include/acpi/acpi_bus.h            |  4 +++-
> >  3 files changed, 20 insertions(+), 5 deletions(-)
> > 
> > We observed that umount can take extremely long time if there is a lot of
> > dirty page cache.  umount will take the s_umount semaphore, which will
> > block the ejecting process:
> > 
> > 	__schedule+0x1e0/0x630
> > 	? kernfs_put.part.0+0xd4/0x1a0
> > 	schedule+0x46/0xb0
> > 	rwsem_down_read_slowpath+0x16b/0x490
> > 	__get_super.part.0+0xc1/0xe0
> > 	fsync_bdev+0x11/0x60
> > 	invalidate_partition+0x5c/0xa0
> > 	del_gendisk+0x103/0x2e0
> > 	virtblk_remove+0x27/0xa0
> > 	virtio_dev_remove+0x36/0x90
> > 	__device_release_driver+0x172/0x260
> > 	device_release_driver+0x24/0x30
> > 	bus_remove_device+0xf6/0x170
> > 	device_del+0x19b/0x450
> > 	device_unregister+0x16/0x60
> > 	unregister_virtio_device+0x11/0x20
> > 	virtio_pci_remove+0x31/0x60
> > 	pci_device_remove+0x38/0xa0
> > 	__device_release_driver+0x172/0x260
> > 	device_release_driver+0x24/0x30
> > 	pci_stop_bus_device+0x6c/0x90
> > 	pci_stop_and_remove_bus_device+0xe/0x20
> > 	disable_slot+0x49/0x90
> > 	acpiphp_disable_and_eject_slot+0x15/0x90
> > 
> > While OS is not invoking _EJ0 timely, the user (or hypervisor) may retry by
> > issuing another notification, which will be queued in kacpi_hotplug_wq.
> > After the umount is finally done, the _EJ0 will be invoked.  Then, if there
> > are devices pending attach, the hypervisor may choose to attach it
> > immediately to the same slot.  The new device can be ejected by the queued
> > ejecting process unintentionally.
> > 
> > On Alibaba Cloud, we re-issue the notification around every 10s if the OS
> > does not respond.  (BTW, do you think platform is allowed to re-issue
> > the notification on timeout?)
> > We can easily reproduce this issue on Alibaba Cloud ECS:
> > 
> > 	WRITE_SIZE=2300M  # tune this value so that the umount takes 20s
> > 	# replace these disk serial numbers
> > 	DISK_DETACH=bp142xxxxxxxxxxxxxxx  # pre-formatted
> > 	DISK_ATTACH=bp109xxxxxxxxxxxxxxx  # any
> > 	# start from these two disks detached
> > 
> > 	INSTANCE_ID=$(curl -sS http://100.100.100.200/latest/meta-data/instance-id)
> > 	echo "instance id: $INSTANCE_ID"
> > 	DISK_PATH=/dev/disk/by-id/nvme-Alibaba_Cloud_Elastic_Block_Storage_
> > 
> > 	echo "attaching disk d-$DISK_DETACH"
> > 	aliyun ecs AttachDisk --DiskId "d-$DISK_DETACH" --InstanceId "$INSTANCE_ID"
> > 
> > 	sleep 2
> > 	mkdir -p /mnt/slow
> > 	mount "$DISK_PATH$DISK_DETACH" /mnt/slow
> > 	echo "mounted d-$DISK_DETACH to /mnt/slow"
> > 
> > 	rm -f /mnt/slow/zero
> > 	echo "populating dirty cache"
> > 	head -c $WRITE_SIZE /dev/zero > /mnt/slow/zero;
> > 
> > 	echo umounting
> > 	(
> > 		umount /mnt/slow
> > 		echo umounted
> > 	)&
> > 
> > 	sleep 2
> > 	echo "detaching disk d-$DISK_DETACH"
> > 	aliyun ecs DetachDisk --DiskId "d-$DISK_DETACH" --InstanceId "$INSTANCE_ID"
> > 
> > 	sleep 10
> > 	echo "attaching disk d-$DISK_ATTACH"
> > 	aliyun ecs AttachDisk --DiskId "d-$DISK_ATTACH" --InstanceId "$INSTANCE_ID"
> > 
> > 	sleep 7
> > 	wait
> > 	for _ in {1..10}; do
> > 		sleep 1
> > 		if [ -e "$DISK_PATH$DISK_ATTACH" ]; then
> > 			echo "disk d-$DISK_ATTACH attached, issue not reproduced"
> > 			exit 0
> > 		fi
> > 		echo "disk d-$DISK_ATTACH not found yet"
> > 	done
> > 
> > 	echo "issue reproduced"
> > 	exit 1
> > 
> > And here is the trace we got from `perf trace` while running the above script on an unpatched kernel:
> > 
> > 	[starting detach]
> > 	 48202.244 kworker/0:0-ev/5 probe:acpi_ev_queue_notify_request(__probe_ip: -1452149680, notify_value: 3)
> > 	 48202.314 kworker/0:0-ev/5 probe:acpi_hotplug_schedule(__probe_ip: -1452297040, src: 3)
> > 	 48203.690 kworker/u8:0-e/1946 probe:acpi_device_hotplug(__probe_ip: -1452251424, src: 3)
> > 	[blocked, retrying detach]
> > 	 58023.813 kworker/0:0-ev/5 probe:acpi_ev_queue_notify_request(__probe_ip: -1452149680, notify_value: 3)
> > 	 58023.881 kworker/0:0-ev/5 probe:acpi_hotplug_schedule(__probe_ip: -1452297040, src: 3)
> > 	[detach done]
> > 	 62834.048 kworker/u8:0-e/1946 probe:acpi_evaluate_ej0(__probe_ip: -1452291632)
> > 	[another device attaching]
> > 	 62954.686 kworker/0:0-ev/5 probe:acpi_ev_queue_notify_request(__probe_ip: -1452149680, notify_value: 1)
> > 	 62954.828 kworker/0:0-ev/5 probe:acpi_hotplug_schedule(__probe_ip: -1452297040, src: 1)
> > 	 63042.506 kworker/u8:0-e/1946 probe:acpi_device_hotplug(__probe_ip: -1452251424, src: 3)
> > 	[the new device is ejected unintentionally]
> > 	 63042.520 kworker/u8:0-e/1946 probe:acpi_evaluate_ej0(__probe_ip: -1452291632)
> > 	[the actual attach task, scanned the bus but got nothing]
> > 	 63266.555 kworker/u8:0-e/1946 probe:acpi_device_hotplug(__probe_ip: -1452251424, src: 1)
> > 
> > With this patch, the acpi_hotplug_schedule at 58023.881 will be skipped to
> > suppress the acpi_evaluate_ej0 at 63042.520.
> > 
> > diff --git a/drivers/acpi/osl.c b/drivers/acpi/osl.c
> > index 5ff343096ece..f041c4db10f7 100644
> > --- a/drivers/acpi/osl.c
> > +++ b/drivers/acpi/osl.c
> > @@ -1193,6 +1193,12 @@ acpi_status acpi_hotplug_schedule(struct acpi_device *adev, u32 src)
> >  {
> >  	struct acpi_hp_work *hpw;
> >  
> > +	if (src == ACPI_NOTIFY_EJECT_REQUEST && adev->flags.should_dedup_eject
> > +			&& atomic_xchg(&adev->hp->ejecting, 1)) {
> > +		put_device(&adev->dev);
> > +		return AE_OK;
> > +	}
> > +
> >  	acpi_handle_debug(adev->handle,
> >  			  "Scheduling hotplug event %u for deferred handling\n",
> >  			   src);
> > diff --git a/drivers/pci/hotplug/acpiphp_glue.c b/drivers/pci/hotplug/acpiphp_glue.c
> > index 5b1f271c6034..3c50f2af1584 100644
> > --- a/drivers/pci/hotplug/acpiphp_glue.c
> > +++ b/drivers/pci/hotplug/acpiphp_glue.c
> > @@ -68,6 +68,7 @@ static struct acpiphp_context *acpiphp_init_context(struct acpi_device *adev)
> >  	context->hp.notify = acpiphp_hotplug_notify;
> >  	context->hp.fixup = acpiphp_post_dock_fixup;
> >  	acpi_set_hp_context(adev, &context->hp);
> > +	adev->flags.should_dedup_eject = true;
> >  	return context;
> >  }
> >  
> > @@ -778,7 +779,8 @@ void acpiphp_check_host_bridge(struct acpi_device *adev)
> >  	}
> >  }
> >  
> > -static int acpiphp_disable_and_eject_slot(struct acpiphp_slot *slot);
> > +static int
> > +acpiphp_disable_and_eject_slot(struct acpi_hotplug_context *hp, struct acpiphp_slot *slot);
> >  
> >  static void hotplug_event(u32 type, struct acpiphp_context *context)
> >  {
> > @@ -825,7 +827,7 @@ static void hotplug_event(u32 type, struct acpiphp_context *context)
> >  	case ACPI_NOTIFY_EJECT_REQUEST:
> >  		/* request device eject */
> >  		acpi_handle_debug(handle, "Eject request in %s()\n", __func__);
> > -		acpiphp_disable_and_eject_slot(slot);
> > +		acpiphp_disable_and_eject_slot(&context->hp, slot);
> >  		break;
> >  	}
> >  
> > @@ -999,9 +1001,11 @@ int acpiphp_enable_slot(struct acpiphp_slot *slot)
> >  
> >  /**
> >   * acpiphp_disable_and_eject_slot - power off and eject slot
> > + * @hp: the context that received eject notification, can be NULL
> >   * @slot: ACPI PHP slot
> >   */
> > -static int acpiphp_disable_and_eject_slot(struct acpiphp_slot *slot)
> > +static int
> > +acpiphp_disable_and_eject_slot(struct acpi_hotplug_context *hp, struct acpiphp_slot *slot)
> >  {
> >  	struct acpiphp_func *func;
> >  
> > @@ -1011,6 +1015,9 @@ static int acpiphp_disable_and_eject_slot(struct acpiphp_slot *slot)
> >  	/* unconfigure all functions */
> >  	disable_slot(slot);
> >  
> > +	if (hp)
> > +		atomic_set(&hp->ejecting, 0);
> > +
> >  	list_for_each_entry(func, &slot->funcs, sibling)
> >  		if (func->flags & FUNC_HAS_EJ0) {
> >  			acpi_handle handle = func_to_handle(func);
> > @@ -1034,7 +1041,7 @@ int acpiphp_disable_slot(struct acpiphp_slot *slot)
> >  	 */
> >  	acpi_scan_lock_acquire();
> >  	pci_lock_rescan_remove();
> > -	ret = acpiphp_disable_and_eject_slot(slot);
> > +	ret = acpiphp_disable_and_eject_slot(NULL, slot);
> >  	pci_unlock_rescan_remove();
> >  	acpi_scan_lock_release();
> >  	return ret;
> > diff --git a/include/acpi/acpi_bus.h b/include/acpi/acpi_bus.h
> > index aad1a95e6863..870c1ffe47c9 100644
> > --- a/include/acpi/acpi_bus.h
> > +++ b/include/acpi/acpi_bus.h
> > @@ -151,6 +151,7 @@ typedef void (*acpi_hp_fixup) (struct acpi_device *);
> >  
> >  struct acpi_hotplug_context {
> >  	struct acpi_device *self;
> > +	atomic_t ejecting;
> >  	acpi_hp_notify notify;
> >  	acpi_hp_uevent uevent;
> >  	acpi_hp_fixup fixup;
> > @@ -215,7 +216,8 @@ struct acpi_device_flags {
> >  	u32 cca_seen:1;
> >  	u32 enumeration_by_parent:1;
> >  	u32 honor_deps:1;
> > -	u32 reserved:18;
> > +	u32 should_dedup_eject:1;
> > +	u32 reserved:17;
> >  };
> >  
> >  /* File System */
> > -- 
> > 2.48.1
> > 

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] ACPI/PCI: deduplicate concurrent eject notify
  2025-04-11  9:12   ` Weiwen Hu
@ 2025-04-11 10:39     ` Rafael J. Wysocki
  0 siblings, 0 replies; 4+ messages in thread
From: Rafael J. Wysocki @ 2025-04-11 10:39 UTC (permalink / raw)
  To: Weiwen Hu
  Cc: Bjorn Helgaas, Rafael J. Wysocki, Bjorn Helgaas, linux-acpi,
	linux-pci

On Fri, Apr 11, 2025 at 11:12 AM Weiwen Hu <huweiwen@linux.alibaba.com> wrote:
>
> Hi Rafael, what's your opinion on this race condition in ACPI hotplug?

In the first place, the platform firmware should wait for an _OST
confirmation of an eject or failure.

I cannot spend more time on this today, though, I'll get to it next week.

> On Fri, Feb 28, 2025 at 11:16:01AM -0600, Bjorn Helgaas wrote:
> > On Mon, Feb 24, 2025 at 03:00:34PM +0800, Weiwen Hu wrote:
> > > Ignore the eject notification when the previous ejection is still in
> > > progress to prevent multiple _EJ0 invocations when the ejection completes.
> > >
> > > The first _EJ0 informs the platform to actually eject the device and frees
> > > the slot for other devices. So the subsequent _EJ0 may accidentally eject
> > > the new device that is just plugged into this slot. We need to avoid this.
> > >
> > > For each acpi_device, this patch introduces a new field `ejecting`, which
> > > is set before enqueuing the `kacpi_hotplug_wq` and reset before invoking
> > > _EJ0.  Every notifications we received before invoking _EJ0 is targeted to
> > > the old device. This ensures we don't miss any notifications for newly
> > > plugged device.  And a new flag `should_dedup_eject` allows each driver to
> > > implement this feature gradually. A driver should set this flag on
> > > initialization if it will reset `ejecting`.
> >
> > Which drivers do you have in mind when you say "each driver can
> > implement this feature gradually"?  You set should_dedup_eject in
> > acpiphp, so I guess you mean other ACPI hotplug drivers?  I see
> > acpi_hotplug_context mentioned in libata-acpi.c, surface3-wmi.c; maybe
> > those are the only other current ones?
>
> Yes, I mean other ACPI hotplug drivers. I've searched for acpi_evaluate_ej0(),
> because `ejecting` should be reset just before this. I got dock.c and scan.c.
> All of them are called from acpi_device_hotplug(). So may be I can change all
> of them at once.  Should we also consider out-of-tree drivers?
>
> As for surface3-wmi.c, its s3_wmi_hp_notify function seems only probing lid
> state, and does not support hot unplug actually.
>
> But ata_acpi_detach_device() seems not calling acpi_evaluate_ej0(), From
> f730ae183863 ("libata: remove functions now handed by ACPI dock driver"), It
> seems that ata relies on dock to call ej0 for it.
>
> > > This fix is not perfect. If we receive an eject notification just after
> > > resetting `ejecting` but before _EJ0, we will still invoke _EJ0 twice.
> > > However this seems to be the best solution available, and it strictly
> > > improves the current situation.
> > >
> > > Another potential fix is to add an `ejected` flag to each device and not
> > > invoke _EJ0 for already ejected devices. However, this approach risks
> > > losing synchronization with the platform if something else goes wrong,
> > > potentially preventing the device from being ejected permanently.  And we
> > > need to check with bus driver to make sure the device is really ejected
> > > successfully. But this check is still racy. So we cannot ensure no extra
> > > _EJ0 invocations either.
> >
> > I see the problem.  Thanks for the detailed explanation and details
> > about reproducing it and the trace.
> >
> > I'm not sure whether the platform should reissue the Bus Check
> > notification based on the fact that the OS hasn't invoked _EJ0 in some
> > arbitrary time.  That seems a little bit presumptuous because, for
> > example, the platform can't know how long it will take to write out
> > the dirty page cache.  The racyness of the workaround seems
> > troublesome to me.
> >
> > But this is all really an ACPI issue, not a PCI issue, so I'd like to
> > defer to the ACPI experts here.
> >
> > > Signed-off-by: Weiwen Hu <huweiwen@linux.alibaba.com>
> > > ---
> > >  drivers/acpi/osl.c                 |  6 ++++++
> > >  drivers/pci/hotplug/acpiphp_glue.c | 15 +++++++++++----
> > >  include/acpi/acpi_bus.h            |  4 +++-
> > >  3 files changed, 20 insertions(+), 5 deletions(-)
> > >
> > > We observed that umount can take extremely long time if there is a lot of
> > > dirty page cache.  umount will take the s_umount semaphore, which will
> > > block the ejecting process:
> > >
> > >     __schedule+0x1e0/0x630
> > >     ? kernfs_put.part.0+0xd4/0x1a0
> > >     schedule+0x46/0xb0
> > >     rwsem_down_read_slowpath+0x16b/0x490
> > >     __get_super.part.0+0xc1/0xe0
> > >     fsync_bdev+0x11/0x60
> > >     invalidate_partition+0x5c/0xa0
> > >     del_gendisk+0x103/0x2e0
> > >     virtblk_remove+0x27/0xa0
> > >     virtio_dev_remove+0x36/0x90
> > >     __device_release_driver+0x172/0x260
> > >     device_release_driver+0x24/0x30
> > >     bus_remove_device+0xf6/0x170
> > >     device_del+0x19b/0x450
> > >     device_unregister+0x16/0x60
> > >     unregister_virtio_device+0x11/0x20
> > >     virtio_pci_remove+0x31/0x60
> > >     pci_device_remove+0x38/0xa0
> > >     __device_release_driver+0x172/0x260
> > >     device_release_driver+0x24/0x30
> > >     pci_stop_bus_device+0x6c/0x90
> > >     pci_stop_and_remove_bus_device+0xe/0x20
> > >     disable_slot+0x49/0x90
> > >     acpiphp_disable_and_eject_slot+0x15/0x90
> > >
> > > While OS is not invoking _EJ0 timely, the user (or hypervisor) may retry by
> > > issuing another notification, which will be queued in kacpi_hotplug_wq.
> > > After the umount is finally done, the _EJ0 will be invoked.  Then, if there
> > > are devices pending attach, the hypervisor may choose to attach it
> > > immediately to the same slot.  The new device can be ejected by the queued
> > > ejecting process unintentionally.
> > >
> > > On Alibaba Cloud, we re-issue the notification around every 10s if the OS
> > > does not respond.  (BTW, do you think platform is allowed to re-issue
> > > the notification on timeout?)
> > > We can easily reproduce this issue on Alibaba Cloud ECS:
> > >
> > >     WRITE_SIZE=2300M  # tune this value so that the umount takes 20s
> > >     # replace these disk serial numbers
> > >     DISK_DETACH=bp142xxxxxxxxxxxxxxx  # pre-formatted
> > >     DISK_ATTACH=bp109xxxxxxxxxxxxxxx  # any
> > >     # start from these two disks detached
> > >
> > >     INSTANCE_ID=$(curl -sS http://100.100.100.200/latest/meta-data/instance-id)
> > >     echo "instance id: $INSTANCE_ID"
> > >     DISK_PATH=/dev/disk/by-id/nvme-Alibaba_Cloud_Elastic_Block_Storage_
> > >
> > >     echo "attaching disk d-$DISK_DETACH"
> > >     aliyun ecs AttachDisk --DiskId "d-$DISK_DETACH" --InstanceId "$INSTANCE_ID"
> > >
> > >     sleep 2
> > >     mkdir -p /mnt/slow
> > >     mount "$DISK_PATH$DISK_DETACH" /mnt/slow
> > >     echo "mounted d-$DISK_DETACH to /mnt/slow"
> > >
> > >     rm -f /mnt/slow/zero
> > >     echo "populating dirty cache"
> > >     head -c $WRITE_SIZE /dev/zero > /mnt/slow/zero;
> > >
> > >     echo umounting
> > >     (
> > >             umount /mnt/slow
> > >             echo umounted
> > >     )&
> > >
> > >     sleep 2
> > >     echo "detaching disk d-$DISK_DETACH"
> > >     aliyun ecs DetachDisk --DiskId "d-$DISK_DETACH" --InstanceId "$INSTANCE_ID"
> > >
> > >     sleep 10
> > >     echo "attaching disk d-$DISK_ATTACH"
> > >     aliyun ecs AttachDisk --DiskId "d-$DISK_ATTACH" --InstanceId "$INSTANCE_ID"
> > >
> > >     sleep 7
> > >     wait
> > >     for _ in {1..10}; do
> > >             sleep 1
> > >             if [ -e "$DISK_PATH$DISK_ATTACH" ]; then
> > >                     echo "disk d-$DISK_ATTACH attached, issue not reproduced"
> > >                     exit 0
> > >             fi
> > >             echo "disk d-$DISK_ATTACH not found yet"
> > >     done
> > >
> > >     echo "issue reproduced"
> > >     exit 1
> > >
> > > And here is the trace we got from `perf trace` while running the above script on an unpatched kernel:
> > >
> > >     [starting detach]
> > >      48202.244 kworker/0:0-ev/5 probe:acpi_ev_queue_notify_request(__probe_ip: -1452149680, notify_value: 3)
> > >      48202.314 kworker/0:0-ev/5 probe:acpi_hotplug_schedule(__probe_ip: -1452297040, src: 3)
> > >      48203.690 kworker/u8:0-e/1946 probe:acpi_device_hotplug(__probe_ip: -1452251424, src: 3)
> > >     [blocked, retrying detach]
> > >      58023.813 kworker/0:0-ev/5 probe:acpi_ev_queue_notify_request(__probe_ip: -1452149680, notify_value: 3)
> > >      58023.881 kworker/0:0-ev/5 probe:acpi_hotplug_schedule(__probe_ip: -1452297040, src: 3)
> > >     [detach done]
> > >      62834.048 kworker/u8:0-e/1946 probe:acpi_evaluate_ej0(__probe_ip: -1452291632)
> > >     [another device attaching]
> > >      62954.686 kworker/0:0-ev/5 probe:acpi_ev_queue_notify_request(__probe_ip: -1452149680, notify_value: 1)
> > >      62954.828 kworker/0:0-ev/5 probe:acpi_hotplug_schedule(__probe_ip: -1452297040, src: 1)
> > >      63042.506 kworker/u8:0-e/1946 probe:acpi_device_hotplug(__probe_ip: -1452251424, src: 3)
> > >     [the new device is ejected unintentionally]
> > >      63042.520 kworker/u8:0-e/1946 probe:acpi_evaluate_ej0(__probe_ip: -1452291632)
> > >     [the actual attach task, scanned the bus but got nothing]
> > >      63266.555 kworker/u8:0-e/1946 probe:acpi_device_hotplug(__probe_ip: -1452251424, src: 1)
> > >
> > > With this patch, the acpi_hotplug_schedule at 58023.881 will be skipped to
> > > suppress the acpi_evaluate_ej0 at 63042.520.
> > >
> > > diff --git a/drivers/acpi/osl.c b/drivers/acpi/osl.c
> > > index 5ff343096ece..f041c4db10f7 100644
> > > --- a/drivers/acpi/osl.c
> > > +++ b/drivers/acpi/osl.c
> > > @@ -1193,6 +1193,12 @@ acpi_status acpi_hotplug_schedule(struct acpi_device *adev, u32 src)
> > >  {
> > >     struct acpi_hp_work *hpw;
> > >
> > > +   if (src == ACPI_NOTIFY_EJECT_REQUEST && adev->flags.should_dedup_eject
> > > +                   && atomic_xchg(&adev->hp->ejecting, 1)) {
> > > +           put_device(&adev->dev);
> > > +           return AE_OK;
> > > +   }
> > > +
> > >     acpi_handle_debug(adev->handle,
> > >                       "Scheduling hotplug event %u for deferred handling\n",
> > >                        src);
> > > diff --git a/drivers/pci/hotplug/acpiphp_glue.c b/drivers/pci/hotplug/acpiphp_glue.c
> > > index 5b1f271c6034..3c50f2af1584 100644
> > > --- a/drivers/pci/hotplug/acpiphp_glue.c
> > > +++ b/drivers/pci/hotplug/acpiphp_glue.c
> > > @@ -68,6 +68,7 @@ static struct acpiphp_context *acpiphp_init_context(struct acpi_device *adev)
> > >     context->hp.notify = acpiphp_hotplug_notify;
> > >     context->hp.fixup = acpiphp_post_dock_fixup;
> > >     acpi_set_hp_context(adev, &context->hp);
> > > +   adev->flags.should_dedup_eject = true;
> > >     return context;
> > >  }
> > >
> > > @@ -778,7 +779,8 @@ void acpiphp_check_host_bridge(struct acpi_device *adev)
> > >     }
> > >  }
> > >
> > > -static int acpiphp_disable_and_eject_slot(struct acpiphp_slot *slot);
> > > +static int
> > > +acpiphp_disable_and_eject_slot(struct acpi_hotplug_context *hp, struct acpiphp_slot *slot);
> > >
> > >  static void hotplug_event(u32 type, struct acpiphp_context *context)
> > >  {
> > > @@ -825,7 +827,7 @@ static void hotplug_event(u32 type, struct acpiphp_context *context)
> > >     case ACPI_NOTIFY_EJECT_REQUEST:
> > >             /* request device eject */
> > >             acpi_handle_debug(handle, "Eject request in %s()\n", __func__);
> > > -           acpiphp_disable_and_eject_slot(slot);
> > > +           acpiphp_disable_and_eject_slot(&context->hp, slot);
> > >             break;
> > >     }
> > >
> > > @@ -999,9 +1001,11 @@ int acpiphp_enable_slot(struct acpiphp_slot *slot)
> > >
> > >  /**
> > >   * acpiphp_disable_and_eject_slot - power off and eject slot
> > > + * @hp: the context that received eject notification, can be NULL
> > >   * @slot: ACPI PHP slot
> > >   */
> > > -static int acpiphp_disable_and_eject_slot(struct acpiphp_slot *slot)
> > > +static int
> > > +acpiphp_disable_and_eject_slot(struct acpi_hotplug_context *hp, struct acpiphp_slot *slot)
> > >  {
> > >     struct acpiphp_func *func;
> > >
> > > @@ -1011,6 +1015,9 @@ static int acpiphp_disable_and_eject_slot(struct acpiphp_slot *slot)
> > >     /* unconfigure all functions */
> > >     disable_slot(slot);
> > >
> > > +   if (hp)
> > > +           atomic_set(&hp->ejecting, 0);
> > > +
> > >     list_for_each_entry(func, &slot->funcs, sibling)
> > >             if (func->flags & FUNC_HAS_EJ0) {
> > >                     acpi_handle handle = func_to_handle(func);
> > > @@ -1034,7 +1041,7 @@ int acpiphp_disable_slot(struct acpiphp_slot *slot)
> > >      */
> > >     acpi_scan_lock_acquire();
> > >     pci_lock_rescan_remove();
> > > -   ret = acpiphp_disable_and_eject_slot(slot);
> > > +   ret = acpiphp_disable_and_eject_slot(NULL, slot);
> > >     pci_unlock_rescan_remove();
> > >     acpi_scan_lock_release();
> > >     return ret;
> > > diff --git a/include/acpi/acpi_bus.h b/include/acpi/acpi_bus.h
> > > index aad1a95e6863..870c1ffe47c9 100644
> > > --- a/include/acpi/acpi_bus.h
> > > +++ b/include/acpi/acpi_bus.h
> > > @@ -151,6 +151,7 @@ typedef void (*acpi_hp_fixup) (struct acpi_device *);
> > >
> > >  struct acpi_hotplug_context {
> > >     struct acpi_device *self;
> > > +   atomic_t ejecting;
> > >     acpi_hp_notify notify;
> > >     acpi_hp_uevent uevent;
> > >     acpi_hp_fixup fixup;
> > > @@ -215,7 +216,8 @@ struct acpi_device_flags {
> > >     u32 cca_seen:1;
> > >     u32 enumeration_by_parent:1;
> > >     u32 honor_deps:1;
> > > -   u32 reserved:18;
> > > +   u32 should_dedup_eject:1;
> > > +   u32 reserved:17;
> > >  };
> > >
> > >  /* File System */
> > > --
> > > 2.48.1
> > >
>

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2025-04-11 10:39 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-02-24  7:00 [PATCH] ACPI/PCI: deduplicate concurrent eject notify Weiwen Hu
2025-02-28 17:16 ` Bjorn Helgaas
2025-04-11  9:12   ` Weiwen Hu
2025-04-11 10:39     ` Rafael J. Wysocki

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox