From: Weiwen Hu <huweiwen@linux.alibaba.com>
To: Bjorn Helgaas <helgaas@kernel.org>,
"Rafael J. Wysocki" <rafael.j.wysocki@intel.com>
Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com>,
Bjorn Helgaas <bhelgaas@google.com>,
linux-acpi@vger.kernel.org, linux-pci@vger.kernel.org
Subject: Re: [PATCH] ACPI/PCI: deduplicate concurrent eject notify
Date: Fri, 11 Apr 2025 17:12:07 +0800 [thread overview]
Message-ID: <Z_i3-z4dODvYjaPN@MacBook-Pro.local> (raw)
In-Reply-To: <20250228171601.GA23123@bhelgaas>
Hi Rafael, what's your opinion on this race condition in ACPI hotplug?
On Fri, Feb 28, 2025 at 11:16:01AM -0600, Bjorn Helgaas wrote:
> On Mon, Feb 24, 2025 at 03:00:34PM +0800, Weiwen Hu wrote:
> > Ignore the eject notification when the previous ejection is still in
> > progress to prevent multiple _EJ0 invocations when the ejection completes.
> >
> > The first _EJ0 informs the platform to actually eject the device and frees
> > the slot for other devices. So the subsequent _EJ0 may accidentally eject
> > the new device that is just plugged into this slot. We need to avoid this.
> >
> > For each acpi_device, this patch introduces a new field `ejecting`, which
> > is set before enqueuing the `kacpi_hotplug_wq` and reset before invoking
> > _EJ0. Every notifications we received before invoking _EJ0 is targeted to
> > the old device. This ensures we don't miss any notifications for newly
> > plugged device. And a new flag `should_dedup_eject` allows each driver to
> > implement this feature gradually. A driver should set this flag on
> > initialization if it will reset `ejecting`.
>
> Which drivers do you have in mind when you say "each driver can
> implement this feature gradually"? You set should_dedup_eject in
> acpiphp, so I guess you mean other ACPI hotplug drivers? I see
> acpi_hotplug_context mentioned in libata-acpi.c, surface3-wmi.c; maybe
> those are the only other current ones?
Yes, I mean other ACPI hotplug drivers. I've searched for acpi_evaluate_ej0(),
because `ejecting` should be reset just before this. I got dock.c and scan.c.
All of them are called from acpi_device_hotplug(). So may be I can change all
of them at once. Should we also consider out-of-tree drivers?
As for surface3-wmi.c, its s3_wmi_hp_notify function seems only probing lid
state, and does not support hot unplug actually.
But ata_acpi_detach_device() seems not calling acpi_evaluate_ej0(), From
f730ae183863 ("libata: remove functions now handed by ACPI dock driver"), It
seems that ata relies on dock to call ej0 for it.
> > This fix is not perfect. If we receive an eject notification just after
> > resetting `ejecting` but before _EJ0, we will still invoke _EJ0 twice.
> > However this seems to be the best solution available, and it strictly
> > improves the current situation.
> >
> > Another potential fix is to add an `ejected` flag to each device and not
> > invoke _EJ0 for already ejected devices. However, this approach risks
> > losing synchronization with the platform if something else goes wrong,
> > potentially preventing the device from being ejected permanently. And we
> > need to check with bus driver to make sure the device is really ejected
> > successfully. But this check is still racy. So we cannot ensure no extra
> > _EJ0 invocations either.
>
> I see the problem. Thanks for the detailed explanation and details
> about reproducing it and the trace.
>
> I'm not sure whether the platform should reissue the Bus Check
> notification based on the fact that the OS hasn't invoked _EJ0 in some
> arbitrary time. That seems a little bit presumptuous because, for
> example, the platform can't know how long it will take to write out
> the dirty page cache. The racyness of the workaround seems
> troublesome to me.
>
> But this is all really an ACPI issue, not a PCI issue, so I'd like to
> defer to the ACPI experts here.
>
> > Signed-off-by: Weiwen Hu <huweiwen@linux.alibaba.com>
> > ---
> > drivers/acpi/osl.c | 6 ++++++
> > drivers/pci/hotplug/acpiphp_glue.c | 15 +++++++++++----
> > include/acpi/acpi_bus.h | 4 +++-
> > 3 files changed, 20 insertions(+), 5 deletions(-)
> >
> > We observed that umount can take extremely long time if there is a lot of
> > dirty page cache. umount will take the s_umount semaphore, which will
> > block the ejecting process:
> >
> > __schedule+0x1e0/0x630
> > ? kernfs_put.part.0+0xd4/0x1a0
> > schedule+0x46/0xb0
> > rwsem_down_read_slowpath+0x16b/0x490
> > __get_super.part.0+0xc1/0xe0
> > fsync_bdev+0x11/0x60
> > invalidate_partition+0x5c/0xa0
> > del_gendisk+0x103/0x2e0
> > virtblk_remove+0x27/0xa0
> > virtio_dev_remove+0x36/0x90
> > __device_release_driver+0x172/0x260
> > device_release_driver+0x24/0x30
> > bus_remove_device+0xf6/0x170
> > device_del+0x19b/0x450
> > device_unregister+0x16/0x60
> > unregister_virtio_device+0x11/0x20
> > virtio_pci_remove+0x31/0x60
> > pci_device_remove+0x38/0xa0
> > __device_release_driver+0x172/0x260
> > device_release_driver+0x24/0x30
> > pci_stop_bus_device+0x6c/0x90
> > pci_stop_and_remove_bus_device+0xe/0x20
> > disable_slot+0x49/0x90
> > acpiphp_disable_and_eject_slot+0x15/0x90
> >
> > While OS is not invoking _EJ0 timely, the user (or hypervisor) may retry by
> > issuing another notification, which will be queued in kacpi_hotplug_wq.
> > After the umount is finally done, the _EJ0 will be invoked. Then, if there
> > are devices pending attach, the hypervisor may choose to attach it
> > immediately to the same slot. The new device can be ejected by the queued
> > ejecting process unintentionally.
> >
> > On Alibaba Cloud, we re-issue the notification around every 10s if the OS
> > does not respond. (BTW, do you think platform is allowed to re-issue
> > the notification on timeout?)
> > We can easily reproduce this issue on Alibaba Cloud ECS:
> >
> > WRITE_SIZE=2300M # tune this value so that the umount takes 20s
> > # replace these disk serial numbers
> > DISK_DETACH=bp142xxxxxxxxxxxxxxx # pre-formatted
> > DISK_ATTACH=bp109xxxxxxxxxxxxxxx # any
> > # start from these two disks detached
> >
> > INSTANCE_ID=$(curl -sS http://100.100.100.200/latest/meta-data/instance-id)
> > echo "instance id: $INSTANCE_ID"
> > DISK_PATH=/dev/disk/by-id/nvme-Alibaba_Cloud_Elastic_Block_Storage_
> >
> > echo "attaching disk d-$DISK_DETACH"
> > aliyun ecs AttachDisk --DiskId "d-$DISK_DETACH" --InstanceId "$INSTANCE_ID"
> >
> > sleep 2
> > mkdir -p /mnt/slow
> > mount "$DISK_PATH$DISK_DETACH" /mnt/slow
> > echo "mounted d-$DISK_DETACH to /mnt/slow"
> >
> > rm -f /mnt/slow/zero
> > echo "populating dirty cache"
> > head -c $WRITE_SIZE /dev/zero > /mnt/slow/zero;
> >
> > echo umounting
> > (
> > umount /mnt/slow
> > echo umounted
> > )&
> >
> > sleep 2
> > echo "detaching disk d-$DISK_DETACH"
> > aliyun ecs DetachDisk --DiskId "d-$DISK_DETACH" --InstanceId "$INSTANCE_ID"
> >
> > sleep 10
> > echo "attaching disk d-$DISK_ATTACH"
> > aliyun ecs AttachDisk --DiskId "d-$DISK_ATTACH" --InstanceId "$INSTANCE_ID"
> >
> > sleep 7
> > wait
> > for _ in {1..10}; do
> > sleep 1
> > if [ -e "$DISK_PATH$DISK_ATTACH" ]; then
> > echo "disk d-$DISK_ATTACH attached, issue not reproduced"
> > exit 0
> > fi
> > echo "disk d-$DISK_ATTACH not found yet"
> > done
> >
> > echo "issue reproduced"
> > exit 1
> >
> > And here is the trace we got from `perf trace` while running the above script on an unpatched kernel:
> >
> > [starting detach]
> > 48202.244 kworker/0:0-ev/5 probe:acpi_ev_queue_notify_request(__probe_ip: -1452149680, notify_value: 3)
> > 48202.314 kworker/0:0-ev/5 probe:acpi_hotplug_schedule(__probe_ip: -1452297040, src: 3)
> > 48203.690 kworker/u8:0-e/1946 probe:acpi_device_hotplug(__probe_ip: -1452251424, src: 3)
> > [blocked, retrying detach]
> > 58023.813 kworker/0:0-ev/5 probe:acpi_ev_queue_notify_request(__probe_ip: -1452149680, notify_value: 3)
> > 58023.881 kworker/0:0-ev/5 probe:acpi_hotplug_schedule(__probe_ip: -1452297040, src: 3)
> > [detach done]
> > 62834.048 kworker/u8:0-e/1946 probe:acpi_evaluate_ej0(__probe_ip: -1452291632)
> > [another device attaching]
> > 62954.686 kworker/0:0-ev/5 probe:acpi_ev_queue_notify_request(__probe_ip: -1452149680, notify_value: 1)
> > 62954.828 kworker/0:0-ev/5 probe:acpi_hotplug_schedule(__probe_ip: -1452297040, src: 1)
> > 63042.506 kworker/u8:0-e/1946 probe:acpi_device_hotplug(__probe_ip: -1452251424, src: 3)
> > [the new device is ejected unintentionally]
> > 63042.520 kworker/u8:0-e/1946 probe:acpi_evaluate_ej0(__probe_ip: -1452291632)
> > [the actual attach task, scanned the bus but got nothing]
> > 63266.555 kworker/u8:0-e/1946 probe:acpi_device_hotplug(__probe_ip: -1452251424, src: 1)
> >
> > With this patch, the acpi_hotplug_schedule at 58023.881 will be skipped to
> > suppress the acpi_evaluate_ej0 at 63042.520.
> >
> > diff --git a/drivers/acpi/osl.c b/drivers/acpi/osl.c
> > index 5ff343096ece..f041c4db10f7 100644
> > --- a/drivers/acpi/osl.c
> > +++ b/drivers/acpi/osl.c
> > @@ -1193,6 +1193,12 @@ acpi_status acpi_hotplug_schedule(struct acpi_device *adev, u32 src)
> > {
> > struct acpi_hp_work *hpw;
> >
> > + if (src == ACPI_NOTIFY_EJECT_REQUEST && adev->flags.should_dedup_eject
> > + && atomic_xchg(&adev->hp->ejecting, 1)) {
> > + put_device(&adev->dev);
> > + return AE_OK;
> > + }
> > +
> > acpi_handle_debug(adev->handle,
> > "Scheduling hotplug event %u for deferred handling\n",
> > src);
> > diff --git a/drivers/pci/hotplug/acpiphp_glue.c b/drivers/pci/hotplug/acpiphp_glue.c
> > index 5b1f271c6034..3c50f2af1584 100644
> > --- a/drivers/pci/hotplug/acpiphp_glue.c
> > +++ b/drivers/pci/hotplug/acpiphp_glue.c
> > @@ -68,6 +68,7 @@ static struct acpiphp_context *acpiphp_init_context(struct acpi_device *adev)
> > context->hp.notify = acpiphp_hotplug_notify;
> > context->hp.fixup = acpiphp_post_dock_fixup;
> > acpi_set_hp_context(adev, &context->hp);
> > + adev->flags.should_dedup_eject = true;
> > return context;
> > }
> >
> > @@ -778,7 +779,8 @@ void acpiphp_check_host_bridge(struct acpi_device *adev)
> > }
> > }
> >
> > -static int acpiphp_disable_and_eject_slot(struct acpiphp_slot *slot);
> > +static int
> > +acpiphp_disable_and_eject_slot(struct acpi_hotplug_context *hp, struct acpiphp_slot *slot);
> >
> > static void hotplug_event(u32 type, struct acpiphp_context *context)
> > {
> > @@ -825,7 +827,7 @@ static void hotplug_event(u32 type, struct acpiphp_context *context)
> > case ACPI_NOTIFY_EJECT_REQUEST:
> > /* request device eject */
> > acpi_handle_debug(handle, "Eject request in %s()\n", __func__);
> > - acpiphp_disable_and_eject_slot(slot);
> > + acpiphp_disable_and_eject_slot(&context->hp, slot);
> > break;
> > }
> >
> > @@ -999,9 +1001,11 @@ int acpiphp_enable_slot(struct acpiphp_slot *slot)
> >
> > /**
> > * acpiphp_disable_and_eject_slot - power off and eject slot
> > + * @hp: the context that received eject notification, can be NULL
> > * @slot: ACPI PHP slot
> > */
> > -static int acpiphp_disable_and_eject_slot(struct acpiphp_slot *slot)
> > +static int
> > +acpiphp_disable_and_eject_slot(struct acpi_hotplug_context *hp, struct acpiphp_slot *slot)
> > {
> > struct acpiphp_func *func;
> >
> > @@ -1011,6 +1015,9 @@ static int acpiphp_disable_and_eject_slot(struct acpiphp_slot *slot)
> > /* unconfigure all functions */
> > disable_slot(slot);
> >
> > + if (hp)
> > + atomic_set(&hp->ejecting, 0);
> > +
> > list_for_each_entry(func, &slot->funcs, sibling)
> > if (func->flags & FUNC_HAS_EJ0) {
> > acpi_handle handle = func_to_handle(func);
> > @@ -1034,7 +1041,7 @@ int acpiphp_disable_slot(struct acpiphp_slot *slot)
> > */
> > acpi_scan_lock_acquire();
> > pci_lock_rescan_remove();
> > - ret = acpiphp_disable_and_eject_slot(slot);
> > + ret = acpiphp_disable_and_eject_slot(NULL, slot);
> > pci_unlock_rescan_remove();
> > acpi_scan_lock_release();
> > return ret;
> > diff --git a/include/acpi/acpi_bus.h b/include/acpi/acpi_bus.h
> > index aad1a95e6863..870c1ffe47c9 100644
> > --- a/include/acpi/acpi_bus.h
> > +++ b/include/acpi/acpi_bus.h
> > @@ -151,6 +151,7 @@ typedef void (*acpi_hp_fixup) (struct acpi_device *);
> >
> > struct acpi_hotplug_context {
> > struct acpi_device *self;
> > + atomic_t ejecting;
> > acpi_hp_notify notify;
> > acpi_hp_uevent uevent;
> > acpi_hp_fixup fixup;
> > @@ -215,7 +216,8 @@ struct acpi_device_flags {
> > u32 cca_seen:1;
> > u32 enumeration_by_parent:1;
> > u32 honor_deps:1;
> > - u32 reserved:18;
> > + u32 should_dedup_eject:1;
> > + u32 reserved:17;
> > };
> >
> > /* File System */
> > --
> > 2.48.1
> >
next prev parent reply other threads:[~2025-04-11 9:12 UTC|newest]
Thread overview: 4+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-02-24 7:00 [PATCH] ACPI/PCI: deduplicate concurrent eject notify Weiwen Hu
2025-02-28 17:16 ` Bjorn Helgaas
2025-04-11 9:12 ` Weiwen Hu [this message]
2025-04-11 10:39 ` Rafael J. Wysocki
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Z_i3-z4dODvYjaPN@MacBook-Pro.local \
--to=huweiwen@linux.alibaba.com \
--cc=bhelgaas@google.com \
--cc=helgaas@kernel.org \
--cc=linux-acpi@vger.kernel.org \
--cc=linux-pci@vger.kernel.org \
--cc=rafael.j.wysocki@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox