qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* an issue for device hot-unplug
@ 2023-04-03 13:24 Yu Zhang
  2023-04-03 16:32 ` Laurent Vivier
  2023-04-04 12:17 ` Igor Mammedov
  0 siblings, 2 replies; 9+ messages in thread
From: Yu Zhang @ 2023-04-03 13:24 UTC (permalink / raw)
  To: Laurent Vivier, qemu-devel, Jinpu Wang, Elmar Gerdes

[-- Attachment #1: Type: text/plain, Size: 2412 bytes --]

Dear Laurent,

recently we run into an issue with the following error:

command '{ "execute": "device_del", "arguments": { "id": "virtio-diskX" }
}' for VM "id" failed ({ "return": {"class": "GenericError", "desc":
"Device virtio-diskX is already in the process of unplug"} }).

The issue is reproducible. With a few seconds delay before hot-unplug,
hot-unplug just works fine.

After a few digging, we found that the commit 9323f892b39 may incur the
issue.
------------------
    failover: fix unplug pending detection

    Failover needs to detect the end of the PCI unplug to start migration
    after the VFIO card has been unplugged.

    To do that, a flag is set in pcie_cap_slot_unplug_request_cb() and
reset in
    pcie_unplug_device().

    But since
        17858a169508 ("hw/acpi/ich9: Set ACPI PCI hot-plug as default on
Q35")
    we have switched to ACPI unplug and these functions are not called
anymore
    and the flag not set. So failover migration is not able to detect if
card
    is really unplugged and acts as it's done as soon as it's started. So it
    doesn't wait the end of the unplug to start the migration. We don't see
any
    problem when we test that because ACPI unplug is faster than PCIe native
    hotplug and when the migration really starts the unplug operation is
    already done.

    See c000a9bd06ea ("pci: mark device having guest unplug request
pending")
        a99c4da9fc2a ("pci: mark devices partially unplugged")

    Signed-off-by: Laurent Vivier <lvivier@redhat.com>
    Reviewed-by: Ani Sinha <ani@anisinha.ca>
    Message-Id: <20211118133225.324937-4-lvivier@redhat.com>
    Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
    Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
------------------
The purpose is for detecting the end of the PCI device hot-unplug. However,
we feel the error confusing. How is it possible that a disk "is already in
the process of unplug" during the first hot-unplug attempt? So far as I
know, the issue was also encountered by libvirt, but they simply ignored it:

    https://bugzilla.redhat.com/show_bug.cgi?id=1878659

Hence, a question is: should we have the line below in
acpi_pcihp_device_unplug_request_cb()?

   pdev->qdev.pending_deleted_event = true;

It would be great if you as the author could give us a few hints.

Thank you very much for your reply!

Sincerely,

Yu Zhang @ Compute Platform IONOS
03.04.2013

[-- Attachment #2: Type: text/html, Size: 3180 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: an issue for device hot-unplug
  2023-04-03 13:24 an issue for device hot-unplug Yu Zhang
@ 2023-04-03 16:32 ` Laurent Vivier
  2023-04-03 16:59   ` Yu Zhang
  2023-04-04 12:17 ` Igor Mammedov
  1 sibling, 1 reply; 9+ messages in thread
From: Laurent Vivier @ 2023-04-03 16:32 UTC (permalink / raw)
  To: Yu Zhang, qemu-devel, Jinpu Wang, Elmar Gerdes

Hi Yu,

please open a bug in the bug tracker:

https://gitlab.com/qemu/qemu/-/issues

It's easier to track the problem.

What is the version of QEMU you are using?
Could you provide QEMU command line?

Thanks,
Laurent


On 4/3/23 15:24, Yu Zhang wrote:
> Dear Laurent,
> 
> recently we run into an issue with the following error:
> 
> command '{ "execute": "device_del", "arguments": { "id": "virtio-diskX" } }' for VM "id" 
> failed ({ "return": {"class": "GenericError", "desc": "Device virtio-diskX is already in 
> the process of unplug"} }).
> 
> The issue is reproducible. With a few seconds delay before hot-unplug, hot-unplug just 
> works fine.
> 
> After a few digging, we found that the commit 9323f892b39 may incur the issue.
> ------------------
>      failover: fix unplug pending detection
> 
>      Failover needs to detect the end of the PCI unplug to start migration
>      after the VFIO card has been unplugged.
> 
>      To do that, a flag is set in pcie_cap_slot_unplug_request_cb() and reset in
>      pcie_unplug_device().
> 
>      But since
>          17858a169508 ("hw/acpi/ich9: Set ACPI PCI hot-plug as default on Q35")
>      we have switched to ACPI unplug and these functions are not called anymore
>      and the flag not set. So failover migration is not able to detect if card
>      is really unplugged and acts as it's done as soon as it's started. So it
>      doesn't wait the end of the unplug to start the migration. We don't see any
>      problem when we test that because ACPI unplug is faster than PCIe native
>      hotplug and when the migration really starts the unplug operation is
>      already done.
> 
>      See c000a9bd06ea ("pci: mark device having guest unplug request pending")
>          a99c4da9fc2a ("pci: mark devices partially unplugged")
> 
>      Signed-off-by: Laurent Vivier <lvivier@redhat.com <mailto:lvivier@redhat.com>>
>      Reviewed-by: Ani Sinha <ani@anisinha.ca <mailto:ani@anisinha.ca>>
>      Message-Id: <20211118133225.324937-4-lvivier@redhat.com 
> <mailto:20211118133225.324937-4-lvivier@redhat.com>>
>      Reviewed-by: Michael S. Tsirkin <mst@redhat.com <mailto:mst@redhat.com>>
>      Signed-off-by: Michael S. Tsirkin <mst@redhat.com <mailto:mst@redhat.com>>
> ------------------
> The purpose is for detecting the end of the PCI device hot-unplug. However, we feel the 
> error confusing. How is it possible that a disk "is already in the process of unplug" 
> during the first hot-unplug attempt? So far as I know, the issue was also encountered by 
> libvirt, but they simply ignored it:
> 
> https://bugzilla.redhat.com/show_bug.cgi?id=1878659 
> <https://bugzilla.redhat.com/show_bug.cgi?id=1878659>
> 
> Hence, a question is: should we have the line below in  acpi_pcihp_device_unplug_request_cb()?
> 
>     pdev->qdev.pending_deleted_event = true;
> 
> It would be great if you as the author could give us a few hints.
> 
> Thank you very much for your reply!
> 
> Sincerely,
> 
> Yu Zhang @ Compute Platform IONOS
> 03.04.2013



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: an issue for device hot-unplug
  2023-04-03 16:32 ` Laurent Vivier
@ 2023-04-03 16:59   ` Yu Zhang
  2023-04-04  6:45     ` Jinpu Wang
  2023-04-04 10:00     ` Jinpu Wang
  0 siblings, 2 replies; 9+ messages in thread
From: Yu Zhang @ 2023-04-03 16:59 UTC (permalink / raw)
  To: Laurent Vivier, qemu-devel, Elmar Gerdes, Jinpu Wang

[-- Attachment #1: Type: text/plain, Size: 3920 bytes --]

Dear Laurent,

Thank you for your quick reply. We used qemu-7.1, but it is reproducible
with qemu from v6.2 to the recent v8.0 release candidates.
I found that it's introduced by the commit  9323f892b39 (between v6.2.0-rc2
and v6.2.0-rc3).

If it doesn't break anything else, it suffices to remove the line below
from acpi_pcihp_device_unplug_request_cb():

    pdev->qdev.pending_deleted_event = true;

but you may have a reason to keep it. First of all, I'll open a bug in the
bug tracker and let you know.

Best regards,
Yu Zhang

On Mon, Apr 3, 2023 at 6:32 PM Laurent Vivier <lvivier@redhat.com> wrote:

> Hi Yu,
>
> please open a bug in the bug tracker:
>
> https://gitlab.com/qemu/qemu/-/issues
>
> It's easier to track the problem.
>
> What is the version of QEMU you are using?
> Could you provide QEMU command line?
>
> Thanks,
> Laurent
>
>
> On 4/3/23 15:24, Yu Zhang wrote:
> > Dear Laurent,
> >
> > recently we run into an issue with the following error:
> >
> > command '{ "execute": "device_del", "arguments": { "id": "virtio-diskX"
> } }' for VM "id"
> > failed ({ "return": {"class": "GenericError", "desc": "Device
> virtio-diskX is already in
> > the process of unplug"} }).
> >
> > The issue is reproducible. With a few seconds delay before hot-unplug,
> hot-unplug just
> > works fine.
> >
> > After a few digging, we found that the commit 9323f892b39 may incur the
> issue.
> > ------------------
> >      failover: fix unplug pending detection
> >
> >      Failover needs to detect the end of the PCI unplug to start
> migration
> >      after the VFIO card has been unplugged.
> >
> >      To do that, a flag is set in pcie_cap_slot_unplug_request_cb() and
> reset in
> >      pcie_unplug_device().
> >
> >      But since
> >          17858a169508 ("hw/acpi/ich9: Set ACPI PCI hot-plug as default
> on Q35")
> >      we have switched to ACPI unplug and these functions are not called
> anymore
> >      and the flag not set. So failover migration is not able to detect
> if card
> >      is really unplugged and acts as it's done as soon as it's started.
> So it
> >      doesn't wait the end of the unplug to start the migration. We don't
> see any
> >      problem when we test that because ACPI unplug is faster than PCIe
> native
> >      hotplug and when the migration really starts the unplug operation is
> >      already done.
> >
> >      See c000a9bd06ea ("pci: mark device having guest unplug request
> pending")
> >          a99c4da9fc2a ("pci: mark devices partially unplugged")
> >
> >      Signed-off-by: Laurent Vivier <lvivier@redhat.com <mailto:
> lvivier@redhat.com>>
> >      Reviewed-by: Ani Sinha <ani@anisinha.ca <mailto:ani@anisinha.ca>>
> >      Message-Id: <20211118133225.324937-4-lvivier@redhat.com
> > <mailto:20211118133225.324937-4-lvivier@redhat.com>>
> >      Reviewed-by: Michael S. Tsirkin <mst@redhat.com <mailto:
> mst@redhat.com>>
> >      Signed-off-by: Michael S. Tsirkin <mst@redhat.com <mailto:
> mst@redhat.com>>
> > ------------------
> > The purpose is for detecting the end of the PCI device hot-unplug.
> However, we feel the
> > error confusing. How is it possible that a disk "is already in the
> process of unplug"
> > during the first hot-unplug attempt? So far as I know, the issue was
> also encountered by
> > libvirt, but they simply ignored it:
> >
> > https://bugzilla.redhat.com/show_bug.cgi?id=1878659
> > <https://bugzilla.redhat.com/show_bug.cgi?id=1878659>
> >
> > Hence, a question is: should we have the line below in
> acpi_pcihp_device_unplug_request_cb()?
> >
> >     pdev->qdev.pending_deleted_event = true;
> >
> > It would be great if you as the author could give us a few hints.
> >
> > Thank you very much for your reply!
> >
> > Sincerely,
> >
> > Yu Zhang @ Compute Platform IONOS
> > 03.04.2013
>
>

[-- Attachment #2: Type: text/html, Size: 5836 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: an issue for device hot-unplug
  2023-04-03 16:59   ` Yu Zhang
@ 2023-04-04  6:45     ` Jinpu Wang
  2023-04-04 12:25       ` Igor Mammedov
  2023-04-04 10:00     ` Jinpu Wang
  1 sibling, 1 reply; 9+ messages in thread
From: Jinpu Wang @ 2023-04-04  6:45 UTC (permalink / raw)
  To: Yu Zhang; +Cc: Laurent Vivier, qemu-devel, Elmar Gerdes, imammedo

Hi Yu,

On Mon, Apr 3, 2023 at 6:59 PM Yu Zhang <yu.zhang@ionos.com> wrote:
>
> Dear Laurent,
>
> Thank you for your quick reply. We used qemu-7.1, but it is reproducible with qemu from v6.2 to the recent v8.0 release candidates.
> I found that it's introduced by the commit  9323f892b39 (between v6.2.0-rc2 and v6.2.0-rc3).
>
> If it doesn't break anything else, it suffices to remove the line below from acpi_pcihp_device_unplug_request_cb():
>
>     pdev->qdev.pending_deleted_event = true;
>
> but you may have a reason to keep it. First of all, I'll open a bug in the bug tracker and let you know.
>
> Best regards,
> Yu Zhang
This patch from Igor Mammedov seems relevant,
https://lore.kernel.org/qemu-devel/20230403131833-mutt-send-email-mst@kernel.org/T/#t
Can you try it out?

Regards!
Jinpu
>
> On Mon, Apr 3, 2023 at 6:32 PM Laurent Vivier <lvivier@redhat.com> wrote:
>>
>> Hi Yu,
>>
>> please open a bug in the bug tracker:
>>
>> https://gitlab.com/qemu/qemu/-/issues
>>
>> It's easier to track the problem.
>>
>> What is the version of QEMU you are using?
>> Could you provide QEMU command line?
>>
>> Thanks,
>> Laurent
>>
>>
>> On 4/3/23 15:24, Yu Zhang wrote:
>> > Dear Laurent,
>> >
>> > recently we run into an issue with the following error:
>> >
>> > command '{ "execute": "device_del", "arguments": { "id": "virtio-diskX" } }' for VM "id"
>> > failed ({ "return": {"class": "GenericError", "desc": "Device virtio-diskX is already in
>> > the process of unplug"} }).
>> >
>> > The issue is reproducible. With a few seconds delay before hot-unplug, hot-unplug just
>> > works fine.
>> >
>> > After a few digging, we found that the commit 9323f892b39 may incur the issue.
>> > ------------------
>> >      failover: fix unplug pending detection
>> >
>> >      Failover needs to detect the end of the PCI unplug to start migration
>> >      after the VFIO card has been unplugged.
>> >
>> >      To do that, a flag is set in pcie_cap_slot_unplug_request_cb() and reset in
>> >      pcie_unplug_device().
>> >
>> >      But since
>> >          17858a169508 ("hw/acpi/ich9: Set ACPI PCI hot-plug as default on Q35")
>> >      we have switched to ACPI unplug and these functions are not called anymore
>> >      and the flag not set. So failover migration is not able to detect if card
>> >      is really unplugged and acts as it's done as soon as it's started. So it
>> >      doesn't wait the end of the unplug to start the migration. We don't see any
>> >      problem when we test that because ACPI unplug is faster than PCIe native
>> >      hotplug and when the migration really starts the unplug operation is
>> >      already done.
>> >
>> >      See c000a9bd06ea ("pci: mark device having guest unplug request pending")
>> >          a99c4da9fc2a ("pci: mark devices partially unplugged")
>> >
>> >      Signed-off-by: Laurent Vivier <lvivier@redhat.com <mailto:lvivier@redhat.com>>
>> >      Reviewed-by: Ani Sinha <ani@anisinha.ca <mailto:ani@anisinha.ca>>
>> >      Message-Id: <20211118133225.324937-4-lvivier@redhat.com
>> > <mailto:20211118133225.324937-4-lvivier@redhat.com>>
>> >      Reviewed-by: Michael S. Tsirkin <mst@redhat.com <mailto:mst@redhat.com>>
>> >      Signed-off-by: Michael S. Tsirkin <mst@redhat.com <mailto:mst@redhat.com>>
>> > ------------------
>> > The purpose is for detecting the end of the PCI device hot-unplug. However, we feel the
>> > error confusing. How is it possible that a disk "is already in the process of unplug"
>> > during the first hot-unplug attempt? So far as I know, the issue was also encountered by
>> > libvirt, but they simply ignored it:
>> >
>> > https://bugzilla.redhat.com/show_bug.cgi?id=1878659
>> > <https://bugzilla.redhat.com/show_bug.cgi?id=1878659>
>> >
>> > Hence, a question is: should we have the line below in  acpi_pcihp_device_unplug_request_cb()?
>> >
>> >     pdev->qdev.pending_deleted_event = true;
>> >
>> > It would be great if you as the author could give us a few hints.
>> >
>> > Thank you very much for your reply!
>> >
>> > Sincerely,
>> >
>> > Yu Zhang @ Compute Platform IONOS
>> > 03.04.2013
>>


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: an issue for device hot-unplug
  2023-04-03 16:59   ` Yu Zhang
  2023-04-04  6:45     ` Jinpu Wang
@ 2023-04-04 10:00     ` Jinpu Wang
  1 sibling, 0 replies; 9+ messages in thread
From: Jinpu Wang @ 2023-04-04 10:00 UTC (permalink / raw)
  To: Yu Zhang, Laurent Vivier; +Cc: qemu-devel, Elmar Gerdes, imammedo

Hi Yu and Laurent,

On Mon, Apr 3, 2023 at 6:59 PM Yu Zhang <yu.zhang@ionos.com> wrote:
>
> Dear Laurent,
>
> Thank you for your quick reply. We used qemu-7.1, but it is reproducible with qemu from v6.2 to the recent v8.0 release candidates.
> I found that it's introduced by the commit  9323f892b39 (between v6.2.0-rc2 and v6.2.0-rc3).
>
> If it doesn't break anything else, it suffices to remove the line below from acpi_pcihp_device_unplug_request_cb():
>
>     pdev->qdev.pending_deleted_event = true;
>
> but you may have a reason to keep it. First of all, I'll open a bug in the bug tracker and let you know.
>
We opened an issue here:
https://gitlab.com/qemu-project/qemu/-/issues/1577

Regards!
Jinpu Wang
> Best regards,
> Yu Zhang
>
> On Mon, Apr 3, 2023 at 6:32 PM Laurent Vivier <lvivier@redhat.com> wrote:
>>
>> Hi Yu,
>>
>> please open a bug in the bug tracker:
>>
>> https://gitlab.com/qemu/qemu/-/issues
>>
>> It's easier to track the problem.
>>
>> What is the version of QEMU you are using?
>> Could you provide QEMU command line?
>>
>> Thanks,
>> Laurent
>>
>>
>> On 4/3/23 15:24, Yu Zhang wrote:
>> > Dear Laurent,
>> >
>> > recently we run into an issue with the following error:
>> >
>> > command '{ "execute": "device_del", "arguments": { "id": "virtio-diskX" } }' for VM "id"
>> > failed ({ "return": {"class": "GenericError", "desc": "Device virtio-diskX is already in
>> > the process of unplug"} }).
>> >
>> > The issue is reproducible. With a few seconds delay before hot-unplug, hot-unplug just
>> > works fine.
>> >
>> > After a few digging, we found that the commit 9323f892b39 may incur the issue.
>> > ------------------
>> >      failover: fix unplug pending detection
>> >
>> >      Failover needs to detect the end of the PCI unplug to start migration
>> >      after the VFIO card has been unplugged.
>> >
>> >      To do that, a flag is set in pcie_cap_slot_unplug_request_cb() and reset in
>> >      pcie_unplug_device().
>> >
>> >      But since
>> >          17858a169508 ("hw/acpi/ich9: Set ACPI PCI hot-plug as default on Q35")
>> >      we have switched to ACPI unplug and these functions are not called anymore
>> >      and the flag not set. So failover migration is not able to detect if card
>> >      is really unplugged and acts as it's done as soon as it's started. So it
>> >      doesn't wait the end of the unplug to start the migration. We don't see any
>> >      problem when we test that because ACPI unplug is faster than PCIe native
>> >      hotplug and when the migration really starts the unplug operation is
>> >      already done.
>> >
>> >      See c000a9bd06ea ("pci: mark device having guest unplug request pending")
>> >          a99c4da9fc2a ("pci: mark devices partially unplugged")
>> >
>> >      Signed-off-by: Laurent Vivier <lvivier@redhat.com <mailto:lvivier@redhat.com>>
>> >      Reviewed-by: Ani Sinha <ani@anisinha.ca <mailto:ani@anisinha.ca>>
>> >      Message-Id: <20211118133225.324937-4-lvivier@redhat.com
>> > <mailto:20211118133225.324937-4-lvivier@redhat.com>>
>> >      Reviewed-by: Michael S. Tsirkin <mst@redhat.com <mailto:mst@redhat.com>>
>> >      Signed-off-by: Michael S. Tsirkin <mst@redhat.com <mailto:mst@redhat.com>>
>> > ------------------
>> > The purpose is for detecting the end of the PCI device hot-unplug. However, we feel the
>> > error confusing. How is it possible that a disk "is already in the process of unplug"
>> > during the first hot-unplug attempt? So far as I know, the issue was also encountered by
>> > libvirt, but they simply ignored it:
>> >
>> > https://bugzilla.redhat.com/show_bug.cgi?id=1878659
>> > <https://bugzilla.redhat.com/show_bug.cgi?id=1878659>
>> >
>> > Hence, a question is: should we have the line below in  acpi_pcihp_device_unplug_request_cb()?
>> >
>> >     pdev->qdev.pending_deleted_event = true;
>> >
>> > It would be great if you as the author could give us a few hints.
>> >
>> > Thank you very much for your reply!
>> >
>> > Sincerely,
>> >
>> > Yu Zhang @ Compute Platform IONOS
>> > 03.04.2013
>>


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: an issue for device hot-unplug
  2023-04-03 13:24 an issue for device hot-unplug Yu Zhang
  2023-04-03 16:32 ` Laurent Vivier
@ 2023-04-04 12:17 ` Igor Mammedov
  1 sibling, 0 replies; 9+ messages in thread
From: Igor Mammedov @ 2023-04-04 12:17 UTC (permalink / raw)
  To: Yu Zhang; +Cc: Laurent Vivier, qemu-devel, Jinpu Wang, Elmar Gerdes

On Mon, 3 Apr 2023 15:24:43 +0200
Yu Zhang <yu.zhang@ionos.com> wrote:

> Dear Laurent,
> 
> recently we run into an issue with the following error:
> 
> command '{ "execute": "device_del", "arguments": { "id": "virtio-diskX" }
> }' for VM "id" failed ({ "return": {"class": "GenericError", "desc":
> "Device virtio-diskX is already in the process of unplug"} }).
> 
> The issue is reproducible. With a few seconds delay before hot-unplug,
> hot-unplug just works fine.
> 
> After a few digging, we found that the commit 9323f892b39 may incur the
> issue.
> ------------------
>     failover: fix unplug pending detection
> 
>     Failover needs to detect the end of the PCI unplug to start migration
>     after the VFIO card has been unplugged.
> 
>     To do that, a flag is set in pcie_cap_slot_unplug_request_cb() and
> reset in
>     pcie_unplug_device().
> 
>     But since
>         17858a169508 ("hw/acpi/ich9: Set ACPI PCI hot-plug as default on
> Q35")
>     we have switched to ACPI unplug and these functions are not called
> anymore
>     and the flag not set. So failover migration is not able to detect if
> card
>     is really unplugged and acts as it's done as soon as it's started. So it
>     doesn't wait the end of the unplug to start the migration. We don't see
> any
>     problem when we test that because ACPI unplug is faster than PCIe native
>     hotplug and when the migration really starts the unplug operation is
>     already done.
> 
>     See c000a9bd06ea ("pci: mark device having guest unplug request
> pending")
>         a99c4da9fc2a ("pci: mark devices partially unplugged")
> 
>     Signed-off-by: Laurent Vivier <lvivier@redhat.com>
>     Reviewed-by: Ani Sinha <ani@anisinha.ca>
>     Message-Id: <20211118133225.324937-4-lvivier@redhat.com>
>     Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
>     Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> ------------------
> The purpose is for detecting the end of the PCI device hot-unplug. However,

unplug is async process and issuing multiple unplug requests waiting for
'not found' error as a means to detect that device has been unplugged
hardly a sane way to do that.
Instead of swamping guest with unplug requests (which lead to hw interrupts)
you should wait for DEVICE_DELETED QMP event.

> we feel the error confusing. How is it possible that a disk "is already in
> the process of unplug" during the first hot-unplug attempt? So far as I
> know, the issue was also encountered by libvirt, but they simply ignored it:
> 
>     https://bugzilla.redhat.com/show_bug.cgi?id=1878659
> 
> Hence, a question is: should we have the line below in
> acpi_pcihp_device_unplug_request_cb()?
> 
>    pdev->qdev.pending_deleted_event = true;

comment 15 in above BZ describes how we could get rid of this line
but also see comment 17
(in nutshell you get error because device hasn't been removed yet)
 
> 
> It would be great if you as the author could give us a few hints.
> 
> Thank you very much for your reply!
> 
> Sincerely,
> 
> Yu Zhang @ Compute Platform IONOS
> 03.04.2013



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: an issue for device hot-unplug
  2023-04-04  6:45     ` Jinpu Wang
@ 2023-04-04 12:25       ` Igor Mammedov
  2023-04-04 16:00         ` Yu Zhang
  0 siblings, 1 reply; 9+ messages in thread
From: Igor Mammedov @ 2023-04-04 12:25 UTC (permalink / raw)
  To: Jinpu Wang; +Cc: Yu Zhang, Laurent Vivier, qemu-devel, Elmar Gerdes

On Tue, 4 Apr 2023 08:45:54 +0200
Jinpu Wang <jinpu.wang@ionos.com> wrote:

> Hi Yu,
> 
> On Mon, Apr 3, 2023 at 6:59 PM Yu Zhang <yu.zhang@ionos.com> wrote:
> >
> > Dear Laurent,
> >
> > Thank you for your quick reply. We used qemu-7.1, but it is reproducible with qemu from v6.2 to the recent v8.0 release candidates.
> > I found that it's introduced by the commit  9323f892b39 (between v6.2.0-rc2 and v6.2.0-rc3).
> >
> > If it doesn't break anything else, it suffices to remove the line below from acpi_pcihp_device_unplug_request_cb():
> >
> >     pdev->qdev.pending_deleted_event = true;
> >
> > but you may have a reason to keep it. First of all, I'll open a bug in the bug tracker and let you know.
> >
> > Best regards,
> > Yu Zhang  
> This patch from Igor Mammedov seems relevant,
> https://lore.kernel.org/qemu-devel/20230403131833-mutt-send-email-mst@kernel.org/T/#t

this patch targets corner case of early boot where
guest hasn't initialized ACPI subsystem yet and 'broken'
management asking to unplug device too early which leads
to device stuck in being unplugged state due to regression
in QEMU.
However, It doesn't apply to fully booted guest.

[...]

> >> > The purpose is for detecting the end of the PCI device hot-unplug. However, we feel the
> >> > error confusing. How is it possible that a disk "is already in the process of unplug"
> >> > during the first hot-unplug attempt? So far as I know, the issue was also encountered by
> >> > libvirt, but they simply ignored it:
> >> >
> >> > https://bugzilla.redhat.com/show_bug.cgi?id=1878659
> >> > <https://bugzilla.redhat.com/show_bug.cgi?id=1878659>
see my other reply email/BZ comment 17.

[...]



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: an issue for device hot-unplug
  2023-04-04 12:25       ` Igor Mammedov
@ 2023-04-04 16:00         ` Yu Zhang
  2023-04-05  7:51           ` Igor Mammedov
  0 siblings, 1 reply; 9+ messages in thread
From: Yu Zhang @ 2023-04-04 16:00 UTC (permalink / raw)
  To: Igor Mammedov, Laurent Vivier, qemu-devel, Jinpu Wang,
	Elmar Gerdes

[-- Attachment #1: Type: text/plain, Size: 3648 bytes --]

> this patch targets corner case of early boot where
> guest hasn't initialized ACPI subsystem yet and 'broken'
> management asking to unplug device too early which leads
> to device stuck in being unplugged state due to regression
> in QEMU.
> However, It doesn't apply to fully booted guest.

by adding a few debug lines I see that in
acpi_pcihp_device_unplug_request_cb(),

    pdev->qdev.pending_deleted_event = true;

was executed, which then directly triggered the error in:

void qmp_device_del(const char *id, Error **errp)
{
    DeviceState *dev = find_device_state(id, errp);
    if (dev != NULL) {
        if (dev->pending_deleted_event &&
            (dev->pending_deleted_expires_ms == 0 ||
             dev->pending_deleted_expires_ms >
qemu_clock_get_ms(QEMU_CLOCK_VIRTUAL))) {
            error_setg(errp, "Device %s is already in the "
                             "process of unplug", id);
            return;
        }

        qdev_unplug(dev, errp);
    }
}

In QEMU code, there are 6 lines where this flag is changed:

hw/core/qdev.c:564:        dev->pending_deleted_event = false;
hw/core/qdev.c:601:        dev->pending_deleted_event = true;
hw/acpi/pcihp.c:219:                    qdev->pending_deleted_event = false;
hw/acpi/pcihp.c:359:    pdev->qdev.pending_deleted_event = true;
hw/pci/pcie.c:516:        dev->qdev.pending_deleted_event = false;
hw/pci/pcie.c:573:    dev->pending_deleted_event = true;

Considering the complexity of the code, the logic for setting and clearing
this flag
seems not quite straightforward. I doubt that the setting of
pending_deleted_event in
acpi_pcihp_device_unplug_request_cb() is the appropriate approach to
accomplish its target.


On Tue, Apr 4, 2023 at 2:25 PM Igor Mammedov <imammedo@redhat.com> wrote:

> On Tue, 4 Apr 2023 08:45:54 +0200
> Jinpu Wang <jinpu.wang@ionos.com> wrote:
>
> > Hi Yu,
> >
> > On Mon, Apr 3, 2023 at 6:59 PM Yu Zhang <yu.zhang@ionos.com> wrote:
> > >
> > > Dear Laurent,
> > >
> > > Thank you for your quick reply. We used qemu-7.1, but it is
> reproducible with qemu from v6.2 to the recent v8.0 release candidates.
> > > I found that it's introduced by the commit  9323f892b39 (between
> v6.2.0-rc2 and v6.2.0-rc3).
> > >
> > > If it doesn't break anything else, it suffices to remove the line
> below from acpi_pcihp_device_unplug_request_cb():
> > >
> > >     pdev->qdev.pending_deleted_event = true;
> > >
> > > but you may have a reason to keep it. First of all, I'll open a bug in
> the bug tracker and let you know.
> > >
> > > Best regards,
> > > Yu Zhang
> > This patch from Igor Mammedov seems relevant,
> >
> https://lore.kernel.org/qemu-devel/20230403131833-mutt-send-email-mst@kernel.org/T/#t
>
> this patch targets corner case of early boot where
> guest hasn't initialized ACPI subsystem yet and 'broken'
> management asking to unplug device too early which leads
> to device stuck in being unplugged state due to regression
> in QEMU.
> However, It doesn't apply to fully booted guest.
>
> [...]
>
> > >> > The purpose is for detecting the end of the PCI device hot-unplug.
> However, we feel the
> > >> > error confusing. How is it possible that a disk "is already in the
> process of unplug"
> > >> > during the first hot-unplug attempt? So far as I know, the issue
> was also encountered by
> > >> > libvirt, but they simply ignored it:
> > >> >
> > >> > https://bugzilla.redhat.com/show_bug.cgi?id=1878659
> > >> > <https://bugzilla.redhat.com/show_bug.cgi?id=1878659>
> see my other reply email/BZ comment 17.
>
> [...]
>
>

[-- Attachment #2: Type: text/html, Size: 4997 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: an issue for device hot-unplug
  2023-04-04 16:00         ` Yu Zhang
@ 2023-04-05  7:51           ` Igor Mammedov
  0 siblings, 0 replies; 9+ messages in thread
From: Igor Mammedov @ 2023-04-05  7:51 UTC (permalink / raw)
  To: Yu Zhang; +Cc: Laurent Vivier, qemu-devel, Jinpu Wang, Elmar Gerdes

On Tue, 4 Apr 2023 18:00:06 +0200
Yu Zhang <yu.zhang@ionos.com> wrote:

> > this patch targets corner case of early boot where
> > guest hasn't initialized ACPI subsystem yet and 'broken'
> > management asking to unplug device too early which leads
> > to device stuck in being unplugged state due to regression
> > in QEMU.
> > However, It doesn't apply to fully booted guest.  
> 
> by adding a few debug lines I see that in
> acpi_pcihp_device_unplug_request_cb(),
> 
>     pdev->qdev.pending_deleted_event = true;
> 
> was executed, which then directly triggered the error in:

If you do repeat unplug request right away after the 1st one,
then getting this error is expected behavior
(as guest needs time to react and unplug device).

> void qmp_device_del(const char *id, Error **errp)
> {
>     DeviceState *dev = find_device_state(id, errp);
>     if (dev != NULL) {
>         if (dev->pending_deleted_event &&
>             (dev->pending_deleted_expires_ms == 0 ||
>              dev->pending_deleted_expires_ms >
> qemu_clock_get_ms(QEMU_CLOCK_VIRTUAL))) {
>             error_setg(errp, "Device %s is already in the "
>                              "process of unplug", id);
>             return;
>         }
> 
>         qdev_unplug(dev, errp);
>     }
> }
> 
> In QEMU code, there are 6 lines where this flag is changed:
> 
> hw/core/qdev.c:564:        dev->pending_deleted_event = false;
> hw/core/qdev.c:601:        dev->pending_deleted_event = true;
> hw/acpi/pcihp.c:219:                    qdev->pending_deleted_event = false;
> hw/acpi/pcihp.c:359:    pdev->qdev.pending_deleted_event = true;
> hw/pci/pcie.c:516:        dev->qdev.pending_deleted_event = false;
> hw/pci/pcie.c:573:    dev->pending_deleted_event = true;
> 
> Considering the complexity of the code, the logic for setting and clearing
> this flag
> seems not quite straightforward. I doubt that the setting of
> pending_deleted_event in
> acpi_pcihp_device_unplug_request_cb() is the appropriate approach to
> accomplish its target.

It's true that pending_deleted_event is abused by failover and later by
pci hotplug.
see comment 15 where Paolo suggest how to fix it
   https://bugzilla.redhat.com/show_bug.cgi?id=1878659#c15

and than see comment 17, explaining that cleaning up pending_deleted_event
usage won't change current behavior.

Anyways, clean up patches are welcome if you wish to follow up
on Paolo's suggestion.

> On Tue, Apr 4, 2023 at 2:25 PM Igor Mammedov <imammedo@redhat.com> wrote:
> 
> > On Tue, 4 Apr 2023 08:45:54 +0200
> > Jinpu Wang <jinpu.wang@ionos.com> wrote:
> >  
> > > Hi Yu,
> > >
> > > On Mon, Apr 3, 2023 at 6:59 PM Yu Zhang <yu.zhang@ionos.com> wrote:  
> > > >
> > > > Dear Laurent,
> > > >
> > > > Thank you for your quick reply. We used qemu-7.1, but it is  
> > reproducible with qemu from v6.2 to the recent v8.0 release candidates.  
> > > > I found that it's introduced by the commit  9323f892b39 (between  
> > v6.2.0-rc2 and v6.2.0-rc3).  
> > > >
> > > > If it doesn't break anything else, it suffices to remove the line  
> > below from acpi_pcihp_device_unplug_request_cb():  
> > > >
> > > >     pdev->qdev.pending_deleted_event = true;
> > > >
> > > > but you may have a reason to keep it. First of all, I'll open a bug in  
> > the bug tracker and let you know.  
> > > >
> > > > Best regards,
> > > > Yu Zhang  
> > > This patch from Igor Mammedov seems relevant,
> > >  
> > https://lore.kernel.org/qemu-devel/20230403131833-mutt-send-email-mst@kernel.org/T/#t
> >
> > this patch targets corner case of early boot where
> > guest hasn't initialized ACPI subsystem yet and 'broken'
> > management asking to unplug device too early which leads
> > to device stuck in being unplugged state due to regression
> > in QEMU.
> > However, It doesn't apply to fully booted guest.
> >
> > [...]
> >  
> > > >> > The purpose is for detecting the end of the PCI device hot-unplug.  
> > However, we feel the  
> > > >> > error confusing. How is it possible that a disk "is already in the  
> > process of unplug"  
> > > >> > during the first hot-unplug attempt? So far as I know, the issue  
> > was also encountered by  
> > > >> > libvirt, but they simply ignored it:
> > > >> >
> > > >> > https://bugzilla.redhat.com/show_bug.cgi?id=1878659
> > > >> > <https://bugzilla.redhat.com/show_bug.cgi?id=1878659>  
> > see my other reply email/BZ comment 17.
> >
> > [...]
> >
> >  



^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2023-04-05  7:52 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-04-03 13:24 an issue for device hot-unplug Yu Zhang
2023-04-03 16:32 ` Laurent Vivier
2023-04-03 16:59   ` Yu Zhang
2023-04-04  6:45     ` Jinpu Wang
2023-04-04 12:25       ` Igor Mammedov
2023-04-04 16:00         ` Yu Zhang
2023-04-05  7:51           ` Igor Mammedov
2023-04-04 10:00     ` Jinpu Wang
2023-04-04 12:17 ` Igor Mammedov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).