* [PATCH 0/2] Two fixes for libxl's PCI detach operation @ 2014-11-10 23:16 Boris Ostrovsky 2014-11-10 23:16 ` [PATCH 1/2] libxl: Wait until QEMU removed the device before tearing it down Boris Ostrovsky 2014-11-10 23:16 ` [PATCH 2/2] libxl: Simplify cleanup in do_pci_remove() Boris Ostrovsky 0 siblings, 2 replies; 21+ messages in thread From: Boris Ostrovsky @ 2014-11-10 23:16 UTC (permalink / raw) To: ian.jackson, stefano.stabellini, ian.campbell, wei.liu2 Cc: boris.ostrovsky, xen-devel Two patches to fix 'xl pci-detach'/do_pci_remove() behavior. * Prevent libxl from resetting device (and removing it from xenstore) before QEMU responded to the guest's 'eject' write * Remove unnecessary calls to xc_physdev_unmap_pirq() and xc_domain_irq_permission() If patches are acceptable I suggest including the first one in 4.5 as it fixes a bug. The second patch is about removing harmless but annoying error messages so it can wait. This has been tested by Konrad for HVM guests. His tests failed with PV guests. However, they failed in the same manner without these patches so it's unlikely that these two introduced a regression. My PV tests worked fine (except for a warning in the guest, but again, I had the same warning before changes). Boris Ostrovsky (2): libxl: Wait until QEMU removed the device before tearing it down libxl: Simplify cleanup in do_pci_remove() tools/libxl/libxl_internal.h | 1 + tools/libxl/libxl_pci.c | 54 +++++++++++++++++++++++++---------- tools/libxl/libxl_qmp.c | 64 ++++++++++++++++++++++++++++++++++++++++++ 3 files changed, 104 insertions(+), 15 deletions(-) ^ permalink raw reply [flat|nested] 21+ messages in thread
* [PATCH 1/2] libxl: Wait until QEMU removed the device before tearing it down 2014-11-10 23:16 [PATCH 0/2] Two fixes for libxl's PCI detach operation Boris Ostrovsky @ 2014-11-10 23:16 ` Boris Ostrovsky 2014-11-14 14:41 ` Ian Jackson 2014-11-14 16:19 ` Ian Jackson 2014-11-10 23:16 ` [PATCH 2/2] libxl: Simplify cleanup in do_pci_remove() Boris Ostrovsky 1 sibling, 2 replies; 21+ messages in thread From: Boris Ostrovsky @ 2014-11-10 23:16 UTC (permalink / raw) To: ian.jackson, stefano.stabellini, ian.campbell, wei.liu2 Cc: boris.ostrovsky, xen-devel When a device is hot-unplugged libxl sends QEMU a "device-del" message (via QMP). This call returns after QEMU has initiated device removal by sending an interrupt to the guest. At some point later QEMU is expected to clean up after the device (such as unbind/unmap MSIs), which will occur when the guest signals that the device has been ejected. Before this happens, however, it is likely that libxl will try to unmap the pirq itself (without first unbinding it) and then reset the device. The unmapping will result in the hypervisor having to perform a forced unbind and that, in turn, will cause any future binds (such as when we try to hotplug the device back) to fail. In addition, if the guest kernel had not had a chance to respond to QEMU's interrupt before libxl resets the device it may fail to properly clean up, e.g. with Linux igb driver: [ 26.976212] WARNING: CPU: 0 PID: 6 at lib/iomap.c:43 bad_io_access+0x3d/0x40() [ 26.985855] Bad IO access at port 0x0 () [ 26.990303] Modules linked in: xt_REDIRECT iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack scsi_transport_iscsi igb i2c_algo_bit ptp pps_core i2c_core hwmon dca floppy [ 27.016289] CPU: 0 PID: 6 Comm: kworker/u256:0 Not tainted 3.18.0-rc2 #66 [ 27.025154] Hardware name: Xen HVM domU, BIOS 4.5-unstable 10/24/2014 [ 27.033357] Workqueue: kacpi_hotplug acpi_hotplug_work_fn [ 27.040638] 0000000000000009 ffff88003c2f3b38 ffffffff8169dbd1 0000000000000001 [ 27.050884] ffff88003c2f3b88 ffff88003c2f3b78 ffffffff8109149c ffff88003c2f3b88 [ 27.060959] ffff880039eb6000 ffff88003b8a8000 ffff880039eb6880 ffff88003b8a8098 [ 27.071173] Call Trace: [ 27.074525] [<ffffffff8169dbd1>] dump_stack+0x46/0x58 [ 27.081411] [<ffffffff8109149c>] warn_slowpath_common+0x8c/0xc0 [ 27.089290] [<ffffffff81091586>] warn_slowpath_fmt+0x46/0x50 [ 27.096756] [<ffffffffa00461c4>] ? igb_reset_interrupt_capability+0x64/0x80 [igb] [ 27.106651] [<ffffffff8137bded>] bad_io_access+0x3d/0x40 [ 27.113785] [<ffffffff8137be1c>] pci_iounmap+0x2c/0x40 [ 27.120619] [<ffffffffa00489cf>] igb_remove+0xbf/0x160 [igb] [ 27.128161] [<ffffffff813a0276>] pci_device_remove+0x46/0xc0 [ 27.135895] [<ffffffff81479f8f>] __device_release_driver+0x7f/0xf0 [ 27.144748] [<ffffffff8147a34c>] device_release_driver+0x2c/0x40 [ 27.153123] [<ffffffff81399b4c>] pci_stop_bus_device+0x9c/0xb0 [ 27.160951] [<ffffffff81399cc6>] pci_stop_and_remove_bus_device+0x16/0x30 [ 27.170290] [<ffffffff813b64a7>] disable_slot+0x57/0xb0 [ 27.177610] [<ffffffff813b6521>] acpiphp_disable_and_eject_slot+0x21/0x90 [ 27.187063] [<ffffffff813b7031>] acpiphp_hotplug_notify+0x141/0x220 [ 27.195588] [<ffffffff813b6ef0>] ? acpiphp_post_dock_fixup+0xd0/0xd0 [ 27.204205] [<ffffffff813deb6d>] acpi_device_hotplug+0x3a1/0x3f0 [ 27.212677] [<ffffffff813d8871>] acpi_hotplug_work_fn+0x1f/0x2b [ 27.220853] [<ffffffff810a8b8c>] process_one_work+0x15c/0x410 [ 27.228802] [<ffffffff810a8f9d>] worker_thread+0x11d/0x520 [ 27.238591] [<ffffffff810a8e80>] ? process_scheduled_works+0x40/0x40 [ 27.251911] [<ffffffff810aed49>] kthread+0xc9/0xe0 [ 27.262033] [<ffffffff810aec80>] ? flush_kthread_worker+0x70/0x70 [ 27.274774] [<ffffffff816a62bc>] ret_from_fork+0x7c/0xb0 [ 27.285985] [<ffffffff810aec80>] ? flush_kthread_worker+0x70/0x70 [ 27.298839] ---[ end trace e8838cc146d5f64c ]--- To avoid this we should wait until QEMU has completed device teardown (which, at least for Linux, happens after the guest is done with its own cleanup). Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Tested-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> --- tools/libxl/libxl_internal.h | 1 + tools/libxl/libxl_pci.c | 22 ++++++++++++++ tools/libxl/libxl_qmp.c | 64 ++++++++++++++++++++++++++++++++++++++++++ 3 files changed, 87 insertions(+), 0 deletions(-) diff --git a/tools/libxl/libxl_internal.h b/tools/libxl/libxl_internal.h index f61673c..139da3a 100644 --- a/tools/libxl/libxl_internal.h +++ b/tools/libxl/libxl_internal.h @@ -1589,6 +1589,7 @@ _hidden libxl__qmp_handler *libxl__qmp_initialize(libxl__gc *gc, uint32_t domid); /* ask to QEMU the serial port information and store it in xenstore. */ _hidden int libxl__qmp_query_serial(libxl__qmp_handler *qmp); +_hidden int libxl__qmp_pci_test(libxl__gc *gc, int d, libxl_device_pci *pcidev); _hidden int libxl__qmp_pci_add(libxl__gc *gc, int d, libxl_device_pci *pcidev); _hidden int libxl__qmp_pci_del(libxl__gc *gc, int domid, libxl_device_pci *pcidev); diff --git a/tools/libxl/libxl_pci.c b/tools/libxl/libxl_pci.c index 9f40100..4c6a733 100644 --- a/tools/libxl/libxl_pci.c +++ b/tools/libxl/libxl_pci.c @@ -1246,6 +1246,28 @@ static int do_pci_remove(libxl__gc *gc, uint32_t domid, break; case LIBXL_DEVICE_MODEL_VERSION_QEMU_XEN: rc = libxl__qmp_pci_del(gc, domid, pcidev); + if (rc < 0) + LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR, "libxl__qmp_pci_del"); + else { + unsigned total_us = 0, wait_us = 100000; + + rc = -ETIMEDOUT; + /* Wait (at most ~6 seconds) for device to disappear */ + do { + int err = libxl__qmp_pci_test(gc, domid, pcidev); + if (err) { + if (err == -ENODEV) + rc = 0; + else + rc = err; + break; + } + usleep(wait_us); + total_us += wait_us; + wait_us *= 2; + } while (total_us < 5000000); + } + break; default: rc = ERROR_INVAL; diff --git a/tools/libxl/libxl_qmp.c b/tools/libxl/libxl_qmp.c index c7324e6..3235c83 100644 --- a/tools/libxl/libxl_qmp.c +++ b/tools/libxl/libxl_qmp.c @@ -739,6 +739,41 @@ static int qmp_query_vnc(libxl__qmp_handler *qmp) NULL, qmp->timeout); } +static int pci_exists_callback(libxl__qmp_handler *qmp, + const libxl__json_object *response, void *opaque) +{ + libxl_device_pci *pcidev = opaque; + const libxl__json_object *bus; + GC_INIT(qmp->ctx); + int i, j, rc = -ENODEV; + char *asked_id = GCSPRINTF(PCI_PT_QDEV_ID, + pcidev->bus, pcidev->dev, pcidev->func); + + for (i = 0; (bus = libxl__json_array_get(response, i)); i++) { + const libxl__json_object *devices = NULL; + const libxl__json_object *device = NULL; + const libxl__json_object *o = NULL; + const char *id = NULL; + + devices = libxl__json_map_get("devices", bus, JSON_ARRAY); + + for (j = 0; (device = libxl__json_array_get(devices, j)); j++) { + o = libxl__json_map_get("qdev_id", device, JSON_STRING); + id = libxl__json_object_get_string(o); + + if (id && strcmp(asked_id, id) == 0) { + rc = 0; + goto out; + } + } + } + +out: + GC_FREE; + return rc; + +} + static int pci_add_callback(libxl__qmp_handler *qmp, const libxl__json_object *response, void *opaque) { @@ -804,6 +839,35 @@ static int qmp_run_command(libxl__gc *gc, int domid, return rc; } +int libxl__qmp_pci_test(libxl__gc *gc, int domid, libxl_device_pci *pcidev) +{ + libxl__qmp_handler *qmp = NULL; + libxl__json_object *args = NULL; + char *hostaddr = NULL; + int rc = 0; + + qmp = libxl__qmp_initialize(gc, domid); + if (!qmp) + return -1; + + hostaddr = GCSPRINTF("%04x:%02x:%02x.%01x", pcidev->domain, + pcidev->bus, pcidev->dev, pcidev->func); + if (!hostaddr) + return -1; + + qmp_parameters_add_string(gc, &args, "driver", "xen-pci-passthrough"); + QMP_PARAMETERS_SPRINTF(&args, "id", PCI_PT_QDEV_ID, + pcidev->bus, pcidev->dev, pcidev->func); + qmp_parameters_add_string(gc, &args, "hostaddr", hostaddr); + + rc = qmp_synchronous_send(qmp, "query-pci", NULL, + pci_exists_callback, pcidev, qmp->timeout); + + libxl__qmp_close(qmp); + return rc; + +} + int libxl__qmp_pci_add(libxl__gc *gc, int domid, libxl_device_pci *pcidev) { libxl__qmp_handler *qmp = NULL; -- 1.7.1 ^ permalink raw reply related [flat|nested] 21+ messages in thread
* Re: [PATCH 1/2] libxl: Wait until QEMU removed the device before tearing it down 2014-11-10 23:16 ` [PATCH 1/2] libxl: Wait until QEMU removed the device before tearing it down Boris Ostrovsky @ 2014-11-14 14:41 ` Ian Jackson 2014-11-14 15:30 ` Boris Ostrovsky 2014-11-14 16:19 ` Ian Jackson 1 sibling, 1 reply; 21+ messages in thread From: Ian Jackson @ 2014-11-14 14:41 UTC (permalink / raw) To: Boris Ostrovsky; +Cc: xen-devel, wei.liu2, ian.campbell, stefano.stabellini Boris Ostrovsky writes ("[PATCH 1/2] libxl: Wait until QEMU removed the device before tearing it down"): > When a device is hot-unplugged libxl sends QEMU a "device-del" message > (via QMP). This call returns after QEMU has initiated device removal by > sending an interrupt to the guest. At some point later QEMU is expected > to clean up after the device (such as unbind/unmap MSIs), which will > occur when the guest signals that the device has been ejected. I haven't repro'd this bug, but after looking at your patch, and staring at the code, I think you may have misdiagnosed the problem. I found this: if (type == LIBXL_DOMAIN_TYPE_HVM) { [stuff] } else if (type != LIBXL_DOMAIN_TYPE_PV) abort(); { This is, of course, very bizarre. For a moment I put it down to coding style, but the subsequent block is the PV pci unplug path. It would appear therefore that HVM PCI unplug executes first the HVM code, and then immediately afterwards the PV code. This strangeness was introduced in abfb006f "tools/libxl: explicitly grant access to needed I/O-memory ranges" which seems to contain an entirely wrong hunk. Can you try the patch below and see if it helps your problem ? AFAICT your other patch probably isn't needed in the light of all this. If it is, then I have misunderstood (and perhaps the commit message could be more explicit about the bug that this is allegedly fixing - calling a patch `Simplify ....' makes it sound like the kind of cleanup we don't want to take during the freeze). Thanks Ian. Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com> diff --git a/tools/libxl/libxl_pci.c b/tools/libxl/libxl_pci.c index 9f40100..bec25a9 100644 --- a/tools/libxl/libxl_pci.c +++ b/tools/libxl/libxl_pci.c @@ -1255,9 +1255,9 @@ static int do_pci_remove(libxl__gc *gc, uint32_t domid, rc = ERROR_FAIL; goto out_fail; } - } else if (type != LIBXL_DOMAIN_TYPE_PV) - abort(); - { + } else { + assert(type == LIBXL_DOMAIN_TYPE_PV); + char *sysfs_path = libxl__sprintf(gc, SYSFS_PCI_DEV"/"PCI_BDF"/resource", pcidev->domain, pcidev->bus, pcidev->dev, pcidev->func); FILE *f = fopen(sysfs_path, "r"); ^ permalink raw reply related [flat|nested] 21+ messages in thread
* Re: [PATCH 1/2] libxl: Wait until QEMU removed the device before tearing it down 2014-11-14 14:41 ` Ian Jackson @ 2014-11-14 15:30 ` Boris Ostrovsky 2014-11-14 16:01 ` Ian Jackson 0 siblings, 1 reply; 21+ messages in thread From: Boris Ostrovsky @ 2014-11-14 15:30 UTC (permalink / raw) To: Ian Jackson; +Cc: xen-devel, wei.liu2, ian.campbell, stefano.stabellini On 11/14/2014 09:41 AM, Ian Jackson wrote: > Boris Ostrovsky writes ("[PATCH 1/2] libxl: Wait until QEMU removed the device before tearing it down"): >> When a device is hot-unplugged libxl sends QEMU a "device-del" message >> (via QMP). This call returns after QEMU has initiated device removal by >> sending an interrupt to the guest. At some point later QEMU is expected >> to clean up after the device (such as unbind/unmap MSIs), which will >> occur when the guest signals that the device has been ejected. > I haven't repro'd this bug, but after looking at your patch, and > staring at the code, I think you may have misdiagnosed the problem. > > I found this: > > if (type == LIBXL_DOMAIN_TYPE_HVM) { > [stuff] > } else if (type != LIBXL_DOMAIN_TYPE_PV) > abort(); > { > > This is, of course, very bizarre. For a moment I put it down to > coding style, but the subsequent block is the PV pci unplug path. It > would appear therefore that HVM PCI unplug executes first the HVM > code, and then immediately afterwards the PV code. > > This strangeness was introduced in abfb006f > "tools/libxl: explicitly grant access to needed I/O-memory ranges" > which seems to contain an entirely wrong hunk. > > Can you try the patch below and see if it helps your problem ? It does help with the first half of the problem --- we don't do the unnecessary unmap anymore in what now becomes PV-only 'else if (type != LIBXL_DOMAIN_TYPE_PV)' clause. But it doesn't help with the second part (the one that shows Linux WARNING), because we still may try to reset the device before guest kernel is done unloading the driver. Reproducing that problem is probably dependent on the guest/driver: in this particular case the igb driver does an IN instruction while unloading the driver and if that instruction fails (which may happen if the device is gone from the POV of toolstack) then it triggers the warning. > > AFAICT your other patch probably isn't needed in the light of all > this. If it is, then I have misunderstood (and perhaps the commit > message could be more explicit about the bug that this is allegedly > fixing - calling a patch `Simplify ....' makes it sound like the kind > of cleanup we don't want to take during the freeze). And I believe we still need part of the second patch --- the one that removes call to xc_domain_irq_permission() for PV guests (after your patch is applied): this call will fail because xc_physdev_unmap_pirq() above it will cause hypervisor to do unmap_domain_pirq()->irq_deny_access() So I think we need: 1. Your patch 2. My first patch 3. Part of my second patch as described above. (and make commit message/subject more understandable) Thanks. -boris > > Thanks > Ian. > > Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com> > > diff --git a/tools/libxl/libxl_pci.c b/tools/libxl/libxl_pci.c > index 9f40100..bec25a9 100644 > --- a/tools/libxl/libxl_pci.c > +++ b/tools/libxl/libxl_pci.c > @@ -1255,9 +1255,9 @@ static int do_pci_remove(libxl__gc *gc, uint32_t domid, > rc = ERROR_FAIL; > goto out_fail; > } > - } else if (type != LIBXL_DOMAIN_TYPE_PV) > - abort(); > - { > + } else { > + assert(type == LIBXL_DOMAIN_TYPE_PV); > + > char *sysfs_path = libxl__sprintf(gc, SYSFS_PCI_DEV"/"PCI_BDF"/resource", pcidev->domain, > pcidev->bus, pcidev->dev, pcidev->func); > FILE *f = fopen(sysfs_path, "r"); ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 1/2] libxl: Wait until QEMU removed the device before tearing it down 2014-11-14 15:30 ` Boris Ostrovsky @ 2014-11-14 16:01 ` Ian Jackson 2014-11-14 16:24 ` Boris Ostrovsky 0 siblings, 1 reply; 21+ messages in thread From: Ian Jackson @ 2014-11-14 16:01 UTC (permalink / raw) To: Boris Ostrovsky; +Cc: xen-devel, wei.liu2, ian.campbell, stefano.stabellini Boris Ostrovsky writes ("Re: [PATCH 1/2] libxl: Wait until QEMU removed the device before tearing it down"): > But it doesn't help with the second part (the one that shows Linux > WARNING), because we still may try to reset the device before guest > kernel is done unloading the driver. Reproducing that problem is > probably dependent on the guest/driver: in this particular case the igb > driver does an IN instruction while unloading the driver and if that > instruction fails (which may happen if the device is gone from the POV > of toolstack) then it triggers the warning. Right. OK. Looking at the code, I think you have convinced about that. I will review the implementation in your patch now. > And I believe we still need part of the second patch --- the one that > removes call to xc_domain_irq_permission() for PV guests (after your > patch is applied): this call will fail because xc_physdev_unmap_pirq() > above it will cause hypervisor to do unmap_domain_pirq()->irq_deny_access() What call to xc_physdev_unmap_pirq ? I see one in the PV path. Ian. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 1/2] libxl: Wait until QEMU removed the device before tearing it down 2014-11-14 16:01 ` Ian Jackson @ 2014-11-14 16:24 ` Boris Ostrovsky 2014-11-14 16:31 ` Ian Jackson 0 siblings, 1 reply; 21+ messages in thread From: Boris Ostrovsky @ 2014-11-14 16:24 UTC (permalink / raw) To: Ian Jackson; +Cc: xen-devel, wei.liu2, ian.campbell, stefano.stabellini On 11/14/2014 11:01 AM, Ian Jackson wrote: > Boris Ostrovsky writes ("Re: [PATCH 1/2] libxl: Wait until QEMU removed the device before tearing it down"): > > >> And I believe we still need part of the second patch --- the one that >> removes call to xc_domain_irq_permission() for PV guests (after your >> patch is applied): this call will fail because xc_physdev_unmap_pirq() >> above it will cause hypervisor to do unmap_domain_pirq()->irq_deny_access() > What call to xc_physdev_unmap_pirq ? I see one in the PV path. (Now with Reply-all, sorry) At the skip1 label. The error is: [root@ovs105 libxl]# ./xl pci-detach pvlinux 0000:09:00.1 libxl: error: libxl_pci.c:1305:do_pci_remove: xc_domain_irq_permission irq=17: Invalid argument [root@ovs105 libxl]# Thanks. -boris ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 1/2] libxl: Wait until QEMU removed the device before tearing it down 2014-11-14 16:24 ` Boris Ostrovsky @ 2014-11-14 16:31 ` Ian Jackson 2014-11-14 16:37 ` Boris Ostrovsky 2014-11-14 17:45 ` Sander Eikelenboom 0 siblings, 2 replies; 21+ messages in thread From: Ian Jackson @ 2014-11-14 16:31 UTC (permalink / raw) To: Boris Ostrovsky; +Cc: xen-devel, wei.liu2, ian.campbell, stefano.stabellini Boris Ostrovsky writes ("Re: [PATCH 1/2] libxl: Wait until QEMU removed the device before tearing it down"): > (Now with Reply-all, sorry) Likewise. > On 11/14/2014 11:01 AM, Ian Jackson wrote: > > What call to xc_physdev_unmap_pirq ? I see one in the PV path. > > At the skip1 label. `goto skip1' only appears in the PV path AFAICT. Ian. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 1/2] libxl: Wait until QEMU removed the device before tearing it down 2014-11-14 16:31 ` Ian Jackson @ 2014-11-14 16:37 ` Boris Ostrovsky 2014-11-14 16:36 ` Ian Jackson 2014-11-14 17:45 ` Sander Eikelenboom 1 sibling, 1 reply; 21+ messages in thread From: Boris Ostrovsky @ 2014-11-14 16:37 UTC (permalink / raw) To: Ian Jackson; +Cc: xen-devel, wei.liu2, ian.campbell, stefano.stabellini On 11/14/2014 11:31 AM, Ian Jackson wrote: > Boris Ostrovsky writes ("Re: [PATCH 1/2] libxl: Wait until QEMU removed the device before tearing it down"): > >> On 11/14/2014 11:01 AM, Ian Jackson wrote: >>> What call to xc_physdev_unmap_pirq ? I see one in the PV path. >> At the skip1 label. > `goto skip1' only appears in the PV path AFAICT. > Right, this is all about PV code path. -boris ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 1/2] libxl: Wait until QEMU removed the device before tearing it down 2014-11-14 16:37 ` Boris Ostrovsky @ 2014-11-14 16:36 ` Ian Jackson 2014-11-14 16:58 ` Boris Ostrovsky 0 siblings, 1 reply; 21+ messages in thread From: Ian Jackson @ 2014-11-14 16:36 UTC (permalink / raw) To: Boris Ostrovsky; +Cc: xen-devel, wei.liu2, ian.campbell, stefano.stabellini Boris Ostrovsky writes ("Re: [PATCH 1/2] libxl: Wait until QEMU removed the device before tearing it down"): > On 11/14/2014 11:31 AM, Ian Jackson wrote: > > `goto skip1' only appears in the PV path AFAICT. > > Right, this is all about PV code path. So now I am confused. Why is qemu involved ? Ian. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 1/2] libxl: Wait until QEMU removed the device before tearing it down 2014-11-14 16:36 ` Ian Jackson @ 2014-11-14 16:58 ` Boris Ostrovsky 0 siblings, 0 replies; 21+ messages in thread From: Boris Ostrovsky @ 2014-11-14 16:58 UTC (permalink / raw) To: Ian Jackson; +Cc: xen-devel, wei.liu2, ian.campbell, stefano.stabellini On 11/14/2014 11:36 AM, Ian Jackson wrote: > Boris Ostrovsky writes ("Re: [PATCH 1/2] libxl: Wait until QEMU removed the device before tearing it down"): >> On 11/14/2014 11:31 AM, Ian Jackson wrote: >>> `goto skip1' only appears in the PV path AFAICT. >> Right, this is all about PV code path. > So now I am confused. Why is qemu involved ? This has nothing to do with QEMU. It's just that we are having this conversation in the thread with wrong subject ;-) We should be talking about it in "[PATCH 2/2] libxl: Simplify cleanup in do_pci_remove()" -boris ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 1/2] libxl: Wait until QEMU removed the device before tearing it down 2014-11-14 16:31 ` Ian Jackson 2014-11-14 16:37 ` Boris Ostrovsky @ 2014-11-14 17:45 ` Sander Eikelenboom 2014-11-14 18:07 ` Ian Jackson 1 sibling, 1 reply; 21+ messages in thread From: Sander Eikelenboom @ 2014-11-14 17:45 UTC (permalink / raw) To: Ian Jackson Cc: Boris Ostrovsky, stefano.stabellini, wei.liu2, ian.campbell, xen-devel Friday, November 14, 2014, 5:31:03 PM, you wrote: > Boris Ostrovsky writes ("Re: [PATCH 1/2] libxl: Wait until QEMU removed the device before tearing it down"): >> (Now with Reply-all, sorry) > Likewise. >> On 11/14/2014 11:01 AM, Ian Jackson wrote: >> > What call to xc_physdev_unmap_pirq ? I see one in the PV path. >> >> At the skip1 label. > `goto skip1' only appears in the PV path AFAICT. Not wanting to hijack this thread, but since it addresses a problem i tried to report and partly fix earlier, just 3 notes: 1) xc_physdev_unmap_pirq does get called when destroying a HVM guest. See this part of an earlier report of mine: http://lists.xen.org/archives/html/xen-devel/2014-10/msg02518.html : - When i destroy the guest (instead of shutting it down), "xen_pcibk_disable_msi()" does get called, but then i get an error from the toolstack, it reads the /sys/bus/pci/devices/<BDF>/irq value which is still "68" and that is not assigned to the guest: libxl: error: libxl_pci.c:1319:do_pci_remove: xc_physdev_unmap_pirq irq=68: Invalid argument libxl: error: libxl_pci.c:1323:do_pci_remove: xc_domain_irq_permission irq=68: Invalid argument On the guest start it called xc_physdev_unmap_pirq and xc_domain_irq_permission for irq=22 .. This particular error only occurs when the guest is has enabled MSI, this changes the irq number in /sys/bus/pci/devices/<BDF>/irq to the MSI nr. So on guest destroy it now tries to do things based on the MSI nr instead of the intX irq. 2) When setting and updating the device states in xenstore for a passed through device, libxl doesn't mimic the states pciback and pcifront set and use, very well. (see f.e. http://lists.xen.org/archives/html/xen-devel/2014-08/msg00970.html) And when 3) When a HVM guest has multiple pci devices passes through, pciback doesn't currently get a signal *what* pci-device has been removed from the guest on a 'xl pci-detach', since it only has a watch on the whole pci-root in xenstore, since it doesn't know which device has been yanked out, it also can't clean up everything properly until *all* devices have been removed from the guest and the whole pci-root entry gets removed from xenstore. The effect is that if you remove only one device from a guest with "xl pci-detach" and then do a "xl pci-assignable-remove" you will run into trouble. Just my 3 notes and 2 cents on issues related to this code. > Ian. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 1/2] libxl: Wait until QEMU removed the device before tearing it down 2014-11-14 17:45 ` Sander Eikelenboom @ 2014-11-14 18:07 ` Ian Jackson 2014-11-14 19:24 ` Sander Eikelenboom 0 siblings, 1 reply; 21+ messages in thread From: Ian Jackson @ 2014-11-14 18:07 UTC (permalink / raw) To: Sander Eikelenboom Cc: Boris Ostrovsky, stefano.stabellini, wei.liu2, ian.campbell, xen-devel Sander Eikelenboom writes ("Re: [Xen-devel] [PATCH 1/2] libxl: Wait until QEMU removed the device before tearing it down"): > 1) xc_physdev_unmap_pirq does get called when destroying a HVM guest. Yes, but I think that is so only because of the block structure bug introduced in abfb006f. > 2) When setting and updating the device states in xenstore for > a[...] The rest of your message is difficult to read on my screen due to wrap damage. Can you wrap it to <75 characters ? Thanks, Ian. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 1/2] libxl: Wait until QEMU removed the device before tearing it down 2014-11-14 18:07 ` Ian Jackson @ 2014-11-14 19:24 ` Sander Eikelenboom 2014-11-14 21:09 ` Boris Ostrovsky 0 siblings, 1 reply; 21+ messages in thread From: Sander Eikelenboom @ 2014-11-14 19:24 UTC (permalink / raw) To: Ian Jackson Cc: Boris Ostrovsky, stefano.stabellini, wei.liu2, ian.campbell, xen-devel Friday, November 14, 2014, 7:07:46 PM, you wrote: > Sander Eikelenboom writes ("Re: [Xen-devel] [PATCH 1/2] libxl: Wait until QEMU removed the device before tearing it down"): >> 1) xc_physdev_unmap_pirq does get called when destroying a HVM guest. > Yes, but I think that is so only because of the block structure bug > introduced in abfb006f. OK, if this code is only for PV guests, this problem could also still be there for PV guests then (they can also enable MSI AFAIK). What i saw was that when a guest enables MSI for a device, the value of /sys/bus/pci/devices/<BDF>/irq in dom0 changes from the intx to the msi irq value. This occurs after libxl has read the irq and has called: When shutting down the guest disables MSI for that device and the value of /sys/bus/pci/devices/<BDF>/irq in dom0 is changed back to the intx irq before libxl reads it again and calls: So there is no error. However in the destroy case the guest doesn't disable MSI, so the value of /sys/bus/pci/devices/<BDF>/irq in dom0 isn't changed, so libxl tries to operate on the wrong value. I don't know if it is a kernel bug that /sys/bus/pci/devices/<BDF>/irq changes when MSI are enabled, or that libxl shouldn't depend on that value. >> 2) When setting and updating the device states in xenstore for >> a[...] > The rest of your message is difficult to read on my screen due to wrap > damage. Can you wrap it to <75 characters ? Hmm it could do with some restructuring an rephrasing as well, sorry for that, hope it's a bit clearer now (combined note 2 and 3): 2) When setting and updating the device states in xenstore for a passed through device, libxl doesn't mimic the states that pciback and pcifront set and use in their "dance", very well. (see f.e. http://lists.xen.org/archives/html/xen-devel/2014-08/msg00970.html ) When a HVM guest has multiple pci devices passed through, pciback doesn't currently get a signal *what* pci-device has been removed from the guest on a 'xl pci-detach'. Pciback currently only has a watch on the whole "pci-root" of a guest in xenstore, so it only knows *something* has changed: - in or under the "pci-root" entry - In the PV guest case, the xenbus-state of a individual pci device node is set to a different state by pcifront, so pciback can take action on cleaning up / resetting this particular physical device. - In the HVM guest case however, the xenbus-states are don't mimic what pcifront does, so pciback doesn't do all the clean up and resetting on the physical device when doing a "xl pci-detach". It *does* cleanup when *all* devices and there for the complete" "pci-root" entry for a device is yanked out of xenstore. That's why you don't get intro trouble when you shutdown a guest with multiple devices passed through. You also don't get into trouble when a single device is passed through and removed (the "pci-root" entry in xenstore is removed when the last pci device of a guest is removed from the guest. But you *do* get into trouble when hot-unplugging any but the last device from a guest which has multiple pci devices passed through to it. You also won't notice it while it is still owned by pciback. You do notice it when you try to do a "xl pci-assignable-remove". I tried to fix this my self .. but ran into trouble because when you signal pciback via xenstore of the intent of removing a device from the guest, you need a callback but you can also run in to "timeout" issues. Another issue was that on shutdown you will remove multiple devices in short succession and i ran into locking/race issues, so it clearly became something beyond my skills at that point. > Thanks, > Ian. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 1/2] libxl: Wait until QEMU removed the device before tearing it down 2014-11-14 19:24 ` Sander Eikelenboom @ 2014-11-14 21:09 ` Boris Ostrovsky 2014-11-14 21:20 ` Sander Eikelenboom 0 siblings, 1 reply; 21+ messages in thread From: Boris Ostrovsky @ 2014-11-14 21:09 UTC (permalink / raw) To: Sander Eikelenboom, Ian Jackson Cc: stefano.stabellini, wei.liu2, ian.campbell, xen-devel On 11/14/2014 02:24 PM, Sander Eikelenboom wrote: > Friday, November 14, 2014, 7:07:46 PM, you wrote: > >> Sander Eikelenboom writes ("Re: [Xen-devel] [PATCH 1/2] libxl: Wait until QEMU removed the device before tearing it down"): >>> 1) xc_physdev_unmap_pirq does get called when destroying a HVM guest. >> Yes, but I think that is so only because of the block structure bug >> introduced in abfb006f. > OK, if this code is only for PV guests, this problem could also still be > there for PV guests then (they can also enable MSI AFAIK). > > What i saw was that when a guest enables MSI for a device, the value of > /sys/bus/pci/devices/<BDF>/irq in dom0 changes from the intx to the msi irq > value. This occurs after libxl has read the irq and has called: > > When shutting down the guest disables MSI for that device and the value of > /sys/bus/pci/devices/<BDF>/irq in dom0 is changed back to the intx irq before > libxl reads it again and calls: > > So there is no error. > However in the destroy case the guest doesn't disable MSI, so the value of > /sys/bus/pci/devices/<BDF>/irq in dom0 isn't changed, so libxl tries to operate > on the wrong value. > > I don't know if it is a kernel bug that /sys/bus/pci/devices/<BDF>/irq changes > when MSI are enabled, or that libxl shouldn't depend on that value. I am unable to reproduce irq value change. But I can see that if it changes you will see libxl complain since we'd be trying to unmap an irq that has not been mapped. What device are you passing through? > >>> 2) When setting and updating the device states in xenstore for >>> a[...] >> The rest of your message is difficult to read on my screen due to wrap >> damage. Can you wrap it to <75 characters ? > Hmm it could do with some restructuring an rephrasing as well, sorry for that, > hope it's a bit clearer now (combined note 2 and 3): > > 2) When setting and updating the device states in xenstore for > a passed through device, libxl doesn't mimic the states that > pciback and pcifront set and use in their "dance", very well. > (see f.e. http://lists.xen.org/archives/html/xen-devel/2014-08/msg00970.html ) > > When a HVM guest has multiple pci devices passed through, > pciback doesn't currently get a signal *what* pci-device has > been removed from the guest on a 'xl pci-detach'. > > Pciback currently only has a watch on the whole "pci-root" > of a guest in xenstore, so it only knows *something* has changed: > - in or under the "pci-root" entry > - In the PV guest case, the xenbus-state of a individual pci device > node is set to a different state by pcifront, so pciback can take > action on cleaning up / resetting this particular physical device. > - In the HVM guest case however, the xenbus-states are don't mimic > what pcifront does, so pciback doesn't do all the clean up and > resetting on the physical device when doing a "xl pci-detach". > It *does* cleanup when *all* devices and there for the complete" > "pci-root" entry for a device is yanked out of xenstore. That's why > you don't get intro trouble when you shutdown a guest with multiple > devices passed through. You also don't get into trouble when a > single device is passed through and removed (the "pci-root" entry in > xenstore is removed when the last pci device of a guest is removed > from the guest. > But you *do* get into trouble when hot-unplugging > any but the last device from a guest which has multiple pci devices > passed through to it. You also won't notice it while it is still > owned by pciback. You do notice it when you try to do a > "xl pci-assignable-remove". I don't know about detach but I apparently can't even properly attach a second device --- I get complaints about it already being in xenstore. But device does show up in the guest. And then I can't remove it because, well, it's not in xenstore. Maybe it has something to do with the fact that both devices are virtual functions of the same physical device. I might look at this sometime next week. -boris > > I tried to fix this my self .. but ran into trouble because when you > signal pciback via xenstore of the intent of removing a device from the > guest, you need a callback but you can also run in to "timeout" issues. > Another issue was that on shutdown you will remove multiple devices in > short succession and i ran into locking/race issues, so it clearly > became something beyond my skills at that point. > >> Thanks, >> Ian. > ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 1/2] libxl: Wait until QEMU removed the device before tearing it down 2014-11-14 21:09 ` Boris Ostrovsky @ 2014-11-14 21:20 ` Sander Eikelenboom 2014-11-14 21:38 ` Boris Ostrovsky 0 siblings, 1 reply; 21+ messages in thread From: Sander Eikelenboom @ 2014-11-14 21:20 UTC (permalink / raw) To: Boris Ostrovsky Cc: wei.liu2, stefano.stabellini, Ian Jackson, ian.campbell, xen-devel Friday, November 14, 2014, 10:09:04 PM, you wrote: > On 11/14/2014 02:24 PM, Sander Eikelenboom wrote: >> Friday, November 14, 2014, 7:07:46 PM, you wrote: >> >>> Sander Eikelenboom writes ("Re: [Xen-devel] [PATCH 1/2] libxl: Wait until QEMU removed the device before tearing it down"): >>>> 1) xc_physdev_unmap_pirq does get called when destroying a HVM guest. >>> Yes, but I think that is so only because of the block structure bug >>> introduced in abfb006f. >> OK, if this code is only for PV guests, this problem could also still be >> there for PV guests then (they can also enable MSI AFAIK). >> >> What i saw was that when a guest enables MSI for a device, the value of >> /sys/bus/pci/devices/<BDF>/irq in dom0 changes from the intx to the msi irq >> value. This occurs after libxl has read the irq and has called: >> >> When shutting down the guest disables MSI for that device and the value of >> /sys/bus/pci/devices/<BDF>/irq in dom0 is changed back to the intx irq before >> libxl reads it again and calls: >> >> So there is no error. >> However in the destroy case the guest doesn't disable MSI, so the value of >> /sys/bus/pci/devices/<BDF>/irq in dom0 isn't changed, so libxl tries to operate >> on the wrong value. >> >> I don't know if it is a kernel bug that /sys/bus/pci/devices/<BDF>/irq changes >> when MSI are enabled, or that libxl shouldn't depend on that value. > I am unable to reproduce irq value change. But I can see that if it > changes you will see libxl complain since we'd be trying to unmap an irq > that has not been mapped. > What device are you passing through? Will look at it again this weekend and report back the details of device, used irq's, kernel version etc, so you can hopefully reproduce. >> >>>> 2) When setting and updating the device states in xenstore for >>>> a[...] >>> The rest of your message is difficult to read on my screen due to wrap >>> damage. Can you wrap it to <75 characters ? >> Hmm it could do with some restructuring an rephrasing as well, sorry for that, >> hope it's a bit clearer now (combined note 2 and 3): >> >> 2) When setting and updating the device states in xenstore for >> a passed through device, libxl doesn't mimic the states that >> pciback and pcifront set and use in their "dance", very well. >> (see f.e. http://lists.xen.org/archives/html/xen-devel/2014-08/msg00970.html ) >> >> When a HVM guest has multiple pci devices passed through, >> pciback doesn't currently get a signal *what* pci-device has >> been removed from the guest on a 'xl pci-detach'. >> >> Pciback currently only has a watch on the whole "pci-root" >> of a guest in xenstore, so it only knows *something* has changed: >> - in or under the "pci-root" entry >> - In the PV guest case, the xenbus-state of a individual pci device >> node is set to a different state by pcifront, so pciback can take >> action on cleaning up / resetting this particular physical device. >> - In the HVM guest case however, the xenbus-states are don't mimic >> what pcifront does, so pciback doesn't do all the clean up and >> resetting on the physical device when doing a "xl pci-detach". >> It *does* cleanup when *all* devices and there for the complete" >> "pci-root" entry for a device is yanked out of xenstore. That's why >> you don't get intro trouble when you shutdown a guest with multiple >> devices passed through. You also don't get into trouble when a >> single device is passed through and removed (the "pci-root" entry in >> xenstore is removed when the last pci device of a guest is removed >> from the guest. >> But you *do* get into trouble when hot-unplugging >> any but the last device from a guest which has multiple pci devices >> passed through to it. You also won't notice it while it is still >> owned by pciback. You do notice it when you try to do a >> "xl pci-assignable-remove". > I don't know about detach but I apparently can't even properly attach a > second device --- I get complaints about it already being in xenstore. > But device does show up in the guest. > And then I can't remove it because, well, it's not in xenstore. > Maybe it has something to do with the fact that both devices are virtual > functions of the same physical device. I might look at this sometime > next week. Ah .. virtual function as in virtual functions of a SR-IOV NIC ? or the functions of a multi-function device (the later ends up as separate devices in a HVM guest (assuming you use qemu-xen and not qemu-traditional)). Ok wil keep an eye on the mailing list if i can test other help in any other way then ! Thx ! Sander > -boris >> >> I tried to fix this my self .. but ran into trouble because when you >> signal pciback via xenstore of the intent of removing a device from the >> guest, you need a callback but you can also run in to "timeout" issues. >> Another issue was that on shutdown you will remove multiple devices in >> short succession and i ran into locking/race issues, so it clearly >> became something beyond my skills at that point. >> >>> Thanks, >>> Ian. >> ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 1/2] libxl: Wait until QEMU removed the device before tearing it down 2014-11-14 21:20 ` Sander Eikelenboom @ 2014-11-14 21:38 ` Boris Ostrovsky 2014-11-14 21:50 ` Sander Eikelenboom 0 siblings, 1 reply; 21+ messages in thread From: Boris Ostrovsky @ 2014-11-14 21:38 UTC (permalink / raw) To: Sander Eikelenboom Cc: wei.liu2, stefano.stabellini, Ian Jackson, ian.campbell, xen-devel On 11/14/2014 04:20 PM, Sander Eikelenboom wrote: > Friday, November 14, 2014, 10:09:04 PM, you wrote: > > >> I don't know about detach but I apparently can't even properly attach a >> second device --- I get complaints about it already being in xenstore. >> But device does show up in the guest. >> And then I can't remove it because, well, it's not in xenstore. >> Maybe it has something to do with the fact that both devices are virtual >> functions of the same physical device. I might look at this sometime >> next week. > Ah .. virtual function as in virtual functions of a SR-IOV NIC ? > or the functions of a multi-function device (the later ends up as separate devices > in a HVM guest (assuming you use qemu-xen and not qemu-traditional)). Virtual functions of an SR-IOV NIC. I'd think they should show up as separate devices in xenstore. (And they do show up as separate devices in the guest) And qemu-xen. -boris ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 1/2] libxl: Wait until QEMU removed the device before tearing it down 2014-11-14 21:38 ` Boris Ostrovsky @ 2014-11-14 21:50 ` Sander Eikelenboom 0 siblings, 0 replies; 21+ messages in thread From: Sander Eikelenboom @ 2014-11-14 21:50 UTC (permalink / raw) To: Boris Ostrovsky Cc: wei.liu2, stefano.stabellini, Ian Jackson, ian.campbell, xen-devel Friday, November 14, 2014, 10:38:11 PM, you wrote: > On 11/14/2014 04:20 PM, Sander Eikelenboom wrote: >> Friday, November 14, 2014, 10:09:04 PM, you wrote: >> >> >>> I don't know about detach but I apparently can't even properly attach a >>> second device --- I get complaints about it already being in xenstore. >>> But device does show up in the guest. >>> And then I can't remove it because, well, it's not in xenstore. >>> Maybe it has something to do with the fact that both devices are virtual >>> functions of the same physical device. I might look at this sometime >>> next week. >> Ah .. virtual function as in virtual functions of a SR-IOV NIC ? >> or the functions of a multi-function device (the later ends up as separate devices >> in a HVM guest (assuming you use qemu-xen and not qemu-traditional)). > Virtual functions of an SR-IOV NIC. I'd think they should show up as separate devices in xenstore. (And they do show up as separate devices in the guest) > And qemu-xen. Hmm i thought SR-IOV was in some aspects a special case, also thought i had seen some conversation between Jan and Konrad about SR-IOV and device removal the last week(s). But i don't have a SR-IOV capable NIC, so also no experiences related to that. > -boris ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 1/2] libxl: Wait until QEMU removed the device before tearing it down 2014-11-10 23:16 ` [PATCH 1/2] libxl: Wait until QEMU removed the device before tearing it down Boris Ostrovsky 2014-11-14 14:41 ` Ian Jackson @ 2014-11-14 16:19 ` Ian Jackson 2014-11-14 17:09 ` Boris Ostrovsky 1 sibling, 1 reply; 21+ messages in thread From: Ian Jackson @ 2014-11-14 16:19 UTC (permalink / raw) To: Boris Ostrovsky; +Cc: xen-devel, wei.liu2, ian.campbell, stefano.stabellini Boris Ostrovsky writes ("[PATCH 1/2] libxl: Wait until QEMU removed the device before tearing it down"): > When a device is hot-unplugged libxl sends QEMU a "device-del" message > (via QMP). This call returns after QEMU has initiated device removal by > sending an interrupt to the guest. At some point later QEMU is expected > to clean up after the device (such as unbind/unmap MSIs), which will > occur when the guest signals that the device has been ejected. As discussed, I agree that this is a problem. There is a race between qemu/libxl and the guest. If libxl gets there first, you see the symptoms you report. Having libxl not progress with the rest of the removal is the right approach to fixing it. However, unfortunately, the approach you have taken has some problems. The most seriouis is this: you may not call usleep() anywhere inside libxl. If you want to wait, you must use the callback mechanisms. This is because the process running libxl may be a daemon handling many domains, and we must not hang waiting for any particular domain. I think that this means: * Making libxl__device_pci_remove_common asynchronous (ie, so that it makes a callback when done, rather than simply returning; * Then, making do_pci_remove asynchronous, too. This will involve unfolding the loop in libxl__device_pci_remove_common into a callback chain. * Then, adjusting the new asynchronous do_pci_remove so that it becomes two (or perhaps more) chained callback functions which use a libxl__ev_time to manage the polling loop Secondly, I'm not particularly keen on polling. Is there not a QMP function that can be used to get notified when the device is really removed ? Sadly I guess that if there is, restructuring the qmp handling in libxl_qmp.c (qmp_next et al) to be able to use it would be way out of scope for a bugfix during the freeze. Finally, some notes about error handling: > + else { > + unsigned total_us = 0, wait_us = 100000; > + > + rc = -ETIMEDOUT; rc must contain only libxl error values. Most libxl functions return libxl error values, not errno values. I'm sorry that the rest of this file is so confused about this. I think you should use `ret' to contain responses which are `0 (for success) or -1 (setting errno)', and probably something like `errnoval' if you need actual errno values. You should definitely never write -EFOOBAR in userland code. Thanks, Ian. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 1/2] libxl: Wait until QEMU removed the device before tearing it down 2014-11-14 16:19 ` Ian Jackson @ 2014-11-14 17:09 ` Boris Ostrovsky 2014-11-14 17:33 ` Ian Jackson 0 siblings, 1 reply; 21+ messages in thread From: Boris Ostrovsky @ 2014-11-14 17:09 UTC (permalink / raw) To: Ian Jackson; +Cc: xen-devel, wei.liu2, ian.campbell, stefano.stabellini On 11/14/2014 11:19 AM, Ian Jackson wrote: > Boris Ostrovsky writes ("[PATCH 1/2] libxl: Wait until QEMU removed the device before tearing it down"): >> When a device is hot-unplugged libxl sends QEMU a "device-del" message >> (via QMP). This call returns after QEMU has initiated device removal by >> sending an interrupt to the guest. At some point later QEMU is expected >> to clean up after the device (such as unbind/unmap MSIs), which will >> occur when the guest signals that the device has been ejected. > As discussed, I agree that this is a problem. There is a race between > qemu/libxl and the guest. If libxl gets there first, you see the > symptoms you report. Having libxl not progress with the rest of the > removal is the right approach to fixing it. > > However, unfortunately, the approach you have taken has some problems. > > The most seriouis is this: you may not call usleep() anywhere inside > libxl. If you want to wait, you must use the callback mechanisms. There are already instances of usleep in libxl (e.g. qmp_open(), which is where I got the ~5 seconds number, btw). But I agree that callbacks are better here, I didn't think of non-xl users. > This is because the process running libxl may be a daemon handling > many domains, and we must not hang waiting for any particular domain. > > I think that this means: > * Making libxl__device_pci_remove_common asynchronous (ie, so > that it makes a callback when done, rather than simply returning; > * Then, making do_pci_remove asynchronous, too. This will involve > unfolding the loop in libxl__device_pci_remove_common into a > callback chain. > * Then, adjusting the new asynchronous do_pci_remove so that it > becomes two (or perhaps more) chained callback functions which > use a libxl__ev_time to manage the polling loop OK, I'll try to see what needs to be done (I am not familiar with how libxl does asynchronous calls) > > Secondly, I'm not particularly keen on polling. Is there not a QMP > function that can be used to get notified when the device is really > removed ? I didn't see any. > Sadly I guess that if there is, restructuring the qmp > handling in libxl_qmp.c (qmp_next et al) to be able to use it would be > way out of scope for a bugfix during the freeze. This would probably require changes in QEMU as well since I don't believe it notifies anyone on the other end of QMP connection that it is done cleaning up. > > > Finally, some notes about error handling: > >> + else { >> + unsigned total_us = 0, wait_us = 100000; >> + >> + rc = -ETIMEDOUT; > rc must contain only libxl error values. Most libxl functions return > libxl error values, not errno values. rc is overwritten later: if (rc) { rc = ERROR_FAIL; goto out_fail; } So this whole rc assignment business in the patch was somewhat pointless anyway. One question though: should we fail on the timeout and not clean up xenstore and reset the device (which is what this version of the patch does)? Because we may end up with device finally removed and cleanup up by QEMU/guest but not reset and removed from xenstore. Thanks. -boris > > I'm sorry that the rest of this file is so confused about this. I > think you should use `ret' to contain responses which are `0 (for > success) or -1 (setting errno)', and probably something like > `errnoval' if you need actual errno values. > > You should definitely never write -EFOOBAR in userland code. > > Thanks, > Ian. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 1/2] libxl: Wait until QEMU removed the device before tearing it down 2014-11-14 17:09 ` Boris Ostrovsky @ 2014-11-14 17:33 ` Ian Jackson 0 siblings, 0 replies; 21+ messages in thread From: Ian Jackson @ 2014-11-14 17:33 UTC (permalink / raw) To: Boris Ostrovsky; +Cc: xen-devel, wei.liu2, ian.campbell, stefano.stabellini Boris Ostrovsky writes ("Re: [PATCH 1/2] libxl: Wait until QEMU removed the device before tearing it down"): > On 11/14/2014 11:19 AM, Ian Jackson wrote: > > The most seriouis is this: you may not call usleep() anywhere inside > > libxl. If you want to wait, you must use the callback mechanisms. > > There are already instances of usleep in libxl Yes. :-/. > (e.g. qmp_open(), which is where I got the ~5 seconds number, > btw). But I agree that callbacks are better here, I didn't think of > non-xl users. Right. In this case the polling loop is worse than in qmp_open, because a guest which doesn't respond to the pci event can trivially cause the process to lock up. (Causing qemu to fail to respond in a timely fashion is probably possible but harder.) > > I think that this means: > > * [asyncinfying] > > OK, I'll try to see what needs to be done (I am not familiar with how > libxl does asynchronous calls) Thanks. You probably want to read the CODING_STYLE patch series that I posted recently. There are some examples of how to do event chaining in the domain creation, bootloader execution and disk/vif handling. > > Secondly, I'm not particularly keen on polling. Is there not a QMP > > function that can be used to get notified when the device is really > > removed ? > > I didn't see any. Oh well. > > Finally, some notes about error handling: ... > > rc must contain only libxl error values. Most libxl functions return > > libxl error values, not errno values. > > rc is overwritten later: I didn't spot that. > So this whole rc assignment business in the patch was somewhat pointless > anyway. Ah well. > One question though: should we fail on the timeout and not clean up > xenstore and reset the device (which is what this version of the patch > does)? Because we may end up with device finally removed and cleanup up > by QEMU/guest but not reset and removed from xenstore. This is a bit of a policy decision. My initial view is that if the guest fails to respond to the unplug request, we should just force the unplug and the guest will have to live with the resulting damage. At the very least, it must be possible to tear the whole thing down properly when destroying the domain. Ian. ^ permalink raw reply [flat|nested] 21+ messages in thread
* [PATCH 2/2] libxl: Simplify cleanup in do_pci_remove() 2014-11-10 23:16 [PATCH 0/2] Two fixes for libxl's PCI detach operation Boris Ostrovsky 2014-11-10 23:16 ` [PATCH 1/2] libxl: Wait until QEMU removed the device before tearing it down Boris Ostrovsky @ 2014-11-10 23:16 ` Boris Ostrovsky 1 sibling, 0 replies; 21+ messages in thread From: Boris Ostrovsky @ 2014-11-10 23:16 UTC (permalink / raw) To: ian.jackson, stefano.stabellini, ian.campbell, wei.liu2 Cc: boris.ostrovsky, xen-devel Calls to xc_physdev_unmap_pirq() will fail for HVM guests since QEMU makes the same call earlier, during device removal. In addition, guest's call to xc_domain_irq_permission() will also fail since clearing permissions is part of hypervisor's unmap_domain_pirq(). Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Tested-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> --- tools/libxl/libxl_pci.c | 32 +++++++++++++++++--------------- 1 files changed, 17 insertions(+), 15 deletions(-) diff --git a/tools/libxl/libxl_pci.c b/tools/libxl/libxl_pci.c index 4c6a733..d5d7cc0 100644 --- a/tools/libxl/libxl_pci.c +++ b/tools/libxl/libxl_pci.c @@ -1310,24 +1310,26 @@ static int do_pci_remove(libxl__gc *gc, uint32_t domid, } fclose(f); skip1: - sysfs_path = libxl__sprintf(gc, SYSFS_PCI_DEV"/"PCI_BDF"/irq", pcidev->domain, - pcidev->bus, pcidev->dev, pcidev->func); - f = fopen(sysfs_path, "r"); - if (f == NULL) { - LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR, "Couldn't open %s", sysfs_path); - goto out; - } - if ((fscanf(f, "%u", &irq) == 1) && irq) { - rc = xc_physdev_unmap_pirq(ctx->xch, domid, irq); - if (rc < 0) { - LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR, "xc_physdev_unmap_pirq irq=%d", irq); + if (!hvm) { + sysfs_path = libxl__sprintf(gc, SYSFS_PCI_DEV"/"PCI_BDF"/irq", + pcidev->domain, + pcidev->bus, pcidev->dev, + pcidev->func); + f = fopen(sysfs_path, "r"); + if (f == NULL) { + LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR, + "Couldn't open %s", sysfs_path); + goto out; } - rc = xc_domain_irq_permission(ctx->xch, domid, irq, 0); - if (rc < 0) { - LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR, "xc_domain_irq_permission irq=%d", irq); + if ((fscanf(f, "%u", &irq) == 1) && irq) { + rc = xc_physdev_unmap_pirq(ctx->xch, domid, irq); + if (rc < 0) { + LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR, + "xc_physdev_unmap_pirq irq=%d", irq); + } } + fclose(f); } - fclose(f); } out: /* don't do multiple resets while some functions are still passed through */ -- 1.7.1 ^ permalink raw reply related [flat|nested] 21+ messages in thread
end of thread, other threads:[~2014-11-14 21:50 UTC | newest] Thread overview: 21+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2014-11-10 23:16 [PATCH 0/2] Two fixes for libxl's PCI detach operation Boris Ostrovsky 2014-11-10 23:16 ` [PATCH 1/2] libxl: Wait until QEMU removed the device before tearing it down Boris Ostrovsky 2014-11-14 14:41 ` Ian Jackson 2014-11-14 15:30 ` Boris Ostrovsky 2014-11-14 16:01 ` Ian Jackson 2014-11-14 16:24 ` Boris Ostrovsky 2014-11-14 16:31 ` Ian Jackson 2014-11-14 16:37 ` Boris Ostrovsky 2014-11-14 16:36 ` Ian Jackson 2014-11-14 16:58 ` Boris Ostrovsky 2014-11-14 17:45 ` Sander Eikelenboom 2014-11-14 18:07 ` Ian Jackson 2014-11-14 19:24 ` Sander Eikelenboom 2014-11-14 21:09 ` Boris Ostrovsky 2014-11-14 21:20 ` Sander Eikelenboom 2014-11-14 21:38 ` Boris Ostrovsky 2014-11-14 21:50 ` Sander Eikelenboom 2014-11-14 16:19 ` Ian Jackson 2014-11-14 17:09 ` Boris Ostrovsky 2014-11-14 17:33 ` Ian Jackson 2014-11-10 23:16 ` [PATCH 2/2] libxl: Simplify cleanup in do_pci_remove() Boris Ostrovsky
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.