From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:34317) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1bfb9G-0002KU-TV for qemu-devel@nongnu.org; Thu, 01 Sep 2016 19:11:55 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1bfb9C-0004m2-OZ for qemu-devel@nongnu.org; Thu, 01 Sep 2016 19:11:53 -0400 Received: from mx1.redhat.com ([209.132.183.28]:46044) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1bfb9C-0004lt-IE for qemu-devel@nongnu.org; Thu, 01 Sep 2016 19:11:50 -0400 Received: from int-mx13.intmail.prod.int.phx2.redhat.com (int-mx13.intmail.prod.int.phx2.redhat.com [10.5.11.26]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id 3EC0883F3B for ; Thu, 1 Sep 2016 23:11:49 +0000 (UTC) Date: Thu, 1 Sep 2016 17:11:48 -0600 From: Alex Williamson Message-ID: <20160901171148.697e696b@t450s.home> MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Subject: [Qemu-devel] qapi DEVICE_DELETED event issued *before* instance_finalize?! List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: qemu-devel Cc: libvir-list@redhat.com Hey, I'm out of my QOM depth, so I'll just beg for help in advance. I noticed in testing vfio-pci hotunplug that the host seems to be trying to reclaim the device before QEMU is actually done with it, there's a very short race where libvirt has seen the DEVICE_DELETED event and tries to unbind the physical device from vfio-pci, the use count is clearly non-zero because the host driver tries to send a device request, but that event channel has already been torn down. Nearly immediately after, QEMU finally releases the device, but we can't do a proper reset due to some issues with device references in the kernel. When I run gdb on QEMU with breakpoints at qapi_event_send_device_deleted() and vfio_instance_finalize(), the QAPI even happens first. Clearly this is horribly wrong, right? I can't unmap my references to the vfio device file until my instance_finalize is called, so I'm always going to have that open when libvirt takes the DEVICE_DELETED event as a cue to return the device to host drivers. The call chains look like this: #0 qapi_event_send_device_deleted (has_device=true, device=0x7f5ca3e36fb0 "hostdev0", path=0x7f5c89e84fe0 "/machine/peripheral/hostdev0", errp=0x7f5ca241f9e8 ) at qapi-event.c:412 #1 0x00007f5ca1701608 in device_unparent (obj=0x7f5ca43ffc00) at hw/core/qdev.c:1115 #2 0x00007f5ca18b7891 in object_finalize_child_property (obj=0x7f5ca380f500, name=0x7f5ca3f21da0 "hostdev0", opaque=0x7f5ca43ffc00) at qom/object.c:1362 #3 0x00007f5ca18b56b2 in object_property_del_child (obj=0x7f5ca380f500, child=0x7f5ca43ffc00, errp=0x0) at qom/object.c:422 #4 0x00007f5ca18b5790 in object_unparent (obj=0x7f5ca43ffc00) at qom/object.c:441 #5 0x00007f5ca16c1f31 in acpi_pcihp_eject_slot (s=0x7f5ca4c41268, bsel=0, slots=4) at hw/acpi/pcihp.c:139 #0 vfio_instance_finalize (obj=0x7f5ca43ffc00) at /net/gimli/home/alwillia/Work/qemu.git/hw/vfio/pci.c:2731 #1 0x00007f5ca18b57c0 in object_deinit (obj=0x7f5ca43ffc00, type=0x7f5ca376f490) at qom/object.c:448 #2 0x00007f5ca18b5831 in object_finalize (data=0x7f5ca43ffc00) at qom/object.c:462 #3 0x00007f5ca18b6782 in object_unref (obj=0x7f5ca43ffc00) at qom/object.c:896 #4 0x00007f5ca1550cc0 in memory_region_unref (mr=0x7f5ca43fff00) at /net/gimli/home/alwillia/Work/qemu.git/memory.c:1476 #5 0x00007f5ca1553886 in do_address_space_destroy (as=0x7f5ca43ffe10) at /net/gimli/home/alwillia/Work/qemu.git/memory.c:2272 It appears that DEVICE_DELETED only means the VM is done with the device but libvirt is interpreting it as QEMU is done with the device. Which is correct? Do we need a new event or do we need to fix the ordering of this event? An ordering fix would be more compatible with existing libvirt. Thanks, Alex