From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:38318) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1Z2Ak4-0005JS-2f for qemu-devel@nongnu.org; Tue, 09 Jun 2015 00:02:25 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1Z2Ak0-0000Yz-Nr for qemu-devel@nongnu.org; Tue, 09 Jun 2015 00:02:24 -0400 Received: from [59.151.112.132] (port=46832 helo=heian.cn.fujitsu.com) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1Z2AUi-0000VB-UF for qemu-devel@nongnu.org; Mon, 08 Jun 2015 23:46:35 -0400 Message-ID: <557660D3.80909@cn.fujitsu.com> Date: Tue, 9 Jun 2015 11:43:15 +0800 From: Chen Fan MIME-Version: 1.0 References: <1432762344.24271.93.camel@redhat.com> <556D6132.2080400@cn.fujitsu.com> <1433263649.5794.223.camel@redhat.com> <556E4FE3.4000607@cn.fujitsu.com> <1433433560.3510.175.camel@redhat.com> In-Reply-To: <1433433560.3510.175.camel@redhat.com> Content-Type: text/plain; charset="utf-8"; format=flowed Content-Transfer-Encoding: 7bit Subject: Re: [Qemu-devel] [RFC v8.1 06/13] vfio: add check host bus reset is support or not List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Alex Williamson Cc: izumi.taku@jp.fujitsu.com, qemu-devel@nongnu.org On 06/04/2015 11:59 PM, Alex Williamson wrote: > On Wed, 2015-06-03 at 08:52 +0800, Chen Fan wrote: >> On 06/03/2015 12:47 AM, Alex Williamson wrote: >>> On Tue, 2015-06-02 at 15:54 +0800, Chen Fan wrote: >>>> On 05/28/2015 05:32 AM, Alex Williamson wrote: >>>>> On Wed, 2015-05-27 at 10:46 +0800, Chen Fan wrote: >>>>>> we introduce a has_bus_reset capability to sign the vfio >>>>>> devices if support host bus reset. >>>>>> >>>>>> Signed-off-by: Chen Fan >>>>>> --- >>>>>> hw/vfio/pci.c | 123 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>>> 1 file changed, 123 insertions(+) >>>>>> >>>>>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c >>>>>> index f4e7855..5934fd7 100644 >>>>>> --- a/hw/vfio/pci.c >>>>>> +++ b/hw/vfio/pci.c >>>>>> @@ -33,6 +33,7 @@ >>>>>> #include "hw/pci/msix.h" >>>>>> #include "hw/pci/pci.h" >>>>>> #include "hw/pci/pci_bridge.h" >>>>>> +#include "hw/pci/pci_bus.h" >>>>>> #include "qemu-common.h" >>>>>> #include "qemu/error-report.h" >>>>>> #include "qemu/event_notifier.h" >>>>>> @@ -170,6 +171,7 @@ typedef struct VFIOPCIDevice { >>>>>> bool req_enabled; >>>>>> bool has_flr; >>>>>> bool has_pm_reset; >>>>>> + bool has_bus_reset; >>>>> I still think that caching this is a bad idea, there's no point at which >>>>> we can blindly assume the capability is still present. >>>>> >>>>>> bool rom_read_failed; >>>>>> } VFIOPCIDevice; >>>>>> >>>>>> @@ -203,6 +205,7 @@ static uint32_t vfio_pci_read_config(PCIDevice *pdev, uint32_t addr, int len); >>>>>> static void vfio_pci_write_config(PCIDevice *pdev, uint32_t addr, >>>>>> uint32_t val, int len); >>>>>> static void vfio_mmap_set_enabled(VFIOPCIDevice *vdev, bool enabled); >>>>>> +static void vfio_check_host_bus_reset(VFIOPCIDevice *vdev); >>>>>> >>>>>> /* >>>>>> * Disabling BAR mmaping can be slow, but toggling it around INTx can >>>>>> @@ -2853,6 +2856,20 @@ static int vfio_setup_aer(VFIOPCIDevice *vdev, int pos, uint16_t size) >>>>>> dev_iter = pci_bridge_get_device(dev_iter->bus); >>>>>> } >>>>>> >>>>>> + /* >>>>>> + * Don't check bus reset capability when device is enabled during >>>>>> + * qemu machine creation, which is done by machine init function. >>>>>> + */ >>>>>> + if (DEVICE(vdev)->hotplugged) { >>>>>> + vfio_check_host_bus_reset(vdev); >>>>>> + if (!vdev->has_bus_reset) { >>>>>> + error_report("vfio: Cannot enable AER for device %s, " >>>>>> + "which is not support host bus reset.", >>>>> "which does not support host bus reset." >>>>> >>>>>> + vdev->vbasedev.name); >>>>>> + goto error; >>>>>> + } >>>>>> + } >>>>>> + >>>>>> errcap = vfio_pci_read_config(pdev, pdev->exp.aer_cap + PCI_ERR_CAP, 4); >>>>>> /* >>>>>> * The ability to record multiple headers is depending on >>>>>> @@ -3678,6 +3695,112 @@ static void vfio_setup_resetfn(VFIOPCIDevice *vdev) >>>>>> } >>>>>> } >>>>>> >>>>>> +struct VfioDeviceFind { >>>>> We use VFIOFooBar for all other camel case definitions, much like PCIBus >>>>> and PCIDevice below. >>>>> >>>>>> + PCIBus *pbus; >>>>>> + PCIDevice *pdev; >>>>>> + bool found; >>>>>> +}; >>>>>> + >>>>>> +static void find_devices(PCIBus *bus, void *opaque) >>>>>> +{ >>>>>> + struct VfioDeviceFind *find = opaque; >>>>>> + int i; >>>>>> + >>>>>> + if (find->found == true) { >>>>> if (find->found) {... >>>>> >>>>>> + return; >>>>>> + } >>>>>> + >>>>>> + for (i = 0; i < ARRAY_SIZE(bus->devices); i++) { >>>>>> + if (!bus->devices[i]) { >>>>>> + continue; >>>>>> + } >>>>>> + >>>>>> + if (bus->devices[i] == find->pdev) { >>>>>> + find->pbus = bus; >>>>>> + find->found = true; >>>>>> + break; >>>>>> + } >>>>>> + } >>>>>> +} >>>>>> + >>>>>> +static void vfio_check_host_bus_reset(VFIOPCIDevice *vdev) >>>>>> +{ >>>>>> + PCIBus *bus = vdev->pdev.bus; >>>>>> + struct vfio_pci_hot_reset_info *info = NULL; >>>>>> + struct vfio_pci_dependent_device *devices; >>>>>> + VFIOGroup *group; >>>>>> + int ret, i; >>>>>> + bool has_bus_reset = false; >>>>>> + >>>>>> + ret = vfio_get_hot_reset_info(vdev, &info); >>>>>> + if (ret < 0) { >>>>> if (ret) {... >>>>> >>>>>> + goto out; >>>>>> + } >>>>>> + >>>>>> + /* List all affected devices by bus reset */ >>>>>> + devices = &info->devices[0]; >>>>>> + >>>>>> + /* Verify that we have all the groups required */ >>>>>> + for (i = 0; i < info->count; i++) { >>>>>> + PCIHostDeviceAddress host; >>>>>> + VFIOPCIDevice *tmp; >>>>>> + VFIODevice *vbasedev_iter; >>>>>> + >>>>>> + host.domain = devices[i].segment; >>>>>> + host.bus = devices[i].bus; >>>>>> + host.slot = PCI_SLOT(devices[i].devfn); >>>>>> + host.function = PCI_FUNC(devices[i].devfn); >>>>>> + >>>>>> + /* Skip the current device */ >>>>>> + if (vfio_pci_host_match(&host, &vdev->host)) { >>>>>> + continue; >>>>>> + } >>>>>> + >>>>>> + /* Ensure we own the group of the affected device */ >>>>>> + QLIST_FOREACH(group, &vfio_group_list, next) { >>>>>> + if (group->groupid == devices[i].group_id) { >>>>>> + break; >>>>>> + } >>>>>> + } >>>>>> + >>>>>> + if (!group) { >>>>>> + goto out; >>>>>> + } >>>>>> + >>>>>> + /* Ensure affected devices for reset under the same bus */ >>>>>> + QLIST_FOREACH(vbasedev_iter, &group->device_list, next) { >>>>>> + if (vbasedev_iter->type != VFIO_DEVICE_TYPE_PCI) { >>>>>> + continue; >>>>>> + } >>>>>> + tmp = container_of(vbasedev_iter, VFIOPCIDevice, vbasedev); >>>>>> + if (vfio_pci_host_match(&host, &tmp->host)) { >>>>>> + struct VfioDeviceFind find = { .pdev = &tmp->pdev, .found = false }; >>>>>> + >>>>>> + pci_for_each_bus(bus, find_devices, &find); >>>>>> + if (!find.found) { >>>>>> + goto out; >>>>>> + } >>>>>> + /* >>>>>> + * When the check device is hotplugged to a higher bus again, >>>>>> + * which would influence the affected device which enable aer >>>>>> + * below the bus. >>>>>> + */ >>>>>> + if (tmp->features & VFIO_FEATURE_ENABLE_AER && >>>>>> + find.pbus != bus) { >>>>>> + goto out; >>>>>> + } >>>>> I think what you're trying to do here is assume that if a reset of A >>>>> affects B, then a reset of B affects A, and if B is on a subordinate bus >>>>> from A, then our configuration is broken, right? Can we really >>>>> guarantee that assumption? If we had a physical topology that mirrored >>>>> this virtual topology, that wouldn't necessarily be true. For instance, >>>>> if A was a function of a multi-function device where another function >>>>> was a PCIe upstream switch, B could be subordinate to that switch, so a >>>>> reset of A affects B, but a reset of B doesn't affect A. >>>>> >>>>>> + break; >>>>>> + } >>>>>> + } >>>>>> + } >>>>>> + >>>>>> + has_bus_reset = true; >>>>>> + >>>>>> +out: >>>>>> + vdev->has_bus_reset = has_bus_reset; >>>>> I don't see any value is storing this, it can't be trusted at any point >>>>> in the future. I think that any time we add a device or think about >>>>> forwarding an AER message to the guest, we need to do a validation pass, >>>>> testing the entire topology. >>>>> >>>>> To do that we'd iterate through every device in every group for PCI >>>>> devices that support AER. For each one, get the hot reset info for the >>>>> affected device list. For each affected device, if it's attached to the >>>>> VM, it needs to be on or below the bus of the target device. >>>>> Additionally, we should test all of the non-AER supporting vfio-pci >>>>> devices on or below the target bus to verify they have a reset >>>>> mechanism. Run that function for each vfio-pci device as it's added, >>>>> regardless of whether it's hotplug. If the test fails, fail the initfn >>>>> for the current device. Also run the test prior to AER injection, if it >>>>> fails demote the AER injection to a machine halt as we have now. >>>> I'm worry about is the case that the affected devices belonged to >>>> another groups but when initialize this device the another group >>>> has not been added. it would cause fail the initfn, but the group >>>> maybe added later. so I used the machine done event notifier to >>>> check this case. if we don't do like that. how can we check the >>>> case when all vfio-pci devices initfn ? >>> That's why the initfn test needs to test the entire topology, not just >>> the device being added. If we have the case you describe where we have >>> two devices in separate groups, we add the first device, test the >>> topology, see that the second device is affected but not yet added and >>> allow AER to be enabled. When the second device is added, we again test >>> the topology, we see the potential conflict and fail the initfn for the >>> second device if the bus requirements are not met. >> In case that the second device may not be added. in this case the >> first device enable the aer, and pass the validate. how do we know >> the second device if be added ? > Yeah, I see what you're getting at here. If we have a dual port NIC > with isolation between functions such that each is a separate IOMMU > group, when we add the first device with AER enabled it will fail > because a bus reset affects both groups and we don't yet own the second > group. > > My proposal wouldn't provide a way to ever enable AER for these devices. > However, the proposal of using a machine-init-done hook only allows > cold-plug of such devices with AER, hotplug would get the same issue. I > don't think that sort of asymmetry is supportable either. > > We almost need some sort of vfio-group stub driver in qemu that we can > claim ownership of a group without adding any of the devices in the > group to the VM. Another option might be to make AER a "soft" > requirement, demote it to a VM stop unless the topology allows it. That > also creates a confusing user scenario that probably looks nearly random > from an outside perspective. The only other idea I can see would be to > allow some manipulation of iommu groups at the kernel level, perhaps > creating a meta-group composed of one or more iommu groups. Or maybe a > kernel option that could ignore isolation for specified devices to > broaden the group. Are we really sure exposing AER to guests is a good > idea? Requiring guest bus resets to map to host bus rests is really a > mess. Thanks, > > Alex Usually the user and the administer are not the same one, when device eject aer error, the user isn't aware of, so I think this feature should have. I think own the affected group in VM for aer devices is good idea. and I sent out the new version 9 to fix the init issues. pls help to review it. Thanks, Chen > > . >