From mboxrd@z Thu Jan  1 00:00:00 1970
From: Anthony Liguori <anthony@codemonkey.ws>
Subject: Re: [PATCH 5/7] KVM/userspace: Device Assignment: Support for assigning
 PCI devices to guests
Date: Wed, 24 Sep 2008 10:07:24 -0500
Message-ID: <48DA57AC.1020903@codemonkey.ws>
References: <1222181695-23418-1-git-send-email-amit.shah@redhat.com> <1222181695-23418-6-git-send-email-amit.shah@redhat.com> <48D919A8.8040409@codemonkey.ws> <200809241028.09067.amit.shah@redhat.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Cc: avi@redhat.com, kvm@vger.kernel.org, muli@il.ibm.com,
	benami@il.ibm.com, weidong.han@intel.com, allen.m.kay@intel.com
To: Amit Shah <amit.shah@redhat.com>
Return-path: <kvm-owner@vger.kernel.org>
Received: from ug-out-1314.google.com ([66.249.92.169]:7844 "EHLO
	ug-out-1314.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1751681AbYIXPIa (ORCPT <rfc822;kvm@vger.kernel.org>);
	Wed, 24 Sep 2008 11:08:30 -0400
Received: by ug-out-1314.google.com with SMTP id k3so1866055ugf.37
        for <kvm@vger.kernel.org>; Wed, 24 Sep 2008 08:08:28 -0700 (PDT)
In-Reply-To: <200809241028.09067.amit.shah@redhat.com>
Sender: kvm-owner@vger.kernel.org
List-ID: <kvm.vger.kernel.org>

Amit Shah wrote:
> * On Tuesday 23 Sep 2008 22:00:32 Anthony Liguori wrote:
>   
>> Amit Shah wrote:
>>     
>
>   
>>> diff --git a/qemu/Makefile.target b/qemu/Makefile.target
>>> index 72f3db8..40eb273 100644
>>> --- a/qemu/Makefile.target
>>> +++ b/qemu/Makefile.target
>>> @@ -616,6 +616,7 @@ OBJS+= ide.o pckbd.o ps2.o vga.o $(SOUND_HW) dma.o
>>>  OBJS+= fdc.o mc146818rtc.o serial.o i8259.o i8254.o pcspk.o pc.o
>>>  OBJS+= cirrus_vga.o apic.o parallel.o acpi.o piix_pci.o
>>>  OBJS+= usb-uhci.o vmmouse.o vmport.o vmware_vga.o extboot.o
>>> +OBJS+= device-assignment.o
>>>       
>> This needs to be conditional on at least linux hosts, but probably also
>> kvm support.
>>     
>
> I didn't see any other file that's doing it. So I added this conditional in 
> vl.c by having a #if defined(__linux__). That's how usb-linux.c does it as 
> well. Is there a better way?
>   

aio and compatfd currently do it this way.  block-raw-win32 and 
block-raw-posix are this way.  We're slowly moving things away from 
#ifdef #else #endif to conditional compilation.

> Not the whole functionality needs kvm support. This should be able to work 
> even without kvm (for example, when the guest is 1:1 mapped in the host 
> address space).
>   

KVM is needed for interrupt remapping though.  That's something I don't 
see happening for normal userspace any time soon.

>>> +	/* FIXME: Add support for emulated MMIO for non-kvm guests */
>>> +	if (kvm_enabled()) {
>>>       
>> This doesn't work at all if kvm isn't enabled right?  You should
>> probably bail out in the init if kvm isn't enabled.  If this whole file
>> is included conditionally based on KVM support, then you don't have to
>> worry about using kvm_enabled() guards to conditionally compile out code.
>>     
>
> Non-kvm support is currently broken and should be fixed, but that can happen 
> after we get this merged.
>   

But it would take bouncing interrupts to userspace?  I don't think that 
will ever happen upstream personally.  At any rate, there's no point in 
even trying to support something like that until progress is made 
upstream on this front.

> I can temporarily add a check for kvm_enabled and bail out.
>
>   
>>> +	sprintf(dir, "/sys/bus/pci/devices/0000:%02x:%02x.%x/",
>>> +		r_bus, r_dev, r_func);
>>>       
>> snprintf()
>>     
>
> It's guarded by the %02x modifiers; so this doesn't depend on user input.
>   

strcpy or sprintf should never be used.  It doesn't matter if it's safe 
in a particular instance.  There are safer functions to use (like snprintf).

All it takes is for someone to come along and change the /sys/bus path 
to be larger without adjusting the buffer size and everything goes to 
hell.  It's inherently brittle.

>>> +	fprintf(stderr, "Registered host PCI device %02x:%02x.%1x "
>>> +		"(\"%s\") as guest device %02x:%02x.%1x\n",
>>> +		r_bus, r_dev, r_func, e_dev_name,
>>> +		pci_bus_num(e_bus), e_device, r_func);
>>>       
>> Please don't fprintf() unconditionally.
>>     
>
> OK; however, a vmdk file open does that so I though it was alright to do it.
>   

I obviously don't use vmdk or else I would have removed that by now :-)

>> A lot more checks are needed here to see if things can succeed.  We
>> definitely should bail out if they can't.
>>     
>
> Bailing out is done in the out: label below. What else do  you think can fail? 
> I've taken care of all the cases that do fail IMO.
>
>   
>>> +	return pci_dev;
>>> +out:
>>> +	pci_unregister_device(&pci_dev->dev);
>>> +	return NULL;
>>> +}
>>>       
>
>   
>>> +/*
>>> + * Syntax to assign device:
>>> + *
>>> + * -pcidevice dev=bus:dev.func,dma=dma
>>> + *
>>> + * Example:
>>> + * -pcidevice host=00:13.0,dma=pvdma
>>> + *
>>> + * dma can currently only be 'none' to disable iommu support.
>>>       
>> Does it actually work if you disable iommu support?
>>     
>
> If the guest is 1:1 mapped.
>   

You mean with Andrea's reserved ram patches?

>>> +#include <sys/mman.h>
>>>       
>> Don't think this is needed here.
>>     
>
> We use mmap(), so this is needed.
>   

Ah.

>>> +    /* Initialize assigned devices */
>>> +    if (pci_enabled) {
>>> +        int r = -1;
>>> +        do {
>>> +            init_assigned_device(pci_bus, &r);
>>>       
>> Why pass r by reference instead of just returning it?  At any rate, you
>> should detect when this fails and gracefully terminate QEMU.
>>     
>
> 'r' is the count of the number of assigned devices -- mostly needed because we 
> have the data stored in an array. If we migrate to a list, this can be 
> relaxed.
>
> ATM, I start the guest without assigning the device. I haven't figured out a 
> way to gracefully terminate qemu yet.
>   

In the case of hot plug, you fail the hot plug.  If you start with 
device assignment, just doing an "exit" would be sufficient.

>>> +#if defined(TARGET_I386) || defined(TARGET_X86_64) || defined(__linux__)
>>> +	    case QEMU_OPTION_pcidevice:
>>> +		add_assigned_device(optarg);
>>>       
>> You should copy into an array, then in pc.c, iterate through the array
>> and call into add_assigned_device.
>>     
>
> Is there any benefit in doing this? We're moving the iterate out of vl.c to 
> pc.c and both will happen at the same time.
>   

It's how everything else works.

Regards,

Anthony Liguori

>> Regards,
>>
>> Anthony Liguori
>>     
>
> Thanks!
>
> Amit.
>