From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43)
	id 1M2QQ9-0003T7-Kh
	for qemu-devel@nongnu.org; Fri, 08 May 2009 09:47:25 -0400
Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43)
	id 1M2QQ3-0003Rc-Cl
	for qemu-devel@nongnu.org; Fri, 08 May 2009 09:47:23 -0400
Received: from [199.232.76.173] (port=60352 helo=monty-python.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.43) id 1M2QQ2-0003RM-2e
	for qemu-devel@nongnu.org; Fri, 08 May 2009 09:47:18 -0400
Received: from yx-out-1718.google.com ([74.125.44.152]:1160)
	by monty-python.gnu.org with esmtp (Exim 4.60)
	(envelope-from <anthony@codemonkey.ws>) id 1M2QQ1-0004Wn-JF
	for qemu-devel@nongnu.org; Fri, 08 May 2009 09:47:17 -0400
Received: by yx-out-1718.google.com with SMTP id 6so738672yxn.82
	for <qemu-devel@nongnu.org>; Fri, 08 May 2009 06:47:14 -0700 (PDT)
Message-ID: <4A0437E0.7080800@codemonkey.ws>
Date: Fri, 08 May 2009 08:47:12 -0500
From: Anthony Liguori <anthony@codemonkey.ws>
MIME-Version: 1.0
Subject: Re: [Qemu-devel] [RFC] New device API
References: <200905051231.09759.paul@codesourcery.com>
	<4A0390DC.2010908@redhat.com>
In-Reply-To: <4A0390DC.2010908@redhat.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
List-Id: qemu-devel.nongnu.org
List-Unsubscribe: <http://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.gnu.org/pipermail/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <http://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Zachary Amsden <zamsden@redhat.com>
Cc: Paul Brook <paul@codesourcery.com>, qemu-devel@nongnu.org

Zachary Amsden wrote:
> I think the general direction is good, and this is sorely needed, but I
> think having a fixed / static device struct isn't flexible enough to
> handle the complexities of multiple device / bus types - for example,
> MSI-X could add tons of IRQ vectors to a device, or complex devices
> could have tons of MMIO regions.
>
> I think a more flexible scheme would be to have a common device header,
> and fixed substructures which are added as needed to a device.
>
> For example, showing exploded embedded types / initialized properties.
> This isn't intended to be a realistic device, or even real C code.
>
> struct RocketController
> {
>     struct Device {
>         struct DeviceType *base_class;
> 	const char *instance_name;
>         DeviceProperty *configuration_data;
>         struct DeviceExtension *next;
>     } dev;
>     struct PCIExtension {
>         struct DeviceExtension header {
>             int type = DEV_EXT_PCI;
>             struct DeviceExtension *next;
>         }
>         PCIRegs regs;
>         PCIFunction *registered_functions;
>         struct PCIExtension *pci_next;
>     } pci;
>     struct MMIOExtension {
>         struct DeviceExtension header {
>             int type = DEV_EXT_MMIO;
>             struct DeviceExtension *next;
>         }
>         target_phys_addr_t addr;
>         target_phys_addr_t size;
>         mmio_mapfunc cb;
>     } mmio;
>     struct IRQExtension {
>         struct DeviceExtension header {
>             int type = DEV_EXT_IRQ;
>             struct DeviceExtension *next;
>         }
>         int irq;
>         int type = DEV_INTR_LEVEL | DEV_INTR_MSI;
>     } irqs [ROCKET_CONTROLLER_MSIX_IRQS+1];
>   

I think the problem with this is that it gives too much information 
about CPU constructs (MMIO/IRQs) to a device that is never connected to 
the actual CPU.

PCI devices do not have a concept of "MMIO".  They have a concept of IO 
regions.  You can have different types of IO regions and the word sizes 
that are supported depend on the width of the PCI bus.  Additionally, 
the rules about endianness of the data totally depend on the PCI 
controller, not the CPU itself.

This is where the current API fails miserably.  You cannot have a PCI 
device calling the CPU MMIO registration functions directly because 
there is no sane way to deal with endianness conversion.  Instead, the 
IO region registration has to go through the PCI bus.

Likewise, for IRQs, we should stick to the same principle.  PCI exposes 
MSI and LNK interrupts to devices.  We should have an API for devices to 
consume at the PCI level for that.

As Paul said, I don't think it's worth trying to make the same devices 
work on top of multiple busses.  I think it's asking for trouble.  
Instead, for devices that can connect to multiple busses (like ne2k), 
they can have separate register_pci_device() and register_isa_device() 
calls and then just internally abstract their chipset functions.  Then 
it just requires a small big of ISA and PCI glue code to support both 
device types.

> 1) Static per-device configuration choices, such as feature enable /
> disable, framebuffer size, PCI vs. ISA bus representation.  These are
> fixed choices over the lifetime on the VM.  In the above, they would be
> represented as DeviceProperty configuration strings.
>   

Yes.  I think this is important too.  But when we introduce this, we 
need to make sure the devices pre-register what strings they support and 
provide human consumable descriptions of what those knobs do.  
Basically, we should be able to auto extract a hardware documentation 
file from the device that describes in detail all of the supported knobs.

> 2) Dynamic per-device data.  Things such as PCI bus numbers, IRQs or
> MMIO addresses.  These are dynamic and change, but have fixed, defined
> layouts.  They may have to be stored in an intermediate representation
> (a configuration file or wire protocol) for suspend / resume or
> migration.  In the above, it would be possible to write handlers for
> suspend / resume that operate on a generic device representation,
> knowing to call specific handlers for PCI, etc.  Would certainly
> mitigate a lot of bug potential in the dangerous intersection of ad-hoc
> device growth and complex migration / suspend code.
>   

For the most part, I think the device should be unaware of these 
things.  It never needs to see it's devfn.  It should preregister what 
lnks it supports and whether it supports MSI, but it should never know 
what IRQs that actually gets routed to.

> 3) Implementation specific data.  Pieces of data which are required for
> a particular realization of a device on the host, such as file
> descriptors, call back functions, authentication data, network routing.
>  This data is not visible at this layer, nor do I think it should be.
> Since devices may move, and virtual machines may migrate, possibly
> across widely differing platforms, there could be an entirely different
> realization of a device (OSS vs. ALSA sound as a trivial example).  You
> may even want to dynamically change these on a running VM (SDL to VNC is
> a good example).
>
> How to represent #3 is not entirely clear, nor is it clear how to parse
>   

We really have three types of things and it's not entirely clear to me 
how to name them.  What we're currently calling devices are emulated 
hardware.  Additionally, we have host drivers that provide backend 
functionality.  This would include things like SDL, VNC, the tap VLAN 
driver.  We also need something like a host device.  This would be the 
front-end functionality that connects to the host driver backend.

A device registers its creation function in it's module_init().  This 
creation function will then register the fact that it's a PCI device and 
will basic information about itself in that registration.  A PCI device 
can be instantiated via a device configuration file and when that 
happens, the device will create a host device for whatever functionality 
it supports.  For a NIC, this would be a NetworkHostDevice or something 
like that.  This NetworkHostDevice would have some link to the device 
itself (via an id, handle, whatever).  A user can then create a 
NetworkHostDriver and attach the NetworkHostDevice to that driver and 
you then have a functional emulated NIC.

Regards,

Anthony Liguori