From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43)
	id 1M2FLV-0001wu-SA
	for qemu-devel@nongnu.org; Thu, 07 May 2009 21:57:53 -0400
Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43)
	id 1M2FLQ-0001p6-8J
	for qemu-devel@nongnu.org; Thu, 07 May 2009 21:57:52 -0400
Received: from [199.232.76.173] (port=60674 helo=monty-python.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.43) id 1M2FLQ-0001p3-2O
	for qemu-devel@nongnu.org; Thu, 07 May 2009 21:57:48 -0400
Received: from mx2.redhat.com ([66.187.237.31]:38135)
	by monty-python.gnu.org with esmtp (Exim 4.60)
	(envelope-from <zamsden@redhat.com>) id 1M2FLP-0006wW-DH
	for qemu-devel@nongnu.org; Thu, 07 May 2009 21:57:47 -0400
Message-ID: <4A0390DC.2010908@redhat.com>
Date: Thu, 07 May 2009 15:54:36 -1000
From: Zachary Amsden <zamsden@redhat.com>
MIME-Version: 1.0
Subject: Re: [Qemu-devel] [RFC] New device API
References: <200905051231.09759.paul@codesourcery.com>
In-Reply-To: <200905051231.09759.paul@codesourcery.com>
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
List-Id: qemu-devel.nongnu.org
List-Unsubscribe: <http://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.gnu.org/pipermail/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <http://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Paul Brook <paul@codesourcery.com>
Cc: qemu-devel@nongnu.org

Paul Brook wrote:
> The attached patch is my attempt at a new internal API for device creation in 
> qemu.
> 
> The long term goal here is fully dynamic machine construction, i.e. a user 
> supplied configuration file/tree rather than the current hardcoded board 
> configs.
> 
> As a first step towards this, I've implemented an API that allows devices to 
> be created and configured without requiring device specific knowledge.
> 
> There are two sides to this API.  The device side allows a device to 
> register/configure functionality (e.g. MMIO regions, IRQs, character devices, 
> etc). Most of the infrastructure for this is already present, we just need to 
> configure it consistently, rather than on an ad-hoc basis. I expect this API 
> to remain largely unchanged[1].
> 
> The board side allows creation on configuration of devices. In the medium term 
> I expect this to go away, and be replaced by data driven configuration.
> 
> I also expect that this device abstraction will let itself to a system bus 
> model to solve many of the IOMMU/DMA/address mapping issues that qemu 
> currently can't handle.
> 
> There are still a few rough edges. Busses aren't handled in a particularly 
> consistent way, PCI isn't particularly well integrated, and the 
> implementation of things like qdev_get_chardev is an ugly hack mapping onto 
> current commandline options. However I believe the device side API is fairly 
> solid, and is a necessary prerequisite for fixing the bigger configuration 
> problem.


I think the general direction is good, and this is sorely needed, but I
think having a fixed / static device struct isn't flexible enough to
handle the complexities of multiple device / bus types - for example,
MSI-X could add tons of IRQ vectors to a device, or complex devices
could have tons of MMIO regions.

I think a more flexible scheme would be to have a common device header,
and fixed substructures which are added as needed to a device.

For example, showing exploded embedded types / initialized properties.
This isn't intended to be a realistic device, or even real C code.

struct RocketController
{
    struct Device {
        struct DeviceType *base_class;
	const char *instance_name;
        DeviceProperty *configuration_data;
        struct DeviceExtension *next;
    } dev;
    struct PCIExtension {
        struct DeviceExtension header {
            int type = DEV_EXT_PCI;
            struct DeviceExtension *next;
        }
        PCIRegs regs;
        PCIFunction *registered_functions;
        struct PCIExtension *pci_next;
    } pci;
    struct MMIOExtension {
        struct DeviceExtension header {
            int type = DEV_EXT_MMIO;
            struct DeviceExtension *next;
        }
        target_phys_addr_t addr;
        target_phys_addr_t size;
        mmio_mapfunc cb;
    } mmio;
    struct IRQExtension {
        struct DeviceExtension header {
            int type = DEV_EXT_IRQ;
            struct DeviceExtension *next;
        }
        int irq;
        int type = DEV_INTR_LEVEL | DEV_INTR_MSI;
    } irqs [ROCKET_CONTROLLER_MSIX_IRQS+1];
    struct MSIXExtension {
        struct DeviceExtension header {
            int type = DEV_EXT_MSIX;
            struct DeviceExtension *next;
        }
        struct MMIOExtension *table_bar;
        target_phys_addr_t table_offset;
        struct MMIOExtension *pba_bar;
        target_phys_addr_t pba_offset;
    } msix;
    struct RocketFuelControllerStatus {
        struct DeviceExtension header {
            int type = DEV_EXT_PRIVATE;
            struct DeviceExtension *next;
            suspend_callback_fn *foo;
            resume_callback_fn *goo;
        }
        ...
    } private;
}


Now it's possible to define cool macros to walk an arbitrary device
structure (as the extensions are linked by the next field):

#define dev_for_each_ext(dev,ext) \
 for (ext = dev->next; ext; ext = ext->next)

#define dev_for_each_irq(dev,ext) \
  dev_for_each_ext(dev,ext) if (ext->type == DEV_EXT_IRQ)

inline Bool dev_supports_pci(struct Device *dev)
{
   struct DeviceExtension *ext;
   dev_for_each_ext(dev,ext)
       if (ext->type == DEV_EXT_PCI)
          return TRUE;
   return FALSE;
}

You could also use offsets instead of a next pointer (might allow more
space efficient packing of a backpointer to the device struct proper).

Device configuration would then be something like

qdev_add_pci(dev, &rocket->pci);
for (i = 0; i < NUM_ROCKET_IRQS; i++)
   qdev_add_intr(dev, &rocket->intr[i]);
qdev_add_msix(dev, &rocket->msix);
qdev_add_private_data(dev, &rocket->private, rocket_launch, rocket_land);

And of course you can skip regions that aren't configured (MSI-X, might
for example have a configuration variable to disable).

The config issue is certainly a very important problem to address.  I
think it should be noted there are three types of configuration data,
all of which are ontologically different.  There are:

1) Static per-device configuration choices, such as feature enable /
disable, framebuffer size, PCI vs. ISA bus representation.  These are
fixed choices over the lifetime on the VM.  In the above, they would be
represented as DeviceProperty configuration strings.

2) Dynamic per-device data.  Things such as PCI bus numbers, IRQs or
MMIO addresses.  These are dynamic and change, but have fixed, defined
layouts.  They may have to be stored in an intermediate representation
(a configuration file or wire protocol) for suspend / resume or
migration.  In the above, it would be possible to write handlers for
suspend / resume that operate on a generic device representation,
knowing to call specific handlers for PCI, etc.  Would certainly
mitigate a lot of bug potential in the dangerous intersection of ad-hoc
device growth and complex migration / suspend code.

3) Implementation specific data.  Pieces of data which are required for
a particular realization of a device on the host, such as file
descriptors, call back functions, authentication data, network routing.
 This data is not visible at this layer, nor do I think it should be.
Since devices may move, and virtual machines may migrate, possibly
across widely differing platforms, there could be an entirely different
realization of a device (OSS vs. ALSA sound as a trivial example).  You
may even want to dynamically change these on a running VM (SDL to VNC is
a good example).

How to represent #3 is not entirely clear, nor is it clear how to parse
the arguments and data required, as it is ad-hoc and necessarily
implementation dependent.  The dynamic changing could be done via
monitor commands or something like it.  It should be entirely possible
to switch my SVGA backend from SDL local rendering to using VNC at
run-time, for example, as long as I supply any additional configuration
data needed for VNC.

I think any attempt to broach the config problem has to realize these
distinctions in whatever namespace it uses if it is to be successful.

How to represent this of course will require more discussion.

Zach