From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from [140.186.70.92] (port=53458 helo=eggs.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.43) id 1Ptkcg-0001Q2-AQ
	for qemu-devel@nongnu.org; Sun, 27 Feb 2011 12:41:35 -0500
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <anthony@codemonkey.ws>) id 1Ptkce-0002Zn-L0
	for qemu-devel@nongnu.org; Sun, 27 Feb 2011 12:41:34 -0500
Received: from mail-yi0-f45.google.com ([209.85.218.45]:51496)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <anthony@codemonkey.ws>) id 1Ptkce-0002Zi-HN
	for qemu-devel@nongnu.org; Sun, 27 Feb 2011 12:41:32 -0500
Received: by yib19 with SMTP id 19so460038yib.4
	for <qemu-devel@nongnu.org>; Sun, 27 Feb 2011 09:41:31 -0800 (PST)
Message-ID: <4D6A8CC9.4090304@codemonkey.ws>
Date: Sun, 27 Feb 2011 11:41:29 -0600
From: Anthony Liguori <anthony@codemonkey.ws>
MIME-Version: 1.0
Subject: Re: [Qemu-devel] Re: [patch 2/3] Add support for live block copy
References: <20110222170004.808373778@redhat.com>	<20110222170115.710717278@redhat.com>	<4D642181.4080509@codemonkey.ws>	<20110222210735.GA9372@amt.cnet>	<4D64266A.3060106@codemonkey.ws>	<20110222230935.GA11082@amt.cnet>	<4D644343.4050800@codemonkey.ws>	<4D65051A.6070707@redhat.com>	<4D651B20.70405@codemonkey.ws>	<4D652852.60505@redhat.com>	<4D652F73.3000305@codemonkey.ws>	<4D65324A.5080408@redhat.com>	<4D65359E.3040008@codemonkey.ws>	<4D65416D.8040803@redhat.com>	<4D656B97.5030301@codemonkey.ws>	<4D661CB8.6010305@redhat.com>	<4D667287.9010005@codemonkey.ws>	<4D6677BE.2030009@redhat.com>	<4D669C46.40909@codemonkey.ws>	<4D6A150B.8030205@redhat.com>
	<4D6A58E0.9020607@codemonkey.ws> <4D6A6E38.4030700@redhat.com>
In-Reply-To: <4D6A6E38.4030700@redhat.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
List-Id: qemu-devel.nongnu.org
List-Unsubscribe: <http://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <http://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Avi Kivity <avi@redhat.com>
Cc: Jes.Sorensen@redhat.com, Marcelo Tosatti <mtosatti@redhat.com>, qemu-devel@nongnu.org

On 02/27/2011 09:31 AM, Avi Kivity wrote:
> On 02/27/2011 04:00 PM, Anthony Liguori wrote:
>> On 02/27/2011 03:10 AM, Avi Kivity wrote:
>>> On 02/24/2011 07:58 PM, Anthony Liguori wrote:
>>>>> If you move the cdrom to a different IDE channel, you have to 
>>>>> update the stateful non-config file.
>>>>>
>>>>> Whereas if you do
>>>>>
>>>>>    $ qemu-img create -f cd-tray -b ~/foo.img ~/foo-media-tray.img
>>>>>    $ qemu -cdrom ~/foo-media-tray.img
>>>>>
>>>>> the cd-rom tray state will be tracked in the image file.
>>>>
>>>>
>>>> Yeah, but how do you move it? 
>>>
>>> There is no need to move the file at all.  Simply point the new 
>>> drive at the media tray.
>>
>> No, I was asking, how do you move the cdrom to a different IDE 
>> channel.  Are you using QMP?  Are you changing the command line 
>> arguments?
>
> Yes.
>
> If we're doing hot-move (not really relevant to ide-cd) then you'd use 
> QMP.  If you're editing a virtual machine that is down, or scheduling 
> a change for the next reboot, then you're using command line arguments 
> (or cold-plugging into a stopped guest).
>
> Requiring management to remember the old configuration and issue delta 
> commands to move the device for the cold-plug case is increased 
> complexity IMO.
>
>>
>>>
>>>> If you do a remove/add through QMP, then the config file will 
>>>> reflect things just fine.
>>>
>>> If all access to the state file is through QMP then it becomes more 
>>> palatable.  A bit on that later.
>>
>> As I think I've mentioned before, I hadn't really thought about an 
>> opaque state file but I'm not necessary opposed to it.  I don't see 
>> an obvious advantage to making it opaque but I agree it should be 
>> accessible via QMP.
>
> The advantage is that we keep the management tool talking to one 
> interface (I don't think we should prevent users from interpreting it, 
> just make it unnecessary).
>
>>>
>>> I thought that's what I'm doing by separating the state out.  It's 
>>> easy for management to assemble configuration from their database 
>>> and convert it into a centralized representation (like a qemu 
>>> command line).  It's a lot harder to disassemble a central state 
>>> representation and move it back to the database.
>>>
>>> Using QMP is better than directly accessing the state file since 
>>> qemu does the disassembly for you (provided the command references 
>>> the device using its normal path, not some random key).  The file 
>>> just becomes a way to survive a crash, and all management needs to 
>>> know about is to make it available and back it up.  But it means 
>>> that everything must be done via QMP, including assembly of the 
>>> machine, otherwise the state file can become stale.
>>>
>>> Separating the state out to the device is even easier, since 
>>> management is already expected to take care of disk images.  All 
>>> that's needed is to create the media tray image once, then you can 
>>> forget about it completely.
>>
>> Except that instead of having one state file, we might have a dozen 
>> additional "device state" files.
>
> That is fine.  We already have one state file per block device.
>
>>>> QEMU.   No question about it.  At any point in time, we are the 
>>>> authoritative source of what the guest's configuration is.  There's 
>>>> no doubt about it.  A management tool can try to keep up with us, 
>>>> but ultimately we are the only ones that know for sure.
>>>>
>>>> We have all of this information internally.  Just persisting it is 
>>>> not a major architectural change.  It's something we should have 
>>>> been doing (arguably) from the very beginning.
>>>
>>> That's a huge divergence from how management tools are written.
>>
>> This is one of the reasons why management tooling around QEMU needs 
>> quite a bit of improving.
>>
>> There is simply no way a management tool can do a good job of being 
>> an authoritative source of configuration.  The races we're discussion 
>> is a good example of why.
>
> What we're discussing is not configuration.  It is non-volatile 
> state.  Configuration comes from the user; state comes from the guest 
> (the management tool may edit state; but the guest cannot edit the 
> configuration).
>
> I agree 100% the management tool cannot be the authoritative source of 
> state.
>
> My position is:
> - the management tool should be 100% in control of configuration (how 
> the guest is put together from its components)
> - qemu should be 100% in control of state (memory, disk state, NVRAM 
> in various components, cd-rom eject state, explosive bolts for payload 
> separation, self-destruct mechanism, etc.)

There simply is not such a clean separation between the two because 
things that the guest does affects the configuration of the guest.

Hot plug, removable media eject, persistent device settings (whether 
it's CMOS or EEPROM) all disrupt this model.

If you really wanted to have this separation, you'd have to be very 
strict about making all guest settings not be specified in config.  You 
would need to do:

qemu-img create -f e1000-eprom -o macaddr=12:23:45:67:78:90 e1000.0.rom
qemu-img create -f e1000-eprom -o macaddr=12:23:45:67:78:91 e1000.1.rom

qemu -device e1000,id=e1000.0,eeprom=e1000.0.rom -device 
e1000,id=e1000.1,eeprom=e1000.1.rom

And now I need a tool that lets me modify e1000-eprom images if I want 
to change the mac address dynamically (say I'm trying to clone a VM).

This type of model can be workable but as I said earlier, I think it's 
overengineering the problem.

We don't separate configuration from guest state today.  Instead of 
setting ourselves up for failure by setting an unrealistic standard that 
we try to achieve and never do, let's embrace the system that is working 
for us today.  We are authoritative for everything and guest state is 
intimately tied to the virtual machine configuration.

>>
>> But beyond those races, QEMU is the only entity that knows with 
>> certainty what bits of information are important to persist in order 
>> to preserve a guest across shutdown/restart.  The fact that we've 
>> punted this problem for so long has only ensured that management 
>> tools are either intrinsically broken or only support the most 
>> minimal subset of functionality we actually support.
>
> I'm not arguing about that.  I just want to stress again the 
> difference between state and configuration.  Qemu has no authority, in 
> my mind, as to configuration.  Only state.

Being the one that creates a guest based on configuration, I would say 
that we most certainly do.

>>>   Currently they contain the required guest configuration, a 
>>> representation of what's the current live configuration, and they 
>>> issue monitor commands to move the live configuration towards the 
>>> required configuration (or just generate a qemu command line).  What 
>>> you're describing is completely different, I'm not even sure what it 
>>> is.
>>
>> Management tools shouldn't have to think about how the monitor 
>> commands they issue impact the invocation options of QEMU.
>
> They have to, when creating a guest from scratch.
>
> But I admit, this throws a new light (for me) on things.  What's the 
> implications?
> - must have a qemu instance running when editing configuration, even 
> when the guest is down

QMP is an API.  Whether a qemu instance is launched is an implementation 
detail.  This could all be hidden completely with libqmp.

> - cannot add additional information to configuration; must store it in 
> an external database and cross-reference it with the qemu data using 
> the device ID

Don't confuse a management tool's notion of configuration with QEMU's 
configuration.

A management tools config is used to initially create and then 
manipulate an existing guest.   If the management tool supports 
out-of-band manipulation of a configuration file, then it needs to 
determine how the configuration file changed and execute the appropriate 
commands.

> - when editing non-hotpluggable configuration for the next boot, must 
> maintain old config somewhere, so we can issue delta commands later 
> (might be needed for current way of doing things)

Yes, it is.  libvirt kind of cheats here and just deletes the old VM and 
creates a new one when editing the XML IIUC.

> - no transactions/queries/etc except on non-authoritative source
> - issues with shared-nothing design (well, can store the configuration 
> file using DRBD).

In both cases, today a management tool races with QEMU so both of these 
points are currently true.

>>> If you look at management tools, they believe they are the 
>>> authoritative source of configuration information (not guest state, 
>>> which is more or less ignored).
>>
>> It's because we've given them no other option.
>
> It's the natural way of doing it.  You have a web interface that talks 
> to a database.  When you want to list all VMs that have network cards 
> on the production subnet, you issue a database query and get a 
> recordset.  How do you do that when the authoritative source of 
> information is spread across a cluster?

This problem still exists today.  A guest can eject a network card on 
it's own (without the management tool issuing a device_del command).  
QEMU will delete the NIC when this happens.  The same is true with CDROM 
eject.

Management tools are simply not authoritative today.

Regards,

Anthony Liguori