From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43)
	id 1NeZ43-0005C4-TC
	for qemu-devel@nongnu.org; Mon, 08 Feb 2010 14:14:31 -0500
Received: from [199.232.76.173] (port=37807 helo=monty-python.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.43) id 1NeZ43-0005Bt-Es
	for qemu-devel@nongnu.org; Mon, 08 Feb 2010 14:14:31 -0500
Received: from Debian-exim by monty-python.gnu.org with spam-scanned (Exim
	4.60) (envelope-from <anthony@codemonkey.ws>) id 1NeZ40-0000vw-Ro
	for qemu-devel@nongnu.org; Mon, 08 Feb 2010 14:14:31 -0500
Received: from mail-iw0-f185.google.com ([209.85.223.185]:34145)
	by monty-python.gnu.org with esmtp (Exim 4.60)
	(envelope-from <anthony@codemonkey.ws>) id 1NeZ40-0000vU-E4
	for qemu-devel@nongnu.org; Mon, 08 Feb 2010 14:14:28 -0500
Received: by iwn15 with SMTP id 15so4150772iwn.19
	for <qemu-devel@nongnu.org>; Mon, 08 Feb 2010 11:14:27 -0800 (PST)
Message-ID: <4B706290.7020104@codemonkey.ws>
Date: Mon, 08 Feb 2010 13:14:24 -0600
From: Anthony Liguori <anthony@codemonkey.ws>
MIME-Version: 1.0
Subject: Re: [Qemu-devel] Re: Two QMP events issues
References: <20100208114145.4bd64349@doriath>	<20100208141218.GG17328@redhat.com>	<4B702470.5080401@codemonkey.ws>	<20100208145653.GA25256@redhat.com>	<4B702A21.1070808@codemonkey.ws>
	<20100208162521.788f9c02@doriath>
In-Reply-To: <20100208162521.788f9c02@doriath>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
List-Id: qemu-devel.nongnu.org
List-Unsubscribe: <http://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.gnu.org/pipermail/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <http://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Luiz Capitulino <lcapitulino@redhat.com>
Cc: qemu-devel@nongnu.org, armbru@redhat.com

On 02/08/2010 12:25 PM, Luiz Capitulino wrote:
> On Mon, 08 Feb 2010 09:13:37 -0600
> Anthony Liguori<anthony@codemonkey.ws>  wrote:
>
>    
>> On 02/08/2010 08:56 AM, Daniel P. Berrange wrote:
>>      
>>> On Mon, Feb 08, 2010 at 08:49:20AM -0600, Anthony Liguori wrote:
>>>
>>>        
>>>> On 02/08/2010 08:12 AM, Daniel P. Berrange wrote:
>>>>
>>>>          
>>>>> For further backgrou, the key end goal here is that in a QMP client, upon
>>>>> receipt of the  'RESET' event, we need to reliably&    immediately determine
>>>>> why it  occurred. eg, triggered by watchdog, or by guest OS request. There
>>>>> are actually 3 possible sequences
>>>>>
>>>>>    - WATCHDOG + action=reset, followed by RESET.  Assuming no intervening
>>>>>      event can occurr, the client can merely record 'WATCHDOG' and interpret
>>>>>      it when it gets the immediately following 'RESET' event
>>>>>
>>>>>    - RESET, followed by WATCHDOG + action=reset. The client doesn't know
>>>>>      the reason for the RESET and can't wait arbitrarily for WATCHDOG since
>>>>>      there might never be one arriving.
>>>>>
>>>>>    - RESET + source=watchdog. Client directly sees the reason
>>>>>
>>>>> The second scenario is the one I'd like us to avoid at all costs, since it
>>>>> will require the client to introduce arbitrary delays in processing events
>>>>> to determine cause. The first is slightly inconvenient, but doable if we
>>>>> can assume no intervening events will occur, between WATCHDOG and the
>>>>> RESET events. The last is obviously simplest for the clients.
>>>>>
>>>>>
>>>>>            
>>>> I really prefer the third option but I'm a little concerned that we're
>>>> throwing events around somewhat haphazardly.
>>>>
>>>> So let me ask, why does a client need to determine when a guest reset
>>>> and why it reset?
>>>>
>>>>          
>>> If a guest OS is repeatedly hanging/crashing resulting in the watchdog
>>> device firing, management software for the host really wants to know about
>>> that (so that appropriate alerts/action can be taken) and thus needs to
>>> be able to distinguish this from a "normal"  guest OS initiated reboot.
>>>
>>>        
>> I think that's an argument for having the watchdog events independent of
>> the reset events.
>>
>> The watchdog condition happening is not directly related to the action
>> the watchdog takes.  The watchdog event really belongs in a class events
>> that are closely associated with a particular device emulation.
>>
>> In fact, I think what we're really missing in events today is a notion
>> of a context.  A RESET event is really a CPU event.  A watchdog
>> expiration event is a watchdog event.  A connect event is a VNC event
>> (Spice and chardevs will also generate connect events).
>>      
>   This could be done by adding a 'context' member to all the events and
> then an event would have to be identified by the pair event_name:context.
>
>   This way we can have the same event_name for events in different
> contexts. For example:
>
> { 'event': DISCONNECT, 'context': 'spice', [...] }
>
> { 'event': DISCONNECT, 'context': 'vnc', [...] }
>
>   Note that today we have VNC_DISCONNECT and will probably have
> SPICE_DISCONNECT too.
>    

Which is why we gave ourselves until 0.13 to straighten out the protocol.

N.B. in this model, you'd have:

{ 'event' : 'EXPIRED', 'context': 'watchdog', 'action': 'reset' }
/* some arbitrary number of events */
{ 'event' : 'RESET', 'context': 'cpu' }

And the only reason RESET follows EXPIRED is because action=reset.  If 
action was different, a RESET might not occur.

A client needs to see the EXPIRED event, determine whether to expect a 
RESET event, and if so, wait for the next RESET event to happen.

Regards,

Anthony Liguori