From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([2001:4830:134:3::10]:57653)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <zhang.zhanghailiang@huawei.com>) id 1aBZtz-00013l-Bu
	for qemu-devel@nongnu.org; Tue, 22 Dec 2015 22:15:48 -0500
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <zhang.zhanghailiang@huawei.com>) id 1aBZtw-0002ho-0k
	for qemu-devel@nongnu.org; Tue, 22 Dec 2015 22:15:47 -0500
References: <1450167779-9960-1-git-send-email-zhang.zhanghailiang@huawei.com>
	<1450167779-9960-26-git-send-email-zhang.zhanghailiang@huawei.com>
	<8737uypxi9.fsf@blackfin.pond.sub.org> <56786BA0.70400@redhat.com>
From: Hailiang Zhang <zhang.zhanghailiang@huawei.com>
Message-ID: <567A1179.2040509@huawei.com>
Date: Wed, 23 Dec 2015 11:14:01 +0800
MIME-Version: 1.0
In-Reply-To: <56786BA0.70400@redhat.com>
Content-Type: text/plain; charset="windows-1252"; format=flowed
Content-Transfer-Encoding: 7bit
Subject: Re: [Qemu-devel] [Qemu-block] [PATCH COLO-Frame v12 25/38] qmp
 event: Add event notification for COLO error
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: John Snow <jsnow@redhat.com>, Markus Armbruster <armbru@redhat.com>
Cc: qemu-block@nongnu.org, lizhijian@cn.fujitsu.com, quintela@redhat.com, qemu-devel@nongnu.org, yunhong.jiang@intel.com, eddie.dong@intel.com, peter.huangpeng@huawei.com, Michael Roth <mdroth@linux.vnet.ibm.com>, arei.gonglei@huawei.com, stefanha@redhat.com, amit.shah@redhat.com, dgilbert@redhat.com, hongyang.yang@easystack.cn

On 2015/12/22 5:14, John Snow wrote:
>
>
> On 12/19/2015 05:02 AM, Markus Armbruster wrote:
>> Copying qemu-block because this seems related to generalising block jobs
>> to background jobs.
>>
>> zhanghailiang <zhang.zhanghailiang@huawei.com> writes:
>>
>>> If some errors happen during VM's COLO FT stage, it's important to notify the users
>>> of this event. Together with 'colo_lost_heartbeat', users can intervene in COLO's
>>> failover work immediately.
>>> If users don't want to get involved in COLO's failover verdict,
>>> it is still necessary to notify users that we exited COLO mode.
>>>
>>> Cc: Markus Armbruster <armbru@redhat.com>
>>> Cc: Michael Roth <mdroth@linux.vnet.ibm.com>
>>> Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
>>> Signed-off-by: Li Zhijian <lizhijian@cn.fujitsu.com>
>>> ---
>>> v11:
>>> - Fix several typos found by Eric
>>>
>>> Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
>>> ---
>>>   docs/qmp-events.txt | 17 +++++++++++++++++
>>>   migration/colo.c    | 11 +++++++++++
>>>   qapi-schema.json    | 16 ++++++++++++++++
>>>   qapi/event.json     | 17 +++++++++++++++++
>>>   4 files changed, 61 insertions(+)
>>>
>>> diff --git a/docs/qmp-events.txt b/docs/qmp-events.txt
>>> index d2f1ce4..19f68fc 100644
>>> --- a/docs/qmp-events.txt
>>> +++ b/docs/qmp-events.txt
>>> @@ -184,6 +184,23 @@ Example:
>>>   Note: The "ready to complete" status is always reset by a BLOCK_JOB_ERROR
>>>   event.
>>>
>>> +COLO_EXIT
>>> +---------
>>> +
>>> +Emitted when VM finishes COLO mode due to some errors happening or
>>> +at the request of users.
>>
>> How would the event's recipient distinguish between "due to error" and
>> "at the user's request"?
>>
>>> +
>>> +Data:
>>> +
>>> + - "mode": COLO mode, primary or secondary side (json-string)
>>> + - "reason":  the exit reason, internal error or external request. (json-string)
>>> + - "error": error message (json-string, operation)
>>> +
>>> +Example:
>>> +
>>> +{"timestamp": {"seconds": 2032141960, "microseconds": 417172},
>>> + "event": "COLO_EXIT", "data": {"mode": "primary", "reason": "request" } }
>>> +
>>
>> Pardon my ignorance again...  Does "VM finishes COLO mode" means have
>> some kind of COLO background job, and it just finished for whatever
>> reason?
>>
>> If yes, this COLO job could be an instance of the general background job
>> concept we're trying to grow from the existing block job concept.
>>
>> I'm not asking you to rebase your work onto the background job
>> infrastructure, not least for the simple reason that it doesn't exist,
>> yet.  But I think it would be fruitful to compare your COLO job
>> management QMP interface with the one we have for block jobs.  Not only
>> may that avoid unnecessary inconsistency, it could also help shape the
>> general background job interface.
>>
>
> Yes. The "background job" concept doesn't exist in a formal way outside
> of the block layer yet, but we're looking to expand it as we re-tool the
> block jobs themselves.
>
> It may be the case that the COLO commands and events need to go in as
> they are now, but later we can bring them back into the generalized job
> infrastructure.
>

Agreed. ;)

>> Quick overview of the block job QMP interface:
>>
>> * Commands to create a job: block-commit, block-stream, drive-mirror,
>>    drive-backup.
>>
>> * Get information on jobs: query-block-jobs
>>
>> * Pause a job: block-job-pause
>>
>> * Resume a job: block-job-resume
>>
>> * Cancel a job: block-job-cancel
>>
>> * Block job completion events: BLOCK_JOB_COMPLETED, BLOCK_JOB_CANCELLED
>>
>> * Block job error event: BLOCK_JOB_ERROR
>>
>> * Block job synchronous completion: event BLOCK_JOB_READY and command
>>    block-job-complete
>>
>
> The block-agnostic version of these commands would likely be:
>
> query-jobs
> job-pause
> job-resume
> job-cancel
> job-complete
>
> Events: JOB_COMPLETED, JOB_CANCELLED, JOB_ERROR, JOB_READY.
>
>
> It looks like COLO_EXIT would be an instance of JOB_COMPLETED, and if it
> occurred due to an error, we'd also see JOB_ERROR emitted.
>

Yes, if we use this job frame for COLO, the COLO_EXIT will be like that.

>>>   DEVICE_DELETED
>>>   --------------
>>>
>>> diff --git a/migration/colo.c b/migration/colo.c
>>> index d1dd4e1..d06c14f 100644
>>> --- a/migration/colo.c
>>> +++ b/migration/colo.c
>>> @@ -18,6 +18,7 @@
>>>   #include "qemu/error-report.h"
>>>   #include "qemu/sockets.h"
>>>   #include "migration/failover.h"
>>> +#include "qapi-event.h"
>>>
>>>   /* colo buffer */
>>>   #define COLO_BUFFER_BASE_SIZE (4 * 1024 * 1024)
>>> @@ -349,6 +350,11 @@ static void colo_process_checkpoint(MigrationState *s)
>>>   out:
>>>       if (ret < 0) {
>>>           error_report("%s: %s", __func__, strerror(-ret));
>>> +        qapi_event_send_colo_exit(COLO_MODE_PRIMARY, COLO_EXIT_REASON_ERROR,
>>> +                                  true, strerror(-ret), NULL);
>>> +    } else {
>>> +        qapi_event_send_colo_exit(COLO_MODE_PRIMARY, COLO_EXIT_REASON_REQUEST,
>>> +                                  false, NULL, NULL);
>>>       }
>>>
>>>       qsb_free(buffer);
>>> @@ -516,6 +522,11 @@ out:
>>>       if (ret < 0) {
>>>           error_report("colo incoming thread will exit, detect error: %s",
>>>                        strerror(-ret));
>>> +        qapi_event_send_colo_exit(COLO_MODE_SECONDARY, COLO_EXIT_REASON_ERROR,
>>> +                                  true, strerror(-ret), NULL);
>>> +    } else {
>>> +        qapi_event_send_colo_exit(COLO_MODE_SECONDARY, COLO_EXIT_REASON_REQUEST,
>>> +                                  false, NULL, NULL);
>>>       }
>>>
>>>       if (fb) {
>>> diff --git a/qapi-schema.json b/qapi-schema.json
>>> index feb7d53..f6ecb88 100644
>>> --- a/qapi-schema.json
>>> +++ b/qapi-schema.json
>>> @@ -778,6 +778,22 @@
>>>     'data': [ 'unknown', 'primary', 'secondary'] }
>>>
>>>   ##
>>> +# @COLOExitReason
>>> +#
>>> +# The reason for a COLO exit
>>> +#
>>> +# @unknown: unknown reason
>>
>> How can @unknown happen?
>>
>>> +#
>>> +# @request: COLO exit is due to an external request
>>> +#
>>> +# @error: COLO exit is due to an internal error
>>> +#
>>> +# Since: 2.6
>>> +##
>>> +{ 'enum': 'COLOExitReason',
>>> +  'data': [ 'unknown', 'request', 'error'] }
>>> +
>>> +##
>>>   # @x-colo-lost-heartbeat
>>>   #
>>>   # Tell qemu that heartbeat is lost, request it to do takeover procedures.
>>> diff --git a/qapi/event.json b/qapi/event.json
>>> index f0cef01..f63d456 100644
>>> --- a/qapi/event.json
>>> +++ b/qapi/event.json
>>> @@ -255,6 +255,23 @@
>>>     'data': {'status': 'MigrationStatus'}}
>>>
>>>   ##
>>> +# @COLO_EXIT
>>> +#
>>> +# Emitted when VM finishes COLO mode due to some errors happening or
>>> +# at the request of users.
>>> +#
>>> +# @mode: which COLO mode the VM was in when it exited.
>>
>> Can we get 'unknown' here?
>>
>>> +#
>>> +# @reason: describes the reason for the COLO exit.
>>
>> Can we get 'unknown' here?
>>
>>> +#
>>> +# @error: #optional, error message. Only present on error happening.
>>> +#
>>> +# Since: 2.6
>>> +##
>>> +{ 'event': 'COLO_EXIT',
>>> +  'data': {'mode': 'COLOMode', 'reason': 'COLOExitReason', '*error': 'str' } }
>>> +
>>> +##
>>>   # @ACPI_DEVICE_OST
>>>   #
>>>   # Emitted when guest executes ACPI _OST method.
>>
>