From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([2001:4830:134:3::10]:55852)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <zhang.zhanghailiang@huawei.com>) id 1aBZpY-0006cJ-Ot
	for qemu-devel@nongnu.org; Tue, 22 Dec 2015 22:11:14 -0500
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <zhang.zhanghailiang@huawei.com>) id 1aBZpX-0000hs-Bi
	for qemu-devel@nongnu.org; Tue, 22 Dec 2015 22:11:12 -0500
References: <1450167779-9960-1-git-send-email-zhang.zhanghailiang@huawei.com>
	<1450167779-9960-26-git-send-email-zhang.zhanghailiang@huawei.com>
	<8737uypxi9.fsf@blackfin.pond.sub.org>
From: Hailiang Zhang <zhang.zhanghailiang@huawei.com>
Message-ID: <567A10A0.5070408@huawei.com>
Date: Wed, 23 Dec 2015 11:10:24 +0800
MIME-Version: 1.0
In-Reply-To: <8737uypxi9.fsf@blackfin.pond.sub.org>
Content-Type: text/plain; charset="windows-1252"; format=flowed
Content-Transfer-Encoding: 7bit
Subject: Re: [Qemu-devel] [PATCH COLO-Frame v12 25/38] qmp event: Add event
 notification for COLO error
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Markus Armbruster <armbru@redhat.com>
Cc: Michael Roth <mdroth@linux.vnet.ibm.com>, lizhijian@cn.fujitsu.com, quintela@redhat.com, yunhong.jiang@intel.com, eddie.dong@intel.com, peter.huangpeng@huawei.com, qemu-devel@nongnu.org, arei.gonglei@huawei.com, stefanha@redhat.com, amit.shah@redhat.com, qemu-block@nongnu.org, dgilbert@redhat.com, hongyang.yang@easystack.cn

On 2015/12/19 18:02, Markus Armbruster wrote:
> Copying qemu-block because this seems related to generalising block jobs
> to background jobs.
>

Er, this event just used to help users to know what happened to VM with COLO FT
on. If users get this event, they can make further check what's wrong, and
decide which side should take over the work.

> zhanghailiang <zhang.zhanghailiang@huawei.com> writes:
>
>> If some errors happen during VM's COLO FT stage, it's important to notify the users
>> of this event. Together with 'colo_lost_heartbeat', users can intervene in COLO's
>> failover work immediately.
>> If users don't want to get involved in COLO's failover verdict,
>> it is still necessary to notify users that we exited COLO mode.
>>
>> Cc: Markus Armbruster <armbru@redhat.com>
>> Cc: Michael Roth <mdroth@linux.vnet.ibm.com>
>> Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
>> Signed-off-by: Li Zhijian <lizhijian@cn.fujitsu.com>
>> ---
>> v11:
>> - Fix several typos found by Eric
>>
>> Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
>> ---
>>   docs/qmp-events.txt | 17 +++++++++++++++++
>>   migration/colo.c    | 11 +++++++++++
>>   qapi-schema.json    | 16 ++++++++++++++++
>>   qapi/event.json     | 17 +++++++++++++++++
>>   4 files changed, 61 insertions(+)
>>
>> diff --git a/docs/qmp-events.txt b/docs/qmp-events.txt
>> index d2f1ce4..19f68fc 100644
>> --- a/docs/qmp-events.txt
>> +++ b/docs/qmp-events.txt
>> @@ -184,6 +184,23 @@ Example:
>>   Note: The "ready to complete" status is always reset by a BLOCK_JOB_ERROR
>>   event.
>>
>> +COLO_EXIT
>> +---------
>> +
>> +Emitted when VM finishes COLO mode due to some errors happening or
>> +at the request of users.
>
> How would the event's recipient distinguish between "due to error" and
> "at the user's request"?
>

If they get this event with 'reason' is 'request', it is 'at the user's request',
Or, it will be 'due to error' (The key for 'reason' will be 'error', and we have an optional
error message which may help to figure out what happened.)

>> +
>> +Data:
>> +
>> + - "mode": COLO mode, primary or secondary side (json-string)
>> + - "reason":  the exit reason, internal error or external request. (json-string)
>> + - "error": error message (json-string, operation)
>> +
>> +Example:
>> +
>> +{"timestamp": {"seconds": 2032141960, "microseconds": 417172},
>> + "event": "COLO_EXIT", "data": {"mode": "primary", "reason": "request" } }
>> +
>
> Pardon my ignorance again...  Does "VM finishes COLO mode" means have
> some kind of COLO background job, and it just finished for whatever
> reason?
>

As above, what i have said.

> If yes, this COLO job could be an instance of the general background job
> concept we're trying to grow from the existing block job concept.
>
> I'm not asking you to rebase your work onto the background job
> infrastructure, not least for the simple reason that it doesn't exist,
> yet.  But I think it would be fruitful to compare your COLO job
> management QMP interface with the one we have for block jobs.  Not only
> may that avoid unnecessary inconsistency, it could also help shape the
> general background job interface.
>

Interesting, i'm not quite familiar with this block background job infrastructure.
If we consider COLO FT as a background job, we can certainly use it. I will have a look
at it.

> Quick overview of the block job QMP interface:
>
> * Commands to create a job: block-commit, block-stream, drive-mirror,
>    drive-backup.
>
> * Get information on jobs: query-block-jobs
>
> * Pause a job: block-job-pause
>
> * Resume a job: block-job-resume
>
> * Cancel a job: block-job-cancel
>
> * Block job completion events: BLOCK_JOB_COMPLETED, BLOCK_JOB_CANCELLED
>
> * Block job error event: BLOCK_JOB_ERROR
>
> * Block job synchronous completion: event BLOCK_JOB_READY and command
>    block-job-complete
>
>>   DEVICE_DELETED
>>   --------------
>>
>> diff --git a/migration/colo.c b/migration/colo.c
>> index d1dd4e1..d06c14f 100644
>> --- a/migration/colo.c
>> +++ b/migration/colo.c
>> @@ -18,6 +18,7 @@
>>   #include "qemu/error-report.h"
>>   #include "qemu/sockets.h"
>>   #include "migration/failover.h"
>> +#include "qapi-event.h"
>>
>>   /* colo buffer */
>>   #define COLO_BUFFER_BASE_SIZE (4 * 1024 * 1024)
>> @@ -349,6 +350,11 @@ static void colo_process_checkpoint(MigrationState *s)
>>   out:
>>       if (ret < 0) {
>>           error_report("%s: %s", __func__, strerror(-ret));
>> +        qapi_event_send_colo_exit(COLO_MODE_PRIMARY, COLO_EXIT_REASON_ERROR,
>> +                                  true, strerror(-ret), NULL);
>> +    } else {
>> +        qapi_event_send_colo_exit(COLO_MODE_PRIMARY, COLO_EXIT_REASON_REQUEST,
>> +                                  false, NULL, NULL);
>>       }
>>
>>       qsb_free(buffer);
>> @@ -516,6 +522,11 @@ out:
>>       if (ret < 0) {
>>           error_report("colo incoming thread will exit, detect error: %s",
>>                        strerror(-ret));
>> +        qapi_event_send_colo_exit(COLO_MODE_SECONDARY, COLO_EXIT_REASON_ERROR,
>> +                                  true, strerror(-ret), NULL);
>> +    } else {
>> +        qapi_event_send_colo_exit(COLO_MODE_SECONDARY, COLO_EXIT_REASON_REQUEST,
>> +                                  false, NULL, NULL);
>>       }
>>
>>       if (fb) {
>> diff --git a/qapi-schema.json b/qapi-schema.json
>> index feb7d53..f6ecb88 100644
>> --- a/qapi-schema.json
>> +++ b/qapi-schema.json
>> @@ -778,6 +778,22 @@
>>     'data': [ 'unknown', 'primary', 'secondary'] }
>>
>>   ##
>> +# @COLOExitReason
>> +#
>> +# The reason for a COLO exit
>> +#
>> +# @unknown: unknown reason
>
> How can @unknown happen?
>

>> +#
>> +# @request: COLO exit is due to an external request
>> +#
>> +# @error: COLO exit is due to an internal error
>> +#
>> +# Since: 2.6
>> +##
>> +{ 'enum': 'COLOExitReason',
>> +  'data': [ 'unknown', 'request', 'error'] }
>> +
>> +##
>>   # @x-colo-lost-heartbeat
>>   #
>>   # Tell qemu that heartbeat is lost, request it to do takeover procedures.
>> diff --git a/qapi/event.json b/qapi/event.json
>> index f0cef01..f63d456 100644
>> --- a/qapi/event.json
>> +++ b/qapi/event.json
>> @@ -255,6 +255,23 @@
>>     'data': {'status': 'MigrationStatus'}}
>>
>>   ##
>> +# @COLO_EXIT
>> +#
>> +# Emitted when VM finishes COLO mode due to some errors happening or
>> +# at the request of users.
>> +#
>> +# @mode: which COLO mode the VM was in when it exited.
>
> Can we get 'unknown' here?
>

No, i will remove it :)

>> +#
>> +# @reason: describes the reason for the COLO exit.
>
> Can we get 'unknown' here?
>

No, it should never happen for now. i will remove it.

>> +#
>> +# @error: #optional, error message. Only present on error happening.
>> +#
>> +# Since: 2.6
>> +##
>> +{ 'event': 'COLO_EXIT',
>> +  'data': {'mode': 'COLOMode', 'reason': 'COLOExitReason', '*error': 'str' } }
>> +
>> +##
>>   # @ACPI_DEVICE_OST
>>   #
>>   # Emitted when guest executes ACPI _OST method.
>
> .
>