From mboxrd@z Thu Jan  1 00:00:00 1970
From: Xiao Guangrong <guangrong.xiao@gmail.com>
Subject: Re: [PATCH 3/8] migration: support to detect
 compression and decompression errors
Date: Tue, 27 Mar 2018 03:42:32 +0800
Message-ID: <51b59c4c-b153-e711-d78a-8611780b2874@gmail.com>
References: <20180313075739.11194-1-xiaoguangrong@tencent.com>
	<20180313075739.11194-4-xiaoguangrong@tencent.com>
	<20180321100043.GA30634@xz-mi>
	<48da6d2b-f5f4-a077-f282-ed52392c17f9@gmail.com>
	<20180327072204.GJ17789@xz-mi>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Cc: kvm@vger.kernel.org, mst@redhat.com, mtosatti@redhat.com,
	Xiao Guangrong <xiaoguangrong@tencent.com>,
	qemu-devel@nongnu.org, pbonzini@redhat.com
To: Peter Xu <peterx@redhat.com>
Return-path: <qemu-devel-bounces+gceq-qemu-devel2=m.gmane.org@nongnu.org>
In-Reply-To: <20180327072204.GJ17789@xz-mi>
Content-Language: en-US
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel/>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
Errors-To: qemu-devel-bounces+gceq-qemu-devel2=m.gmane.org@nongnu.org
Sender: "Qemu-devel"
	<qemu-devel-bounces+gceq-qemu-devel2=m.gmane.org@nongnu.org>
List-Id: kvm.vger.kernel.org


On 03/27/2018 03:22 PM, Peter Xu wrote:
> On Thu, Mar 22, 2018 at 08:03:53PM +0800, Xiao Guangrong wrote:
>>
>>
>> On 03/21/2018 06:00 PM, Peter Xu wrote:
>>> On Tue, Mar 13, 2018 at 03:57:34PM +0800, guangrong.xiao@gmail.com wrote:
>>>> From: Xiao Guangrong <xiaoguangrong@tencent.com>
>>>>
>>>> Currently the page being compressed is allowed to be updated by
>>>> the VM on the source QEMU, correspondingly the destination QEMU
>>>> just ignores the decompression error. However, we completely miss
>>>> the chance to catch real errors, then the VM is corrupted silently
>>>>
>>>> To make the migration more robuster, we copy the page to a buffer
>>>> first to avoid it being written by VM, then detect and handle the
>>>> errors of both compression and decompression errors properly
>>>
>>> Not sure I missed anything important, but I'll just shoot my thoughts
>>> as questions (again)...
>>>
>>> Actually this is a more general question? Say, even without
>>> compression, we can be sending a page that is being modified.
>>>
>>> However, IMHO we don't need to worry that, since if that page is
>>> modified, we'll definitely send that page again, so the new page will
>>> replace the old.  So on destination side, even if decompress() failed
>>> on a page it'll be fine IMHO.  Though now we are copying the corrupted
>>> buffer.  On that point, I fully agree that we should not - maybe we
>>> can just drop the page entirely?
>>>
>>> For non-compress pages, we can't detect that, so we'll copy the page
>>> even if corrupted.
>>>
>>> The special part for compression would be: would the deflate() fail if
>>> there is concurrent update to the buffer being compressed?  And would
>>> that corrupt the whole compression stream, or it would only fail the
>>> deflate() call?
>>
>> It is not the same for normal page and compressed page.
>>
>> For the normal page, the dirty-log mechanism in QEMU and the infrastructure
>> of the network (e.g, TCP) can make sure that the modified memory will
>> be posted to the destination without corruption.
>>
>> However, nothing can guarantee compression/decompression is BUG-free,
>> e,g, consider the case, in the last step, vCPUs & dirty-log are paused and
>> the memory is compressed and posted to destination, if there is any error
>> in compression/decompression, VM dies silently.
> 
> Here do you mean the compression error even if the VM is halted?  I'd
> say in that case IMHO the extra memcpy() would still help little since
> the coiped page should exactly be the same as the source page?

”compression error“ means that compress2() in original code returns a
error code.

If the data being compressed is being modified at the some time,
compression will fail and this failure is negative. We move the data to
a internal buffer to avoid this case, so that we can catch the real
error condition.

> 
> I'd say I don't know what we can really do if there are zlib bugs. I
> was assuming we'll definitely fail in a strange way if there is any,
> which should be hard to be detected from QEMU's POV (maybe a
> destination VM crash, as you mentioned).  It'll be easy for us to
> detect errors when we got error code returned from compress(), however
> IMHO when we say "zlib bug" it can also mean that data is corrputed
> even compress() and decompress() both returned with good state.
> 

Ah, sorry, i abused the word "BUG".

It does not mean the BUG in compression/decompression API, i mean the
failure conditions (The API returns a error code).

> It'll be understandable to me if the problem is that the compress()
> API does not allow the input buffer to be changed during the whole
> period of the call.  If that is a must, this patch for sure helps.

Yes, that is exactly what i want to say. :)