From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:59714) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1f0jFI-0001Hp-DP for qemu-devel@nongnu.org; Tue, 27 Mar 2018 03:42:18 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1f0jFE-0005uk-GJ for qemu-devel@nongnu.org; Tue, 27 Mar 2018 03:42:16 -0400 Received: from mail-it0-x230.google.com ([2607:f8b0:4001:c0b::230]:56017) by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_128_CBC_SHA1:16) (Exim 4.71) (envelope-from ) id 1f0jFE-0005uJ-A9 for qemu-devel@nongnu.org; Tue, 27 Mar 2018 03:42:12 -0400 Received: by mail-it0-x230.google.com with SMTP id 142-v6so754325itl.5 for ; Tue, 27 Mar 2018 00:42:12 -0700 (PDT) References: <20180313075739.11194-1-xiaoguangrong@tencent.com> <20180313075739.11194-4-xiaoguangrong@tencent.com> <20180321100043.GA30634@xz-mi> <48da6d2b-f5f4-a077-f282-ed52392c17f9@gmail.com> <20180327072204.GJ17789@xz-mi> From: Xiao Guangrong Message-ID: <51b59c4c-b153-e711-d78a-8611780b2874@gmail.com> Date: Tue, 27 Mar 2018 03:42:32 +0800 MIME-Version: 1.0 In-Reply-To: <20180327072204.GJ17789@xz-mi> Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: 8bit Subject: Re: [Qemu-devel] [PATCH 3/8] migration: support to detect compression and decompression errors List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Peter Xu Cc: pbonzini@redhat.com, mst@redhat.com, mtosatti@redhat.com, Xiao Guangrong , qemu-devel@nongnu.org, kvm@vger.kernel.org On 03/27/2018 03:22 PM, Peter Xu wrote: > On Thu, Mar 22, 2018 at 08:03:53PM +0800, Xiao Guangrong wrote: >> >> >> On 03/21/2018 06:00 PM, Peter Xu wrote: >>> On Tue, Mar 13, 2018 at 03:57:34PM +0800, guangrong.xiao@gmail.com wrote: >>>> From: Xiao Guangrong >>>> >>>> Currently the page being compressed is allowed to be updated by >>>> the VM on the source QEMU, correspondingly the destination QEMU >>>> just ignores the decompression error. However, we completely miss >>>> the chance to catch real errors, then the VM is corrupted silently >>>> >>>> To make the migration more robuster, we copy the page to a buffer >>>> first to avoid it being written by VM, then detect and handle the >>>> errors of both compression and decompression errors properly >>> >>> Not sure I missed anything important, but I'll just shoot my thoughts >>> as questions (again)... >>> >>> Actually this is a more general question? Say, even without >>> compression, we can be sending a page that is being modified. >>> >>> However, IMHO we don't need to worry that, since if that page is >>> modified, we'll definitely send that page again, so the new page will >>> replace the old. So on destination side, even if decompress() failed >>> on a page it'll be fine IMHO. Though now we are copying the corrupted >>> buffer. On that point, I fully agree that we should not - maybe we >>> can just drop the page entirely? >>> >>> For non-compress pages, we can't detect that, so we'll copy the page >>> even if corrupted. >>> >>> The special part for compression would be: would the deflate() fail if >>> there is concurrent update to the buffer being compressed? And would >>> that corrupt the whole compression stream, or it would only fail the >>> deflate() call? >> >> It is not the same for normal page and compressed page. >> >> For the normal page, the dirty-log mechanism in QEMU and the infrastructure >> of the network (e.g, TCP) can make sure that the modified memory will >> be posted to the destination without corruption. >> >> However, nothing can guarantee compression/decompression is BUG-free, >> e,g, consider the case, in the last step, vCPUs & dirty-log are paused and >> the memory is compressed and posted to destination, if there is any error >> in compression/decompression, VM dies silently. > > Here do you mean the compression error even if the VM is halted? I'd > say in that case IMHO the extra memcpy() would still help little since > the coiped page should exactly be the same as the source page? ”compression error“ means that compress2() in original code returns a error code. If the data being compressed is being modified at the some time, compression will fail and this failure is negative. We move the data to a internal buffer to avoid this case, so that we can catch the real error condition. > > I'd say I don't know what we can really do if there are zlib bugs. I > was assuming we'll definitely fail in a strange way if there is any, > which should be hard to be detected from QEMU's POV (maybe a > destination VM crash, as you mentioned). It'll be easy for us to > detect errors when we got error code returned from compress(), however > IMHO when we say "zlib bug" it can also mean that data is corrputed > even compress() and decompress() both returned with good state. > Ah, sorry, i abused the word "BUG". It does not mean the BUG in compression/decompression API, i mean the failure conditions (The API returns a error code). > It'll be understandable to me if the problem is that the compress() > API does not allow the input buffer to be changed during the whole > period of the call. If that is a must, this patch for sure helps. Yes, that is exactly what i want to say. :)