From mboxrd@z Thu Jan 1 00:00:00 1970 From: Xiao Guangrong Subject: Re: [PATCH 3/8] migration: support to detect compression and decompression errors Date: Tue, 27 Mar 2018 03:42:32 +0800 Message-ID: <51b59c4c-b153-e711-d78a-8611780b2874@gmail.com> References: <20180313075739.11194-1-xiaoguangrong@tencent.com> <20180313075739.11194-4-xiaoguangrong@tencent.com> <20180321100043.GA30634@xz-mi> <48da6d2b-f5f4-a077-f282-ed52392c17f9@gmail.com> <20180327072204.GJ17789@xz-mi> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit Cc: kvm@vger.kernel.org, mst@redhat.com, mtosatti@redhat.com, Xiao Guangrong , qemu-devel@nongnu.org, pbonzini@redhat.com To: Peter Xu Return-path: In-Reply-To: <20180327072204.GJ17789@xz-mi> Content-Language: en-US List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: qemu-devel-bounces+gceq-qemu-devel2=m.gmane.org@nongnu.org Sender: "Qemu-devel" List-Id: kvm.vger.kernel.org On 03/27/2018 03:22 PM, Peter Xu wrote: > On Thu, Mar 22, 2018 at 08:03:53PM +0800, Xiao Guangrong wrote: >> >> >> On 03/21/2018 06:00 PM, Peter Xu wrote: >>> On Tue, Mar 13, 2018 at 03:57:34PM +0800, guangrong.xiao@gmail.com wrote: >>>> From: Xiao Guangrong >>>> >>>> Currently the page being compressed is allowed to be updated by >>>> the VM on the source QEMU, correspondingly the destination QEMU >>>> just ignores the decompression error. However, we completely miss >>>> the chance to catch real errors, then the VM is corrupted silently >>>> >>>> To make the migration more robuster, we copy the page to a buffer >>>> first to avoid it being written by VM, then detect and handle the >>>> errors of both compression and decompression errors properly >>> >>> Not sure I missed anything important, but I'll just shoot my thoughts >>> as questions (again)... >>> >>> Actually this is a more general question? Say, even without >>> compression, we can be sending a page that is being modified. >>> >>> However, IMHO we don't need to worry that, since if that page is >>> modified, we'll definitely send that page again, so the new page will >>> replace the old. So on destination side, even if decompress() failed >>> on a page it'll be fine IMHO. Though now we are copying the corrupted >>> buffer. On that point, I fully agree that we should not - maybe we >>> can just drop the page entirely? >>> >>> For non-compress pages, we can't detect that, so we'll copy the page >>> even if corrupted. >>> >>> The special part for compression would be: would the deflate() fail if >>> there is concurrent update to the buffer being compressed? And would >>> that corrupt the whole compression stream, or it would only fail the >>> deflate() call? >> >> It is not the same for normal page and compressed page. >> >> For the normal page, the dirty-log mechanism in QEMU and the infrastructure >> of the network (e.g, TCP) can make sure that the modified memory will >> be posted to the destination without corruption. >> >> However, nothing can guarantee compression/decompression is BUG-free, >> e,g, consider the case, in the last step, vCPUs & dirty-log are paused and >> the memory is compressed and posted to destination, if there is any error >> in compression/decompression, VM dies silently. > > Here do you mean the compression error even if the VM is halted? I'd > say in that case IMHO the extra memcpy() would still help little since > the coiped page should exactly be the same as the source page? ”compression error“ means that compress2() in original code returns a error code. If the data being compressed is being modified at the some time, compression will fail and this failure is negative. We move the data to a internal buffer to avoid this case, so that we can catch the real error condition. > > I'd say I don't know what we can really do if there are zlib bugs. I > was assuming we'll definitely fail in a strange way if there is any, > which should be hard to be detected from QEMU's POV (maybe a > destination VM crash, as you mentioned). It'll be easy for us to > detect errors when we got error code returned from compress(), however > IMHO when we say "zlib bug" it can also mean that data is corrputed > even compress() and decompress() both returned with good state. > Ah, sorry, i abused the word "BUG". It does not mean the BUG in compression/decompression API, i mean the failure conditions (The API returns a error code). > It'll be understandable to me if the problem is that the compress() > API does not allow the input buffer to be changed during the whole > period of the call. If that is a must, this patch for sure helps. Yes, that is exactly what i want to say. :)