From mboxrd@z Thu Jan 1 00:00:00 1970 From: Kevin Wolf Subject: Re: [Qemu-devel] [PATCH 09/19] Introduce event-tap. Date: Thu, 20 Jan 2011 10:15:20 +0100 Message-ID: <4D37FD28.8000402@redhat.com> References: <1295415904-11918-1-git-send-email-tamura.yoshiaki@lab.ntt.co.jp> <1295415904-11918-10-git-send-email-tamura.yoshiaki@lab.ntt.co.jp> <4D36B130.4010608@redhat.com> <4D36EC41.5050104@redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Cc: aliguori@us.ibm.com, dlaor@redhat.com, ananth@in.ibm.com, kvm@vger.kernel.org, mst@redhat.com, mtosatti@redhat.com, qemu-devel@nongnu.org, vatsa@linux.vnet.ibm.com, blauwirbel@gmail.com, ohmura.kei@lab.ntt.co.jp, avi@redhat.com, psuriset@linux.vnet.ibm.com, stefanha@linux.vnet.ibm.com To: Yoshiaki Tamura Return-path: Received: from mx1.redhat.com ([209.132.183.28]:4961 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754919Ab1ATJOS (ORCPT ); Thu, 20 Jan 2011 04:14:18 -0500 In-Reply-To: Sender: kvm-owner@vger.kernel.org List-ID: Am 20.01.2011 06:19, schrieb Yoshiaki Tamura: >>>>> + return; >>>>> + } >>>>> + >>>>> + bdrv_aio_writev(bs, blk_req->reqs[0].sector, blk_req->reqs[0].qiov, >>>>> + blk_req->reqs[0].nb_sectors, blk_req->reqs[0].cb, >>>>> + blk_req->reqs[0].opaque); >>>> >>>> Same here. >>>> >>>>> + bdrv_flush(bs); >>>> >>>> This looks really strange. What is this supposed to do? >>>> >>>> One point is that you write it immediately after bdrv_aio_write, so you >>>> get an fsync for which you don't know if it includes the current write >>>> request or if it doesn't. Which data do you want to get flushed to the disk? >>> >>> I was expecting to flush the aio request that was just initiated. >>> Am I misunderstanding the function? >> >> Seems so. The function names don't use really clear terminology either, >> so you're not the first one to fall in this trap. Basically we have: >> >> * qemu_aio_flush() waits for all AIO requests to complete. I think you >> wanted to have exactly this, but only for a single block device. Such a >> function doesn't exist yet. >> >> * bdrv_flush() makes sure that all successfully completed requests are >> written to disk (by calling fsync) >> >> * bdrv_aio_flush() is the asynchronous version of bdrv_flush, i.e. run >> the fsync in the thread pool > > Then what I wanted to do is, call qemu_aio_flush first, then > bdrv_flush. It should be like live migration. Okay, that makes sense. :-) >>>> The other thing is that you introduce a bdrv_flush for each request, >>>> basically forcing everyone to something very similar to writethrough >>>> mode. I'm sure this will have a big impact on performance. >>> >>> The reason is to avoid inversion of queued requests. Although >>> processing one-by-one is heavy, wouldn't having requests flushed >>> to disk out of order break the disk image? >> >> No, that's fine. If a guest issues two requests at the same time, they >> may complete in any order. You just need to make sure that you don't >> call the completion callback before the request really has completed. > > We need to flush requests, meaning aio and fsync, before sending > the final state of the guests, to make sure we can switch to the > secondary safely. In theory I think you could just re-submit the requests on the secondary if they had not completed yet. But you're right, let's keep things simple for the start. >> I'm just starting to wonder if the guest won't timeout the requests if >> they are queued for too long. Even more, with IDE, it can only handle >> one request at a time, so not completing requests doesn't sound like a >> good idea at all. In what intervals is the event-tap queue flushed? > > The requests are flushed once each transaction completes. So > it's not with specific intervals. Right. So when is a transaction completed? This is the time that a single request will take. >> On the other hand, if you complete before actually writing out, you >> don't get timeouts, but you signal success to the guest when the request >> could still fail. What would you do in this case? With a writeback cache >> mode we're fine, we can just fail the next flush (until then nothing is >> guaranteed to be on disk and order doesn't matter either), but with >> cache=writethrough we're in serious trouble. >> >> Have you thought about this problem? Maybe we end up having to flush the >> event-tap queue for each single write in writethrough mode. > > Yes, and that's what I'm trying to do at this point. Oh, I must have missed that code. Which patch/function should I look at? > I know that > performance matters a lot, but sacrificing reliability over > performance now isn't a good idea. I first want to lay the > ground, and then focus on optimization. Note that without dirty > bitmap optimization, Kemari suffers a lot in sending rams. > Anthony and I discussed to take this approach at KVM Forum. I agree, starting simple makes sense. Kevin