From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from [140.186.70.92] (port=37383 helo=eggs.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1Pfsx2-0001gd-Pb for qemu-devel@nongnu.org; Thu, 20 Jan 2011 06:45:18 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1Pfswx-0000lc-HX for qemu-devel@nongnu.org; Thu, 20 Jan 2011 06:45:16 -0500 Received: from mx1.redhat.com ([209.132.183.28]:41548) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1Pfswx-0000lN-7Q for qemu-devel@nongnu.org; Thu, 20 Jan 2011 06:45:11 -0500 Message-ID: <4D382094.1030807@redhat.com> Date: Thu, 20 Jan 2011 12:46:28 +0100 From: Kevin Wolf MIME-Version: 1.0 Subject: Re: [Qemu-devel] [PATCH 09/19] Introduce event-tap. References: <1295415904-11918-1-git-send-email-tamura.yoshiaki@lab.ntt.co.jp> <1295415904-11918-10-git-send-email-tamura.yoshiaki@lab.ntt.co.jp> <4D36B130.4010608@redhat.com> <4D36EC41.5050104@redhat.com> <4D37FD28.8000402@redhat.com> In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit List-Id: qemu-devel.nongnu.org List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Yoshiaki Tamura Cc: aliguori@us.ibm.com, dlaor@redhat.com, ananth@in.ibm.com, kvm@vger.kernel.org, mst@redhat.com, mtosatti@redhat.com, qemu-devel@nongnu.org, vatsa@linux.vnet.ibm.com, blauwirbel@gmail.com, ohmura.kei@lab.ntt.co.jp, avi@redhat.com, psuriset@linux.vnet.ibm.com, stefanha@linux.vnet.ibm.com Am 20.01.2011 11:39, schrieb Yoshiaki Tamura: > 2011/1/20 Kevin Wolf : >> Am 20.01.2011 06:19, schrieb Yoshiaki Tamura: >>>>>>> + return; >>>>>>> + } >>>>>>> + >>>>>>> + bdrv_aio_writev(bs, blk_req->reqs[0].sector, blk_req->reqs[0].qiov, >>>>>>> + blk_req->reqs[0].nb_sectors, blk_req->reqs[0].cb, >>>>>>> + blk_req->reqs[0].opaque); >>>>>> >>>>>> Same here. >>>>>> >>>>>>> + bdrv_flush(bs); >>>>>> >>>>>> This looks really strange. What is this supposed to do? >>>>>> >>>>>> One point is that you write it immediately after bdrv_aio_write, so you >>>>>> get an fsync for which you don't know if it includes the current write >>>>>> request or if it doesn't. Which data do you want to get flushed to the disk? >>>>> >>>>> I was expecting to flush the aio request that was just initiated. >>>>> Am I misunderstanding the function? >>>> >>>> Seems so. The function names don't use really clear terminology either, >>>> so you're not the first one to fall in this trap. Basically we have: >>>> >>>> * qemu_aio_flush() waits for all AIO requests to complete. I think you >>>> wanted to have exactly this, but only for a single block device. Such a >>>> function doesn't exist yet. >>>> >>>> * bdrv_flush() makes sure that all successfully completed requests are >>>> written to disk (by calling fsync) >>>> >>>> * bdrv_aio_flush() is the asynchronous version of bdrv_flush, i.e. run >>>> the fsync in the thread pool >>> >>> Then what I wanted to do is, call qemu_aio_flush first, then >>> bdrv_flush. It should be like live migration. >> >> Okay, that makes sense. :-) >> >>>>>> The other thing is that you introduce a bdrv_flush for each request, >>>>>> basically forcing everyone to something very similar to writethrough >>>>>> mode. I'm sure this will have a big impact on performance. >>>>> >>>>> The reason is to avoid inversion of queued requests. Although >>>>> processing one-by-one is heavy, wouldn't having requests flushed >>>>> to disk out of order break the disk image? >>>> >>>> No, that's fine. If a guest issues two requests at the same time, they >>>> may complete in any order. You just need to make sure that you don't >>>> call the completion callback before the request really has completed. >>> >>> We need to flush requests, meaning aio and fsync, before sending >>> the final state of the guests, to make sure we can switch to the >>> secondary safely. >> >> In theory I think you could just re-submit the requests on the secondary >> if they had not completed yet. >> >> But you're right, let's keep things simple for the start. >> >>>> I'm just starting to wonder if the guest won't timeout the requests if >>>> they are queued for too long. Even more, with IDE, it can only handle >>>> one request at a time, so not completing requests doesn't sound like a >>>> good idea at all. In what intervals is the event-tap queue flushed? >>> >>> The requests are flushed once each transaction completes. So >>> it's not with specific intervals. >> >> Right. So when is a transaction completed? This is the time that a >> single request will take. > > The transaction is completed when the vm state is sent to the > secondary, and the primary receives the ack to it. Please let me > know if the answer is too vague. What I can tell is that it > can't be super fast. > >>>> On the other hand, if you complete before actually writing out, you >>>> don't get timeouts, but you signal success to the guest when the request >>>> could still fail. What would you do in this case? With a writeback cache >>>> mode we're fine, we can just fail the next flush (until then nothing is >>>> guaranteed to be on disk and order doesn't matter either), but with >>>> cache=writethrough we're in serious trouble. >>>> >>>> Have you thought about this problem? Maybe we end up having to flush the >>>> event-tap queue for each single write in writethrough mode. >>> >>> Yes, and that's what I'm trying to do at this point. >> >> Oh, I must have missed that code. Which patch/function should I look at? > > Maybe I miss-answered to your question. The device may receive > timeouts. We should pay attention that the guest does not see timeouts. I'm not expecting that I/O will be super fast, and as long as it is only a performance problem we can live with it. However, as soon as the guest gets timeouts it reports I/O errors and eventually offlines the block device. At this point it's not a performance problem any more, but also a correctness problem. This is why I suggested that we flush the event-tap queue (i.e. complete the transaction) immediately after an I/O request has been issued instead of waiting for other events that would complete the transaction. > If timeouts didn't happen, the requests are flushed > one-by-one in writethrough because we're calling qemu_aio_flush > and bdrv_flush together. I think this is what we must do. Kevin