From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from [140.186.70.92] (port=37383 helo=eggs.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.43) id 1Pfsx2-0001gd-Pb
	for qemu-devel@nongnu.org; Thu, 20 Jan 2011 06:45:18 -0500
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <kwolf@redhat.com>) id 1Pfswx-0000lc-HX
	for qemu-devel@nongnu.org; Thu, 20 Jan 2011 06:45:16 -0500
Received: from mx1.redhat.com ([209.132.183.28]:41548)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <kwolf@redhat.com>) id 1Pfswx-0000lN-7Q
	for qemu-devel@nongnu.org; Thu, 20 Jan 2011 06:45:11 -0500
Message-ID: <4D382094.1030807@redhat.com>
Date: Thu, 20 Jan 2011 12:46:28 +0100
From: Kevin Wolf <kwolf@redhat.com>
MIME-Version: 1.0
Subject: Re: [Qemu-devel] [PATCH 09/19] Introduce event-tap.
References: <1295415904-11918-1-git-send-email-tamura.yoshiaki@lab.ntt.co.jp>	<1295415904-11918-10-git-send-email-tamura.yoshiaki@lab.ntt.co.jp>	<4D36B130.4010608@redhat.com>	<AANLkTinrjduWRudObXB61sYaFyHT0SOfyajiOmfxNaxS@mail.gmail.com>	<4D36EC41.5050104@redhat.com>	<AANLkTik_oHT71ggEiUN+vfBJw_CczzXW7LsyRqogq5pF@mail.gmail.com>	<4D37FD28.8000402@redhat.com>
	<AANLkTikLL0-G=Vs7KV9mkxmjcmuZjCHZedsorxN3h0DE@mail.gmail.com>
In-Reply-To: <AANLkTikLL0-G=Vs7KV9mkxmjcmuZjCHZedsorxN3h0DE@mail.gmail.com>
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
List-Id: qemu-devel.nongnu.org
List-Unsubscribe: <http://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <http://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
Cc: aliguori@us.ibm.com, dlaor@redhat.com, ananth@in.ibm.com, kvm@vger.kernel.org, mst@redhat.com, mtosatti@redhat.com, qemu-devel@nongnu.org, vatsa@linux.vnet.ibm.com, blauwirbel@gmail.com, ohmura.kei@lab.ntt.co.jp, avi@redhat.com, psuriset@linux.vnet.ibm.com, stefanha@linux.vnet.ibm.com

Am 20.01.2011 11:39, schrieb Yoshiaki Tamura:
> 2011/1/20 Kevin Wolf <kwolf@redhat.com>:
>> Am 20.01.2011 06:19, schrieb Yoshiaki Tamura:
>>>>>>> +        return;
>>>>>>> +    }
>>>>>>> +
>>>>>>> +    bdrv_aio_writev(bs, blk_req->reqs[0].sector, blk_req->reqs[0].qiov,
>>>>>>> +                    blk_req->reqs[0].nb_sectors, blk_req->reqs[0].cb,
>>>>>>> +                    blk_req->reqs[0].opaque);
>>>>>>
>>>>>> Same here.
>>>>>>
>>>>>>> +    bdrv_flush(bs);
>>>>>>
>>>>>> This looks really strange. What is this supposed to do?
>>>>>>
>>>>>> One point is that you write it immediately after bdrv_aio_write, so you
>>>>>> get an fsync for which you don't know if it includes the current write
>>>>>> request or if it doesn't. Which data do you want to get flushed to the disk?
>>>>>
>>>>> I was expecting to flush the aio request that was just initiated.
>>>>> Am I misunderstanding the function?
>>>>
>>>> Seems so. The function names don't use really clear terminology either,
>>>> so you're not the first one to fall in this trap. Basically we have:
>>>>
>>>> * qemu_aio_flush() waits for all AIO requests to complete. I think you
>>>> wanted to have exactly this, but only for a single block device. Such a
>>>> function doesn't exist yet.
>>>>
>>>> * bdrv_flush() makes sure that all successfully completed requests are
>>>> written to disk (by calling fsync)
>>>>
>>>> * bdrv_aio_flush() is the asynchronous version of bdrv_flush, i.e. run
>>>> the fsync in the thread pool
>>>
>>> Then what I wanted to do is, call qemu_aio_flush first, then
>>> bdrv_flush.  It should be like live migration.
>>
>> Okay, that makes sense. :-)
>>
>>>>>> The other thing is that you introduce a bdrv_flush for each request,
>>>>>> basically forcing everyone to something very similar to writethrough
>>>>>> mode. I'm sure this will have a big impact on performance.
>>>>>
>>>>> The reason is to avoid inversion of queued requests.  Although
>>>>> processing one-by-one is heavy, wouldn't having requests flushed
>>>>> to disk out of order break the disk image?
>>>>
>>>> No, that's fine. If a guest issues two requests at the same time, they
>>>> may complete in any order. You just need to make sure that you don't
>>>> call the completion callback before the request really has completed.
>>>
>>> We need to flush requests, meaning aio and fsync, before sending
>>> the final state of the guests, to make sure we can switch to the
>>> secondary safely.
>>
>> In theory I think you could just re-submit the requests on the secondary
>> if they had not completed yet.
>>
>> But you're right, let's keep things simple for the start.
>>
>>>> I'm just starting to wonder if the guest won't timeout the requests if
>>>> they are queued for too long. Even more, with IDE, it can only handle
>>>> one request at a time, so not completing requests doesn't sound like a
>>>> good idea at all. In what intervals is the event-tap queue flushed?
>>>
>>> The requests are flushed once each transaction completes.  So
>>> it's not with specific intervals.
>>
>> Right. So when is a transaction completed? This is the time that a
>> single request will take.
> 
> The transaction is completed when the vm state is sent to the
> secondary, and the primary receives the ack to it.  Please let me
> know if the answer is too vague.  What I can tell is that it
> can't be super fast.
> 
>>>> On the other hand, if you complete before actually writing out, you
>>>> don't get timeouts, but you signal success to the guest when the request
>>>> could still fail. What would you do in this case? With a writeback cache
>>>> mode we're fine, we can just fail the next flush (until then nothing is
>>>> guaranteed to be on disk and order doesn't matter either), but with
>>>> cache=writethrough we're in serious trouble.
>>>>
>>>> Have you thought about this problem? Maybe we end up having to flush the
>>>> event-tap queue for each single write in writethrough mode.
>>>
>>> Yes, and that's what I'm trying to do at this point.
>>
>> Oh, I must have missed that code. Which patch/function should I look at?
> 
> Maybe I miss-answered to your question.  The device may receive
> timeouts.  

We should pay attention that the guest does not see timeouts. I'm not
expecting that I/O will be super fast, and as long as it is only a
performance problem we can live with it.

However, as soon as the guest gets timeouts it reports I/O errors and
eventually offlines the block device. At this point it's not a
performance problem any more, but also a correctness problem.

This is why I suggested that we flush the event-tap queue (i.e. complete
the transaction) immediately after an I/O request has been issued
instead of waiting for other events that would complete the transaction.

> If timeouts didn't happen, the requests are flushed
> one-by-one in writethrough because we're calling qemu_aio_flush
> and bdrv_flush together.

I think this is what we must do.

Kevin