From: Kevin Wolf <kwolf@redhat.com>
To: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
Cc: aliguori@us.ibm.com, dlaor@redhat.com, ananth@in.ibm.com,
kvm@vger.kernel.org, mst@redhat.com, mtosatti@redhat.com,
qemu-devel@nongnu.org, vatsa@linux.vnet.ibm.com,
blauwirbel@gmail.com, ohmura.kei@lab.ntt.co.jp, avi@redhat.com,
psuriset@linux.vnet.ibm.com, stefanha@linux.vnet.ibm.com
Subject: Re: [Qemu-devel] [PATCH 09/19] Introduce event-tap.
Date: Thu, 20 Jan 2011 15:21:23 +0100 [thread overview]
Message-ID: <4D3844E3.2010104@redhat.com> (raw)
In-Reply-To: <AANLkTinrK9vwSsjjO9N_vhb6pxG6o3nvRaskxG1t4xav@mail.gmail.com>
Am 20.01.2011 14:50, schrieb Yoshiaki Tamura:
> 2011/1/20 Kevin Wolf <kwolf@redhat.com>:
>> Am 20.01.2011 11:39, schrieb Yoshiaki Tamura:
>>> 2011/1/20 Kevin Wolf <kwolf@redhat.com>:
>>>> Am 20.01.2011 06:19, schrieb Yoshiaki Tamura:
>>>>>>>>> + return;
>>>>>>>>> + }
>>>>>>>>> +
>>>>>>>>> + bdrv_aio_writev(bs, blk_req->reqs[0].sector, blk_req->reqs[0].qiov,
>>>>>>>>> + blk_req->reqs[0].nb_sectors, blk_req->reqs[0].cb,
>>>>>>>>> + blk_req->reqs[0].opaque);
>>>>>>>>
>>>>>>>> Same here.
>>>>>>>>
>>>>>>>>> + bdrv_flush(bs);
>>>>>>>>
>>>>>>>> This looks really strange. What is this supposed to do?
>>>>>>>>
>>>>>>>> One point is that you write it immediately after bdrv_aio_write, so you
>>>>>>>> get an fsync for which you don't know if it includes the current write
>>>>>>>> request or if it doesn't. Which data do you want to get flushed to the disk?
>>>>>>>
>>>>>>> I was expecting to flush the aio request that was just initiated.
>>>>>>> Am I misunderstanding the function?
>>>>>>
>>>>>> Seems so. The function names don't use really clear terminology either,
>>>>>> so you're not the first one to fall in this trap. Basically we have:
>>>>>>
>>>>>> * qemu_aio_flush() waits for all AIO requests to complete. I think you
>>>>>> wanted to have exactly this, but only for a single block device. Such a
>>>>>> function doesn't exist yet.
>>>>>>
>>>>>> * bdrv_flush() makes sure that all successfully completed requests are
>>>>>> written to disk (by calling fsync)
>>>>>>
>>>>>> * bdrv_aio_flush() is the asynchronous version of bdrv_flush, i.e. run
>>>>>> the fsync in the thread pool
>>>>>
>>>>> Then what I wanted to do is, call qemu_aio_flush first, then
>>>>> bdrv_flush. It should be like live migration.
>>>>
>>>> Okay, that makes sense. :-)
>>>>
>>>>>>>> The other thing is that you introduce a bdrv_flush for each request,
>>>>>>>> basically forcing everyone to something very similar to writethrough
>>>>>>>> mode. I'm sure this will have a big impact on performance.
>>>>>>>
>>>>>>> The reason is to avoid inversion of queued requests. Although
>>>>>>> processing one-by-one is heavy, wouldn't having requests flushed
>>>>>>> to disk out of order break the disk image?
>>>>>>
>>>>>> No, that's fine. If a guest issues two requests at the same time, they
>>>>>> may complete in any order. You just need to make sure that you don't
>>>>>> call the completion callback before the request really has completed.
>>>>>
>>>>> We need to flush requests, meaning aio and fsync, before sending
>>>>> the final state of the guests, to make sure we can switch to the
>>>>> secondary safely.
>>>>
>>>> In theory I think you could just re-submit the requests on the secondary
>>>> if they had not completed yet.
>>>>
>>>> But you're right, let's keep things simple for the start.
>>>>
>>>>>> I'm just starting to wonder if the guest won't timeout the requests if
>>>>>> they are queued for too long. Even more, with IDE, it can only handle
>>>>>> one request at a time, so not completing requests doesn't sound like a
>>>>>> good idea at all. In what intervals is the event-tap queue flushed?
>>>>>
>>>>> The requests are flushed once each transaction completes. So
>>>>> it's not with specific intervals.
>>>>
>>>> Right. So when is a transaction completed? This is the time that a
>>>> single request will take.
>>>
>>> The transaction is completed when the vm state is sent to the
>>> secondary, and the primary receives the ack to it. Please let me
>>> know if the answer is too vague. What I can tell is that it
>>> can't be super fast.
>>>
>>>>>> On the other hand, if you complete before actually writing out, you
>>>>>> don't get timeouts, but you signal success to the guest when the request
>>>>>> could still fail. What would you do in this case? With a writeback cache
>>>>>> mode we're fine, we can just fail the next flush (until then nothing is
>>>>>> guaranteed to be on disk and order doesn't matter either), but with
>>>>>> cache=writethrough we're in serious trouble.
>>>>>>
>>>>>> Have you thought about this problem? Maybe we end up having to flush the
>>>>>> event-tap queue for each single write in writethrough mode.
>>>>>
>>>>> Yes, and that's what I'm trying to do at this point.
>>>>
>>>> Oh, I must have missed that code. Which patch/function should I look at?
>>>
>>> Maybe I miss-answered to your question. The device may receive
>>> timeouts.
>>
>> We should pay attention that the guest does not see timeouts. I'm not
>> expecting that I/O will be super fast, and as long as it is only a
>> performance problem we can live with it.
>>
>> However, as soon as the guest gets timeouts it reports I/O errors and
>> eventually offlines the block device. At this point it's not a
>> performance problem any more, but also a correctness problem.
>>
>> This is why I suggested that we flush the event-tap queue (i.e. complete
>> the transaction) immediately after an I/O request has been issued
>> instead of waiting for other events that would complete the transaction.
>
> Right. event-tap doesn't queue at specific interval. It'll
> schedule the transaction as bh once events are tapped . The
> purpose of the queue is store requests initiated while the
> transaction.
Ok, now I got it. :-)
So the patches are already doing the best we can do.
> So I believe current implementation should be doing
> what you're expecting. However, if the guest dirtied huge amount
> of ram and initiated block requests, we may get timeouts even we
> started transaction right away.
Right. We'll have to live with that for now. If it happens, bad luck.
Kevin
next prev parent reply other threads:[~2011-01-20 14:20 UTC|newest]
Thread overview: 73+ messages / expand[flat|nested] mbox.gz Atom feed top
2011-01-19 5:44 [PATCH 00/19] Kemari for KVM v0.2.6 Yoshiaki Tamura
2011-01-19 5:44 ` [Qemu-devel] " Yoshiaki Tamura
2011-01-19 5:44 ` [PATCH 01/19] Make QEMUFile buf expandable, and introduce qemu_realloc_buffer() and qemu_clear_buffer() Yoshiaki Tamura
2011-01-19 5:44 ` [Qemu-devel] " Yoshiaki Tamura
2011-01-19 5:44 ` [PATCH 02/19] Introduce read() to FdMigrationState Yoshiaki Tamura
2011-01-19 5:44 ` [Qemu-devel] " Yoshiaki Tamura
2011-01-19 5:44 ` [PATCH 03/19] Introduce skip_header parameter to qemu_loadvm_state() Yoshiaki Tamura
2011-01-19 5:44 ` [Qemu-devel] " Yoshiaki Tamura
2011-01-19 5:44 ` [PATCH 04/19] qemu-char: export socket_set_nodelay() Yoshiaki Tamura
2011-01-19 5:44 ` [Qemu-devel] " Yoshiaki Tamura
2011-01-19 5:44 ` [PATCH 05/19] vl.c: add deleted flag for deleting the handler Yoshiaki Tamura
2011-01-19 5:44 ` [Qemu-devel] " Yoshiaki Tamura
2011-01-19 5:44 ` [PATCH 06/19] virtio: decrement last_avail_idx with inuse before saving Yoshiaki Tamura
2011-01-19 5:44 ` [Qemu-devel] " Yoshiaki Tamura
2011-01-19 5:44 ` [PATCH 07/19] Introduce fault tolerant VM transaction QEMUFile and ft_mode Yoshiaki Tamura
2011-01-19 5:44 ` [Qemu-devel] " Yoshiaki Tamura
2011-01-19 5:44 ` [PATCH 08/19] savevm: introduce util functions to control ft_trans_file from savevm layer Yoshiaki Tamura
2011-01-19 5:44 ` [Qemu-devel] " Yoshiaki Tamura
2011-01-19 5:44 ` [PATCH 09/19] Introduce event-tap Yoshiaki Tamura
2011-01-19 5:44 ` [Qemu-devel] " Yoshiaki Tamura
2011-01-19 9:38 ` Kevin Wolf
2011-01-19 9:38 ` Kevin Wolf
2011-01-19 13:04 ` Yoshiaki Tamura
2011-01-19 13:04 ` Yoshiaki Tamura
2011-01-19 13:50 ` Kevin Wolf
2011-01-19 13:50 ` Kevin Wolf
2011-01-20 5:19 ` Yoshiaki Tamura
2011-01-20 9:15 ` Kevin Wolf
2011-01-20 10:39 ` Yoshiaki Tamura
2011-01-20 11:46 ` Kevin Wolf
2011-01-20 13:50 ` Yoshiaki Tamura
2011-01-20 14:21 ` Kevin Wolf [this message]
2011-01-20 15:48 ` Yoshiaki Tamura
2011-01-19 5:44 ` [PATCH 10/19] Call init handler of event-tap at main() in vl.c Yoshiaki Tamura
2011-01-19 5:44 ` [Qemu-devel] " Yoshiaki Tamura
2011-01-19 5:44 ` [PATCH 11/19] ioport: insert event_tap_ioport() to ioport_write() Yoshiaki Tamura
2011-01-19 5:44 ` [Qemu-devel] " Yoshiaki Tamura
2011-01-19 5:44 ` [PATCH 12/19] Insert event_tap_mmio() to cpu_physical_memory_rw() in exec.c Yoshiaki Tamura
2011-01-19 5:44 ` [Qemu-devel] " Yoshiaki Tamura
2011-01-19 5:44 ` [PATCH 13/19] net: insert event-tap to qemu_send_packet() and qemu_sendv_packet_async() Yoshiaki Tamura
2011-01-19 5:44 ` [Qemu-devel] " Yoshiaki Tamura
2011-01-19 5:44 ` [PATCH 14/19] block: insert event-tap to bdrv_aio_writev() and bdrv_aio_flush() Yoshiaki Tamura
2011-01-19 5:44 ` [Qemu-devel] " Yoshiaki Tamura
2011-01-19 9:05 ` Kevin Wolf
2011-01-19 9:05 ` Kevin Wolf
2011-01-19 12:06 ` Yoshiaki Tamura
2011-01-19 12:06 ` Yoshiaki Tamura
2011-01-19 9:47 ` Kevin Wolf
2011-01-19 9:47 ` Kevin Wolf
2011-01-19 13:16 ` Yoshiaki Tamura
2011-01-19 13:16 ` Yoshiaki Tamura
2011-01-19 14:08 ` Kevin Wolf
2011-01-19 14:08 ` Kevin Wolf
2011-01-20 5:01 ` Yoshiaki Tamura
2011-01-20 5:01 ` Yoshiaki Tamura
2011-01-19 5:45 ` [PATCH 15/19] savevm: introduce qemu_savevm_trans_{begin,commit} Yoshiaki Tamura
2011-01-19 5:45 ` [Qemu-devel] [PATCH 15/19] savevm: introduce qemu_savevm_trans_{begin, commit} Yoshiaki Tamura
2011-01-19 5:45 ` [PATCH 16/19] migration: introduce migrate_ft_trans_{put,get}_ready(), and modify migrate_fd_put_ready() when ft_mode is on Yoshiaki Tamura
2011-01-19 5:45 ` [Qemu-devel] [PATCH 16/19] migration: introduce migrate_ft_trans_{put, get}_ready(), " Yoshiaki Tamura
2011-01-19 5:45 ` [PATCH 17/19] migration-tcp: modify tcp_accept_incoming_migration() to handle ft_mode, and add a hack not to close fd when ft_mode is enabled Yoshiaki Tamura
2011-01-19 5:45 ` [Qemu-devel] " Yoshiaki Tamura
2011-01-19 5:45 ` [PATCH 18/19] Introduce -k option to enable FT migration mode (Kemari) Yoshiaki Tamura
2011-01-19 5:45 ` [Qemu-devel] " Yoshiaki Tamura
2011-01-19 5:45 ` [PATCH 19/19] migration: add a parser to accept FT migration incoming mode Yoshiaki Tamura
2011-01-19 5:45 ` [Qemu-devel] " Yoshiaki Tamura
-- strict thread matches above, loose matches on Subject: below --
2011-02-08 11:01 [PATCH 00/19] Kemari for KVM v0.2.9 Yoshiaki Tamura
2011-02-08 11:01 ` [Qemu-devel] [PATCH 09/19] Introduce event-tap Yoshiaki Tamura
2011-01-28 7:21 [PATCH 00/19] Kemari for KVM v0.2.8 Yoshiaki Tamura
2011-01-28 7:21 ` [Qemu-devel] [PATCH 09/19] Introduce event-tap Yoshiaki Tamura
2011-01-26 9:41 [PATCH 00/19] Kemari for KVM v0.2.7 Yoshiaki Tamura
2011-01-26 9:42 ` [Qemu-devel] [PATCH 09/19] Introduce event-tap Yoshiaki Tamura
2011-01-14 17:33 [PATCH 00/19] Kemari for KVM v0.2.5 Yoshiaki Tamura
2011-01-14 17:33 ` [Qemu-devel] [PATCH 09/19] Introduce event-tap Yoshiaki Tamura
2011-01-13 17:15 [PATCH 00/19] Kemari for KVM v0.2.4 Yoshiaki Tamura
2011-01-13 17:15 ` [Qemu-devel] [PATCH 09/19] Introduce event-tap Yoshiaki Tamura
2011-01-11 10:59 [PATCH 00/19] Kemari for KVM v0.2.3 Yoshiaki Tamura
2011-01-11 10:59 ` [Qemu-devel] [PATCH 09/19] Introduce event-tap Yoshiaki Tamura
2010-12-27 8:25 [PATCH 00/19] Kemari for KVM v0.2.2 Yoshiaki Tamura
2010-12-27 8:25 ` [Qemu-devel] [PATCH 09/19] Introduce event-tap Yoshiaki Tamura
2010-12-24 3:18 [PATCH 00/19] Kemari for KVM v0.2.1 Yoshiaki Tamura
2010-12-24 3:18 ` [Qemu-devel] [PATCH 09/19] Introduce event-tap Yoshiaki Tamura
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4D3844E3.2010104@redhat.com \
--to=kwolf@redhat.com \
--cc=aliguori@us.ibm.com \
--cc=ananth@in.ibm.com \
--cc=avi@redhat.com \
--cc=blauwirbel@gmail.com \
--cc=dlaor@redhat.com \
--cc=kvm@vger.kernel.org \
--cc=mst@redhat.com \
--cc=mtosatti@redhat.com \
--cc=ohmura.kei@lab.ntt.co.jp \
--cc=psuriset@linux.vnet.ibm.com \
--cc=qemu-devel@nongnu.org \
--cc=stefanha@linux.vnet.ibm.com \
--cc=tamura.yoshiaki@lab.ntt.co.jp \
--cc=vatsa@linux.vnet.ibm.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.