qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: Kevin Wolf <kwolf@redhat.com>
To: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
Cc: aliguori@us.ibm.com, dlaor@redhat.com, ananth@in.ibm.com,
	kvm@vger.kernel.org, mst@redhat.com, mtosatti@redhat.com,
	qemu-devel@nongnu.org, vatsa@linux.vnet.ibm.com,
	blauwirbel@gmail.com, ohmura.kei@lab.ntt.co.jp, avi@redhat.com,
	psuriset@linux.vnet.ibm.com, stefanha@linux.vnet.ibm.com
Subject: Re: [Qemu-devel] [PATCH 09/19] Introduce event-tap.
Date: Thu, 20 Jan 2011 15:21:23 +0100	[thread overview]
Message-ID: <4D3844E3.2010104@redhat.com> (raw)
In-Reply-To: <AANLkTinrK9vwSsjjO9N_vhb6pxG6o3nvRaskxG1t4xav@mail.gmail.com>

Am 20.01.2011 14:50, schrieb Yoshiaki Tamura:
> 2011/1/20 Kevin Wolf <kwolf@redhat.com>:
>> Am 20.01.2011 11:39, schrieb Yoshiaki Tamura:
>>> 2011/1/20 Kevin Wolf <kwolf@redhat.com>:
>>>> Am 20.01.2011 06:19, schrieb Yoshiaki Tamura:
>>>>>>>>> +        return;
>>>>>>>>> +    }
>>>>>>>>> +
>>>>>>>>> +    bdrv_aio_writev(bs, blk_req->reqs[0].sector, blk_req->reqs[0].qiov,
>>>>>>>>> +                    blk_req->reqs[0].nb_sectors, blk_req->reqs[0].cb,
>>>>>>>>> +                    blk_req->reqs[0].opaque);
>>>>>>>>
>>>>>>>> Same here.
>>>>>>>>
>>>>>>>>> +    bdrv_flush(bs);
>>>>>>>>
>>>>>>>> This looks really strange. What is this supposed to do?
>>>>>>>>
>>>>>>>> One point is that you write it immediately after bdrv_aio_write, so you
>>>>>>>> get an fsync for which you don't know if it includes the current write
>>>>>>>> request or if it doesn't. Which data do you want to get flushed to the disk?
>>>>>>>
>>>>>>> I was expecting to flush the aio request that was just initiated.
>>>>>>> Am I misunderstanding the function?
>>>>>>
>>>>>> Seems so. The function names don't use really clear terminology either,
>>>>>> so you're not the first one to fall in this trap. Basically we have:
>>>>>>
>>>>>> * qemu_aio_flush() waits for all AIO requests to complete. I think you
>>>>>> wanted to have exactly this, but only for a single block device. Such a
>>>>>> function doesn't exist yet.
>>>>>>
>>>>>> * bdrv_flush() makes sure that all successfully completed requests are
>>>>>> written to disk (by calling fsync)
>>>>>>
>>>>>> * bdrv_aio_flush() is the asynchronous version of bdrv_flush, i.e. run
>>>>>> the fsync in the thread pool
>>>>>
>>>>> Then what I wanted to do is, call qemu_aio_flush first, then
>>>>> bdrv_flush.  It should be like live migration.
>>>>
>>>> Okay, that makes sense. :-)
>>>>
>>>>>>>> The other thing is that you introduce a bdrv_flush for each request,
>>>>>>>> basically forcing everyone to something very similar to writethrough
>>>>>>>> mode. I'm sure this will have a big impact on performance.
>>>>>>>
>>>>>>> The reason is to avoid inversion of queued requests.  Although
>>>>>>> processing one-by-one is heavy, wouldn't having requests flushed
>>>>>>> to disk out of order break the disk image?
>>>>>>
>>>>>> No, that's fine. If a guest issues two requests at the same time, they
>>>>>> may complete in any order. You just need to make sure that you don't
>>>>>> call the completion callback before the request really has completed.
>>>>>
>>>>> We need to flush requests, meaning aio and fsync, before sending
>>>>> the final state of the guests, to make sure we can switch to the
>>>>> secondary safely.
>>>>
>>>> In theory I think you could just re-submit the requests on the secondary
>>>> if they had not completed yet.
>>>>
>>>> But you're right, let's keep things simple for the start.
>>>>
>>>>>> I'm just starting to wonder if the guest won't timeout the requests if
>>>>>> they are queued for too long. Even more, with IDE, it can only handle
>>>>>> one request at a time, so not completing requests doesn't sound like a
>>>>>> good idea at all. In what intervals is the event-tap queue flushed?
>>>>>
>>>>> The requests are flushed once each transaction completes.  So
>>>>> it's not with specific intervals.
>>>>
>>>> Right. So when is a transaction completed? This is the time that a
>>>> single request will take.
>>>
>>> The transaction is completed when the vm state is sent to the
>>> secondary, and the primary receives the ack to it.  Please let me
>>> know if the answer is too vague.  What I can tell is that it
>>> can't be super fast.
>>>
>>>>>> On the other hand, if you complete before actually writing out, you
>>>>>> don't get timeouts, but you signal success to the guest when the request
>>>>>> could still fail. What would you do in this case? With a writeback cache
>>>>>> mode we're fine, we can just fail the next flush (until then nothing is
>>>>>> guaranteed to be on disk and order doesn't matter either), but with
>>>>>> cache=writethrough we're in serious trouble.
>>>>>>
>>>>>> Have you thought about this problem? Maybe we end up having to flush the
>>>>>> event-tap queue for each single write in writethrough mode.
>>>>>
>>>>> Yes, and that's what I'm trying to do at this point.
>>>>
>>>> Oh, I must have missed that code. Which patch/function should I look at?
>>>
>>> Maybe I miss-answered to your question.  The device may receive
>>> timeouts.
>>
>> We should pay attention that the guest does not see timeouts. I'm not
>> expecting that I/O will be super fast, and as long as it is only a
>> performance problem we can live with it.
>>
>> However, as soon as the guest gets timeouts it reports I/O errors and
>> eventually offlines the block device. At this point it's not a
>> performance problem any more, but also a correctness problem.
>>
>> This is why I suggested that we flush the event-tap queue (i.e. complete
>> the transaction) immediately after an I/O request has been issued
>> instead of waiting for other events that would complete the transaction.
> 
> Right.  event-tap doesn't queue at specific interval.  It'll
> schedule the transaction as bh once events are tapped .  The
> purpose of the queue is store requests initiated while the
> transaction.  

Ok, now I got it. :-)

So the patches are already doing the best we can do.

> So I believe current implementation should be doing
> what you're expecting.  However, if the guest dirtied huge amount
> of ram and initiated block requests, we may get timeouts even we
> started transaction right away.

Right. We'll have to live with that for now. If it happens, bad luck.

Kevin

  reply	other threads:[~2011-01-20 14:20 UTC|newest]

Thread overview: 44+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-01-19  5:44 [Qemu-devel] [PATCH 00/19] Kemari for KVM v0.2.6 Yoshiaki Tamura
2011-01-19  5:44 ` [Qemu-devel] [PATCH 01/19] Make QEMUFile buf expandable, and introduce qemu_realloc_buffer() and qemu_clear_buffer() Yoshiaki Tamura
2011-01-19  5:44 ` [Qemu-devel] [PATCH 02/19] Introduce read() to FdMigrationState Yoshiaki Tamura
2011-01-19  5:44 ` [Qemu-devel] [PATCH 03/19] Introduce skip_header parameter to qemu_loadvm_state() Yoshiaki Tamura
2011-01-19  5:44 ` [Qemu-devel] [PATCH 04/19] qemu-char: export socket_set_nodelay() Yoshiaki Tamura
2011-01-19  5:44 ` [Qemu-devel] [PATCH 05/19] vl.c: add deleted flag for deleting the handler Yoshiaki Tamura
2011-01-19  5:44 ` [Qemu-devel] [PATCH 06/19] virtio: decrement last_avail_idx with inuse before saving Yoshiaki Tamura
2011-01-19  5:44 ` [Qemu-devel] [PATCH 07/19] Introduce fault tolerant VM transaction QEMUFile and ft_mode Yoshiaki Tamura
2011-01-19  5:44 ` [Qemu-devel] [PATCH 08/19] savevm: introduce util functions to control ft_trans_file from savevm layer Yoshiaki Tamura
2011-01-19  5:44 ` [Qemu-devel] [PATCH 09/19] Introduce event-tap Yoshiaki Tamura
2011-01-19  9:38   ` Kevin Wolf
2011-01-19 13:04     ` Yoshiaki Tamura
2011-01-19 13:50       ` Kevin Wolf
2011-01-20  5:19         ` Yoshiaki Tamura
2011-01-20  9:15           ` Kevin Wolf
2011-01-20 10:39             ` Yoshiaki Tamura
2011-01-20 11:46               ` Kevin Wolf
2011-01-20 13:50                 ` Yoshiaki Tamura
2011-01-20 14:21                   ` Kevin Wolf [this message]
2011-01-20 15:48                     ` Yoshiaki Tamura
2011-01-19  5:44 ` [Qemu-devel] [PATCH 10/19] Call init handler of event-tap at main() in vl.c Yoshiaki Tamura
2011-01-19  5:44 ` [Qemu-devel] [PATCH 11/19] ioport: insert event_tap_ioport() to ioport_write() Yoshiaki Tamura
2011-01-19  5:44 ` [Qemu-devel] [PATCH 12/19] Insert event_tap_mmio() to cpu_physical_memory_rw() in exec.c Yoshiaki Tamura
2011-01-19  5:44 ` [Qemu-devel] [PATCH 13/19] net: insert event-tap to qemu_send_packet() and qemu_sendv_packet_async() Yoshiaki Tamura
2011-01-19  5:44 ` [Qemu-devel] [PATCH 14/19] block: insert event-tap to bdrv_aio_writev() and bdrv_aio_flush() Yoshiaki Tamura
2011-01-19  9:05   ` Kevin Wolf
2011-01-19 12:06     ` Yoshiaki Tamura
2011-01-19  9:47   ` Kevin Wolf
2011-01-19 13:16     ` Yoshiaki Tamura
2011-01-19 14:08       ` Kevin Wolf
2011-01-20  5:01         ` Yoshiaki Tamura
2011-01-19  5:45 ` [Qemu-devel] [PATCH 15/19] savevm: introduce qemu_savevm_trans_{begin, commit} Yoshiaki Tamura
2011-01-19  5:45 ` [Qemu-devel] [PATCH 16/19] migration: introduce migrate_ft_trans_{put, get}_ready(), and modify migrate_fd_put_ready() when ft_mode is on Yoshiaki Tamura
2011-01-19  5:45 ` [Qemu-devel] [PATCH 17/19] migration-tcp: modify tcp_accept_incoming_migration() to handle ft_mode, and add a hack not to close fd when ft_mode is enabled Yoshiaki Tamura
2011-01-19  5:45 ` [Qemu-devel] [PATCH 18/19] Introduce -k option to enable FT migration mode (Kemari) Yoshiaki Tamura
2011-01-19  5:45 ` [Qemu-devel] [PATCH 19/19] migration: add a parser to accept FT migration incoming mode Yoshiaki Tamura
  -- strict thread matches above, loose matches on Subject: below --
2011-02-08 11:01 [Qemu-devel] [PATCH 00/19] Kemari for KVM v0.2.9 Yoshiaki Tamura
2011-02-08 11:01 ` [Qemu-devel] [PATCH 09/19] Introduce event-tap Yoshiaki Tamura
2011-01-28  7:21 [Qemu-devel] [PATCH 00/19] Kemari for KVM v0.2.8 Yoshiaki Tamura
2011-01-28  7:21 ` [Qemu-devel] [PATCH 09/19] Introduce event-tap Yoshiaki Tamura
2011-01-26  9:41 [Qemu-devel] [PATCH 00/19] Kemari for KVM v0.2.7 Yoshiaki Tamura
2011-01-26  9:42 ` [Qemu-devel] [PATCH 09/19] Introduce event-tap Yoshiaki Tamura
2011-01-14 17:33 [Qemu-devel] [PATCH 00/19] Kemari for KVM v0.2.5 Yoshiaki Tamura
2011-01-14 17:33 ` [Qemu-devel] [PATCH 09/19] Introduce event-tap Yoshiaki Tamura
2011-01-13 17:15 [Qemu-devel] [PATCH 00/19] Kemari for KVM v0.2.4 Yoshiaki Tamura
2011-01-13 17:15 ` [Qemu-devel] [PATCH 09/19] Introduce event-tap Yoshiaki Tamura
2011-01-11 10:59 [Qemu-devel] [PATCH 00/19] Kemari for KVM v0.2.3 Yoshiaki Tamura
2011-01-11 10:59 ` [Qemu-devel] [PATCH 09/19] Introduce event-tap Yoshiaki Tamura
2010-12-27  8:25 [Qemu-devel] [PATCH 00/19] Kemari for KVM v0.2.2 Yoshiaki Tamura
2010-12-27  8:25 ` [Qemu-devel] [PATCH 09/19] Introduce event-tap Yoshiaki Tamura
2010-12-24  3:18 [Qemu-devel] [PATCH 00/19] Kemari for KVM v0.2.1 Yoshiaki Tamura
2010-12-24  3:18 ` [Qemu-devel] [PATCH 09/19] Introduce event-tap Yoshiaki Tamura

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4D3844E3.2010104@redhat.com \
    --to=kwolf@redhat.com \
    --cc=aliguori@us.ibm.com \
    --cc=ananth@in.ibm.com \
    --cc=avi@redhat.com \
    --cc=blauwirbel@gmail.com \
    --cc=dlaor@redhat.com \
    --cc=kvm@vger.kernel.org \
    --cc=mst@redhat.com \
    --cc=mtosatti@redhat.com \
    --cc=ohmura.kei@lab.ntt.co.jp \
    --cc=psuriset@linux.vnet.ibm.com \
    --cc=qemu-devel@nongnu.org \
    --cc=stefanha@linux.vnet.ibm.com \
    --cc=tamura.yoshiaki@lab.ntt.co.jp \
    --cc=vatsa@linux.vnet.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).