From: Kevin Wolf <kwolf@redhat.com>
To: Anthony Liguori <anthony@codemonkey.ws>
Cc: qemu-devel@nongnu.org
Subject: Re: [Qemu-devel] [RFC] block-queue: Delay and batch metadata writes
Date: Mon, 20 Sep 2010 17:08:28 +0200 [thread overview]
Message-ID: <4C9778EC.9060704@redhat.com> (raw)
In-Reply-To: <4C977028.3050602@codemonkey.ws>
Am 20.09.2010 16:31, schrieb Anthony Liguori:
> On 09/20/2010 08:56 AM, Kevin Wolf wrote:
>> I won't get this ready until I leave for vacation on Wednesday, so I thought I
>> could just as well post it as an RFC in this state.
>>
>> With this patch applied, qcow2 doesn't directly access the image file any more
>> for metadata, but rather goes through the newly introduced blkqueue. Write
>> and sync requests are queued there and executed in a separate worker thread.
>> Reads consider the contents of the queue before accessing the the image file.
>>
>> What makes this interesting is that we can delay syncs and if multiple syncs
>> occur, we can merge them into one bdrv_flush.
>>
>> A typical sequence in qcow2 (simple cluster allocation) looks like this:
>>
>> 1. Update refcount table
>> 2. bdrv_flush
>> 3. Update L2 entry
>>
>
> Let's expand it a bit more:
>
> 1. Update refcount table
> 2. bdrv_flush
> 3. Update L2 entry
> 4. Write data to disk
> 5. Report write complete
>
> I'm struggling to understand how a thread helps out.
This sequence becomes:
1. Update refcount table
2. Write data to disk
3. Report write complete
And only later:
4. Update L2 entry
5. bdrv_flush (possibly merged with other flushes)
> If you run 1-3 in a thread, you need to inject a barrier between steps 3
> and 5 or you'll report the write complete before writing the metadata
> out. You can't delay completing step 3 until a guest requests a flush.
> If you do, then you're implementing a writeback cache for metadata.
Yeah, if you like to call it that, that's probably an accurate description.
> If you're comfortable with a writeback cache for metadata, then you
> should also be comfortable with a writeback cache for data in which
> case, cache=writeback is the answer.
Well, there is a difference: We don't pollute the host page cache with
guest data and we don't get a virtual "disk cache" as big as the host
RAM, but only a very limited queue of metadata.
Basically, in qemu we have three different types of caching:
1. O_DSYNC, everything is always synced without any explicit request.
This is cache=writethrough.
2. Nothing is ever synced. This is cache=unsafe.
3. We present a writeback disk cache to the guest and the guest needs
to explicitly flush to gets its data safe on disk. This is
cache=writeback and cache=none.
So they are actually very similar, the difference is only if to use
O_DIRECT or not. In principle, regarding the integrity requirements
there is already no difference between cache=none and cache=writeback today.
We're still lacking modes for O_DSYNC | O_DIRECT and unsafe | O_DIRECT,
but they are entirely possible, because it's two different dimensions.
(And I think Christoph was planning to actually make it two independent
options)
You have a point in that we need to disable the queueing for
cache=writethrough. I'm aware of that, but forgot to mention it in the
todo list.
> If it's a matter of batching, batching can't occur if you have a barrier
> between steps 3 and 5. The only way you can get batching is by doing a
> writeback cache for the metadata such that you can complete your request
> before the metadata is written.
>
> Am I misunderstanding the idea?
No, I think you understand it right, but maybe you were not completely
aware that cache=none doesn't mean writethrough.
Kevin
next prev parent reply other threads:[~2010-09-20 15:08 UTC|newest]
Thread overview: 15+ messages / expand[flat|nested] mbox.gz Atom feed top
2010-09-20 13:56 [Qemu-devel] [RFC] block-queue: Delay and batch metadata writes Kevin Wolf
2010-09-20 14:31 ` Anthony Liguori
2010-09-20 14:56 ` Anthony Liguori
2010-09-20 15:33 ` Kevin Wolf
2010-09-20 15:48 ` Anthony Liguori
2010-09-20 15:08 ` Kevin Wolf [this message]
2010-09-20 15:33 ` Avi Kivity
2010-09-20 15:38 ` Avi Kivity
2010-09-20 15:46 ` Kevin Wolf
2010-09-20 15:40 ` Anthony Liguori
2010-09-20 15:55 ` Kevin Wolf
2010-09-20 16:34 ` Anthony Liguori
2010-09-20 15:51 ` Anthony Liguori
2010-09-20 16:05 ` Avi Kivity
2010-09-21 9:13 ` Kevin Wolf
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4C9778EC.9060704@redhat.com \
--to=kwolf@redhat.com \
--cc=anthony@codemonkey.ws \
--cc=qemu-devel@nongnu.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.