* [Qemu-devel] [RFC] Generic image streaming
@ 2011-09-23 15:57 Stefan Hajnoczi
2011-09-26 5:32 ` Zhi Yong Wu
` (3 more replies)
0 siblings, 4 replies; 12+ messages in thread
From: Stefan Hajnoczi @ 2011-09-23 15:57 UTC (permalink / raw)
To: qemu-devel; +Cc: Kevin Wolf, Marcelo Tosatti, Zhi Yong Wu
Here is my generic image streaming branch, which aims to provide a way
to copy the contents of a backing file into an image file of a running
guest without requiring specific support in the various block drivers
(e.g. qcow2, qed, vmdk):
http://repo.or.cz/w/qemu/stefanha.git/shortlog/refs/heads/image-streaming-api
The tree does not provide full image streaming yet but I'd like to
discuss the approach taken in the code. Here are the main points:
The image streaming API is available through HMP and QMP commands. When
streaming is started on a block device a coroutine is created to do the
background I/O work. The coroutine can be cancelled.
While the coroutine copies data from the backing file into the image
file, the guest may be performing I/O to the image file. Guest reads do
not conflict with streaming but guest writes require special handling.
If the guest writes to a region of the image file that we are currently
copying, then there is the potential to clobber the guest write with old
data from the backing file.
Previously I solved this in a QED-specific way by taking advantage of
the serialization of allocating write requests. In order to do this
generically we need to track in-flight requests and have the ability to
queue I/O. Guest writes that affect an in-flight streaming copy
operation must wait for that operation to complete before being issued.
Streaming copy operations must skip overlapping regions of guest writes.
One big difference to the QED image streaming implementation is that
this generic implementation is not based on copy-on-read operations.
Instead we do a sequence of bdrv_is_allocated() to find regions for
streaming, followed by bdrv_co_read() and bdrv_co_write() in order to
populate the image file.
It turns out that generic copy-on-read is not an attractive operation
because it requires using bounce buffers for every request. Kevin
pointed out the case where a guest performs a read and pokes the data
buffer before the read completes, copy-on-read would write out the
modified memory into the image file unless we use a bounce buffer.
There are a few pieces missing in my tree, which have mostly been solved
in other places and just need to be reused:
1. Arbitration between guest and streaming requests (this is the only
real new thing).
2. Efficient zero handling (skip writing those regions or mark them as
zero clusters).
3. Queuing/dependencies when arbitration decides a request must wait.
I'm taking a look at reusing Zhi Yong's block queue.
4. Rate-limiting to ensure streaming I/O does not impact the guest.
Already exists in the QED-specific patches, it may make sense to
extract common code that both migration and the block layer can use.
Ideas or questions?
Stefan
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [Qemu-devel] [RFC] Generic image streaming
2011-09-23 15:57 [Qemu-devel] [RFC] Generic image streaming Stefan Hajnoczi
@ 2011-09-26 5:32 ` Zhi Yong Wu
2011-09-26 7:55 ` Stefan Hajnoczi
2011-09-26 12:35 ` Marcelo Tosatti
` (2 subsequent siblings)
3 siblings, 1 reply; 12+ messages in thread
From: Zhi Yong Wu @ 2011-09-26 5:32 UTC (permalink / raw)
To: Stefan Hajnoczi; +Cc: Kevin Wolf, Marcelo Tosatti, qemu-devel, Zhi Yong Wu
On Fri, Sep 23, 2011 at 11:57 PM, Stefan Hajnoczi
<stefanha@linux.vnet.ibm.com> wrote:
> Here is my generic image streaming branch, which aims to provide a way
> to copy the contents of a backing file into an image file of a running
> guest without requiring specific support in the various block drivers
> (e.g. qcow2, qed, vmdk):
>
> http://repo.or.cz/w/qemu/stefanha.git/shortlog/refs/heads/image-streaming-api
>
> The tree does not provide full image streaming yet but I'd like to
> discuss the approach taken in the code. Here are the main points:
>
> The image streaming API is available through HMP and QMP commands. When
> streaming is started on a block device a coroutine is created to do the
> background I/O work. The coroutine can be cancelled.
>
> While the coroutine copies data from the backing file into the image
> file, the guest may be performing I/O to the image file. Guest reads do
> not conflict with streaming but guest writes require special handling.
> If the guest writes to a region of the image file that we are currently
> copying, then there is the potential to clobber the guest write with old
> data from the backing file.
>
> Previously I solved this in a QED-specific way by taking advantage of
> the serialization of allocating write requests. In order to do this
> generically we need to track in-flight requests and have the ability to
> queue I/O. Guest writes that affect an in-flight streaming copy
> operation must wait for that operation to complete before being issued.
> Streaming copy operations must skip overlapping regions of guest writes.
>
> One big difference to the QED image streaming implementation is that
> this generic implementation is not based on copy-on-read operations.
> Instead we do a sequence of bdrv_is_allocated() to find regions for
> streaming, followed by bdrv_co_read() and bdrv_co_write() in order to
> populate the image file.
>
> It turns out that generic copy-on-read is not an attractive operation
> because it requires using bounce buffers for every request. Kevin
bounce buffers == buffer ring?
> pointed out the case where a guest performs a read and pokes the data
> buffer before the read completes, copy-on-read would write out the
> modified memory into the image file unless we use a bounce buffer.
Can you elaborate this?
>
> There are a few pieces missing in my tree, which have mostly been solved
> in other places and just need to be reused:
> 1. Arbitration between guest and streaming requests (this is the only
> real new thing).
> 2. Efficient zero handling (skip writing those regions or mark them as
> zero clusters).
> 3. Queuing/dependencies when arbitration decides a request must wait.
> I'm taking a look at reusing Zhi Yong's block queue.
> 4. Rate-limiting to ensure streaming I/O does not impact the guest.
> Already exists in the QED-specific patches, it may make sense to
> extract common code that both migration and the block layer can use.
>
> Ideas or questions?
>
> Stefan
>
>
--
Regards,
Zhi Yong Wu
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [Qemu-devel] [RFC] Generic image streaming
2011-09-26 5:32 ` Zhi Yong Wu
@ 2011-09-26 7:55 ` Stefan Hajnoczi
2011-09-26 9:11 ` Zhi Yong Wu
0 siblings, 1 reply; 12+ messages in thread
From: Stefan Hajnoczi @ 2011-09-26 7:55 UTC (permalink / raw)
To: Zhi Yong Wu
Cc: Kevin Wolf, Marcelo Tosatti, Stefan Hajnoczi, Zhi Yong Wu,
qemu-devel
On Mon, Sep 26, 2011 at 01:32:34PM +0800, Zhi Yong Wu wrote:
> On Fri, Sep 23, 2011 at 11:57 PM, Stefan Hajnoczi
> <stefanha@linux.vnet.ibm.com> wrote:
> > Here is my generic image streaming branch, which aims to provide a way
> > to copy the contents of a backing file into an image file of a running
> > guest without requiring specific support in the various block drivers
> > (e.g. qcow2, qed, vmdk):
> >
> > http://repo.or.cz/w/qemu/stefanha.git/shortlog/refs/heads/image-streaming-api
> >
> > The tree does not provide full image streaming yet but I'd like to
> > discuss the approach taken in the code. Here are the main points:
> >
> > The image streaming API is available through HMP and QMP commands. When
> > streaming is started on a block device a coroutine is created to do the
> > background I/O work. The coroutine can be cancelled.
> >
> > While the coroutine copies data from the backing file into the image
> > file, the guest may be performing I/O to the image file. Guest reads do
> > not conflict with streaming but guest writes require special handling.
> > If the guest writes to a region of the image file that we are currently
> > copying, then there is the potential to clobber the guest write with old
> > data from the backing file.
> >
> > Previously I solved this in a QED-specific way by taking advantage of
> > the serialization of allocating write requests. In order to do this
> > generically we need to track in-flight requests and have the ability to
> > queue I/O. Guest writes that affect an in-flight streaming copy
> > operation must wait for that operation to complete before being issued.
> > Streaming copy operations must skip overlapping regions of guest writes.
> >
> > One big difference to the QED image streaming implementation is that
> > this generic implementation is not based on copy-on-read operations.
> > Instead we do a sequence of bdrv_is_allocated() to find regions for
> > streaming, followed by bdrv_co_read() and bdrv_co_write() in order to
> > populate the image file.
> >
> > It turns out that generic copy-on-read is not an attractive operation
> > because it requires using bounce buffers for every request. Kevin
> bounce buffers == buffer ring?
A bounce buffer is a temporary buffer that is used because the actual
data buffer is not addressable or cannot be directly accessed for some
other reason. In this case it's because the guest should see read
semantics and not find that writes to its read data buffer result in
writes to disk.
> > pointed out the case where a guest performs a read and pokes the data
> > buffer before the read completes, copy-on-read would write out the
> > modified memory into the image file unless we use a bounce buffer.
> Can you elaborate this?
1. Guest issues a read request.
2. QEMU issues host read request as first step in copy-on-read.
3. Host read request completes...
4. Guest overwrites its data buffer before QEMU acknowledges request
completion.
5. ...QEMU issues host write request.
6. Host completes write request and QEMU acknowledges guest read
completion.
What happened is that we populated the image file with data from guest
memory that does not match what is in the backing file. The guest
issued a read request, this should never result in writing to the image
file.
Although legitimate guests do not do this, a buggy guest could corrupt
its disk in this way!
Stefan
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [Qemu-devel] [RFC] Generic image streaming
2011-09-26 7:55 ` Stefan Hajnoczi
@ 2011-09-26 9:11 ` Zhi Yong Wu
2011-09-26 9:30 ` Stefan Hajnoczi
0 siblings, 1 reply; 12+ messages in thread
From: Zhi Yong Wu @ 2011-09-26 9:11 UTC (permalink / raw)
To: Stefan Hajnoczi
Cc: Kevin Wolf, Marcelo Tosatti, Stefan Hajnoczi, Zhi Yong Wu,
qemu-devel
On Mon, Sep 26, 2011 at 3:55 PM, Stefan Hajnoczi <stefanha@gmail.com> wrote:
> On Mon, Sep 26, 2011 at 01:32:34PM +0800, Zhi Yong Wu wrote:
>> On Fri, Sep 23, 2011 at 11:57 PM, Stefan Hajnoczi
>> <stefanha@linux.vnet.ibm.com> wrote:
>> > Here is my generic image streaming branch, which aims to provide a way
>> > to copy the contents of a backing file into an image file of a running
>> > guest without requiring specific support in the various block drivers
>> > (e.g. qcow2, qed, vmdk):
>> >
>> > http://repo.or.cz/w/qemu/stefanha.git/shortlog/refs/heads/image-streaming-api
>> >
>> > The tree does not provide full image streaming yet but I'd like to
>> > discuss the approach taken in the code. Here are the main points:
>> >
>> > The image streaming API is available through HMP and QMP commands. When
>> > streaming is started on a block device a coroutine is created to do the
>> > background I/O work. The coroutine can be cancelled.
>> >
>> > While the coroutine copies data from the backing file into the image
>> > file, the guest may be performing I/O to the image file. Guest reads do
>> > not conflict with streaming but guest writes require special handling.
>> > If the guest writes to a region of the image file that we are currently
>> > copying, then there is the potential to clobber the guest write with old
>> > data from the backing file.
>> >
>> > Previously I solved this in a QED-specific way by taking advantage of
>> > the serialization of allocating write requests. In order to do this
>> > generically we need to track in-flight requests and have the ability to
>> > queue I/O. Guest writes that affect an in-flight streaming copy
>> > operation must wait for that operation to complete before being issued.
>> > Streaming copy operations must skip overlapping regions of guest writes.
>> >
>> > One big difference to the QED image streaming implementation is that
>> > this generic implementation is not based on copy-on-read operations.
>> > Instead we do a sequence of bdrv_is_allocated() to find regions for
>> > streaming, followed by bdrv_co_read() and bdrv_co_write() in order to
>> > populate the image file.
>> >
>> > It turns out that generic copy-on-read is not an attractive operation
>> > because it requires using bounce buffers for every request. Kevin
>> bounce buffers == buffer ring?
>
> A bounce buffer is a temporary buffer that is used because the actual
> data buffer is not addressable or cannot be directly accessed for some
> other reason. In this case it's because the guest should see read
> semantics and not find that writes to its read data buffer result in
> writes to disk.
>
>> > pointed out the case where a guest performs a read and pokes the data
>> > buffer before the read completes, copy-on-read would write out the
>> > modified memory into the image file unless we use a bounce buffer.
Sorry, to be honest, i don't know which scenario will cause guest
modified memory is written out into image file.
>> Can you elaborate this?
>
> 1. Guest issues a read request.
> 2. QEMU issues host read request as first step in copy-on-read.
> 3. Host read request completes...
> 4. Guest overwrites its data buffer before QEMU acknowledges request
> completion.
> 5. ...QEMU issues host write request.
> 6. Host completes write request and QEMU acknowledges guest read
> completion.
Good, thanks.
>
> What happened is that we populated the image file with data from guest
> memory that does not match what is in the backing file. The guest
How to find that the two data don't match?
> issued a read request, this should never result in writing to the image
> file.
>
> Although legitimate guests do not do this, a buggy guest could corrupt
> its disk in this way!
>
> Stefan
>
--
Regards,
Zhi Yong Wu
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [Qemu-devel] [RFC] Generic image streaming
2011-09-26 9:11 ` Zhi Yong Wu
@ 2011-09-26 9:30 ` Stefan Hajnoczi
2011-09-26 14:18 ` Zhi Yong Wu
0 siblings, 1 reply; 12+ messages in thread
From: Stefan Hajnoczi @ 2011-09-26 9:30 UTC (permalink / raw)
To: Zhi Yong Wu
Cc: Kevin Wolf, Stefan Hajnoczi, Marcelo Tosatti, qemu-devel,
Zhi Yong Wu
On Mon, Sep 26, 2011 at 05:11:00PM +0800, Zhi Yong Wu wrote:
> On Mon, Sep 26, 2011 at 3:55 PM, Stefan Hajnoczi <stefanha@gmail.com> wrote:
> > On Mon, Sep 26, 2011 at 01:32:34PM +0800, Zhi Yong Wu wrote:
> >> On Fri, Sep 23, 2011 at 11:57 PM, Stefan Hajnoczi
> >> <stefanha@linux.vnet.ibm.com> wrote:
> >> > Here is my generic image streaming branch, which aims to provide a way
> >> > to copy the contents of a backing file into an image file of a running
> >> > guest without requiring specific support in the various block drivers
> >> > (e.g. qcow2, qed, vmdk):
> >> >
> >> > http://repo.or.cz/w/qemu/stefanha.git/shortlog/refs/heads/image-streaming-api
> >> >
> >> > The tree does not provide full image streaming yet but I'd like to
> >> > discuss the approach taken in the code. Here are the main points:
> >> >
> >> > The image streaming API is available through HMP and QMP commands. When
> >> > streaming is started on a block device a coroutine is created to do the
> >> > background I/O work. The coroutine can be cancelled.
> >> >
> >> > While the coroutine copies data from the backing file into the image
> >> > file, the guest may be performing I/O to the image file. Guest reads do
> >> > not conflict with streaming but guest writes require special handling.
> >> > If the guest writes to a region of the image file that we are currently
> >> > copying, then there is the potential to clobber the guest write with old
> >> > data from the backing file.
> >> >
> >> > Previously I solved this in a QED-specific way by taking advantage of
> >> > the serialization of allocating write requests. In order to do this
> >> > generically we need to track in-flight requests and have the ability to
> >> > queue I/O. Guest writes that affect an in-flight streaming copy
> >> > operation must wait for that operation to complete before being issued.
> >> > Streaming copy operations must skip overlapping regions of guest writes.
> >> >
> >> > One big difference to the QED image streaming implementation is that
> >> > this generic implementation is not based on copy-on-read operations.
> >> > Instead we do a sequence of bdrv_is_allocated() to find regions for
> >> > streaming, followed by bdrv_co_read() and bdrv_co_write() in order to
> >> > populate the image file.
> >> >
> >> > It turns out that generic copy-on-read is not an attractive operation
> >> > because it requires using bounce buffers for every request. Kevin
> >> bounce buffers == buffer ring?
> >
> > A bounce buffer is a temporary buffer that is used because the actual
> > data buffer is not addressable or cannot be directly accessed for some
> > other reason. In this case it's because the guest should see read
> > semantics and not find that writes to its read data buffer result in
> > writes to disk.
> >
> >> > pointed out the case where a guest performs a read and pokes the data
> >> > buffer before the read completes, copy-on-read would write out the
> >> > modified memory into the image file unless we use a bounce buffer.
> Sorry, to be honest, i don't know which scenario will cause guest
> modified memory is written out into image file.
I showed the scenario in the steps posted below:
> >> Can you elaborate this?
> >
> > 1. Guest issues a read request.
> > 2. QEMU issues host read request as first step in copy-on-read.
> > 3. Host read request completes...
> > 4. Guest overwrites its data buffer before QEMU acknowledges request
> > completion.
> > 5. ...QEMU issues host write request.
> > 6. Host completes write request and QEMU acknowledges guest read
> > completion.
> Good, thanks.
> >
> > What happened is that we populated the image file with data from guest
> > memory that does not match what is in the backing file. The guest
> How to find that the two data don't match?
Reread what I posted and think about the case where a QEMU read buffer
(the "bounce buffer") is used in step 2. In that case the guest cannot
tamper with the data buffer while performing copy-on-read.
Stefan
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [Qemu-devel] [RFC] Generic image streaming
2011-09-23 15:57 [Qemu-devel] [RFC] Generic image streaming Stefan Hajnoczi
2011-09-26 5:32 ` Zhi Yong Wu
@ 2011-09-26 12:35 ` Marcelo Tosatti
2011-09-26 14:21 ` Stefan Hajnoczi
2011-09-27 3:26 ` Zhi Yong Wu
2011-09-27 9:05 ` Zhi Yong Wu
3 siblings, 1 reply; 12+ messages in thread
From: Marcelo Tosatti @ 2011-09-26 12:35 UTC (permalink / raw)
To: Stefan Hajnoczi; +Cc: Kevin Wolf, qemu-devel, Zhi Yong Wu
On Fri, Sep 23, 2011 at 04:57:26PM +0100, Stefan Hajnoczi wrote:
> Here is my generic image streaming branch, which aims to provide a way
> to copy the contents of a backing file into an image file of a running
> guest without requiring specific support in the various block drivers
> (e.g. qcow2, qed, vmdk):
>
> http://repo.or.cz/w/qemu/stefanha.git/shortlog/refs/heads/image-streaming-api
>
> The tree does not provide full image streaming yet but I'd like to
> discuss the approach taken in the code. Here are the main points:
>
> The image streaming API is available through HMP and QMP commands. When
> streaming is started on a block device a coroutine is created to do the
> background I/O work. The coroutine can be cancelled.
>
> While the coroutine copies data from the backing file into the image
> file, the guest may be performing I/O to the image file. Guest reads do
> not conflict with streaming but guest writes require special handling.
> If the guest writes to a region of the image file that we are currently
> copying, then there is the potential to clobber the guest write with old
> data from the backing file.
>
> Previously I solved this in a QED-specific way by taking advantage of
> the serialization of allocating write requests. In order to do this
> generically we need to track in-flight requests and have the ability to
> queue I/O. Guest writes that affect an in-flight streaming copy
> operation must wait for that operation to complete before being issued.
> Streaming copy operations must skip overlapping regions of guest writes.
>
> One big difference to the QED image streaming implementation is that
> this generic implementation is not based on copy-on-read operations.
> Instead we do a sequence of bdrv_is_allocated() to find regions for
> streaming, followed by bdrv_co_read() and bdrv_co_write() in order to
> populate the image file.
>
> It turns out that generic copy-on-read is not an attractive operation
> because it requires using bounce buffers for every request.
Isnt COR essential for a decent read performance on the
image-stream-from-slow-remote-origin case?
> Kevin pointed out the case where a guest performs a read and pokes the
> data buffer before the read completes, copy-on-read would write out
> the modified memory into the image file unless we use a bounce buffer.
Either wait for the write originating from a COR to finish before
exposing the read to the guest, or have a bounce buffer.
> There are a few pieces missing in my tree, which have mostly been solved
> in other places and just need to be reused:
> 1. Arbitration between guest and streaming requests (this is the only
> real new thing).
> 2. Efficient zero handling (skip writing those regions or mark them as
> zero clusters).
> 3. Queuing/dependencies when arbitration decides a request must wait.
> I'm taking a look at reusing Zhi Yong's block queue.
> 4. Rate-limiting to ensure streaming I/O does not impact the guest.
> Already exists in the QED-specific patches, it may make sense to
> extract common code that both migration and the block layer can use.
>
> Ideas or questions?
>
> Stefan
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [Qemu-devel] [RFC] Generic image streaming
2011-09-26 9:30 ` Stefan Hajnoczi
@ 2011-09-26 14:18 ` Zhi Yong Wu
0 siblings, 0 replies; 12+ messages in thread
From: Zhi Yong Wu @ 2011-09-26 14:18 UTC (permalink / raw)
To: Stefan Hajnoczi
Cc: Kevin Wolf, Stefan Hajnoczi, Marcelo Tosatti, qemu-devel,
Zhi Yong Wu
On Mon, Sep 26, 2011 at 5:30 PM, Stefan Hajnoczi
<stefanha@linux.vnet.ibm.com> wrote:
> On Mon, Sep 26, 2011 at 05:11:00PM +0800, Zhi Yong Wu wrote:
>> On Mon, Sep 26, 2011 at 3:55 PM, Stefan Hajnoczi <stefanha@gmail.com> wrote:
>> > On Mon, Sep 26, 2011 at 01:32:34PM +0800, Zhi Yong Wu wrote:
>> >> On Fri, Sep 23, 2011 at 11:57 PM, Stefan Hajnoczi
>> >> <stefanha@linux.vnet.ibm.com> wrote:
>> >> > Here is my generic image streaming branch, which aims to provide a way
>> >> > to copy the contents of a backing file into an image file of a running
>> >> > guest without requiring specific support in the various block drivers
>> >> > (e.g. qcow2, qed, vmdk):
>> >> >
>> >> > http://repo.or.cz/w/qemu/stefanha.git/shortlog/refs/heads/image-streaming-api
>> >> >
>> >> > The tree does not provide full image streaming yet but I'd like to
>> >> > discuss the approach taken in the code. Here are the main points:
>> >> >
>> >> > The image streaming API is available through HMP and QMP commands. When
>> >> > streaming is started on a block device a coroutine is created to do the
>> >> > background I/O work. The coroutine can be cancelled.
>> >> >
>> >> > While the coroutine copies data from the backing file into the image
>> >> > file, the guest may be performing I/O to the image file. Guest reads do
>> >> > not conflict with streaming but guest writes require special handling.
>> >> > If the guest writes to a region of the image file that we are currently
>> >> > copying, then there is the potential to clobber the guest write with old
>> >> > data from the backing file.
>> >> >
>> >> > Previously I solved this in a QED-specific way by taking advantage of
>> >> > the serialization of allocating write requests. In order to do this
>> >> > generically we need to track in-flight requests and have the ability to
>> >> > queue I/O. Guest writes that affect an in-flight streaming copy
>> >> > operation must wait for that operation to complete before being issued.
>> >> > Streaming copy operations must skip overlapping regions of guest writes.
>> >> >
>> >> > One big difference to the QED image streaming implementation is that
>> >> > this generic implementation is not based on copy-on-read operations.
>> >> > Instead we do a sequence of bdrv_is_allocated() to find regions for
>> >> > streaming, followed by bdrv_co_read() and bdrv_co_write() in order to
>> >> > populate the image file.
>> >> >
>> >> > It turns out that generic copy-on-read is not an attractive operation
>> >> > because it requires using bounce buffers for every request. Kevin
>> >> bounce buffers == buffer ring?
>> >
>> > A bounce buffer is a temporary buffer that is used because the actual
>> > data buffer is not addressable or cannot be directly accessed for some
>> > other reason. In this case it's because the guest should see read
>> > semantics and not find that writes to its read data buffer result in
>> > writes to disk.
>> >
>> >> > pointed out the case where a guest performs a read and pokes the data
>> >> > buffer before the read completes, copy-on-read would write out the
>> >> > modified memory into the image file unless we use a bounce buffer.
>> Sorry, to be honest, i don't know which scenario will cause guest
>> modified memory is written out into image file.
>
> I showed the scenario in the steps posted below:
>
>> >> Can you elaborate this?
>> >
>> > 1. Guest issues a read request.
>> > 2. QEMU issues host read request as first step in copy-on-read.
>> > 3. Host read request completes...
>> > 4. Guest overwrites its data buffer before QEMU acknowledges request
>> > completion.
>> > 5. ...QEMU issues host write request.
>> > 6. Host completes write request and QEMU acknowledges guest read
>> > completion.
>> Good, thanks.
>> >
>> > What happened is that we populated the image file with data from guest
>> > memory that does not match what is in the backing file. The guest
>> How to find that the two data don't match?
>
> Reread what I posted and think about the case where a QEMU read buffer
> (the "bounce buffer") is used in step 2. In that case the guest cannot
> tamper with the data buffer while performing copy-on-read.
Got it now, thanks.
>
> Stefan
>
--
Regards,
Zhi Yong Wu
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [Qemu-devel] [RFC] Generic image streaming
2011-09-26 12:35 ` Marcelo Tosatti
@ 2011-09-26 14:21 ` Stefan Hajnoczi
2011-09-27 12:07 ` Stefan Hajnoczi
0 siblings, 1 reply; 12+ messages in thread
From: Stefan Hajnoczi @ 2011-09-26 14:21 UTC (permalink / raw)
To: Marcelo Tosatti; +Cc: Kevin Wolf, qemu-devel, Zhi Yong Wu
On Mon, Sep 26, 2011 at 09:35:01AM -0300, Marcelo Tosatti wrote:
> On Fri, Sep 23, 2011 at 04:57:26PM +0100, Stefan Hajnoczi wrote:
> > Here is my generic image streaming branch, which aims to provide a way
> > to copy the contents of a backing file into an image file of a running
> > guest without requiring specific support in the various block drivers
> > (e.g. qcow2, qed, vmdk):
> >
> > http://repo.or.cz/w/qemu/stefanha.git/shortlog/refs/heads/image-streaming-api
> >
> > The tree does not provide full image streaming yet but I'd like to
> > discuss the approach taken in the code. Here are the main points:
> >
> > The image streaming API is available through HMP and QMP commands. When
> > streaming is started on a block device a coroutine is created to do the
> > background I/O work. The coroutine can be cancelled.
> >
> > While the coroutine copies data from the backing file into the image
> > file, the guest may be performing I/O to the image file. Guest reads do
> > not conflict with streaming but guest writes require special handling.
> > If the guest writes to a region of the image file that we are currently
> > copying, then there is the potential to clobber the guest write with old
> > data from the backing file.
> >
> > Previously I solved this in a QED-specific way by taking advantage of
> > the serialization of allocating write requests. In order to do this
> > generically we need to track in-flight requests and have the ability to
> > queue I/O. Guest writes that affect an in-flight streaming copy
> > operation must wait for that operation to complete before being issued.
> > Streaming copy operations must skip overlapping regions of guest writes.
> >
> > One big difference to the QED image streaming implementation is that
> > this generic implementation is not based on copy-on-read operations.
> > Instead we do a sequence of bdrv_is_allocated() to find regions for
> > streaming, followed by bdrv_co_read() and bdrv_co_write() in order to
> > populate the image file.
> >
> > It turns out that generic copy-on-read is not an attractive operation
> > because it requires using bounce buffers for every request.
>
> Isnt COR essential for a decent read performance on the
> image-stream-from-slow-remote-origin case?
It is essential for re-read performance from a slow backing file. With
images over internet HTTP it most definitely is worth doing
copy-on-read.
In the case of an NFS server the performance depends on the network and
server. It might be similar speed or faster to read from NFS.
I will think some more about how to implement generic copy-on-read.
> > Kevin pointed out the case where a guest performs a read and pokes the
> > data buffer before the read completes, copy-on-read would write out
> > the modified memory into the image file unless we use a bounce buffer.
>
> Either wait for the write originating from a COR to finish before
> exposing the read to the guest, or have a bounce buffer.
When the guest issues a write we try read directly into its read data
buffer. The problem is that it is not okay to write out that buffer
from guest many because it may have been scribbled on by the guest.
The guest can do this even before we notify it of read completion. So
waiting for the write to complete does not solve the problem,
unfortunately.
Although sane guests will not scribble over data buffers we cannot allow
QEMU to turn a memory corruption inside the guest into a data corruption
of the disk image.
Stefan
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [Qemu-devel] [RFC] Generic image streaming
2011-09-23 15:57 [Qemu-devel] [RFC] Generic image streaming Stefan Hajnoczi
2011-09-26 5:32 ` Zhi Yong Wu
2011-09-26 12:35 ` Marcelo Tosatti
@ 2011-09-27 3:26 ` Zhi Yong Wu
2011-09-27 8:37 ` Stefan Hajnoczi
2011-09-27 9:05 ` Zhi Yong Wu
3 siblings, 1 reply; 12+ messages in thread
From: Zhi Yong Wu @ 2011-09-27 3:26 UTC (permalink / raw)
To: Stefan Hajnoczi; +Cc: Kevin Wolf, Marcelo Tosatti, qemu-devel, Zhi Yong Wu
On Fri, Sep 23, 2011 at 11:57 PM, Stefan Hajnoczi
<stefanha@linux.vnet.ibm.com> wrote:
> Here is my generic image streaming branch, which aims to provide a way
> to copy the contents of a backing file into an image file of a running
> guest without requiring specific support in the various block drivers
> (e.g. qcow2, qed, vmdk):
>
> http://repo.or.cz/w/qemu/stefanha.git/shortlog/refs/heads/image-streaming-api
>
> The tree does not provide full image streaming yet but I'd like to
> discuss the approach taken in the code. Here are the main points:
>
> The image streaming API is available through HMP and QMP commands. When
> streaming is started on a block device a coroutine is created to do the
> background I/O work. The coroutine can be cancelled.
>
> While the coroutine copies data from the backing file into the image
> file, the guest may be performing I/O to the image file. Guest reads do
> not conflict with streaming but guest writes require special handling.
> If the guest writes to a region of the image file that we are currently
> copying, then there is the potential to clobber the guest write with old
> data from the backing file.
>
> Previously I solved this in a QED-specific way by taking advantage of
> the serialization of allocating write requests. In order to do this
> generically we need to track in-flight requests and have the ability to
> queue I/O. Guest writes that affect an in-flight streaming copy
> operation must wait for that operation to complete before being issued.
> Streaming copy operations must skip overlapping regions of guest writes.
>
> One big difference to the QED image streaming implementation is that
> this generic implementation is not based on copy-on-read operations.
> Instead we do a sequence of bdrv_is_allocated() to find regions for
> streaming, followed by bdrv_co_read() and bdrv_co_write() in order to
Why is the api not bdrv_aio_readv/writev? In your branch, it seems
that you only modify bdrv_read/write. Does your branch currently only
support sync read/write mode?
> populate the image file.
>
> It turns out that generic copy-on-read is not an attractive operation
> because it requires using bounce buffers for every request. Kevin
> pointed out the case where a guest performs a read and pokes the data
> buffer before the read completes, copy-on-read would write out the
> modified memory into the image file unless we use a bounce buffer.
>
> There are a few pieces missing in my tree, which have mostly been solved
> in other places and just need to be reused:
> 1. Arbitration between guest and streaming requests (this is the only
> real new thing).
> 2. Efficient zero handling (skip writing those regions or mark them as
> zero clusters).
> 3. Queuing/dependencies when arbitration decides a request must wait.
> I'm taking a look at reusing Zhi Yong's block queue.
> 4. Rate-limiting to ensure streaming I/O does not impact the guest.
> Already exists in the QED-specific patches, it may make sense to
> extract common code that both migration and the block layer can use.
>
> Ideas or questions?
>
> Stefan
>
>
--
Regards,
Zhi Yong Wu
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [Qemu-devel] [RFC] Generic image streaming
2011-09-27 3:26 ` Zhi Yong Wu
@ 2011-09-27 8:37 ` Stefan Hajnoczi
0 siblings, 0 replies; 12+ messages in thread
From: Stefan Hajnoczi @ 2011-09-27 8:37 UTC (permalink / raw)
To: Zhi Yong Wu; +Cc: Kevin Wolf, Marcelo Tosatti, qemu-devel, Zhi Yong Wu
On Tue, Sep 27, 2011 at 11:26:45AM +0800, Zhi Yong Wu wrote:
> On Fri, Sep 23, 2011 at 11:57 PM, Stefan Hajnoczi
> <stefanha@linux.vnet.ibm.com> wrote:
> > Here is my generic image streaming branch, which aims to provide a way
> > to copy the contents of a backing file into an image file of a running
> > guest without requiring specific support in the various block drivers
> > (e.g. qcow2, qed, vmdk):
> >
> > http://repo.or.cz/w/qemu/stefanha.git/shortlog/refs/heads/image-streaming-api
> >
> > The tree does not provide full image streaming yet but I'd like to
> > discuss the approach taken in the code. Here are the main points:
> >
> > The image streaming API is available through HMP and QMP commands. When
> > streaming is started on a block device a coroutine is created to do the
> > background I/O work. The coroutine can be cancelled.
> >
> > While the coroutine copies data from the backing file into the image
> > file, the guest may be performing I/O to the image file. Guest reads do
> > not conflict with streaming but guest writes require special handling.
> > If the guest writes to a region of the image file that we are currently
> > copying, then there is the potential to clobber the guest write with old
> > data from the backing file.
> >
> > Previously I solved this in a QED-specific way by taking advantage of
> > the serialization of allocating write requests. In order to do this
> > generically we need to track in-flight requests and have the ability to
> > queue I/O. Guest writes that affect an in-flight streaming copy
> > operation must wait for that operation to complete before being issued.
> > Streaming copy operations must skip overlapping regions of guest writes.
> >
> > One big difference to the QED image streaming implementation is that
> > this generic implementation is not based on copy-on-read operations.
> > Instead we do a sequence of bdrv_is_allocated() to find regions for
> > streaming, followed by bdrv_co_read() and bdrv_co_write() in order to
> Why is the api not bdrv_aio_readv/writev? In your branch, it seems
> that you only modify bdrv_read/write. Does your branch currently only
> support sync read/write mode?
No, it is designed to work with all three: aio, co, and sync.
The streaming loop itself is in a coroutine but it will be able to
interact with any other block I/O requests. The critical part is the
request tracking here, which monitors aio, co, and sync:
http://repo.or.cz/w/qemu/stefanha.git/commitdiff/cf365d4e33ba19fadd4f1f20c64526e890d34239
Stefan
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [Qemu-devel] [RFC] Generic image streaming
2011-09-23 15:57 [Qemu-devel] [RFC] Generic image streaming Stefan Hajnoczi
` (2 preceding siblings ...)
2011-09-27 3:26 ` Zhi Yong Wu
@ 2011-09-27 9:05 ` Zhi Yong Wu
3 siblings, 0 replies; 12+ messages in thread
From: Zhi Yong Wu @ 2011-09-27 9:05 UTC (permalink / raw)
To: Stefan Hajnoczi; +Cc: Kevin Wolf, Marcelo Tosatti, qemu-devel, Zhi Yong Wu
On Fri, Sep 23, 2011 at 11:57 PM, Stefan Hajnoczi
<stefanha@linux.vnet.ibm.com> wrote:
> Here is my generic image streaming branch, which aims to provide a way
> to copy the contents of a backing file into an image file of a running
> guest without requiring specific support in the various block drivers
> (e.g. qcow2, qed, vmdk):
>
> http://repo.or.cz/w/qemu/stefanha.git/shortlog/refs/heads/image-streaming-api
Sorry, i missed this support logic. thanks.
>
> The tree does not provide full image streaming yet but I'd like to
> discuss the approach taken in the code. Here are the main points:
>
> The image streaming API is available through HMP and QMP commands. When
> streaming is started on a block device a coroutine is created to do the
> background I/O work. The coroutine can be cancelled.
>
> While the coroutine copies data from the backing file into the image
> file, the guest may be performing I/O to the image file. Guest reads do
> not conflict with streaming but guest writes require special handling.
> If the guest writes to a region of the image file that we are currently
> copying, then there is the potential to clobber the guest write with old
> data from the backing file.
>
> Previously I solved this in a QED-specific way by taking advantage of
> the serialization of allocating write requests. In order to do this
> generically we need to track in-flight requests and have the ability to
> queue I/O. Guest writes that affect an in-flight streaming copy
> operation must wait for that operation to complete before being issued.
> Streaming copy operations must skip overlapping regions of guest writes.
>
> One big difference to the QED image streaming implementation is that
> this generic implementation is not based on copy-on-read operations.
> Instead we do a sequence of bdrv_is_allocated() to find regions for
> streaming, followed by bdrv_co_read() and bdrv_co_write() in order to
> populate the image file.
>
> It turns out that generic copy-on-read is not an attractive operation
> because it requires using bounce buffers for every request. Kevin
> pointed out the case where a guest performs a read and pokes the data
> buffer before the read completes, copy-on-read would write out the
> modified memory into the image file unless we use a bounce buffer.
>
> There are a few pieces missing in my tree, which have mostly been solved
> in other places and just need to be reused:
> 1. Arbitration between guest and streaming requests (this is the only
> real new thing).
> 2. Efficient zero handling (skip writing those regions or mark them as
> zero clusters).
> 3. Queuing/dependencies when arbitration decides a request must wait.
> I'm taking a look at reusing Zhi Yong's block queue.
> 4. Rate-limiting to ensure streaming I/O does not impact the guest.
> Already exists in the QED-specific patches, it may make sense to
> extract common code that both migration and the block layer can use.
>
> Ideas or questions?
>
> Stefan
>
>
--
Regards,
Zhi Yong Wu
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [Qemu-devel] [RFC] Generic image streaming
2011-09-26 14:21 ` Stefan Hajnoczi
@ 2011-09-27 12:07 ` Stefan Hajnoczi
0 siblings, 0 replies; 12+ messages in thread
From: Stefan Hajnoczi @ 2011-09-27 12:07 UTC (permalink / raw)
To: Marcelo Tosatti; +Cc: Kevin Wolf, qemu-devel, Stefan Hajnoczi, Zhi Yong Wu
On Mon, Sep 26, 2011 at 3:21 PM, Stefan Hajnoczi
<stefanha@linux.vnet.ibm.com> wrote:
> On Mon, Sep 26, 2011 at 09:35:01AM -0300, Marcelo Tosatti wrote:
>> On Fri, Sep 23, 2011 at 04:57:26PM +0100, Stefan Hajnoczi wrote:
>> > Here is my generic image streaming branch, which aims to provide a way
>> > to copy the contents of a backing file into an image file of a running
>> > guest without requiring specific support in the various block drivers
>> > (e.g. qcow2, qed, vmdk):
>> >
>> > http://repo.or.cz/w/qemu/stefanha.git/shortlog/refs/heads/image-streaming-api
>> >
>> > The tree does not provide full image streaming yet but I'd like to
>> > discuss the approach taken in the code. Here are the main points:
>> >
>> > The image streaming API is available through HMP and QMP commands. When
>> > streaming is started on a block device a coroutine is created to do the
>> > background I/O work. The coroutine can be cancelled.
>> >
>> > While the coroutine copies data from the backing file into the image
>> > file, the guest may be performing I/O to the image file. Guest reads do
>> > not conflict with streaming but guest writes require special handling.
>> > If the guest writes to a region of the image file that we are currently
>> > copying, then there is the potential to clobber the guest write with old
>> > data from the backing file.
>> >
>> > Previously I solved this in a QED-specific way by taking advantage of
>> > the serialization of allocating write requests. In order to do this
>> > generically we need to track in-flight requests and have the ability to
>> > queue I/O. Guest writes that affect an in-flight streaming copy
>> > operation must wait for that operation to complete before being issued.
>> > Streaming copy operations must skip overlapping regions of guest writes.
>> >
>> > One big difference to the QED image streaming implementation is that
>> > this generic implementation is not based on copy-on-read operations.
>> > Instead we do a sequence of bdrv_is_allocated() to find regions for
>> > streaming, followed by bdrv_co_read() and bdrv_co_write() in order to
>> > populate the image file.
>> >
>> > It turns out that generic copy-on-read is not an attractive operation
>> > because it requires using bounce buffers for every request.
>>
>> Isnt COR essential for a decent read performance on the
>> image-stream-from-slow-remote-origin case?
>
> It is essential for re-read performance from a slow backing file. With
> images over internet HTTP it most definitely is worth doing
> copy-on-read.
>
> In the case of an NFS server the performance depends on the network and
> server. It might be similar speed or faster to read from NFS.
>
> I will think some more about how to implement generic copy-on-read.
I've sketched out how generic copy-on-read can work. It's probably
not much extra effort since we need request tracking and the ability
to queue/hold requests anyway.
I hope to have patches implementing this by the end of the week:
1. When CoR is enabled, overlapping requests get queued so that only
one is actually being issued to the host at a time. This prevents
race conditions where a guest write request is clobbered by a
copy-on-read. Note that only overlapping requests are queued,
non-overlapping requests proceed in parallel.
2. The read operation uses bdrv_is_allocated() first to see whether a
copy-on-read needs to be performed or if we can go down the fast path.
The fast path is the normal read straight into the guest buffer. The
copy-on-read path reads into a bounce buffer, writes into the image
file, and then copies the bounce buffer into the guest buffer.
3. The .bdrv_is_allocated() implementations will be audited and
improved to make them aio/coroutine-friendly where necessary.
Stefan
^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2011-09-27 12:07 UTC | newest]
Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-09-23 15:57 [Qemu-devel] [RFC] Generic image streaming Stefan Hajnoczi
2011-09-26 5:32 ` Zhi Yong Wu
2011-09-26 7:55 ` Stefan Hajnoczi
2011-09-26 9:11 ` Zhi Yong Wu
2011-09-26 9:30 ` Stefan Hajnoczi
2011-09-26 14:18 ` Zhi Yong Wu
2011-09-26 12:35 ` Marcelo Tosatti
2011-09-26 14:21 ` Stefan Hajnoczi
2011-09-27 12:07 ` Stefan Hajnoczi
2011-09-27 3:26 ` Zhi Yong Wu
2011-09-27 8:37 ` Stefan Hajnoczi
2011-09-27 9:05 ` Zhi Yong Wu
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).