* [Qemu-devel] [PATCH v4 1/8] qemu-common: add QEMU_ALIGN_DOWN() and QEMU_ALIGN_UP() macros
2011-11-23 11:47 [Qemu-devel] [PATCH v4 0/8] block: generic copy-on-read Stefan Hajnoczi
@ 2011-11-23 11:47 ` Stefan Hajnoczi
2011-11-23 11:47 ` [Qemu-devel] [PATCH v4 2/8] coroutine: add qemu_co_queue_restart_all() Stefan Hajnoczi
` (6 subsequent siblings)
7 siblings, 0 replies; 16+ messages in thread
From: Stefan Hajnoczi @ 2011-11-23 11:47 UTC (permalink / raw)
To: qemu-devel; +Cc: Kevin Wolf, Paolo Bonzini, Marcelo Tosatti, Stefan Hajnoczi
Add macros for aligning a number to a multiple, for example:
QEMU_ALIGN_DOWN(500, 2000) = 0
QEMU_ALIGN_UP(500, 2000) = 2000
Since ALIGN_UP() is a common macro name use the QEMU_* namespace prefix.
Hopefully this will protect us from included headers that leak something
with a similar name.
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
---
qemu-common.h | 6 ++++++
1 files changed, 6 insertions(+), 0 deletions(-)
diff --git a/qemu-common.h b/qemu-common.h
index 2ce47aa..44870fe 100644
--- a/qemu-common.h
+++ b/qemu-common.h
@@ -341,6 +341,12 @@ static inline uint64_t muldiv64(uint64_t a, uint32_t b, uint32_t c)
return res.ll;
}
+/* Round number down to multiple */
+#define QEMU_ALIGN_DOWN(n, m) ((n) / (m) * (m))
+
+/* Round number up to multiple */
+#define QEMU_ALIGN_UP(n, m) QEMU_ALIGN_DOWN((n) + (m) - 1, (m))
+
#include "module.h"
#endif
--
1.7.7.1
^ permalink raw reply related [flat|nested] 16+ messages in thread
* [Qemu-devel] [PATCH v4 2/8] coroutine: add qemu_co_queue_restart_all()
2011-11-23 11:47 [Qemu-devel] [PATCH v4 0/8] block: generic copy-on-read Stefan Hajnoczi
2011-11-23 11:47 ` [Qemu-devel] [PATCH v4 1/8] qemu-common: add QEMU_ALIGN_DOWN() and QEMU_ALIGN_UP() macros Stefan Hajnoczi
@ 2011-11-23 11:47 ` Stefan Hajnoczi
2011-11-23 11:47 ` [Qemu-devel] [PATCH v4 3/8] block: add request tracking Stefan Hajnoczi
` (5 subsequent siblings)
7 siblings, 0 replies; 16+ messages in thread
From: Stefan Hajnoczi @ 2011-11-23 11:47 UTC (permalink / raw)
To: qemu-devel; +Cc: Kevin Wolf, Paolo Bonzini, Marcelo Tosatti, Stefan Hajnoczi
It's common to wake up all waiting coroutines. Introduce the
qemu_co_queue_restart_all() function to do this instead of looping over
qemu_co_queue_next() in every caller.
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
---
block/qcow2.c | 2 +-
qemu-coroutine-lock.c | 15 ++++++++-------
qemu-coroutine.h | 5 +++++
3 files changed, 14 insertions(+), 8 deletions(-)
diff --git a/block/qcow2.c b/block/qcow2.c
index eab35d1..195e1b1 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -514,7 +514,7 @@ static void run_dependent_requests(BDRVQcowState *s, QCowL2Meta *m)
/* Restart all dependent requests */
if (!qemu_co_queue_empty(&m->dependent_requests)) {
qemu_co_mutex_unlock(&s->lock);
- while(qemu_co_queue_next(&m->dependent_requests));
+ qemu_co_queue_restart_all(&m->dependent_requests);
qemu_co_mutex_lock(&s->lock);
}
}
diff --git a/qemu-coroutine-lock.c b/qemu-coroutine-lock.c
index 9549c07..26ad76b 100644
--- a/qemu-coroutine-lock.c
+++ b/qemu-coroutine-lock.c
@@ -84,6 +84,13 @@ bool qemu_co_queue_next(CoQueue *queue)
return (next != NULL);
}
+void qemu_co_queue_restart_all(CoQueue *queue)
+{
+ while (qemu_co_queue_next(queue)) {
+ /* Do nothing */
+ }
+}
+
bool qemu_co_queue_empty(CoQueue *queue)
{
return (QTAILQ_FIRST(&queue->entries) == NULL);
@@ -144,13 +151,7 @@ void qemu_co_rwlock_unlock(CoRwlock *lock)
assert(qemu_in_coroutine());
if (lock->writer) {
lock->writer = false;
- while (!qemu_co_queue_empty(&lock->queue)) {
- /*
- * Wakeup every body. This will include some
- * writers too.
- */
- qemu_co_queue_next(&lock->queue);
- }
+ qemu_co_queue_restart_all(&lock->queue);
} else {
lock->reader--;
assert(lock->reader >= 0);
diff --git a/qemu-coroutine.h b/qemu-coroutine.h
index 8a2e5d2..8a55fe1 100644
--- a/qemu-coroutine.h
+++ b/qemu-coroutine.h
@@ -131,6 +131,11 @@ void coroutine_fn qemu_co_queue_wait_insert_head(CoQueue *queue);
bool qemu_co_queue_next(CoQueue *queue);
/**
+ * Restarts all coroutines in the CoQueue and leaves the queue empty.
+ */
+void qemu_co_queue_restart_all(CoQueue *queue);
+
+/**
* Checks if the CoQueue is empty.
*/
bool qemu_co_queue_empty(CoQueue *queue);
--
1.7.7.1
^ permalink raw reply related [flat|nested] 16+ messages in thread
* [Qemu-devel] [PATCH v4 3/8] block: add request tracking
2011-11-23 11:47 [Qemu-devel] [PATCH v4 0/8] block: generic copy-on-read Stefan Hajnoczi
2011-11-23 11:47 ` [Qemu-devel] [PATCH v4 1/8] qemu-common: add QEMU_ALIGN_DOWN() and QEMU_ALIGN_UP() macros Stefan Hajnoczi
2011-11-23 11:47 ` [Qemu-devel] [PATCH v4 2/8] coroutine: add qemu_co_queue_restart_all() Stefan Hajnoczi
@ 2011-11-23 11:47 ` Stefan Hajnoczi
2011-12-05 12:17 ` Marcelo Tosatti
2011-12-05 16:09 ` Marcelo Tosatti
2011-11-23 11:47 ` [Qemu-devel] [PATCH v4 4/8] block: add bdrv_set_copy_on_read() Stefan Hajnoczi
` (4 subsequent siblings)
7 siblings, 2 replies; 16+ messages in thread
From: Stefan Hajnoczi @ 2011-11-23 11:47 UTC (permalink / raw)
To: qemu-devel; +Cc: Kevin Wolf, Paolo Bonzini, Marcelo Tosatti, Stefan Hajnoczi
The block layer does not know about pending requests. This information
is necessary for copy-on-read since overlapping requests must be
serialized to prevent races that corrupt the image.
The BlockDriverState gets a new tracked_request list field which
contains all pending requests. Each request is a BdrvTrackedRequest
record with sector_num, nb_sectors, and is_write fields.
Note that request tracking is always enabled but hopefully this extra
work is so small that it doesn't justify adding an enable/disable flag.
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
---
block.c | 48 +++++++++++++++++++++++++++++++++++++++++++++++-
block_int.h | 4 ++++
2 files changed, 51 insertions(+), 1 deletions(-)
diff --git a/block.c b/block.c
index 0df7eb9..27c4e84 100644
--- a/block.c
+++ b/block.c
@@ -1071,6 +1071,42 @@ void bdrv_commit_all(void)
}
}
+struct BdrvTrackedRequest {
+ BlockDriverState *bs;
+ int64_t sector_num;
+ int nb_sectors;
+ bool is_write;
+ QLIST_ENTRY(BdrvTrackedRequest) list;
+};
+
+/**
+ * Remove an active request from the tracked requests list
+ *
+ * This function should be called when a tracked request is completing.
+ */
+static void tracked_request_end(BdrvTrackedRequest *req)
+{
+ QLIST_REMOVE(req, list);
+}
+
+/**
+ * Add an active request to the tracked requests list
+ */
+static void tracked_request_begin(BdrvTrackedRequest *req,
+ BlockDriverState *bs,
+ int64_t sector_num,
+ int nb_sectors, bool is_write)
+{
+ *req = (BdrvTrackedRequest){
+ .bs = bs,
+ .sector_num = sector_num,
+ .nb_sectors = nb_sectors,
+ .is_write = is_write,
+ };
+
+ QLIST_INSERT_HEAD(&bs->tracked_requests, req, list);
+}
+
/*
* Return values:
* 0 - success
@@ -1350,6 +1386,8 @@ static int coroutine_fn bdrv_co_do_readv(BlockDriverState *bs,
int64_t sector_num, int nb_sectors, QEMUIOVector *qiov)
{
BlockDriver *drv = bs->drv;
+ BdrvTrackedRequest req;
+ int ret;
if (!drv) {
return -ENOMEDIUM;
@@ -1363,7 +1401,10 @@ static int coroutine_fn bdrv_co_do_readv(BlockDriverState *bs,
bdrv_io_limits_intercept(bs, false, nb_sectors);
}
- return drv->bdrv_co_readv(bs, sector_num, nb_sectors, qiov);
+ tracked_request_begin(&req, bs, sector_num, nb_sectors, false);
+ ret = drv->bdrv_co_readv(bs, sector_num, nb_sectors, qiov);
+ tracked_request_end(&req);
+ return ret;
}
int coroutine_fn bdrv_co_readv(BlockDriverState *bs, int64_t sector_num,
@@ -1381,6 +1422,7 @@ static int coroutine_fn bdrv_co_do_writev(BlockDriverState *bs,
int64_t sector_num, int nb_sectors, QEMUIOVector *qiov)
{
BlockDriver *drv = bs->drv;
+ BdrvTrackedRequest req;
int ret;
if (!bs->drv) {
@@ -1398,6 +1440,8 @@ static int coroutine_fn bdrv_co_do_writev(BlockDriverState *bs,
bdrv_io_limits_intercept(bs, true, nb_sectors);
}
+ tracked_request_begin(&req, bs, sector_num, nb_sectors, true);
+
ret = drv->bdrv_co_writev(bs, sector_num, nb_sectors, qiov);
if (bs->dirty_bitmap) {
@@ -1408,6 +1452,8 @@ static int coroutine_fn bdrv_co_do_writev(BlockDriverState *bs,
bs->wr_highest_sector = sector_num + nb_sectors - 1;
}
+ tracked_request_end(&req);
+
return ret;
}
diff --git a/block_int.h b/block_int.h
index f9e2c9a..788fde9 100644
--- a/block_int.h
+++ b/block_int.h
@@ -51,6 +51,8 @@
#define BLOCK_OPT_PREALLOC "preallocation"
#define BLOCK_OPT_SUBFMT "subformat"
+typedef struct BdrvTrackedRequest BdrvTrackedRequest;
+
typedef struct AIOPool {
void (*cancel)(BlockDriverAIOCB *acb);
int aiocb_size;
@@ -250,6 +252,8 @@ struct BlockDriverState {
int in_use; /* users other than guest access, eg. block migration */
QTAILQ_ENTRY(BlockDriverState) list;
void *private;
+
+ QLIST_HEAD(, BdrvTrackedRequest) tracked_requests;
};
struct BlockDriverAIOCB {
--
1.7.7.1
^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [Qemu-devel] [PATCH v4 3/8] block: add request tracking
2011-11-23 11:47 ` [Qemu-devel] [PATCH v4 3/8] block: add request tracking Stefan Hajnoczi
@ 2011-12-05 12:17 ` Marcelo Tosatti
2011-12-05 12:20 ` Paolo Bonzini
2011-12-05 16:09 ` Marcelo Tosatti
1 sibling, 1 reply; 16+ messages in thread
From: Marcelo Tosatti @ 2011-12-05 12:17 UTC (permalink / raw)
To: Stefan Hajnoczi; +Cc: Kevin Wolf, Paolo Bonzini, qemu-devel
On Wed, Nov 23, 2011 at 11:47:53AM +0000, Stefan Hajnoczi wrote:
> The block layer does not know about pending requests. This information
> is necessary for copy-on-read since overlapping requests must be
> serialized to prevent races that corrupt the image.
>
> The BlockDriverState gets a new tracked_request list field which
> contains all pending requests. Each request is a BdrvTrackedRequest
> record with sector_num, nb_sectors, and is_write fields.
>
> Note that request tracking is always enabled but hopefully this extra
> work is so small that it doesn't justify adding an enable/disable flag.
>
> Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
> ---
> block.c | 48 +++++++++++++++++++++++++++++++++++++++++++++++-
> block_int.h | 4 ++++
> 2 files changed, 51 insertions(+), 1 deletions(-)
>
> diff --git a/block.c b/block.c
> index 0df7eb9..27c4e84 100644
> --- a/block.c
> +++ b/block.c
> @@ -1071,6 +1071,42 @@ void bdrv_commit_all(void)
> }
> }
>
> +struct BdrvTrackedRequest {
> + BlockDriverState *bs;
> + int64_t sector_num;
> + int nb_sectors;
> + bool is_write;
> + QLIST_ENTRY(BdrvTrackedRequest) list;
> +};
> +
> +/**
> + * Remove an active request from the tracked requests list
> + *
> + * This function should be called when a tracked request is completing.
> + */
> +static void tracked_request_end(BdrvTrackedRequest *req)
> +{
> + QLIST_REMOVE(req, list);
> +}
> +
> +/**
> + * Add an active request to the tracked requests list
> + */
> +static void tracked_request_begin(BdrvTrackedRequest *req,
> + BlockDriverState *bs,
> + int64_t sector_num,
> + int nb_sectors, bool is_write)
> +{
> + *req = (BdrvTrackedRequest){
> + .bs = bs,
> + .sector_num = sector_num,
> + .nb_sectors = nb_sectors,
> + .is_write = is_write,
> + };
> +
> + QLIST_INSERT_HEAD(&bs->tracked_requests, req, list);
> +}
> +
> /*
> * Return values:
> * 0 - success
> @@ -1350,6 +1386,8 @@ static int coroutine_fn bdrv_co_do_readv(BlockDriverState *bs,
> int64_t sector_num, int nb_sectors, QEMUIOVector *qiov)
> {
> BlockDriver *drv = bs->drv;
> + BdrvTrackedRequest req;
> + int ret;
>
> if (!drv) {
> return -ENOMEDIUM;
> @@ -1363,7 +1401,10 @@ static int coroutine_fn bdrv_co_do_readv(BlockDriverState *bs,
> bdrv_io_limits_intercept(bs, false, nb_sectors);
> }
>
> - return drv->bdrv_co_readv(bs, sector_num, nb_sectors, qiov);
> + tracked_request_begin(&req, bs, sector_num, nb_sectors, false);
> + ret = drv->bdrv_co_readv(bs, sector_num, nb_sectors, qiov);
> + tracked_request_end(&req);
> + return ret;
> }
>
> int coroutine_fn bdrv_co_readv(BlockDriverState *bs, int64_t sector_num,
> @@ -1381,6 +1422,7 @@ static int coroutine_fn bdrv_co_do_writev(BlockDriverState *bs,
> int64_t sector_num, int nb_sectors, QEMUIOVector *qiov)
> {
> BlockDriver *drv = bs->drv;
> + BdrvTrackedRequest req;
> int ret;
>
> if (!bs->drv) {
> @@ -1398,6 +1440,8 @@ static int coroutine_fn bdrv_co_do_writev(BlockDriverState *bs,
> bdrv_io_limits_intercept(bs, true, nb_sectors);
> }
>
> + tracked_request_begin(&req, bs, sector_num, nb_sectors, true);
> +
> ret = drv->bdrv_co_writev(bs, sector_num, nb_sectors, qiov);
>
> if (bs->dirty_bitmap) {
> @@ -1408,6 +1452,8 @@ static int coroutine_fn bdrv_co_do_writev(BlockDriverState *bs,
> bs->wr_highest_sector = sector_num + nb_sectors - 1;
> }
>
> + tracked_request_end(&req);
> +
> return ret;
> }
There is no need to worry about synchronous read/write requests
bypassing this interface, correct?
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [Qemu-devel] [PATCH v4 3/8] block: add request tracking
2011-11-23 11:47 ` [Qemu-devel] [PATCH v4 3/8] block: add request tracking Stefan Hajnoczi
2011-12-05 12:17 ` Marcelo Tosatti
@ 2011-12-05 16:09 ` Marcelo Tosatti
2011-12-05 16:20 ` Stefan Hajnoczi
1 sibling, 1 reply; 16+ messages in thread
From: Marcelo Tosatti @ 2011-12-05 16:09 UTC (permalink / raw)
To: Stefan Hajnoczi; +Cc: Kevin Wolf, Paolo Bonzini, qemu-devel
On Wed, Nov 23, 2011 at 11:47:53AM +0000, Stefan Hajnoczi wrote:
> The block layer does not know about pending requests. This information
> is necessary for copy-on-read since overlapping requests must be
> serialized to prevent races that corrupt the image.
>
> The BlockDriverState gets a new tracked_request list field which
> contains all pending requests. Each request is a BdrvTrackedRequest
> record with sector_num, nb_sectors, and is_write fields.
>
> Note that request tracking is always enabled but hopefully this extra
> work is so small that it doesn't justify adding an enable/disable flag.
>
> Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
> ---
> block.c | 48 +++++++++++++++++++++++++++++++++++++++++++++++-
> block_int.h | 4 ++++
> 2 files changed, 51 insertions(+), 1 deletions(-)
>
> diff --git a/block.c b/block.c
> index 0df7eb9..27c4e84 100644
> --- a/block.c
> +++ b/block.c
> @@ -1071,6 +1071,42 @@ void bdrv_commit_all(void)
> }
> }
>
> +struct BdrvTrackedRequest {
> + BlockDriverState *bs;
> + int64_t sector_num;
> + int nb_sectors;
> + bool is_write;
> + QLIST_ENTRY(BdrvTrackedRequest) list;
> +};
> +
> +/**
> + * Remove an active request from the tracked requests list
> + *
> + * This function should be called when a tracked request is completing.
> + */
> +static void tracked_request_end(BdrvTrackedRequest *req)
> +{
> + QLIST_REMOVE(req, list);
> +}
> +
> +/**
> + * Add an active request to the tracked requests list
> + */
> +static void tracked_request_begin(BdrvTrackedRequest *req,
> + BlockDriverState *bs,
> + int64_t sector_num,
> + int nb_sectors, bool is_write)
> +{
> + *req = (BdrvTrackedRequest){
> + .bs = bs,
> + .sector_num = sector_num,
> + .nb_sectors = nb_sectors,
> + .is_write = is_write,
> + };
> +
> + QLIST_INSERT_HEAD(&bs->tracked_requests, req, list);
> +}
> +
> /*
> * Return values:
> * 0 - success
> @@ -1350,6 +1386,8 @@ static int coroutine_fn bdrv_co_do_readv(BlockDriverState *bs,
> int64_t sector_num, int nb_sectors, QEMUIOVector *qiov)
> {
> BlockDriver *drv = bs->drv;
> + BdrvTrackedRequest req;
> + int ret;
>
> if (!drv) {
> return -ENOMEDIUM;
> @@ -1363,7 +1401,10 @@ static int coroutine_fn bdrv_co_do_readv(BlockDriverState *bs,
> bdrv_io_limits_intercept(bs, false, nb_sectors);
> }
>
> - return drv->bdrv_co_readv(bs, sector_num, nb_sectors, qiov);
> + tracked_request_begin(&req, bs, sector_num, nb_sectors, false);
> + ret = drv->bdrv_co_readv(bs, sector_num, nb_sectors, qiov);
> + tracked_request_end(&req);
> + return ret;
> }
Stefan,
On earlier discussion, you replied to me:
"
>> req = tracked_request_add(bs, sector_num, nb_sectors, false);
>
> The tracked request should include cluster round info?
When checking A and B for overlap, only one of them needs to be
rounded in order for overlap detection to be correct. Therefore we
can store the original request [start, length) in tracked_requests and
only round the new request.
"
The problem AFAICS is this:
- Store a non-cluster-aligned request in the tracked request list.
- Wait on that non-cluster-aligned request
(wait_for_overlapping_requests).
- Submit cluster-aligned request for COR request.
So, the tracked request list does not properly reflect the in-flight
COR requests. Which can result in:
1) guest reads sector 10.
2) <sector_num=10,nb_sectors=2> added to tracked request list.
3) COR code submits read for <sector_num=10,nb_sectors=2+cluster_align>
4) unrelated guest operation writes to sector 13, nb_sectors=1. That is
allowed to proceed without waiting because tracked request list does not
reflect what COR in-flight requests.
Am i missing something here?
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [Qemu-devel] [PATCH v4 3/8] block: add request tracking
2011-12-05 16:09 ` Marcelo Tosatti
@ 2011-12-05 16:20 ` Stefan Hajnoczi
2011-12-05 16:31 ` Marcelo Tosatti
0 siblings, 1 reply; 16+ messages in thread
From: Stefan Hajnoczi @ 2011-12-05 16:20 UTC (permalink / raw)
To: Marcelo Tosatti; +Cc: Kevin Wolf, Paolo Bonzini, Stefan Hajnoczi, qemu-devel
On Mon, Dec 5, 2011 at 4:09 PM, Marcelo Tosatti <mtosatti@redhat.com> wrote:
> On Wed, Nov 23, 2011 at 11:47:53AM +0000, Stefan Hajnoczi wrote:
> On earlier discussion, you replied to me:
>
> "
>>> req = tracked_request_add(bs, sector_num, nb_sectors, false);
>>
>> The tracked request should include cluster round info?
>
> When checking A and B for overlap, only one of them needs to be
> rounded in order for overlap detection to be correct. Therefore we
> can store the original request [start, length) in tracked_requests and
> only round the new request.
> "
>
> The problem AFAICS is this:
>
> - Store a non-cluster-aligned request in the tracked request list.
> - Wait on that non-cluster-aligned request
> (wait_for_overlapping_requests).
> - Submit cluster-aligned request for COR request.
>
> So, the tracked request list does not properly reflect the in-flight
> COR requests. Which can result in:
>
> 1) guest reads sector 10.
> 2) <sector_num=10,nb_sectors=2> added to tracked request list.
> 3) COR code submits read for <sector_num=10,nb_sectors=2+cluster_align>
It will also round down to sector_num=0 when cluster size is 128
sectors (e.g. qcow2 and qed).
> 4) unrelated guest operation writes to sector 13, nb_sectors=1. That is
> allowed to proceed without waiting because tracked request list does not
> reflect what COR in-flight requests.
The tracked request list has <sector_num=10, nb_sectors=2> and the
candidate write request is <sector_num=13, nb_sectors=1>.
In wait_for_overlapping_requests() we round the candidate request to
<sector_num=0, nb_sectors=cluster_size>. This rounded request does
overlap <sector_num=10, sectors=2> so it will need to wait.
Therefore CoR and write will not execute at the same time.
Does this make more sense?
Stefan
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [Qemu-devel] [PATCH v4 3/8] block: add request tracking
2011-12-05 16:20 ` Stefan Hajnoczi
@ 2011-12-05 16:31 ` Marcelo Tosatti
0 siblings, 0 replies; 16+ messages in thread
From: Marcelo Tosatti @ 2011-12-05 16:31 UTC (permalink / raw)
To: Stefan Hajnoczi; +Cc: Kevin Wolf, Paolo Bonzini, Stefan Hajnoczi, qemu-devel
On Mon, Dec 05, 2011 at 04:20:55PM +0000, Stefan Hajnoczi wrote:
> On Mon, Dec 5, 2011 at 4:09 PM, Marcelo Tosatti <mtosatti@redhat.com> wrote:
> > On Wed, Nov 23, 2011 at 11:47:53AM +0000, Stefan Hajnoczi wrote:
> > On earlier discussion, you replied to me:
> >
> > "
> >>> req = tracked_request_add(bs, sector_num, nb_sectors, false);
> >>
> >> The tracked request should include cluster round info?
> >
> > When checking A and B for overlap, only one of them needs to be
> > rounded in order for overlap detection to be correct. Therefore we
> > can store the original request [start, length) in tracked_requests and
> > only round the new request.
> > "
> >
> > The problem AFAICS is this:
> >
> > - Store a non-cluster-aligned request in the tracked request list.
> > - Wait on that non-cluster-aligned request
> > (wait_for_overlapping_requests).
> > - Submit cluster-aligned request for COR request.
> >
> > So, the tracked request list does not properly reflect the in-flight
> > COR requests. Which can result in:
> >
> > 1) guest reads sector 10.
> > 2) <sector_num=10,nb_sectors=2> added to tracked request list.
> > 3) COR code submits read for <sector_num=10,nb_sectors=2+cluster_align>
>
> It will also round down to sector_num=0 when cluster size is 128
> sectors (e.g. qcow2 and qed).
>
> > 4) unrelated guest operation writes to sector 13, nb_sectors=1. That is
> > allowed to proceed without waiting because tracked request list does not
> > reflect what COR in-flight requests.
>
> The tracked request list has <sector_num=10, nb_sectors=2> and the
> candidate write request is <sector_num=13, nb_sectors=1>.
>
> In wait_for_overlapping_requests() we round the candidate request to
> <sector_num=0, nb_sectors=cluster_size>. This rounded request does
> overlap <sector_num=10, sectors=2> so it will need to wait.
>
> Therefore CoR and write will not execute at the same time.
>
> Does this make more sense?
>
> Stefan
Ah yes same mistake on my part, again. Thanks.
^ permalink raw reply [flat|nested] 16+ messages in thread
* [Qemu-devel] [PATCH v4 4/8] block: add bdrv_set_copy_on_read()
2011-11-23 11:47 [Qemu-devel] [PATCH v4 0/8] block: generic copy-on-read Stefan Hajnoczi
` (2 preceding siblings ...)
2011-11-23 11:47 ` [Qemu-devel] [PATCH v4 3/8] block: add request tracking Stefan Hajnoczi
@ 2011-11-23 11:47 ` Stefan Hajnoczi
2011-11-23 11:47 ` [Qemu-devel] [PATCH v4 5/8] block: wait for overlapping requests Stefan Hajnoczi
` (3 subsequent siblings)
7 siblings, 0 replies; 16+ messages in thread
From: Stefan Hajnoczi @ 2011-11-23 11:47 UTC (permalink / raw)
To: qemu-devel; +Cc: Kevin Wolf, Paolo Bonzini, Marcelo Tosatti, Stefan Hajnoczi
The bdrv_set_copy_on_read() function can be used to programmatically
enable or disable copy-on-read for a block device. Later patches add
the actual copy-on-read logic.
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
---
block.c | 22 ++++++++++++++++++++++
block.h | 3 +++
block_int.h | 2 ++
3 files changed, 27 insertions(+), 0 deletions(-)
diff --git a/block.c b/block.c
index 27c4e84..c90880b 100644
--- a/block.c
+++ b/block.c
@@ -538,6 +538,22 @@ int bdrv_parse_cache_flags(const char *mode, int *flags)
return 0;
}
+/**
+ * Enable/disable copy-on-read
+ *
+ * This is based on a reference count so multiple users may call this function
+ * without worrying about clobbering the previous state. Copy-on-read stays
+ * enabled until all users have called to disable it.
+ */
+void bdrv_set_copy_on_read(BlockDriverState *bs, bool enable)
+{
+ if (enable) {
+ bs->copy_on_read++;
+ } else {
+ bs->copy_on_read--;
+ }
+}
+
/*
* Common part for opening disk images and files
*/
@@ -559,6 +575,11 @@ static int bdrv_open_common(BlockDriverState *bs, const char *filename,
bs->growable = 0;
bs->buffer_alignment = 512;
+ assert(bs->copy_on_read == 0); /* bdrv_new() and bdrv_close() make it so */
+ if (flags & BDRV_O_RDWR) {
+ bdrv_set_copy_on_read(bs, !!(flags & BDRV_O_COPY_ON_READ));
+ }
+
pstrcpy(bs->filename, sizeof(bs->filename), filename);
bs->backing_file[0] = '\0';
@@ -801,6 +822,7 @@ void bdrv_close(BlockDriverState *bs)
#endif
bs->opaque = NULL;
bs->drv = NULL;
+ bs->copy_on_read = 0;
if (bs->file != NULL) {
bdrv_close(bs->file);
diff --git a/block.h b/block.h
index ad8dd48..68b4b14 100644
--- a/block.h
+++ b/block.h
@@ -70,6 +70,7 @@ typedef struct BlockDevOps {
#define BDRV_O_NATIVE_AIO 0x0080 /* use native AIO instead of the thread pool */
#define BDRV_O_NO_BACKING 0x0100 /* don't open the backing file */
#define BDRV_O_NO_FLUSH 0x0200 /* disable flushing on this disk */
+#define BDRV_O_COPY_ON_READ 0x0400 /* copy read backing sectors into image */
#define BDRV_O_CACHE_MASK (BDRV_O_NOCACHE | BDRV_O_CACHE_WB | BDRV_O_NO_FLUSH)
@@ -308,6 +309,8 @@ void bdrv_reset_dirty(BlockDriverState *bs, int64_t cur_sector,
int nr_sectors);
int64_t bdrv_get_dirty_count(BlockDriverState *bs);
+void bdrv_set_copy_on_read(BlockDriverState *bs, bool enable);
+
void bdrv_set_in_use(BlockDriverState *bs, int in_use);
int bdrv_in_use(BlockDriverState *bs);
diff --git a/block_int.h b/block_int.h
index 788fde9..3c5bacb 100644
--- a/block_int.h
+++ b/block_int.h
@@ -193,6 +193,8 @@ struct BlockDriverState {
int encrypted; /* if true, the media is encrypted */
int valid_key; /* if true, a valid encryption key has been set */
int sg; /* if true, the device is a /dev/sg* */
+ int copy_on_read; /* if true, copy read backing sectors into image
+ note this is a reference count */
BlockDriver *drv; /* NULL means no media */
void *opaque;
--
1.7.7.1
^ permalink raw reply related [flat|nested] 16+ messages in thread
* [Qemu-devel] [PATCH v4 5/8] block: wait for overlapping requests
2011-11-23 11:47 [Qemu-devel] [PATCH v4 0/8] block: generic copy-on-read Stefan Hajnoczi
` (3 preceding siblings ...)
2011-11-23 11:47 ` [Qemu-devel] [PATCH v4 4/8] block: add bdrv_set_copy_on_read() Stefan Hajnoczi
@ 2011-11-23 11:47 ` Stefan Hajnoczi
2011-11-23 11:47 ` [Qemu-devel] [PATCH v4 6/8] block: request overlap detection Stefan Hajnoczi
` (2 subsequent siblings)
7 siblings, 0 replies; 16+ messages in thread
From: Stefan Hajnoczi @ 2011-11-23 11:47 UTC (permalink / raw)
To: qemu-devel; +Cc: Kevin Wolf, Paolo Bonzini, Marcelo Tosatti, Stefan Hajnoczi
When copy-on-read is enabled it is necessary to wait for overlapping
requests before issuing new requests. This prevents races between the
copy-on-read and a write request.
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
---
block.c | 35 +++++++++++++++++++++++++++++++++++
1 files changed, 35 insertions(+), 0 deletions(-)
diff --git a/block.c b/block.c
index c90880b..da7aaa2 100644
--- a/block.c
+++ b/block.c
@@ -1099,6 +1099,7 @@ struct BdrvTrackedRequest {
int nb_sectors;
bool is_write;
QLIST_ENTRY(BdrvTrackedRequest) list;
+ CoQueue wait_queue; /* coroutines blocked on this request */
};
/**
@@ -1109,6 +1110,7 @@ struct BdrvTrackedRequest {
static void tracked_request_end(BdrvTrackedRequest *req)
{
QLIST_REMOVE(req, list);
+ qemu_co_queue_restart_all(&req->wait_queue);
}
/**
@@ -1126,9 +1128,34 @@ static void tracked_request_begin(BdrvTrackedRequest *req,
.is_write = is_write,
};
+ qemu_co_queue_init(&req->wait_queue);
+
QLIST_INSERT_HEAD(&bs->tracked_requests, req, list);
}
+static bool tracked_request_overlaps(BdrvTrackedRequest *req,
+ int64_t sector_num, int nb_sectors) {
+ return false; /* not yet implemented */
+}
+
+static void coroutine_fn wait_for_overlapping_requests(BlockDriverState *bs,
+ int64_t sector_num, int nb_sectors)
+{
+ BdrvTrackedRequest *req;
+ bool retry;
+
+ do {
+ retry = false;
+ QLIST_FOREACH(req, &bs->tracked_requests, list) {
+ if (tracked_request_overlaps(req, sector_num, nb_sectors)) {
+ qemu_co_queue_wait(&req->wait_queue);
+ retry = true;
+ break;
+ }
+ }
+ } while (retry);
+}
+
/*
* Return values:
* 0 - success
@@ -1423,6 +1450,10 @@ static int coroutine_fn bdrv_co_do_readv(BlockDriverState *bs,
bdrv_io_limits_intercept(bs, false, nb_sectors);
}
+ if (bs->copy_on_read) {
+ wait_for_overlapping_requests(bs, sector_num, nb_sectors);
+ }
+
tracked_request_begin(&req, bs, sector_num, nb_sectors, false);
ret = drv->bdrv_co_readv(bs, sector_num, nb_sectors, qiov);
tracked_request_end(&req);
@@ -1462,6 +1493,10 @@ static int coroutine_fn bdrv_co_do_writev(BlockDriverState *bs,
bdrv_io_limits_intercept(bs, true, nb_sectors);
}
+ if (bs->copy_on_read) {
+ wait_for_overlapping_requests(bs, sector_num, nb_sectors);
+ }
+
tracked_request_begin(&req, bs, sector_num, nb_sectors, true);
ret = drv->bdrv_co_writev(bs, sector_num, nb_sectors, qiov);
--
1.7.7.1
^ permalink raw reply related [flat|nested] 16+ messages in thread
* [Qemu-devel] [PATCH v4 6/8] block: request overlap detection
2011-11-23 11:47 [Qemu-devel] [PATCH v4 0/8] block: generic copy-on-read Stefan Hajnoczi
` (4 preceding siblings ...)
2011-11-23 11:47 ` [Qemu-devel] [PATCH v4 5/8] block: wait for overlapping requests Stefan Hajnoczi
@ 2011-11-23 11:47 ` Stefan Hajnoczi
2011-11-23 11:47 ` [Qemu-devel] [PATCH v4 7/8] block: core copy-on-read logic Stefan Hajnoczi
2011-11-23 11:47 ` [Qemu-devel] [PATCH v4 8/8] block: add -drive copy-on-read=on|off Stefan Hajnoczi
7 siblings, 0 replies; 16+ messages in thread
From: Stefan Hajnoczi @ 2011-11-23 11:47 UTC (permalink / raw)
To: qemu-devel; +Cc: Kevin Wolf, Paolo Bonzini, Marcelo Tosatti, Stefan Hajnoczi
Detect overlapping requests and remember to align to cluster boundaries
if the image format uses them. This assumes that allocating I/O is
performed in cluster granularity - which is true for qcow2, qed, etc.
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
---
block.c | 45 +++++++++++++++++++++++++++++++++++++++++++--
1 files changed, 43 insertions(+), 2 deletions(-)
diff --git a/block.c b/block.c
index da7aaa2..c30c8f2 100644
--- a/block.c
+++ b/block.c
@@ -1133,21 +1133,62 @@ static void tracked_request_begin(BdrvTrackedRequest *req,
QLIST_INSERT_HEAD(&bs->tracked_requests, req, list);
}
+/**
+ * Round a region to cluster boundaries
+ */
+static void round_to_clusters(BlockDriverState *bs,
+ int64_t sector_num, int nb_sectors,
+ int64_t *cluster_sector_num,
+ int *cluster_nb_sectors)
+{
+ BlockDriverInfo bdi;
+
+ if (bdrv_get_info(bs, &bdi) < 0 || bdi.cluster_size == 0) {
+ *cluster_sector_num = sector_num;
+ *cluster_nb_sectors = nb_sectors;
+ } else {
+ int64_t c = bdi.cluster_size / BDRV_SECTOR_SIZE;
+ *cluster_sector_num = QEMU_ALIGN_DOWN(sector_num, c);
+ *cluster_nb_sectors = QEMU_ALIGN_UP(sector_num - *cluster_sector_num +
+ nb_sectors, c);
+ }
+}
+
static bool tracked_request_overlaps(BdrvTrackedRequest *req,
int64_t sector_num, int nb_sectors) {
- return false; /* not yet implemented */
+ /* aaaa bbbb */
+ if (sector_num >= req->sector_num + req->nb_sectors) {
+ return false;
+ }
+ /* bbbb aaaa */
+ if (req->sector_num >= sector_num + nb_sectors) {
+ return false;
+ }
+ return true;
}
static void coroutine_fn wait_for_overlapping_requests(BlockDriverState *bs,
int64_t sector_num, int nb_sectors)
{
BdrvTrackedRequest *req;
+ int64_t cluster_sector_num;
+ int cluster_nb_sectors;
bool retry;
+ /* If we touch the same cluster it counts as an overlap. This guarantees
+ * that allocating writes will be serialized and not race with each other
+ * for the same cluster. For example, in copy-on-read it ensures that the
+ * CoR read and write operations are atomic and guest writes cannot
+ * interleave between them.
+ */
+ round_to_clusters(bs, sector_num, nb_sectors,
+ &cluster_sector_num, &cluster_nb_sectors);
+
do {
retry = false;
QLIST_FOREACH(req, &bs->tracked_requests, list) {
- if (tracked_request_overlaps(req, sector_num, nb_sectors)) {
+ if (tracked_request_overlaps(req, cluster_sector_num,
+ cluster_nb_sectors)) {
qemu_co_queue_wait(&req->wait_queue);
retry = true;
break;
--
1.7.7.1
^ permalink raw reply related [flat|nested] 16+ messages in thread
* [Qemu-devel] [PATCH v4 7/8] block: core copy-on-read logic
2011-11-23 11:47 [Qemu-devel] [PATCH v4 0/8] block: generic copy-on-read Stefan Hajnoczi
` (5 preceding siblings ...)
2011-11-23 11:47 ` [Qemu-devel] [PATCH v4 6/8] block: request overlap detection Stefan Hajnoczi
@ 2011-11-23 11:47 ` Stefan Hajnoczi
2011-12-02 17:13 ` Marcelo Tosatti
2011-11-23 11:47 ` [Qemu-devel] [PATCH v4 8/8] block: add -drive copy-on-read=on|off Stefan Hajnoczi
7 siblings, 1 reply; 16+ messages in thread
From: Stefan Hajnoczi @ 2011-11-23 11:47 UTC (permalink / raw)
To: qemu-devel; +Cc: Kevin Wolf, Paolo Bonzini, Marcelo Tosatti, Stefan Hajnoczi
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
---
block.c | 72 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
trace-events | 1 +
2 files changed, 73 insertions(+), 0 deletions(-)
diff --git a/block.c b/block.c
index c30c8f2..a598a19 100644
--- a/block.c
+++ b/block.c
@@ -1469,6 +1469,61 @@ int bdrv_pwrite_sync(BlockDriverState *bs, int64_t offset,
return 0;
}
+static int coroutine_fn bdrv_co_copy_on_readv(BlockDriverState *bs,
+ int64_t sector_num, int nb_sectors, QEMUIOVector *qiov)
+{
+ /* Perform I/O through a temporary buffer so that users who scribble over
+ * their read buffer while the operation is in progress do not end up
+ * modifying the image file. This is critical for zero-copy guest I/O
+ * where anything might happen inside guest memory.
+ */
+ void *bounce_buffer;
+
+ struct iovec iov;
+ QEMUIOVector bounce_qiov;
+ int64_t cluster_sector_num;
+ int cluster_nb_sectors;
+ size_t skip_bytes;
+ int ret;
+
+ /* Cover entire cluster so no additional backing file I/O is required when
+ * allocating cluster in the image file.
+ */
+ round_to_clusters(bs, sector_num, nb_sectors,
+ &cluster_sector_num, &cluster_nb_sectors);
+
+ trace_bdrv_co_copy_on_readv(bs, sector_num, nb_sectors,
+ cluster_sector_num, cluster_nb_sectors);
+
+ iov.iov_len = cluster_nb_sectors * BDRV_SECTOR_SIZE;
+ iov.iov_base = bounce_buffer = qemu_blockalign(bs, iov.iov_len);
+ qemu_iovec_init_external(&bounce_qiov, &iov, 1);
+
+ ret = bs->drv->bdrv_co_readv(bs, cluster_sector_num, cluster_nb_sectors,
+ &bounce_qiov);
+ if (ret < 0) {
+ goto err;
+ }
+
+ ret = bs->drv->bdrv_co_writev(bs, cluster_sector_num, cluster_nb_sectors,
+ &bounce_qiov);
+ if (ret < 0) {
+ /* It might be okay to ignore write errors for guest requests. If this
+ * is a deliberate copy-on-read then we don't want to ignore the error.
+ * Simply report it in all cases.
+ */
+ goto err;
+ }
+
+ skip_bytes = (sector_num - cluster_sector_num) * BDRV_SECTOR_SIZE;
+ qemu_iovec_from_buffer(qiov, bounce_buffer + skip_bytes,
+ nb_sectors * BDRV_SECTOR_SIZE);
+
+err:
+ qemu_vfree(bounce_buffer);
+ return ret;
+}
+
/*
* Handle a read request in coroutine context
*/
@@ -1496,7 +1551,24 @@ static int coroutine_fn bdrv_co_do_readv(BlockDriverState *bs,
}
tracked_request_begin(&req, bs, sector_num, nb_sectors, false);
+
+ if (bs->copy_on_read) {
+ int pnum;
+
+ ret = bdrv_co_is_allocated(bs, sector_num, nb_sectors, &pnum);
+ if (ret < 0) {
+ goto out;
+ }
+
+ if (!ret || pnum != nb_sectors) {
+ ret = bdrv_co_copy_on_readv(bs, sector_num, nb_sectors, qiov);
+ goto out;
+ }
+ }
+
ret = drv->bdrv_co_readv(bs, sector_num, nb_sectors, qiov);
+
+out:
tracked_request_end(&req);
return ret;
}
diff --git a/trace-events b/trace-events
index 962caca..518b76b 100644
--- a/trace-events
+++ b/trace-events
@@ -69,6 +69,7 @@ bdrv_lock_medium(void *bs, bool locked) "bs %p locked %d"
bdrv_co_readv(void *bs, int64_t sector_num, int nb_sector) "bs %p sector_num %"PRId64" nb_sectors %d"
bdrv_co_writev(void *bs, int64_t sector_num, int nb_sector) "bs %p sector_num %"PRId64" nb_sectors %d"
bdrv_co_io_em(void *bs, int64_t sector_num, int nb_sectors, int is_write, void *acb) "bs %p sector_num %"PRId64" nb_sectors %d is_write %d acb %p"
+bdrv_co_copy_on_readv(void *bs, int64_t sector_num, int nb_sectors, int64_t cluster_sector_num, int cluster_nb_sectors) "bs %p sector_num %"PRId64" nb_sectors %d cluster_sector_num %"PRId64" cluster_nb_sectors %d"
# hw/virtio-blk.c
virtio_blk_req_complete(void *req, int status) "req %p status %d"
--
1.7.7.1
^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [Qemu-devel] [PATCH v4 7/8] block: core copy-on-read logic
2011-11-23 11:47 ` [Qemu-devel] [PATCH v4 7/8] block: core copy-on-read logic Stefan Hajnoczi
@ 2011-12-02 17:13 ` Marcelo Tosatti
2011-12-05 11:35 ` Stefan Hajnoczi
0 siblings, 1 reply; 16+ messages in thread
From: Marcelo Tosatti @ 2011-12-02 17:13 UTC (permalink / raw)
To: Stefan Hajnoczi; +Cc: Kevin Wolf, Paolo Bonzini, qemu-devel
On Wed, Nov 23, 2011 at 11:47:57AM +0000, Stefan Hajnoczi wrote:
> Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
> ---
> block.c | 72 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> trace-events | 1 +
> 2 files changed, 73 insertions(+), 0 deletions(-)
>
> diff --git a/block.c b/block.c
> index c30c8f2..a598a19 100644
> --- a/block.c
> +++ b/block.c
> @@ -1469,6 +1469,61 @@ int bdrv_pwrite_sync(BlockDriverState *bs, int64_t offset,
> return 0;
> }
>
> +static int coroutine_fn bdrv_co_copy_on_readv(BlockDriverState *bs,
> + int64_t sector_num, int nb_sectors, QEMUIOVector *qiov)
> +{
> + /* Perform I/O through a temporary buffer so that users who scribble over
> + * their read buffer while the operation is in progress do not end up
> + * modifying the image file. This is critical for zero-copy guest I/O
> + * where anything might happen inside guest memory.
> + */
> + void *bounce_buffer;
> +
> + struct iovec iov;
> + QEMUIOVector bounce_qiov;
> + int64_t cluster_sector_num;
> + int cluster_nb_sectors;
> + size_t skip_bytes;
> + int ret;
> +
> + /* Cover entire cluster so no additional backing file I/O is required when
> + * allocating cluster in the image file.
> + */
> + round_to_clusters(bs, sector_num, nb_sectors,
> + &cluster_sector_num, &cluster_nb_sectors);
> +
> + trace_bdrv_co_copy_on_readv(bs, sector_num, nb_sectors,
> + cluster_sector_num, cluster_nb_sectors);
> +
> + iov.iov_len = cluster_nb_sectors * BDRV_SECTOR_SIZE;
> + iov.iov_base = bounce_buffer = qemu_blockalign(bs, iov.iov_len);
> + qemu_iovec_init_external(&bounce_qiov, &iov, 1);
> +
> + ret = bs->drv->bdrv_co_readv(bs, cluster_sector_num, cluster_nb_sectors,
> + &bounce_qiov);
> + if (ret < 0) {
> + goto err;
> + }
> +
> + ret = bs->drv->bdrv_co_writev(bs, cluster_sector_num, cluster_nb_sectors,
> + &bounce_qiov);
> + if (ret < 0) {
> + /* It might be okay to ignore write errors for guest requests. If this
> + * is a deliberate copy-on-read then we don't want to ignore the error.
> + * Simply report it in all cases.
> + */
> + goto err;
> + }
> +
> + skip_bytes = (sector_num - cluster_sector_num) * BDRV_SECTOR_SIZE;
> + qemu_iovec_from_buffer(qiov, bounce_buffer + skip_bytes,
> + nb_sectors * BDRV_SECTOR_SIZE);
> +
> +err:
> + qemu_vfree(bounce_buffer);
> + return ret;
> +}
> +
> /*
> * Handle a read request in coroutine context
> */
> @@ -1496,7 +1551,24 @@ static int coroutine_fn bdrv_co_do_readv(BlockDriverState *bs,
> }
>
> tracked_request_begin(&req, bs, sector_num, nb_sectors, false);
> +
> + if (bs->copy_on_read) {
> + int pnum;
> +
> + ret = bdrv_co_is_allocated(bs, sector_num, nb_sectors, &pnum);
> + if (ret < 0) {
> + goto out;
> + }
Stefan,
It is not clear where support for shared backing files would fit.
Let us consider the following block copy example:
1) Original chain:
[ BASE ] -> [ IMAGE-1 ] -> [ IMAGE-2 ] -> [ IMAGE-3 ]
2) Final chain:
[ BASE ] -> [ IMAGE-3 ]
I was talking to Kevin and we don't have code/monitor command to the
switch from 1) to 2). But that is a separate issue.
Question is, how do you plan to stream the contents of IMAGE-1 and
IMAGE-2 (but not BASE), into IMAGE-3 ? That is an important use case.
Also, do you have status on the external COW file work, for raw images?
Thanks
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [Qemu-devel] [PATCH v4 7/8] block: core copy-on-read logic
2011-12-02 17:13 ` Marcelo Tosatti
@ 2011-12-05 11:35 ` Stefan Hajnoczi
0 siblings, 0 replies; 16+ messages in thread
From: Stefan Hajnoczi @ 2011-12-05 11:35 UTC (permalink / raw)
To: Marcelo Tosatti; +Cc: Kevin Wolf, Paolo Bonzini, Stefan Hajnoczi, qemu-devel
On Fri, Dec 2, 2011 at 5:13 PM, Marcelo Tosatti <mtosatti@redhat.com> wrote:
> On Wed, Nov 23, 2011 at 11:47:57AM +0000, Stefan Hajnoczi wrote:
>> Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
>> ---
>> block.c | 72 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> trace-events | 1 +
>> 2 files changed, 73 insertions(+), 0 deletions(-)
>>
>> diff --git a/block.c b/block.c
>> index c30c8f2..a598a19 100644
>> --- a/block.c
>> +++ b/block.c
>> @@ -1469,6 +1469,61 @@ int bdrv_pwrite_sync(BlockDriverState *bs, int64_t offset,
>> return 0;
>> }
>>
>> +static int coroutine_fn bdrv_co_copy_on_readv(BlockDriverState *bs,
>> + int64_t sector_num, int nb_sectors, QEMUIOVector *qiov)
>> +{
>> + /* Perform I/O through a temporary buffer so that users who scribble over
>> + * their read buffer while the operation is in progress do not end up
>> + * modifying the image file. This is critical for zero-copy guest I/O
>> + * where anything might happen inside guest memory.
>> + */
>> + void *bounce_buffer;
>> +
>> + struct iovec iov;
>> + QEMUIOVector bounce_qiov;
>> + int64_t cluster_sector_num;
>> + int cluster_nb_sectors;
>> + size_t skip_bytes;
>> + int ret;
>> +
>> + /* Cover entire cluster so no additional backing file I/O is required when
>> + * allocating cluster in the image file.
>> + */
>> + round_to_clusters(bs, sector_num, nb_sectors,
>> + &cluster_sector_num, &cluster_nb_sectors);
>> +
>> + trace_bdrv_co_copy_on_readv(bs, sector_num, nb_sectors,
>> + cluster_sector_num, cluster_nb_sectors);
>> +
>> + iov.iov_len = cluster_nb_sectors * BDRV_SECTOR_SIZE;
>> + iov.iov_base = bounce_buffer = qemu_blockalign(bs, iov.iov_len);
>> + qemu_iovec_init_external(&bounce_qiov, &iov, 1);
>> +
>> + ret = bs->drv->bdrv_co_readv(bs, cluster_sector_num, cluster_nb_sectors,
>> + &bounce_qiov);
>> + if (ret < 0) {
>> + goto err;
>> + }
>> +
>> + ret = bs->drv->bdrv_co_writev(bs, cluster_sector_num, cluster_nb_sectors,
>> + &bounce_qiov);
>> + if (ret < 0) {
>> + /* It might be okay to ignore write errors for guest requests. If this
>> + * is a deliberate copy-on-read then we don't want to ignore the error.
>> + * Simply report it in all cases.
>> + */
>> + goto err;
>> + }
>> +
>> + skip_bytes = (sector_num - cluster_sector_num) * BDRV_SECTOR_SIZE;
>> + qemu_iovec_from_buffer(qiov, bounce_buffer + skip_bytes,
>> + nb_sectors * BDRV_SECTOR_SIZE);
>> +
>> +err:
>> + qemu_vfree(bounce_buffer);
>> + return ret;
>> +}
>> +
>> /*
>> * Handle a read request in coroutine context
>> */
>> @@ -1496,7 +1551,24 @@ static int coroutine_fn bdrv_co_do_readv(BlockDriverState *bs,
>> }
>>
>> tracked_request_begin(&req, bs, sector_num, nb_sectors, false);
>> +
>> + if (bs->copy_on_read) {
>> + int pnum;
>> +
>> + ret = bdrv_co_is_allocated(bs, sector_num, nb_sectors, &pnum);
>> + if (ret < 0) {
>> + goto out;
>> + }
>
> Stefan,
>
> It is not clear where support for shared backing files would fit.
> Let us consider the following block copy example:
>
> 1) Original chain:
> [ BASE ] -> [ IMAGE-1 ] -> [ IMAGE-2 ] -> [ IMAGE-3 ]
>
> 2) Final chain:
> [ BASE ] -> [ IMAGE-3 ]
>
> I was talking to Kevin and we don't have code/monitor command to the
> switch from 1) to 2). But that is a separate issue.
>
> Question is, how do you plan to stream the contents of IMAGE-1 and
> IMAGE-2 (but not BASE), into IMAGE-3 ? That is an important use case.
We need to introduce a "deep" bdrv_is_allocated() which takes a "base"
argument at which to terminate the search. Here's what it will do:
[ BASE ] -> [ IMAGE-1 ] -> [ IMAGE-2 ] -> [ IMAGE-3 ]
def bdrv_co_is_allocated(IMAGE-3, sector_num=0, base=BASE):
if IMAGE-3 sector_num=0 is allocated:
return True # already allocated
if IMAGE-2 sector_num=0 is allocated:
return False # copy it!
if IMAGE-1 sector_num=0 is allocated:
return False # copy it!
return True # reached base image so we don't care
This is the hardcoded version of the function, in reality it will be a
loop but it shows the idea.
> Also, do you have status on the external COW file work, for raw images?
Sure, please help review Dong Xu Wang's add-cow patches on the list,
he is looking for feedback.
Stefan
^ permalink raw reply [flat|nested] 16+ messages in thread
* [Qemu-devel] [PATCH v4 8/8] block: add -drive copy-on-read=on|off
2011-11-23 11:47 [Qemu-devel] [PATCH v4 0/8] block: generic copy-on-read Stefan Hajnoczi
` (6 preceding siblings ...)
2011-11-23 11:47 ` [Qemu-devel] [PATCH v4 7/8] block: core copy-on-read logic Stefan Hajnoczi
@ 2011-11-23 11:47 ` Stefan Hajnoczi
7 siblings, 0 replies; 16+ messages in thread
From: Stefan Hajnoczi @ 2011-11-23 11:47 UTC (permalink / raw)
To: qemu-devel; +Cc: Kevin Wolf, Paolo Bonzini, Marcelo Tosatti, Stefan Hajnoczi
This patch adds the -drive copy-on-read=on|off command-line option:
copy-on-read=on|off
copy-on-read is "on" or "off" and enables whether to copy read backing
file sectors into the image file. Copy-on-read avoids accessing the
same backing file sectors repeatedly and is useful when the backing
file is over a slow network. By default copy-on-read is off.
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
---
blockdev.c | 6 ++++++
hmp-commands.hx | 5 +++--
qemu-config.c | 4 ++++
qemu-options.hx | 9 ++++++++-
4 files changed, 21 insertions(+), 3 deletions(-)
diff --git a/blockdev.c b/blockdev.c
index 9068c5b..af4e239 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -257,6 +257,7 @@ DriveInfo *drive_init(QemuOpts *opts, int default_to_scsi)
DriveInfo *dinfo;
BlockIOLimit io_limits;
int snapshot = 0;
+ bool copy_on_read;
int ret;
translation = BIOS_ATA_TRANSLATION_AUTO;
@@ -273,6 +274,7 @@ DriveInfo *drive_init(QemuOpts *opts, int default_to_scsi)
snapshot = qemu_opt_get_bool(opts, "snapshot", 0);
ro = qemu_opt_get_bool(opts, "readonly", 0);
+ copy_on_read = qemu_opt_get_bool(opts, "copy-on-read", false);
file = qemu_opt_get(opts, "file");
serial = qemu_opt_get(opts, "serial");
@@ -546,6 +548,10 @@ DriveInfo *drive_init(QemuOpts *opts, int default_to_scsi)
bdrv_flags |= (BDRV_O_SNAPSHOT|BDRV_O_CACHE_WB|BDRV_O_NO_FLUSH);
}
+ if (copy_on_read) {
+ bdrv_flags |= BDRV_O_COPY_ON_READ;
+ }
+
if (media == MEDIA_CDROM) {
/* CDROM is fine for any interface, don't check. */
ro = 1;
diff --git a/hmp-commands.hx b/hmp-commands.hx
index f8d855e..79a9195 100644
--- a/hmp-commands.hx
+++ b/hmp-commands.hx
@@ -860,9 +860,10 @@ ETEXI
.args_type = "pci_addr:s,opts:s",
.params = "[[<domain>:]<bus>:]<slot>\n"
"[file=file][,if=type][,bus=n]\n"
- "[,unit=m][,media=d][index=i]\n"
+ "[,unit=m][,media=d][,index=i]\n"
"[,cyls=c,heads=h,secs=s[,trans=t]]\n"
- "[snapshot=on|off][,cache=on|off]",
+ "[,snapshot=on|off][,cache=on|off]\n"
+ "[,readonly=on|off][,copy-on-read=on|off]",
.help = "add drive to PCI storage controller",
.mhandler.cmd = drive_hot_add,
},
diff --git a/qemu-config.c b/qemu-config.c
index 1aa080f..18f3020 100644
--- a/qemu-config.c
+++ b/qemu-config.c
@@ -109,6 +109,10 @@ static QemuOptsList qemu_drive_opts = {
.name = "bps_wr",
.type = QEMU_OPT_NUMBER,
.help = "limit write bytes per second",
+ },{
+ .name = "copy-on-read",
+ .type = QEMU_OPT_BOOL,
+ .help = "copy read data from backing file into image file",
},
{ /* end of list */ }
},
diff --git a/qemu-options.hx b/qemu-options.hx
index 25a7be7..b3db10c 100644
--- a/qemu-options.hx
+++ b/qemu-options.hx
@@ -135,7 +135,7 @@ DEF("drive", HAS_ARG, QEMU_OPTION_drive,
" [,cyls=c,heads=h,secs=s[,trans=t]][,snapshot=on|off]\n"
" [,cache=writethrough|writeback|none|directsync|unsafe][,format=f]\n"
" [,serial=s][,addr=A][,id=name][,aio=threads|native]\n"
- " [,readonly=on|off]\n"
+ " [,readonly=on|off][,copy-on-read=on|off]\n"
" [[,bps=b]|[[,bps_rd=r][,bps_wr=w]]][[,iops=i]|[[,iops_rd=r][,iops_wr=w]]\n"
" use 'file' as a drive image\n", QEMU_ARCH_ALL)
STEXI
@@ -187,6 +187,9 @@ host disk is full; report the error to the guest otherwise).
The default setting is @option{werror=enospc} and @option{rerror=report}.
@item readonly
Open drive @option{file} as read-only. Guest write attempts will fail.
+@item copy-on-read=@var{copy-on-read}
+@var{copy-on-read} is "on" or "off" and enables whether to copy read backing
+file sectors into the image file.
@end table
By default, writethrough caching is used for all block device. This means that
@@ -218,6 +221,10 @@ like your host losing power, the disk storage getting disconnected accidently,
etc. you're image will most probably be rendered unusable. When using
the @option{-snapshot} option, unsafe caching is always used.
+Copy-on-read avoids accessing the same backing file sectors repeatedly and is
+useful when the backing file is over a slow network. By default copy-on-read
+is off.
+
Instead of @option{-cdrom} you can use:
@example
qemu -drive file=file,index=2,media=cdrom
--
1.7.7.1
^ permalink raw reply related [flat|nested] 16+ messages in thread