* [PATCH v2] qcow2: Skip copy-on-write when allocating a zero cluster
@ 2020-08-27 14:53 Alberto Garcia
2020-09-09 12:52 ` Alberto Garcia
2020-09-09 19:23 ` Vladimir Sementsov-Ogievskiy
0 siblings, 2 replies; 4+ messages in thread
From: Alberto Garcia @ 2020-08-27 14:53 UTC (permalink / raw)
To: qemu-devel
Cc: Kevin Wolf, Vladimir Sementsov-Ogievskiy, Alberto Garcia,
qemu-block, Max Reitz
Since commit c8bb23cbdbe32f5c326365e0a82e1b0e68cdcd8a when a write
request results in a new allocation QEMU first tries to see if the
rest of the cluster outside the written area contains only zeroes.
In that case, instead of doing a normal copy-on-write operation and
writing explicit zero buffers to disk, the code zeroes the whole
cluster efficiently using pwrite_zeroes() with BDRV_REQ_NO_FALLBACK.
This improves performance very significantly but it only happens when
we are writing to an area that was completely unallocated before. Zero
clusters (QCOW2_CLUSTER_ZERO_*) are treated like normal clusters and
are therefore slower to allocate.
This happens because the code uses bdrv_is_allocated_above() rather
bdrv_block_status_above(). The former is not as accurate for this
purpose but it is faster. However in the case of qcow2 the underlying
call does already report zero clusters just fine so there is no reason
why we cannot use that information.
After testing 4KB writes on an image that only contains zero clusters
this patch results in almost five times more IOPS.
Signed-off-by: Alberto Garcia <berto@igalia.com>
---
v2:
- Add new, simpler API: bdrv_is_unallocated_or_zero_above()
include/block/block.h | 2 ++
block/io.c | 24 ++++++++++++++++++++++++
block/qcow2.c | 37 +++++++++++++++++++++----------------
3 files changed, 47 insertions(+), 16 deletions(-)
diff --git a/include/block/block.h b/include/block/block.h
index 6e36154061..477a72b2e9 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -496,6 +496,8 @@ int bdrv_is_allocated(BlockDriverState *bs, int64_t offset, int64_t bytes,
int bdrv_is_allocated_above(BlockDriverState *top, BlockDriverState *base,
bool include_base, int64_t offset, int64_t bytes,
int64_t *pnum);
+int bdrv_is_unallocated_or_zero_above(BlockDriverState *bs, int64_t offset,
+ int64_t bytes);
bool bdrv_is_read_only(BlockDriverState *bs);
int bdrv_can_set_read_only(BlockDriverState *bs, bool read_only,
diff --git a/block/io.c b/block/io.c
index ad3a51ed53..94faa3f9d7 100644
--- a/block/io.c
+++ b/block/io.c
@@ -2557,6 +2557,30 @@ int bdrv_block_status(BlockDriverState *bs, int64_t offset, int64_t bytes,
offset, bytes, pnum, map, file);
}
+/*
+ * Check @bs (and its backing chain) to see if the range defined
+ * by @offset and @bytes is unallocated or known to read as zeroes.
+ * Return 1 if that is the case, 0 otherwise and -errno on error.
+ * This test is meant to be fast rather than accurate so returning 0
+ * does not guarantee non-zero data.
+ */
+int bdrv_is_unallocated_or_zero_above(BlockDriverState *bs, int64_t offset,
+ int64_t bytes)
+{
+ int ret;
+ int64_t pnum = bytes;
+
+ ret = bdrv_common_block_status_above(bs, NULL, false, offset,
+ bytes, &pnum, NULL, NULL);
+
+ if (ret < 0) {
+ return ret;
+ }
+
+ return (pnum == bytes) &&
+ ((ret & BDRV_BLOCK_ZERO) || (!(ret & BDRV_BLOCK_ALLOCATED)));
+}
+
int coroutine_fn bdrv_is_allocated(BlockDriverState *bs, int64_t offset,
int64_t bytes, int64_t *pnum)
{
diff --git a/block/qcow2.c b/block/qcow2.c
index da56b1a4df..29bea548db 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -2391,26 +2391,28 @@ static bool merge_cow(uint64_t offset, unsigned bytes,
return false;
}
-static bool is_unallocated(BlockDriverState *bs, int64_t offset, int64_t bytes)
-{
- int64_t nr;
- return !bytes ||
- (!bdrv_is_allocated_above(bs, NULL, false, offset, bytes, &nr) &&
- nr == bytes);
-}
-
-static bool is_zero_cow(BlockDriverState *bs, QCowL2Meta *m)
+/*
+ * Return 1 if the COW regions read as zeroes, 0 if not, < 0 on error.
+ * Note that returning 0 does not guarantee non-zero data.
+ */
+static int is_zero_cow(BlockDriverState *bs, QCowL2Meta *m)
{
/*
* This check is designed for optimization shortcut so it must be
* efficient.
- * Instead of is_zero(), use is_unallocated() as it is faster (but not
- * as accurate and can result in false negatives).
+ * Instead of is_zero(), use bdrv_is_unallocated_or_zero_above() as
+ * it is faster (but not as accurate and can result in false negatives).
*/
- return is_unallocated(bs, m->offset + m->cow_start.offset,
- m->cow_start.nb_bytes) &&
- is_unallocated(bs, m->offset + m->cow_end.offset,
- m->cow_end.nb_bytes);
+ int ret;
+
+ ret = bdrv_is_unallocated_or_zero_above(bs, m->offset + m->cow_start.offset,
+ m->cow_start.nb_bytes);
+ if (ret <= 0) {
+ return ret;
+ }
+
+ return bdrv_is_unallocated_or_zero_above(bs, m->offset + m->cow_end.offset,
+ m->cow_end.nb_bytes);
}
static int handle_alloc_space(BlockDriverState *bs, QCowL2Meta *l2meta)
@@ -2436,7 +2438,10 @@ static int handle_alloc_space(BlockDriverState *bs, QCowL2Meta *l2meta)
continue;
}
- if (!is_zero_cow(bs, m)) {
+ ret = is_zero_cow(bs, m);
+ if (ret < 0) {
+ return ret;
+ } else if (ret == 0) {
continue;
}
--
2.20.1
^ permalink raw reply related [flat|nested] 4+ messages in thread
* Re: [PATCH v2] qcow2: Skip copy-on-write when allocating a zero cluster
2020-08-27 14:53 [PATCH v2] qcow2: Skip copy-on-write when allocating a zero cluster Alberto Garcia
@ 2020-09-09 12:52 ` Alberto Garcia
2020-09-09 19:23 ` Vladimir Sementsov-Ogievskiy
1 sibling, 0 replies; 4+ messages in thread
From: Alberto Garcia @ 2020-09-09 12:52 UTC (permalink / raw)
To: qemu-devel
Cc: Kevin Wolf, Vladimir Sementsov-Ogievskiy, qemu-block, Max Reitz
ping
On Thu 27 Aug 2020 04:53:50 PM CEST, Alberto Garcia wrote:
> Since commit c8bb23cbdbe32f5c326365e0a82e1b0e68cdcd8a when a write
> request results in a new allocation QEMU first tries to see if the
> rest of the cluster outside the written area contains only zeroes.
>
> In that case, instead of doing a normal copy-on-write operation and
> writing explicit zero buffers to disk, the code zeroes the whole
> cluster efficiently using pwrite_zeroes() with BDRV_REQ_NO_FALLBACK.
>
> This improves performance very significantly but it only happens when
> we are writing to an area that was completely unallocated before. Zero
> clusters (QCOW2_CLUSTER_ZERO_*) are treated like normal clusters and
> are therefore slower to allocate.
>
> This happens because the code uses bdrv_is_allocated_above() rather
> bdrv_block_status_above(). The former is not as accurate for this
> purpose but it is faster. However in the case of qcow2 the underlying
> call does already report zero clusters just fine so there is no reason
> why we cannot use that information.
>
> After testing 4KB writes on an image that only contains zero clusters
> this patch results in almost five times more IOPS.
Berto
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [PATCH v2] qcow2: Skip copy-on-write when allocating a zero cluster
2020-08-27 14:53 [PATCH v2] qcow2: Skip copy-on-write when allocating a zero cluster Alberto Garcia
2020-09-09 12:52 ` Alberto Garcia
@ 2020-09-09 19:23 ` Vladimir Sementsov-Ogievskiy
2020-09-10 12:04 ` Kevin Wolf
1 sibling, 1 reply; 4+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2020-09-09 19:23 UTC (permalink / raw)
To: Alberto Garcia, qemu-devel; +Cc: qemu-block, Kevin Wolf, Max Reitz
27.08.2020 17:53, Alberto Garcia wrote:
> Since commit c8bb23cbdbe32f5c326365e0a82e1b0e68cdcd8a when a write
> request results in a new allocation QEMU first tries to see if the
> rest of the cluster outside the written area contains only zeroes.
>
> In that case, instead of doing a normal copy-on-write operation and
> writing explicit zero buffers to disk, the code zeroes the whole
> cluster efficiently using pwrite_zeroes() with BDRV_REQ_NO_FALLBACK.
>
> This improves performance very significantly but it only happens when
> we are writing to an area that was completely unallocated before. Zero
> clusters (QCOW2_CLUSTER_ZERO_*) are treated like normal clusters and
> are therefore slower to allocate.
>
> This happens because the code uses bdrv_is_allocated_above() rather
> bdrv_block_status_above(). The former is not as accurate for this
> purpose but it is faster. However in the case of qcow2 the underlying
> call does already report zero clusters just fine so there is no reason
> why we cannot use that information.
>
> After testing 4KB writes on an image that only contains zero clusters
> this patch results in almost five times more IOPS.
>
> Signed-off-by: Alberto Garcia <berto@igalia.com>
> ---
> v2:
> - Add new, simpler API: bdrv_is_unallocated_or_zero_above()
>
> include/block/block.h | 2 ++
> block/io.c | 24 ++++++++++++++++++++++++
> block/qcow2.c | 37 +++++++++++++++++++++----------------
> 3 files changed, 47 insertions(+), 16 deletions(-)
>
> diff --git a/include/block/block.h b/include/block/block.h
> index 6e36154061..477a72b2e9 100644
> --- a/include/block/block.h
> +++ b/include/block/block.h
> @@ -496,6 +496,8 @@ int bdrv_is_allocated(BlockDriverState *bs, int64_t offset, int64_t bytes,
> int bdrv_is_allocated_above(BlockDriverState *top, BlockDriverState *base,
> bool include_base, int64_t offset, int64_t bytes,
> int64_t *pnum);
> +int bdrv_is_unallocated_or_zero_above(BlockDriverState *bs, int64_t offset,
> + int64_t bytes);
>
> bool bdrv_is_read_only(BlockDriverState *bs);
> int bdrv_can_set_read_only(BlockDriverState *bs, bool read_only,
> diff --git a/block/io.c b/block/io.c
> index ad3a51ed53..94faa3f9d7 100644
> --- a/block/io.c
> +++ b/block/io.c
> @@ -2557,6 +2557,30 @@ int bdrv_block_status(BlockDriverState *bs, int64_t offset, int64_t bytes,
> offset, bytes, pnum, map, file);
> }
>
> +/*
> + * Check @bs (and its backing chain) to see if the range defined
> + * by @offset and @bytes is unallocated or known to read as zeroes.
> + * Return 1 if that is the case, 0 otherwise and -errno on error.
> + * This test is meant to be fast rather than accurate so returning 0
> + * does not guarantee non-zero data.
> + */
> +int bdrv_is_unallocated_or_zero_above(BlockDriverState *bs, int64_t offset,
> + int64_t bytes)
_above prefix for block-status functions usually assume existing of "base"
parameter, otherwise, it's not clear "above what?"
Also, actually the caller doesn't care about it it allocated or not. It only wants to know is it read-as-zero or not. So, probably better name is bdrv_is_zero_fast()
> +{
> + int ret;
> + int64_t pnum = bytes;
> +
> + ret = bdrv_common_block_status_above(bs, NULL, false, offset,
> + bytes, &pnum, NULL, NULL);
> +
> + if (ret < 0) {
> + return ret;
> + }
> +
> + return (pnum == bytes) &&
> + ((ret & BDRV_BLOCK_ZERO) || (!(ret & BDRV_BLOCK_ALLOCATED)));
Note that some protocol drivers returns unallocated status when it doesn't read-as-zero, so in general, we can't use this function as is_zero.
I think, that better to keep only (pnum == bytes) && (ret & BDRV_BLOCK_ZERO) here, and to make it work correctly improve bdrv_co_block_status like this:
diff --git a/block/io.c b/block/io.c
index ad3a51ed53..33b2e91bcd 100644
--- a/block/io.c
+++ b/block/io.c
@@ -2408,15 +2408,15 @@ static int coroutine_fn bdrv_co_block_status(BlockDriverState *bs,
if (ret & (BDRV_BLOCK_DATA | BDRV_BLOCK_ZERO)) {
ret |= BDRV_BLOCK_ALLOCATED;
- } else if (want_zero && bs->drv->supports_backing) {
- if (bs->backing) {
+ } else if (bs->drv->supports_backing) {
+ if (bs->backing && want_zero) {
BlockDriverState *bs2 = bs->backing->bs;
int64_t size2 = bdrv_getlength(bs2);
if (size2 >= 0 && offset >= size2) {
ret |= BDRV_BLOCK_ZERO;
}
- } else {
+ } else if (!bs->backing) {
ret |= BDRV_BLOCK_ZERO;
}
}
- we can always add ZERO flag to backing-supporting formats if range is unallocated and there is no backing file.
> +}
> +
> int coroutine_fn bdrv_is_allocated(BlockDriverState *bs, int64_t offset,
> int64_t bytes, int64_t *pnum)
> {
> diff --git a/block/qcow2.c b/block/qcow2.c
> index da56b1a4df..29bea548db 100644
> --- a/block/qcow2.c
> +++ b/block/qcow2.c
> @@ -2391,26 +2391,28 @@ static bool merge_cow(uint64_t offset, unsigned bytes,
> return false;
> }
>
> -static bool is_unallocated(BlockDriverState *bs, int64_t offset, int64_t bytes)
> -{
> - int64_t nr;
> - return !bytes ||
> - (!bdrv_is_allocated_above(bs, NULL, false, offset, bytes, &nr) &&
> - nr == bytes);
> -}
> -
> -static bool is_zero_cow(BlockDriverState *bs, QCowL2Meta *m)
> +/*
> + * Return 1 if the COW regions read as zeroes, 0 if not, < 0 on error.
> + * Note that returning 0 does not guarantee non-zero data.
> + */
> +static int is_zero_cow(BlockDriverState *bs, QCowL2Meta *m)
> {
> /*
> * This check is designed for optimization shortcut so it must be
> * efficient.
> - * Instead of is_zero(), use is_unallocated() as it is faster (but not
> - * as accurate and can result in false negatives).
> + * Instead of is_zero(), use bdrv_is_unallocated_or_zero_above() as
> + * it is faster (but not as accurate and can result in false negatives).
> */
> - return is_unallocated(bs, m->offset + m->cow_start.offset,
> - m->cow_start.nb_bytes) &&
> - is_unallocated(bs, m->offset + m->cow_end.offset,
> - m->cow_end.nb_bytes);
> + int ret;
> +
> + ret = bdrv_is_unallocated_or_zero_above(bs, m->offset + m->cow_start.offset,
> + m->cow_start.nb_bytes);
> + if (ret <= 0) {
> + return ret;
> + }
> +
> + return bdrv_is_unallocated_or_zero_above(bs, m->offset + m->cow_end.offset,
> + m->cow_end.nb_bytes);
> }
>
> static int handle_alloc_space(BlockDriverState *bs, QCowL2Meta *l2meta)
> @@ -2436,7 +2438,10 @@ static int handle_alloc_space(BlockDriverState *bs, QCowL2Meta *l2meta)
> continue;
> }
>
> - if (!is_zero_cow(bs, m)) {
> + ret = is_zero_cow(bs, m);
> + if (ret < 0) {
> + return ret;
> + } else if (ret == 0) {
> continue;
> }
>
>
--
Best regards,
Vladimir
^ permalink raw reply related [flat|nested] 4+ messages in thread
* Re: [PATCH v2] qcow2: Skip copy-on-write when allocating a zero cluster
2020-09-09 19:23 ` Vladimir Sementsov-Ogievskiy
@ 2020-09-10 12:04 ` Kevin Wolf
0 siblings, 0 replies; 4+ messages in thread
From: Kevin Wolf @ 2020-09-10 12:04 UTC (permalink / raw)
To: Vladimir Sementsov-Ogievskiy
Cc: Alberto Garcia, qemu-devel, qemu-block, Max Reitz
Am 09.09.2020 um 21:23 hat Vladimir Sementsov-Ogievskiy geschrieben:
> 27.08.2020 17:53, Alberto Garcia wrote:
> > Since commit c8bb23cbdbe32f5c326365e0a82e1b0e68cdcd8a when a write
> > request results in a new allocation QEMU first tries to see if the
> > rest of the cluster outside the written area contains only zeroes.
> >
> > In that case, instead of doing a normal copy-on-write operation and
> > writing explicit zero buffers to disk, the code zeroes the whole
> > cluster efficiently using pwrite_zeroes() with BDRV_REQ_NO_FALLBACK.
> >
> > This improves performance very significantly but it only happens when
> > we are writing to an area that was completely unallocated before. Zero
> > clusters (QCOW2_CLUSTER_ZERO_*) are treated like normal clusters and
> > are therefore slower to allocate.
> >
> > This happens because the code uses bdrv_is_allocated_above() rather
> > bdrv_block_status_above(). The former is not as accurate for this
> > purpose but it is faster. However in the case of qcow2 the underlying
> > call does already report zero clusters just fine so there is no reason
> > why we cannot use that information.
> >
> > After testing 4KB writes on an image that only contains zero clusters
> > this patch results in almost five times more IOPS.
> >
> > Signed-off-by: Alberto Garcia <berto@igalia.com>
> > ---
> > v2:
> > - Add new, simpler API: bdrv_is_unallocated_or_zero_above()
> >
> > include/block/block.h | 2 ++
> > block/io.c | 24 ++++++++++++++++++++++++
> > block/qcow2.c | 37 +++++++++++++++++++++----------------
> > 3 files changed, 47 insertions(+), 16 deletions(-)
> >
> > diff --git a/include/block/block.h b/include/block/block.h
> > index 6e36154061..477a72b2e9 100644
> > --- a/include/block/block.h
> > +++ b/include/block/block.h
> > @@ -496,6 +496,8 @@ int bdrv_is_allocated(BlockDriverState *bs, int64_t offset, int64_t bytes,
> > int bdrv_is_allocated_above(BlockDriverState *top, BlockDriverState *base,
> > bool include_base, int64_t offset, int64_t bytes,
> > int64_t *pnum);
> > +int bdrv_is_unallocated_or_zero_above(BlockDriverState *bs, int64_t offset,
> > + int64_t bytes);
> > bool bdrv_is_read_only(BlockDriverState *bs);
> > int bdrv_can_set_read_only(BlockDriverState *bs, bool read_only,
> > diff --git a/block/io.c b/block/io.c
> > index ad3a51ed53..94faa3f9d7 100644
> > --- a/block/io.c
> > +++ b/block/io.c
> > @@ -2557,6 +2557,30 @@ int bdrv_block_status(BlockDriverState *bs, int64_t offset, int64_t bytes,
> > offset, bytes, pnum, map, file);
> > }
> > +/*
> > + * Check @bs (and its backing chain) to see if the range defined
> > + * by @offset and @bytes is unallocated or known to read as zeroes.
> > + * Return 1 if that is the case, 0 otherwise and -errno on error.
> > + * This test is meant to be fast rather than accurate so returning 0
> > + * does not guarantee non-zero data.
> > + */
> > +int bdrv_is_unallocated_or_zero_above(BlockDriverState *bs, int64_t offset,
> > + int64_t bytes)
>
> _above prefix for block-status functions usually assume existing of "base"
> parameter, otherwise, it's not clear "above what?"
>
> Also, actually the caller doesn't care about it it allocated or not. It only wants to know is it read-as-zero or not. So, probably better name is bdrv_is_zero_fast()
>
> > +{
> > + int ret;
> > + int64_t pnum = bytes;
> > +
> > + ret = bdrv_common_block_status_above(bs, NULL, false, offset,
> > + bytes, &pnum, NULL, NULL);
> > +
> > + if (ret < 0) {
> > + return ret;
> > + }
> > +
> > + return (pnum == bytes) &&
> > + ((ret & BDRV_BLOCK_ZERO) || (!(ret & BDRV_BLOCK_ALLOCATED)));
>
> Note that some protocol drivers returns unallocated status when it doesn't read-as-zero, so in general, we can't use this function as is_zero.
>
> I think, that better to keep only (pnum == bytes) && (ret & BDRV_BLOCK_ZERO) here
Ah, very good, you already mentioned the main points I had, and more. (I
didn't realise that using BDRV_BLOCK_ALLOCATED is actually wrong, just
that it's more complicated than necessary.)
What I would like to add is that a bdrv_co_is_zero_fast() would be even
better. is_zero_cow() isn't marked as such yet, but semantically it's a
coroutine_fn, so we have no reason to go through the synchronous
wrappers.
> , and to make it work correctly improve bdrv_co_block_status like this:
>
> diff --git a/block/io.c b/block/io.c
> index ad3a51ed53..33b2e91bcd 100644
> --- a/block/io.c
> +++ b/block/io.c
> @@ -2408,15 +2408,15 @@ static int coroutine_fn bdrv_co_block_status(BlockDriverState *bs,
> if (ret & (BDRV_BLOCK_DATA | BDRV_BLOCK_ZERO)) {
> ret |= BDRV_BLOCK_ALLOCATED;
> - } else if (want_zero && bs->drv->supports_backing) {
> - if (bs->backing) {
> + } else if (bs->drv->supports_backing) {
> + if (bs->backing && want_zero) {
> BlockDriverState *bs2 = bs->backing->bs;
> int64_t size2 = bdrv_getlength(bs2);
> if (size2 >= 0 && offset >= size2) {
> ret |= BDRV_BLOCK_ZERO;
> }
> - } else {
> + } else if (!bs->backing) {
> ret |= BDRV_BLOCK_ZERO;
> }
> }
>
> - we can always add ZERO flag to backing-supporting formats if range is unallocated and there is no backing file.
This makes sense to me, though it should be a separate patch. This one
wouldn't become incorrect without it, but it would be less effective.
Kevin
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2020-09-10 12:05 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2020-08-27 14:53 [PATCH v2] qcow2: Skip copy-on-write when allocating a zero cluster Alberto Garcia
2020-09-09 12:52 ` Alberto Garcia
2020-09-09 19:23 ` Vladimir Sementsov-Ogievskiy
2020-09-10 12:04 ` Kevin Wolf
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).