* [PATCH 0/7] qcow2: make subclusters discardable
@ 2023-10-20 21:56 Andrey Drobyshev
2023-10-20 21:56 ` [PATCH 1/7] qcow2: make function update_refcount_discard() global Andrey Drobyshev
` (6 more replies)
0 siblings, 7 replies; 28+ messages in thread
From: Andrey Drobyshev @ 2023-10-20 21:56 UTC (permalink / raw)
To: qemu-block
Cc: qemu-devel, hreitz, kwolf, eblake, berto, andrey.drobyshev, den
This series extends the discard operation for qcow2 driver so that it
can be used to discard individial subclusters. This feature might prove
useful and gain some host disk space when dealing with the guest
TRIM/discard requests, especially with large clusters (1M, 2M) when
subclusters are enabled.
Andrey Drobyshev (7):
qcow2: make function update_refcount_discard() global
qcow2: add get_sc_range_info() helper for working with subcluster
ranges
qcow2: zeroize the entire cluster when there're no non-zero
subclusters
qcow2: make subclusters discardable
qcow2: zero_l2_subclusters: fall through to discard operation when
requested
iotests/common.rc: add disk_usage function
iotests/271: check disk usage on subcluster-based discard/unmap
block/qcow2-cluster.c | 217 +++++++++++++++++++++++++++++------
block/qcow2-refcount.c | 8 +-
block/qcow2.c | 8 +-
block/qcow2.h | 2 +
tests/qemu-iotests/250 | 5 -
tests/qemu-iotests/271 | 25 +++-
tests/qemu-iotests/271.out | 2 +
tests/qemu-iotests/common.rc | 6 +
8 files changed, 226 insertions(+), 47 deletions(-)
--
2.39.3
^ permalink raw reply [flat|nested] 28+ messages in thread
* [PATCH 1/7] qcow2: make function update_refcount_discard() global
2023-10-20 21:56 [PATCH 0/7] qcow2: make subclusters discardable Andrey Drobyshev
@ 2023-10-20 21:56 ` Andrey Drobyshev
2023-10-31 15:27 ` Hanna Czenczek
2023-10-20 21:56 ` [PATCH 2/7] qcow2: add get_sc_range_info() helper for working with subcluster ranges Andrey Drobyshev
` (5 subsequent siblings)
6 siblings, 1 reply; 28+ messages in thread
From: Andrey Drobyshev @ 2023-10-20 21:56 UTC (permalink / raw)
To: qemu-block
Cc: qemu-devel, hreitz, kwolf, eblake, berto, andrey.drobyshev, den
We are going to need it for discarding separate subclusters. The
function itself doesn't do anything with the refcount tables, it simply
adds a discard request to the queue, so rename it to qcow2_queue_discard().
Signed-off-by: Andrey Drobyshev <andrey.drobyshev@virtuozzo.com>
---
block/qcow2-refcount.c | 8 ++++----
block/qcow2.h | 2 ++
2 files changed, 6 insertions(+), 4 deletions(-)
diff --git a/block/qcow2-refcount.c b/block/qcow2-refcount.c
index 0266542cee..2026f5fa21 100644
--- a/block/qcow2-refcount.c
+++ b/block/qcow2-refcount.c
@@ -754,8 +754,8 @@ void qcow2_process_discards(BlockDriverState *bs, int ret)
}
}
-static void update_refcount_discard(BlockDriverState *bs,
- uint64_t offset, uint64_t length)
+void qcow2_queue_discard(BlockDriverState *bs, uint64_t offset,
+ uint64_t length)
{
BDRVQcow2State *s = bs->opaque;
Qcow2DiscardRegion *d, *p, *next;
@@ -902,7 +902,7 @@ update_refcount(BlockDriverState *bs, int64_t offset, int64_t length,
}
if (s->discard_passthrough[type]) {
- update_refcount_discard(bs, cluster_offset, s->cluster_size);
+ qcow2_queue_discard(bs, cluster_offset, s->cluster_size);
}
}
}
@@ -3619,7 +3619,7 @@ qcow2_discard_refcount_block(BlockDriverState *bs, uint64_t discard_block_offs)
/* discard refblock from the cache if refblock is cached */
qcow2_cache_discard(s->refcount_block_cache, refblock);
}
- update_refcount_discard(bs, discard_block_offs, s->cluster_size);
+ qcow2_queue_discard(bs, discard_block_offs, s->cluster_size);
return 0;
}
diff --git a/block/qcow2.h b/block/qcow2.h
index 29958c512b..75d6a1b61b 100644
--- a/block/qcow2.h
+++ b/block/qcow2.h
@@ -891,6 +891,8 @@ int coroutine_fn qcow2_check_refcounts(BlockDriverState *bs, BdrvCheckResult *re
BdrvCheckMode fix);
void qcow2_process_discards(BlockDriverState *bs, int ret);
+void qcow2_queue_discard(BlockDriverState *bs, uint64_t offset,
+ uint64_t length);
int qcow2_check_metadata_overlap(BlockDriverState *bs, int ign, int64_t offset,
int64_t size);
--
2.39.3
^ permalink raw reply related [flat|nested] 28+ messages in thread
* [PATCH 2/7] qcow2: add get_sc_range_info() helper for working with subcluster ranges
2023-10-20 21:56 [PATCH 0/7] qcow2: make subclusters discardable Andrey Drobyshev
2023-10-20 21:56 ` [PATCH 1/7] qcow2: make function update_refcount_discard() global Andrey Drobyshev
@ 2023-10-20 21:56 ` Andrey Drobyshev
2023-10-31 15:53 ` Hanna Czenczek
2023-10-20 21:56 ` [PATCH 3/7] qcow2: zeroize the entire cluster when there're no non-zero subclusters Andrey Drobyshev
` (4 subsequent siblings)
6 siblings, 1 reply; 28+ messages in thread
From: Andrey Drobyshev @ 2023-10-20 21:56 UTC (permalink / raw)
To: qemu-block
Cc: qemu-devel, hreitz, kwolf, eblake, berto, andrey.drobyshev, den
This helper simply obtains the l2 table parameters of the cluster which
contains the given subclusters range. Right now this info is being
obtained and used by zero_l2_subclusters(). As we're about to introduce
the subclusters discard operation, this helper would let us avoid code
duplication.
Also introduce struct SubClusterRangeInfo, which would contain all the
needed params.
Signed-off-by: Andrey Drobyshev <andrey.drobyshev@virtuozzo.com>
---
block/qcow2-cluster.c | 90 +++++++++++++++++++++++++++++--------------
1 file changed, 62 insertions(+), 28 deletions(-)
diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index 904f00d1b3..8801856b93 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -32,6 +32,13 @@
#include "qemu/memalign.h"
#include "trace.h"
+typedef struct SubClusterRangeInfo {
+ uint64_t *l2_slice;
+ int l2_index;
+ uint64_t l2_entry;
+ uint64_t l2_bitmap;
+} SubClusterRangeInfo;
+
int coroutine_fn qcow2_shrink_l1_table(BlockDriverState *bs,
uint64_t exact_size)
{
@@ -1892,6 +1899,50 @@ again:
return 0;
}
+static int get_sc_range_info(BlockDriverState *bs, uint64_t offset,
+ unsigned nb_subclusters,
+ SubClusterRangeInfo *scri)
+{
+ BDRVQcow2State *s = bs->opaque;
+ int ret, sc_cleared = 0, sc_index = offset_to_sc_index(s, offset);
+ QCow2SubclusterType sctype;
+
+ /* Here we only work with the subclusters within single cluster. */
+ assert(nb_subclusters > 0 && nb_subclusters < s->subclusters_per_cluster);
+ assert(sc_index + nb_subclusters <= s->subclusters_per_cluster);
+ assert(offset_into_subcluster(s, offset) == 0);
+
+ ret = get_cluster_table(bs, offset, &scri->l2_slice, &scri->l2_index);
+ if (ret < 0) {
+ return ret;
+ }
+
+ scri->l2_entry = get_l2_entry(s, scri->l2_slice, scri->l2_index);
+ scri->l2_bitmap = get_l2_bitmap(s, scri->l2_slice, scri->l2_index);
+
+ do {
+ qcow2_get_subcluster_range_type(bs, scri->l2_entry, scri->l2_bitmap,
+ sc_index, &sctype);
+ if (ret < 0) {
+ return ret;
+ }
+
+ switch (sctype) {
+ case QCOW2_SUBCLUSTER_COMPRESSED:
+ /* We cannot partially zeroize/discard compressed clusters. */
+ return -ENOTSUP;
+ case QCOW2_SUBCLUSTER_INVALID:
+ return -EINVAL;
+ default:
+ break;
+ }
+
+ sc_cleared += ret;
+ } while (sc_cleared < nb_subclusters);
+
+ return 0;
+}
+
/*
* This discards as many clusters of nb_clusters as possible at once (i.e.
* all clusters in the same L2 slice) and returns the number of discarded
@@ -2097,44 +2148,27 @@ zero_l2_subclusters(BlockDriverState *bs, uint64_t offset,
unsigned nb_subclusters)
{
BDRVQcow2State *s = bs->opaque;
- uint64_t *l2_slice;
- uint64_t old_l2_bitmap, l2_bitmap;
- int l2_index, ret, sc = offset_to_sc_index(s, offset);
+ uint64_t new_l2_bitmap;
+ int ret, sc = offset_to_sc_index(s, offset);
+ SubClusterRangeInfo scri = { 0 };
- /* For full clusters use zero_in_l2_slice() instead */
- assert(nb_subclusters > 0 && nb_subclusters < s->subclusters_per_cluster);
- assert(sc + nb_subclusters <= s->subclusters_per_cluster);
- assert(offset_into_subcluster(s, offset) == 0);
-
- ret = get_cluster_table(bs, offset, &l2_slice, &l2_index);
+ ret = get_sc_range_info(bs, offset, nb_subclusters, &scri);
if (ret < 0) {
- return ret;
- }
-
- switch (qcow2_get_cluster_type(bs, get_l2_entry(s, l2_slice, l2_index))) {
- case QCOW2_CLUSTER_COMPRESSED:
- ret = -ENOTSUP; /* We cannot partially zeroize compressed clusters */
goto out;
- case QCOW2_CLUSTER_NORMAL:
- case QCOW2_CLUSTER_UNALLOCATED:
- break;
- default:
- g_assert_not_reached();
}
- old_l2_bitmap = l2_bitmap = get_l2_bitmap(s, l2_slice, l2_index);
-
- l2_bitmap |= QCOW_OFLAG_SUB_ZERO_RANGE(sc, sc + nb_subclusters);
- l2_bitmap &= ~QCOW_OFLAG_SUB_ALLOC_RANGE(sc, sc + nb_subclusters);
+ new_l2_bitmap = scri.l2_bitmap;
+ new_l2_bitmap |= QCOW_OFLAG_SUB_ZERO_RANGE(sc, sc + nb_subclusters);
+ new_l2_bitmap &= ~QCOW_OFLAG_SUB_ALLOC_RANGE(sc, sc + nb_subclusters);
- if (old_l2_bitmap != l2_bitmap) {
- set_l2_bitmap(s, l2_slice, l2_index, l2_bitmap);
- qcow2_cache_entry_mark_dirty(s->l2_table_cache, l2_slice);
+ if (new_l2_bitmap != scri.l2_bitmap) {
+ set_l2_bitmap(s, scri.l2_slice, scri.l2_index, new_l2_bitmap);
+ qcow2_cache_entry_mark_dirty(s->l2_table_cache, scri.l2_slice);
}
ret = 0;
out:
- qcow2_cache_put(s->l2_table_cache, (void **) &l2_slice);
+ qcow2_cache_put(s->l2_table_cache, (void **) &scri.l2_slice);
return ret;
}
--
2.39.3
^ permalink raw reply related [flat|nested] 28+ messages in thread
* [PATCH 3/7] qcow2: zeroize the entire cluster when there're no non-zero subclusters
2023-10-20 21:56 [PATCH 0/7] qcow2: make subclusters discardable Andrey Drobyshev
2023-10-20 21:56 ` [PATCH 1/7] qcow2: make function update_refcount_discard() global Andrey Drobyshev
2023-10-20 21:56 ` [PATCH 2/7] qcow2: add get_sc_range_info() helper for working with subcluster ranges Andrey Drobyshev
@ 2023-10-20 21:56 ` Andrey Drobyshev
2023-10-31 16:06 ` Hanna Czenczek
2023-10-20 21:56 ` [PATCH 4/7] qcow2: make subclusters discardable Andrey Drobyshev
` (3 subsequent siblings)
6 siblings, 1 reply; 28+ messages in thread
From: Andrey Drobyshev @ 2023-10-20 21:56 UTC (permalink / raw)
To: qemu-block
Cc: qemu-devel, hreitz, kwolf, eblake, berto, andrey.drobyshev, den
When zeroizing the last non-zero subclusters within single cluster, it
makes sense to go zeroize the entire cluster and go down zero_in_l2_slice()
path right away. That way we'd also update the corresponding refcount
table.
Signed-off-by: Andrey Drobyshev <andrey.drobyshev@virtuozzo.com>
---
block/qcow2-cluster.c | 18 +++++++++++++++---
1 file changed, 15 insertions(+), 3 deletions(-)
diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index 8801856b93..7c6fa5524c 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -2145,7 +2145,7 @@ zero_in_l2_slice(BlockDriverState *bs, uint64_t offset,
static int coroutine_fn GRAPH_RDLOCK
zero_l2_subclusters(BlockDriverState *bs, uint64_t offset,
- unsigned nb_subclusters)
+ unsigned nb_subclusters, int flags)
{
BDRVQcow2State *s = bs->opaque;
uint64_t new_l2_bitmap;
@@ -2161,6 +2161,17 @@ zero_l2_subclusters(BlockDriverState *bs, uint64_t offset,
new_l2_bitmap |= QCOW_OFLAG_SUB_ZERO_RANGE(sc, sc + nb_subclusters);
new_l2_bitmap &= ~QCOW_OFLAG_SUB_ALLOC_RANGE(sc, sc + nb_subclusters);
+ /*
+ * If there're no non-zero subclusters left, we might as well zeroize
+ * the entire cluster. That way we'd also update the refcount table.
+ */
+ if ((new_l2_bitmap & QCOW_L2_BITMAP_ALL_ZEROES) ==
+ QCOW_L2_BITMAP_ALL_ZEROES) {
+ qcow2_cache_put(s->l2_table_cache, (void **) &scri.l2_slice);
+ return zero_in_l2_slice(bs, QEMU_ALIGN_DOWN(offset, s->cluster_size),
+ 1, flags);
+ }
+
if (new_l2_bitmap != scri.l2_bitmap) {
set_l2_bitmap(s, scri.l2_slice, scri.l2_index, new_l2_bitmap);
qcow2_cache_entry_mark_dirty(s->l2_table_cache, scri.l2_slice);
@@ -2221,7 +2232,7 @@ int coroutine_fn qcow2_subcluster_zeroize(BlockDriverState *bs, uint64_t offset,
if (head) {
ret = zero_l2_subclusters(bs, offset - head,
- size_to_subclusters(s, head));
+ size_to_subclusters(s, head), flags);
if (ret < 0) {
goto fail;
}
@@ -2242,7 +2253,8 @@ int coroutine_fn qcow2_subcluster_zeroize(BlockDriverState *bs, uint64_t offset,
}
if (tail) {
- ret = zero_l2_subclusters(bs, end_offset, size_to_subclusters(s, tail));
+ ret = zero_l2_subclusters(bs, end_offset,
+ size_to_subclusters(s, tail), flags);
if (ret < 0) {
goto fail;
}
--
2.39.3
^ permalink raw reply related [flat|nested] 28+ messages in thread
* [PATCH 4/7] qcow2: make subclusters discardable
2023-10-20 21:56 [PATCH 0/7] qcow2: make subclusters discardable Andrey Drobyshev
` (2 preceding siblings ...)
2023-10-20 21:56 ` [PATCH 3/7] qcow2: zeroize the entire cluster when there're no non-zero subclusters Andrey Drobyshev
@ 2023-10-20 21:56 ` Andrey Drobyshev
2023-10-27 11:10 ` Jean-Louis Dupond
` (3 more replies)
2023-10-20 21:56 ` [PATCH 5/7] qcow2: zero_l2_subclusters: fall through to discard operation when requested Andrey Drobyshev
` (2 subsequent siblings)
6 siblings, 4 replies; 28+ messages in thread
From: Andrey Drobyshev @ 2023-10-20 21:56 UTC (permalink / raw)
To: qemu-block
Cc: qemu-devel, hreitz, kwolf, eblake, berto, andrey.drobyshev, den
This commit makes the discard operation work on the subcluster level
rather than cluster level. It introduces discard_l2_subclusters()
function and makes use of it in qcow2 discard implementation, much like
it's done with zero_in_l2_slice() / zero_l2_subclusters(). It also
changes the qcow2 driver pdiscard_alignment to subcluster_size. That
way subcluster-aligned discards lead to actual fallocate(PUNCH_HOLE)
operation and free host disk space.
This feature will let us gain additional disk space on guest
TRIM/discard requests, especially when using large enough clusters
(1M, 2M) with subclusters enabled.
Signed-off-by: Andrey Drobyshev <andrey.drobyshev@virtuozzo.com>
---
block/qcow2-cluster.c | 100 ++++++++++++++++++++++++++++++++++++++++--
block/qcow2.c | 8 ++--
2 files changed, 101 insertions(+), 7 deletions(-)
diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index 7c6fa5524c..cf40f2dc12 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -2042,6 +2042,74 @@ discard_in_l2_slice(BlockDriverState *bs, uint64_t offset, uint64_t nb_clusters,
return nb_clusters;
}
+static int coroutine_fn GRAPH_RDLOCK
+discard_l2_subclusters(BlockDriverState *bs, uint64_t offset,
+ uint64_t nb_subclusters,
+ enum qcow2_discard_type type,
+ bool full_discard,
+ SubClusterRangeInfo *pscri)
+{
+ BDRVQcow2State *s = bs->opaque;
+ uint64_t new_l2_bitmap, l2_bitmap_mask;
+ int ret, sc = offset_to_sc_index(s, offset);
+ SubClusterRangeInfo scri = { 0 };
+
+ if (!pscri) {
+ ret = get_sc_range_info(bs, offset, nb_subclusters, &scri);
+ if (ret < 0) {
+ goto out;
+ }
+ } else {
+ scri = *pscri;
+ }
+
+ l2_bitmap_mask = QCOW_OFLAG_SUB_ALLOC_RANGE(sc, sc + nb_subclusters);
+ new_l2_bitmap = scri.l2_bitmap;
+ new_l2_bitmap &= ~l2_bitmap_mask;
+
+ /*
+ * If there're no allocated subclusters left, we might as well discard
+ * the entire cluster. That way we'd also update the refcount table.
+ */
+ if (!(new_l2_bitmap & QCOW_L2_BITMAP_ALL_ALLOC)) {
+ return discard_in_l2_slice(bs,
+ QEMU_ALIGN_DOWN(offset, s->cluster_size),
+ 1, type, full_discard);
+ }
+
+ /*
+ * Full discard means we fall through to the backing file, thus we only
+ * need to mark the subclusters as deallocated.
+ *
+ * Non-full discard means subclusters should be explicitly marked as
+ * zeroes. In this case QCOW2 specification requires the corresponding
+ * allocation status bits to be unset as well. If the subclusters are
+ * deallocated in the first place and there's no backing, the operation
+ * can be skipped.
+ */
+ if (!full_discard &&
+ (bs->backing || scri.l2_bitmap & l2_bitmap_mask)) {
+ new_l2_bitmap |= QCOW_OFLAG_SUB_ZERO_RANGE(sc, sc + nb_subclusters);
+ }
+
+ if (scri.l2_bitmap != new_l2_bitmap) {
+ set_l2_bitmap(s, scri.l2_slice, scri.l2_index, new_l2_bitmap);
+ qcow2_cache_entry_mark_dirty(s->l2_table_cache, scri.l2_slice);
+ }
+
+ if (s->discard_passthrough[type]) {
+ qcow2_queue_discard(bs, (scri.l2_entry & L2E_OFFSET_MASK) +
+ offset_into_cluster(s, offset),
+ nb_subclusters * s->subcluster_size);
+ }
+
+ ret = 0;
+out:
+ qcow2_cache_put(s->l2_table_cache, (void **) &scri.l2_slice);
+
+ return ret;
+}
+
int qcow2_cluster_discard(BlockDriverState *bs, uint64_t offset,
uint64_t bytes, enum qcow2_discard_type type,
bool full_discard)
@@ -2049,19 +2117,36 @@ int qcow2_cluster_discard(BlockDriverState *bs, uint64_t offset,
BDRVQcow2State *s = bs->opaque;
uint64_t end_offset = offset + bytes;
uint64_t nb_clusters;
+ unsigned head, tail;
int64_t cleared;
int ret;
/* Caller must pass aligned values, except at image end */
- assert(QEMU_IS_ALIGNED(offset, s->cluster_size));
- assert(QEMU_IS_ALIGNED(end_offset, s->cluster_size) ||
+ assert(QEMU_IS_ALIGNED(offset, s->subcluster_size));
+ assert(QEMU_IS_ALIGNED(end_offset, s->subcluster_size) ||
end_offset == bs->total_sectors << BDRV_SECTOR_BITS);
- nb_clusters = size_to_clusters(s, bytes);
+ head = MIN(end_offset, ROUND_UP(offset, s->cluster_size)) - offset;
+ offset += head;
+
+ tail = (end_offset >= bs->total_sectors << BDRV_SECTOR_BITS) ? 0 :
+ end_offset - MAX(offset, start_of_cluster(s, end_offset));
+ end_offset -= tail;
s->cache_discards = true;
+ if (head) {
+ ret = discard_l2_subclusters(bs, offset - head,
+ size_to_subclusters(s, head), type,
+ full_discard, NULL);
+ if (ret < 0) {
+ goto fail;
+ }
+ }
+
/* Each L2 slice is handled by its own loop iteration */
+ nb_clusters = size_to_clusters(s, end_offset - offset);
+
while (nb_clusters > 0) {
cleared = discard_in_l2_slice(bs, offset, nb_clusters, type,
full_discard);
@@ -2074,6 +2159,15 @@ int qcow2_cluster_discard(BlockDriverState *bs, uint64_t offset,
offset += (cleared * s->cluster_size);
}
+ if (tail) {
+ ret = discard_l2_subclusters(bs, end_offset,
+ size_to_subclusters(s, tail), type,
+ full_discard, NULL);
+ if (ret < 0) {
+ goto fail;
+ }
+ }
+
ret = 0;
fail:
s->cache_discards = false;
diff --git a/block/qcow2.c b/block/qcow2.c
index aa01d9e7b5..66961fa59e 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -1966,7 +1966,7 @@ static void qcow2_refresh_limits(BlockDriverState *bs, Error **errp)
bs->bl.request_alignment = qcrypto_block_get_sector_size(s->crypto);
}
bs->bl.pwrite_zeroes_alignment = s->subcluster_size;
- bs->bl.pdiscard_alignment = s->cluster_size;
+ bs->bl.pdiscard_alignment = s->subcluster_size;
}
static int GRAPH_UNLOCKED
@@ -4102,11 +4102,11 @@ qcow2_co_pdiscard(BlockDriverState *bs, int64_t offset, int64_t bytes)
return -ENOTSUP;
}
- if (!QEMU_IS_ALIGNED(offset | bytes, s->cluster_size)) {
- assert(bytes < s->cluster_size);
+ if (!QEMU_IS_ALIGNED(offset | bytes, bs->bl.pdiscard_alignment)) {
+ assert(bytes < bs->bl.pdiscard_alignment);
/* Ignore partial clusters, except for the special case of the
* complete partial cluster at the end of an unaligned file */
- if (!QEMU_IS_ALIGNED(offset, s->cluster_size) ||
+ if (!QEMU_IS_ALIGNED(offset, bs->bl.pdiscard_alignment) ||
offset + bytes != bs->total_sectors * BDRV_SECTOR_SIZE) {
return -ENOTSUP;
}
--
2.39.3
^ permalink raw reply related [flat|nested] 28+ messages in thread
* [PATCH 5/7] qcow2: zero_l2_subclusters: fall through to discard operation when requested
2023-10-20 21:56 [PATCH 0/7] qcow2: make subclusters discardable Andrey Drobyshev
` (3 preceding siblings ...)
2023-10-20 21:56 ` [PATCH 4/7] qcow2: make subclusters discardable Andrey Drobyshev
@ 2023-10-20 21:56 ` Andrey Drobyshev
2023-11-03 15:19 ` Hanna Czenczek
2023-10-20 21:56 ` [PATCH 6/7] iotests/common.rc: add disk_usage function Andrey Drobyshev
2023-10-20 21:56 ` [PATCH 7/7] iotests/271: check disk usage on subcluster-based discard/unmap Andrey Drobyshev
6 siblings, 1 reply; 28+ messages in thread
From: Andrey Drobyshev @ 2023-10-20 21:56 UTC (permalink / raw)
To: qemu-block
Cc: qemu-devel, hreitz, kwolf, eblake, berto, andrey.drobyshev, den
When zeroizing subclusters within single cluster, detect usage of the
BDRV_REQ_MAY_UNMAP flag and fall through to the subcluster-based discard
operation, much like it's done with the cluster-based discards. That
way subcluster-aligned operations "qemu-io -c 'write -z -u ...'" will
lead to actual unmap.
Signed-off-by: Andrey Drobyshev <andrey.drobyshev@virtuozzo.com>
---
block/qcow2-cluster.c | 17 ++++++++++++++---
1 file changed, 14 insertions(+), 3 deletions(-)
diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index cf40f2dc12..040251f2c3 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -2242,7 +2242,7 @@ zero_l2_subclusters(BlockDriverState *bs, uint64_t offset,
unsigned nb_subclusters, int flags)
{
BDRVQcow2State *s = bs->opaque;
- uint64_t new_l2_bitmap;
+ uint64_t new_l2_bitmap, l2_bitmap_mask;
int ret, sc = offset_to_sc_index(s, offset);
SubClusterRangeInfo scri = { 0 };
@@ -2251,9 +2251,10 @@ zero_l2_subclusters(BlockDriverState *bs, uint64_t offset,
goto out;
}
+ l2_bitmap_mask = QCOW_OFLAG_SUB_ALLOC_RANGE(sc, sc + nb_subclusters);
new_l2_bitmap = scri.l2_bitmap;
- new_l2_bitmap |= QCOW_OFLAG_SUB_ZERO_RANGE(sc, sc + nb_subclusters);
- new_l2_bitmap &= ~QCOW_OFLAG_SUB_ALLOC_RANGE(sc, sc + nb_subclusters);
+ new_l2_bitmap |= QCOW_OFLAG_SUB_ZERO_RANGE(sc, sc + nb_subclusters);
+ new_l2_bitmap &= ~l2_bitmap_mask;
/*
* If there're no non-zero subclusters left, we might as well zeroize
@@ -2266,6 +2267,16 @@ zero_l2_subclusters(BlockDriverState *bs, uint64_t offset,
1, flags);
}
+ /*
+ * If the request allows discarding subclusters and they're actually
+ * allocated, we go down the discard path since after the discard
+ * operation the subclusters are going to be read as zeroes anyway.
+ */
+ if ((flags & BDRV_REQ_MAY_UNMAP) && (scri.l2_bitmap & l2_bitmap_mask)) {
+ return discard_l2_subclusters(bs, offset, nb_subclusters,
+ QCOW2_DISCARD_REQUEST, false, &scri);
+ }
+
if (new_l2_bitmap != scri.l2_bitmap) {
set_l2_bitmap(s, scri.l2_slice, scri.l2_index, new_l2_bitmap);
qcow2_cache_entry_mark_dirty(s->l2_table_cache, scri.l2_slice);
--
2.39.3
^ permalink raw reply related [flat|nested] 28+ messages in thread
* [PATCH 6/7] iotests/common.rc: add disk_usage function
2023-10-20 21:56 [PATCH 0/7] qcow2: make subclusters discardable Andrey Drobyshev
` (4 preceding siblings ...)
2023-10-20 21:56 ` [PATCH 5/7] qcow2: zero_l2_subclusters: fall through to discard operation when requested Andrey Drobyshev
@ 2023-10-20 21:56 ` Andrey Drobyshev
2023-11-03 15:20 ` Hanna Czenczek
2023-10-20 21:56 ` [PATCH 7/7] iotests/271: check disk usage on subcluster-based discard/unmap Andrey Drobyshev
6 siblings, 1 reply; 28+ messages in thread
From: Andrey Drobyshev @ 2023-10-20 21:56 UTC (permalink / raw)
To: qemu-block
Cc: qemu-devel, hreitz, kwolf, eblake, berto, andrey.drobyshev, den
Move the definition from iotests/250 to common.rc. This is used to
detect real disk usage of sparse files. In particular, we want to use
it for checking subclusters-based discards.
Signed-off-by: Andrey Drobyshev <andrey.drobyshev@virtuozzo.com>
---
tests/qemu-iotests/250 | 5 -----
tests/qemu-iotests/common.rc | 6 ++++++
2 files changed, 6 insertions(+), 5 deletions(-)
diff --git a/tests/qemu-iotests/250 b/tests/qemu-iotests/250
index af48f83aba..c0a0dbc0ff 100755
--- a/tests/qemu-iotests/250
+++ b/tests/qemu-iotests/250
@@ -52,11 +52,6 @@ _unsupported_imgopts data_file
# bdrv_co_truncate(bs->file) call in qcow2_co_truncate(), which might succeed
# anyway.
-disk_usage()
-{
- du --block-size=1 $1 | awk '{print $1}'
-}
-
size=2100M
_make_test_img -o "cluster_size=1M,preallocation=metadata" $size
diff --git a/tests/qemu-iotests/common.rc b/tests/qemu-iotests/common.rc
index 95c12577dd..5d2ea26c7f 100644
--- a/tests/qemu-iotests/common.rc
+++ b/tests/qemu-iotests/common.rc
@@ -140,6 +140,12 @@ _optstr_add()
fi
}
+# report real disk usage for sparse files
+disk_usage()
+{
+ du --block-size=1 $1 | awk '{print $1}'
+}
+
# Set the variables to the empty string to turn Valgrind off
# for specific processes, e.g.
# $ VALGRIND_QEMU_IO= ./check -qcow2 -valgrind 015
--
2.39.3
^ permalink raw reply related [flat|nested] 28+ messages in thread
* [PATCH 7/7] iotests/271: check disk usage on subcluster-based discard/unmap
2023-10-20 21:56 [PATCH 0/7] qcow2: make subclusters discardable Andrey Drobyshev
` (5 preceding siblings ...)
2023-10-20 21:56 ` [PATCH 6/7] iotests/common.rc: add disk_usage function Andrey Drobyshev
@ 2023-10-20 21:56 ` Andrey Drobyshev
2023-11-03 15:51 ` Hanna Czenczek
6 siblings, 1 reply; 28+ messages in thread
From: Andrey Drobyshev @ 2023-10-20 21:56 UTC (permalink / raw)
To: qemu-block
Cc: qemu-devel, hreitz, kwolf, eblake, berto, andrey.drobyshev, den
Add _verify_du_delta() checker which is used to check that real disk
usage delta meets the expectations. For now we use it for checking that
subcluster-based discard/unmap operations lead to actual disk usage
decrease (i.e. PUNCH_HOLE operation is performed).
Also add separate test case for discarding particular subcluster within
one cluster.
Signed-off-by: Andrey Drobyshev <andrey.drobyshev@virtuozzo.com>
---
tests/qemu-iotests/271 | 25 ++++++++++++++++++++++++-
tests/qemu-iotests/271.out | 2 ++
2 files changed, 26 insertions(+), 1 deletion(-)
diff --git a/tests/qemu-iotests/271 b/tests/qemu-iotests/271
index c7c2cadda0..5fcb209f5f 100755
--- a/tests/qemu-iotests/271
+++ b/tests/qemu-iotests/271
@@ -81,6 +81,15 @@ _verify_l2_bitmap()
fi
}
+# Check disk usage delta after a discard/unmap operation
+# _verify_du_delta $before $after $expected_delta
+_verify_du_delta()
+{
+ if [ $(($1 - $2)) -ne $3 ]; then
+ printf "ERROR: unexpected delta: $1 - $2 = $(($1 - $2)) != $3\n"
+ fi
+}
+
# This should be called as _run_test c=XXX sc=XXX off=XXX len=XXX cmd=XXX
# c: cluster number (0 if unset)
# sc: subcluster number inside cluster @c (0 if unset)
@@ -198,9 +207,12 @@ for use_backing_file in yes no; do
alloc="$(seq 0 31)"; zero=""
_run_test sc=0 len=64k
- ### Zero and unmap half of cluster #0 (this won't unmap it)
+ ### Zero and unmap half of cluster #0 (this will unmap it)
alloc="$(seq 16 31)"; zero="$(seq 0 15)"
+ before=$(disk_usage "$TEST_IMG")
_run_test sc=0 len=32k cmd=unmap
+ after=$(disk_usage "$TEST_IMG")
+ _verify_du_delta $before $after 32768
### Zero and unmap cluster #0
alloc=""; zero="$(seq 0 31)"
@@ -447,7 +459,10 @@ for use_backing_file in yes no; do
# Subcluster-aligned request from clusters #12 to #14
alloc="$(seq 0 15)"; zero="$(seq 16 31)"
+ before=$(disk_usage "$TEST_IMG")
_run_test c=12 sc=16 len=128k cmd=unmap
+ after=$(disk_usage "$TEST_IMG")
+ _verify_du_delta $before $after $((128 * 1024))
alloc=""; zero="$(seq 0 31)"
_verify_l2_bitmap 13
alloc="$(seq 16 31)"; zero="$(seq 0 15)"
@@ -528,6 +543,14 @@ for use_backing_file in yes no; do
else
_make_test_img -o extended_l2=on 1M
fi
+ # Write cluster #0 and discard its subclusters #0-#3
+ $QEMU_IO -c 'write -q 0 64k' "$TEST_IMG"
+ before=$(disk_usage "$TEST_IMG")
+ $QEMU_IO -c 'discard -q 0 8k' "$TEST_IMG"
+ after=$(disk_usage "$TEST_IMG")
+ _verify_du_delta $before $after 8192
+ alloc="$(seq 4 31)"; zero="$(seq 0 3)"
+ _verify_l2_bitmap 0
# Write clusters #0-#2 and then discard them
$QEMU_IO -c 'write -q 0 128k' "$TEST_IMG"
$QEMU_IO -c 'discard -q 0 128k' "$TEST_IMG"
diff --git a/tests/qemu-iotests/271.out b/tests/qemu-iotests/271.out
index 5be780de76..0da8d72cde 100644
--- a/tests/qemu-iotests/271.out
+++ b/tests/qemu-iotests/271.out
@@ -426,6 +426,7 @@ L2 entry #29: 0x0000000000000000 0000ffff00000000
### Discarding clusters with non-zero bitmaps (backing file: yes) ###
Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1048576 backing_file=TEST_DIR/t.IMGFMT.base backing_fmt=raw
+L2 entry #0: 0x8000000000050000 0000000ffffffff0
L2 entry #0: 0x0000000000000000 ffffffff00000000
L2 entry #1: 0x0000000000000000 ffffffff00000000
Image resized.
@@ -436,6 +437,7 @@ L2 entry #1: 0x0000000000000000 ffffffff00000000
### Discarding clusters with non-zero bitmaps (backing file: no) ###
Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1048576
+L2 entry #0: 0x8000000000050000 0000000ffffffff0
L2 entry #0: 0x0000000000000000 ffffffff00000000
L2 entry #1: 0x0000000000000000 ffffffff00000000
Image resized.
--
2.39.3
^ permalink raw reply related [flat|nested] 28+ messages in thread
* Re: [PATCH 4/7] qcow2: make subclusters discardable
2023-10-20 21:56 ` [PATCH 4/7] qcow2: make subclusters discardable Andrey Drobyshev
@ 2023-10-27 11:10 ` Jean-Louis Dupond
2024-04-16 19:56 ` Andrey Drobyshev
2023-10-31 16:32 ` Hanna Czenczek
` (2 subsequent siblings)
3 siblings, 1 reply; 28+ messages in thread
From: Jean-Louis Dupond @ 2023-10-27 11:10 UTC (permalink / raw)
To: Andrey Drobyshev, qemu-block
Cc: qemu-devel, hreitz, kwolf, eblake, berto, den
On 20/10/2023 23:56, Andrey Drobyshev wrote:
> This commit makes the discard operation work on the subcluster level
> rather than cluster level. It introduces discard_l2_subclusters()
> function and makes use of it in qcow2 discard implementation, much like
> it's done with zero_in_l2_slice() / zero_l2_subclusters(). It also
> changes the qcow2 driver pdiscard_alignment to subcluster_size. That
> way subcluster-aligned discards lead to actual fallocate(PUNCH_HOLE)
> operation and free host disk space.
>
> This feature will let us gain additional disk space on guest
> TRIM/discard requests, especially when using large enough clusters
> (1M, 2M) with subclusters enabled.
>
> Signed-off-by: Andrey Drobyshev <andrey.drobyshev@virtuozzo.com>
> ---
> block/qcow2-cluster.c | 100 ++++++++++++++++++++++++++++++++++++++++--
> block/qcow2.c | 8 ++--
> 2 files changed, 101 insertions(+), 7 deletions(-)
>
> diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
> index 7c6fa5524c..cf40f2dc12 100644
> --- a/block/qcow2-cluster.c
> +++ b/block/qcow2-cluster.c
> @@ -2042,6 +2042,74 @@ discard_in_l2_slice(BlockDriverState *bs, uint64_t offset, uint64_t nb_clusters,
> return nb_clusters;
> }
>
> +static int coroutine_fn GRAPH_RDLOCK
> +discard_l2_subclusters(BlockDriverState *bs, uint64_t offset,
> + uint64_t nb_subclusters,
> + enum qcow2_discard_type type,
> + bool full_discard,
> + SubClusterRangeInfo *pscri)
> +{
> + BDRVQcow2State *s = bs->opaque;
> + uint64_t new_l2_bitmap, l2_bitmap_mask;
> + int ret, sc = offset_to_sc_index(s, offset);
> + SubClusterRangeInfo scri = { 0 };
> +
> + if (!pscri) {
> + ret = get_sc_range_info(bs, offset, nb_subclusters, &scri);
> + if (ret < 0) {
> + goto out;
> + }
> + } else {
> + scri = *pscri;
> + }
> +
> + l2_bitmap_mask = QCOW_OFLAG_SUB_ALLOC_RANGE(sc, sc + nb_subclusters);
> + new_l2_bitmap = scri.l2_bitmap;
> + new_l2_bitmap &= ~l2_bitmap_mask;
> +
> + /*
> + * If there're no allocated subclusters left, we might as well discard
> + * the entire cluster. That way we'd also update the refcount table.
> + */
> + if (!(new_l2_bitmap & QCOW_L2_BITMAP_ALL_ALLOC)) {
> + return discard_in_l2_slice(bs,
> + QEMU_ALIGN_DOWN(offset, s->cluster_size),
> + 1, type, full_discard);
> + }
> +
> + /*
> + * Full discard means we fall through to the backing file, thus we only
> + * need to mark the subclusters as deallocated.
> + *
> + * Non-full discard means subclusters should be explicitly marked as
> + * zeroes. In this case QCOW2 specification requires the corresponding
> + * allocation status bits to be unset as well. If the subclusters are
> + * deallocated in the first place and there's no backing, the operation
> + * can be skipped.
> + */
> + if (!full_discard &&
> + (bs->backing || scri.l2_bitmap & l2_bitmap_mask)) {
> + new_l2_bitmap |= QCOW_OFLAG_SUB_ZERO_RANGE(sc, sc + nb_subclusters);
> + }
> +
> + if (scri.l2_bitmap != new_l2_bitmap) {
> + set_l2_bitmap(s, scri.l2_slice, scri.l2_index, new_l2_bitmap);
> + qcow2_cache_entry_mark_dirty(s->l2_table_cache, scri.l2_slice);
> + }
> +
> + if (s->discard_passthrough[type]) {
> + qcow2_queue_discard(bs, (scri.l2_entry & L2E_OFFSET_MASK) +
> + offset_into_cluster(s, offset),
> + nb_subclusters * s->subcluster_size);
> + }
> +
> + ret = 0;
> +out:
> + qcow2_cache_put(s->l2_table_cache, (void **) &scri.l2_slice);
> +
> + return ret;
> +}
> +
> int qcow2_cluster_discard(BlockDriverState *bs, uint64_t offset,
> uint64_t bytes, enum qcow2_discard_type type,
> bool full_discard)
> @@ -2049,19 +2117,36 @@ int qcow2_cluster_discard(BlockDriverState *bs, uint64_t offset,
> BDRVQcow2State *s = bs->opaque;
> uint64_t end_offset = offset + bytes;
> uint64_t nb_clusters;
> + unsigned head, tail;
> int64_t cleared;
> int ret;
>
> /* Caller must pass aligned values, except at image end */
> - assert(QEMU_IS_ALIGNED(offset, s->cluster_size));
> - assert(QEMU_IS_ALIGNED(end_offset, s->cluster_size) ||
> + assert(QEMU_IS_ALIGNED(offset, s->subcluster_size));
> + assert(QEMU_IS_ALIGNED(end_offset, s->subcluster_size) ||
> end_offset == bs->total_sectors << BDRV_SECTOR_BITS);
>
> - nb_clusters = size_to_clusters(s, bytes);
> + head = MIN(end_offset, ROUND_UP(offset, s->cluster_size)) - offset;
> + offset += head;
> +
> + tail = (end_offset >= bs->total_sectors << BDRV_SECTOR_BITS) ? 0 :
> + end_offset - MAX(offset, start_of_cluster(s, end_offset));
> + end_offset -= tail;
>
> s->cache_discards = true;
>
> + if (head) {
> + ret = discard_l2_subclusters(bs, offset - head,
> + size_to_subclusters(s, head), type,
> + full_discard, NULL);
> + if (ret < 0) {
> + goto fail;
> + }
> + }
> +
> /* Each L2 slice is handled by its own loop iteration */
> + nb_clusters = size_to_clusters(s, end_offset - offset);
> +
> while (nb_clusters > 0) {
> cleared = discard_in_l2_slice(bs, offset, nb_clusters, type,
> full_discard);
> @@ -2074,6 +2159,15 @@ int qcow2_cluster_discard(BlockDriverState *bs, uint64_t offset,
> offset += (cleared * s->cluster_size);
> }
>
> + if (tail) {
> + ret = discard_l2_subclusters(bs, end_offset,
> + size_to_subclusters(s, tail), type,
> + full_discard, NULL);
> + if (ret < 0) {
> + goto fail;
> + }
> + }
> +
> ret = 0;
> fail:
> s->cache_discards = false;
> diff --git a/block/qcow2.c b/block/qcow2.c
> index aa01d9e7b5..66961fa59e 100644
> --- a/block/qcow2.c
> +++ b/block/qcow2.c
> @@ -1966,7 +1966,7 @@ static void qcow2_refresh_limits(BlockDriverState *bs, Error **errp)
> bs->bl.request_alignment = qcrypto_block_get_sector_size(s->crypto);
> }
> bs->bl.pwrite_zeroes_alignment = s->subcluster_size;
> - bs->bl.pdiscard_alignment = s->cluster_size;
> + bs->bl.pdiscard_alignment = s->subcluster_size;
> }
>
> static int GRAPH_UNLOCKED
> @@ -4102,11 +4102,11 @@ qcow2_co_pdiscard(BlockDriverState *bs, int64_t offset, int64_t bytes)
> return -ENOTSUP;
> }
>
> - if (!QEMU_IS_ALIGNED(offset | bytes, s->cluster_size)) {
> - assert(bytes < s->cluster_size);
> + if (!QEMU_IS_ALIGNED(offset | bytes, bs->bl.pdiscard_alignment)) {
> + assert(bytes < bs->bl.pdiscard_alignment);
> /* Ignore partial clusters, except for the special case of the
> * complete partial cluster at the end of an unaligned file */
> - if (!QEMU_IS_ALIGNED(offset, s->cluster_size) ||
> + if (!QEMU_IS_ALIGNED(offset, bs->bl.pdiscard_alignment) ||
> offset + bytes != bs->total_sectors * BDRV_SECTOR_SIZE) {
> return -ENOTSUP;
> }
I've checked all the code paths, and as far as I see it nowhere breaks
the discard_no_unref option.
It's important that we don't introduce new code paths that can make
holes in the qcow2 image when this option is enabled :)
If you can confirm my conclusion, that would be great.
Thanks
Jean-Louis
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 1/7] qcow2: make function update_refcount_discard() global
2023-10-20 21:56 ` [PATCH 1/7] qcow2: make function update_refcount_discard() global Andrey Drobyshev
@ 2023-10-31 15:27 ` Hanna Czenczek
0 siblings, 0 replies; 28+ messages in thread
From: Hanna Czenczek @ 2023-10-31 15:27 UTC (permalink / raw)
To: Andrey Drobyshev, qemu-block; +Cc: qemu-devel, kwolf, eblake, berto, den
On 20.10.23 23:56, Andrey Drobyshev wrote:
> We are going to need it for discarding separate subclusters. The
> function itself doesn't do anything with the refcount tables,
I think the idea behind the naming was that updating refcounts can lead
to clusters being discarded, i.e. update_refcount_discard() did the
“discard” part of “update_refcount”.
> it simply
> adds a discard request to the queue, so rename it to qcow2_queue_discard().
But that’s fine, too.
> Signed-off-by: Andrey Drobyshev <andrey.drobyshev@virtuozzo.com>
> ---
> block/qcow2-refcount.c | 8 ++++----
> block/qcow2.h | 2 ++
> 2 files changed, 6 insertions(+), 4 deletions(-)
Reviewed-by: Hanna Czenczek <hreitz@redhat.com>
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 2/7] qcow2: add get_sc_range_info() helper for working with subcluster ranges
2023-10-20 21:56 ` [PATCH 2/7] qcow2: add get_sc_range_info() helper for working with subcluster ranges Andrey Drobyshev
@ 2023-10-31 15:53 ` Hanna Czenczek
2023-11-09 12:32 ` Andrey Drobyshev
0 siblings, 1 reply; 28+ messages in thread
From: Hanna Czenczek @ 2023-10-31 15:53 UTC (permalink / raw)
To: Andrey Drobyshev, qemu-block; +Cc: qemu-devel, kwolf, eblake, berto, den
On 20.10.23 23:56, Andrey Drobyshev wrote:
> This helper simply obtains the l2 table parameters of the cluster which
> contains the given subclusters range. Right now this info is being
> obtained and used by zero_l2_subclusters(). As we're about to introduce
> the subclusters discard operation, this helper would let us avoid code
> duplication.
>
> Also introduce struct SubClusterRangeInfo, which would contain all the
> needed params.
>
> Signed-off-by: Andrey Drobyshev <andrey.drobyshev@virtuozzo.com>
> ---
> block/qcow2-cluster.c | 90 +++++++++++++++++++++++++++++--------------
> 1 file changed, 62 insertions(+), 28 deletions(-)
>
> diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
> index 904f00d1b3..8801856b93 100644
> --- a/block/qcow2-cluster.c
> +++ b/block/qcow2-cluster.c
> @@ -32,6 +32,13 @@
> #include "qemu/memalign.h"
> #include "trace.h"
>
> +typedef struct SubClusterRangeInfo {
> + uint64_t *l2_slice;
We should document that this is a strong reference that must be returned
via qcow2_cache_put(). Maybe you could also define a clean-up function
using G_DEFINE_AUTO_CLEANUP_CLEAR_FUNC() that does this, allowing this
type to be used with g_auto().
> + int l2_index;
> + uint64_t l2_entry;
> + uint64_t l2_bitmap;
> +} SubClusterRangeInfo;
> +
> int coroutine_fn qcow2_shrink_l1_table(BlockDriverState *bs,
> uint64_t exact_size)
> {
> @@ -1892,6 +1899,50 @@ again:
> return 0;
> }
>
> +static int get_sc_range_info(BlockDriverState *bs, uint64_t offset,
> + unsigned nb_subclusters,
> + SubClusterRangeInfo *scri)
It would be good to have documentation for this function, for example
that it only works on a single cluster, i.e. that the range denoted by
@offset and @nb_subclusters must not cross a cluster boundary, and that
@offset must be aligned to subclusters.
In general, it is unclear to me at this point what this function does.
OK, it gets the SCRI for all subclusters in the cluster at @offset (this
is what its name implies), but then it also has this loop that checks
whether there are compressed clusters among the @nb_subclusters. It has
a comment about being unable to zero/discard subclusters in compressed
clusters, but the function name says nothing about this scope of
zeroing/discarding.
> +{
> + BDRVQcow2State *s = bs->opaque;
> + int ret, sc_cleared = 0, sc_index = offset_to_sc_index(s, offset);
> + QCow2SubclusterType sctype;
> +
> + /* Here we only work with the subclusters within single cluster. */
> + assert(nb_subclusters > 0 && nb_subclusters < s->subclusters_per_cluster);
> + assert(sc_index + nb_subclusters <= s->subclusters_per_cluster);
> + assert(offset_into_subcluster(s, offset) == 0);
> +
> + ret = get_cluster_table(bs, offset, &scri->l2_slice, &scri->l2_index);
> + if (ret < 0) {
> + return ret;
> + }
> +
> + scri->l2_entry = get_l2_entry(s, scri->l2_slice, scri->l2_index);
> + scri->l2_bitmap = get_l2_bitmap(s, scri->l2_slice, scri->l2_index);
> +
> + do {
> + qcow2_get_subcluster_range_type(bs, scri->l2_entry, scri->l2_bitmap,
> + sc_index, &sctype);
I think there’s a `ret = ` missing here.
> + if (ret < 0) {
> + return ret;
> + }
> +
> + switch (sctype) {
> + case QCOW2_SUBCLUSTER_COMPRESSED:
> + /* We cannot partially zeroize/discard compressed clusters. */
> + return -ENOTSUP;
> + case QCOW2_SUBCLUSTER_INVALID:
> + return -EINVAL;
> + default:
> + break;
> + }
> +
> + sc_cleared += ret;
> + } while (sc_cleared < nb_subclusters);
> +
> + return 0;
> +}
> +
> /*
> * This discards as many clusters of nb_clusters as possible at once (i.e.
> * all clusters in the same L2 slice) and returns the number of discarded
> @@ -2097,44 +2148,27 @@ zero_l2_subclusters(BlockDriverState *bs, uint64_t offset,
> unsigned nb_subclusters)
> {
> BDRVQcow2State *s = bs->opaque;
> - uint64_t *l2_slice;
> - uint64_t old_l2_bitmap, l2_bitmap;
> - int l2_index, ret, sc = offset_to_sc_index(s, offset);
> + uint64_t new_l2_bitmap;
> + int ret, sc = offset_to_sc_index(s, offset);
> + SubClusterRangeInfo scri = { 0 };
>
> - /* For full clusters use zero_in_l2_slice() instead */
> - assert(nb_subclusters > 0 && nb_subclusters < s->subclusters_per_cluster);
> - assert(sc + nb_subclusters <= s->subclusters_per_cluster);
> - assert(offset_into_subcluster(s, offset) == 0);
> -
> - ret = get_cluster_table(bs, offset, &l2_slice, &l2_index);
> + ret = get_sc_range_info(bs, offset, nb_subclusters, &scri);
> if (ret < 0) {
> - return ret;
> - }
> -
> - switch (qcow2_get_cluster_type(bs, get_l2_entry(s, l2_slice, l2_index))) {
> - case QCOW2_CLUSTER_COMPRESSED:
> - ret = -ENOTSUP; /* We cannot partially zeroize compressed clusters */
> goto out;
> - case QCOW2_CLUSTER_NORMAL:
> - case QCOW2_CLUSTER_UNALLOCATED:
> - break;
> - default:
> - g_assert_not_reached();
> }
>
> - old_l2_bitmap = l2_bitmap = get_l2_bitmap(s, l2_slice, l2_index);
> -
> - l2_bitmap |= QCOW_OFLAG_SUB_ZERO_RANGE(sc, sc + nb_subclusters);
> - l2_bitmap &= ~QCOW_OFLAG_SUB_ALLOC_RANGE(sc, sc + nb_subclusters);
> + new_l2_bitmap = scri.l2_bitmap;
> + new_l2_bitmap |= QCOW_OFLAG_SUB_ZERO_RANGE(sc, sc + nb_subclusters);
> + new_l2_bitmap &= ~QCOW_OFLAG_SUB_ALLOC_RANGE(sc, sc + nb_subclusters);
>
> - if (old_l2_bitmap != l2_bitmap) {
> - set_l2_bitmap(s, l2_slice, l2_index, l2_bitmap);
> - qcow2_cache_entry_mark_dirty(s->l2_table_cache, l2_slice);
> + if (new_l2_bitmap != scri.l2_bitmap) {
> + set_l2_bitmap(s, scri.l2_slice, scri.l2_index, new_l2_bitmap);
> + qcow2_cache_entry_mark_dirty(s->l2_table_cache, scri.l2_slice);
> }
>
> ret = 0;
> out:
> - qcow2_cache_put(s->l2_table_cache, (void **) &l2_slice);
> + qcow2_cache_put(s->l2_table_cache, (void **) &scri.l2_slice);
qcow2_cache_put() doesn’t look like it’s safe to be called if the table
is NULL (i.e. `scri.l2_slice == NULL`). We can get here if
get_sc_range_info() fails, so that may happen. I think you should either:
(A) Implement G_DEFINE_AUTO_CLEANUP_CLEAR_FUNC() for the SCRI type so it
isn’t an issue (at least I assume that could solve it), or
(B) Call qcow2_cache_put() here only if `scri.l2_slice != NULL`.
In any case, I think it also makes sense to have get_sc_range_info()
only return a valid table if it returns success, i.e. if there’s any
error in get_sc_range_info(), it should clean up `scri.l2_slice` itself.
Hanna
>
> return ret;
> }
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 3/7] qcow2: zeroize the entire cluster when there're no non-zero subclusters
2023-10-20 21:56 ` [PATCH 3/7] qcow2: zeroize the entire cluster when there're no non-zero subclusters Andrey Drobyshev
@ 2023-10-31 16:06 ` Hanna Czenczek
0 siblings, 0 replies; 28+ messages in thread
From: Hanna Czenczek @ 2023-10-31 16:06 UTC (permalink / raw)
To: Andrey Drobyshev, qemu-block; +Cc: qemu-devel, kwolf, eblake, berto, den
On 20.10.23 23:56, Andrey Drobyshev wrote:
> When zeroizing the last non-zero subclusters within single cluster, it
> makes sense to go zeroize the entire cluster and go down zero_in_l2_slice()
> path right away. That way we'd also update the corresponding refcount
> table.
>
> Signed-off-by: Andrey Drobyshev <andrey.drobyshev@virtuozzo.com>
> ---
> block/qcow2-cluster.c | 18 +++++++++++++++---
> 1 file changed, 15 insertions(+), 3 deletions(-)
Reviewed-by: Hanna Czenczek <hreitz@redhat.com>
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 4/7] qcow2: make subclusters discardable
2023-10-20 21:56 ` [PATCH 4/7] qcow2: make subclusters discardable Andrey Drobyshev
2023-10-27 11:10 ` Jean-Louis Dupond
@ 2023-10-31 16:32 ` Hanna Czenczek
2023-11-09 15:05 ` Andrey Drobyshev
2023-10-31 16:33 ` Hanna Czenczek
2023-11-03 15:53 ` Hanna Czenczek
3 siblings, 1 reply; 28+ messages in thread
From: Hanna Czenczek @ 2023-10-31 16:32 UTC (permalink / raw)
To: Andrey Drobyshev, qemu-block; +Cc: qemu-devel, kwolf, eblake, berto, den
On 20.10.23 23:56, Andrey Drobyshev wrote:
> This commit makes the discard operation work on the subcluster level
> rather than cluster level. It introduces discard_l2_subclusters()
> function and makes use of it in qcow2 discard implementation, much like
> it's done with zero_in_l2_slice() / zero_l2_subclusters(). It also
> changes the qcow2 driver pdiscard_alignment to subcluster_size. That
> way subcluster-aligned discards lead to actual fallocate(PUNCH_HOLE)
> operation and free host disk space.
>
> This feature will let us gain additional disk space on guest
> TRIM/discard requests, especially when using large enough clusters
> (1M, 2M) with subclusters enabled.
>
> Signed-off-by: Andrey Drobyshev <andrey.drobyshev@virtuozzo.com>
> ---
> block/qcow2-cluster.c | 100 ++++++++++++++++++++++++++++++++++++++++--
> block/qcow2.c | 8 ++--
> 2 files changed, 101 insertions(+), 7 deletions(-)
>
> diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
> index 7c6fa5524c..cf40f2dc12 100644
> --- a/block/qcow2-cluster.c
> +++ b/block/qcow2-cluster.c
[...]
> + if (scri.l2_bitmap != new_l2_bitmap) {
> + set_l2_bitmap(s, scri.l2_slice, scri.l2_index, new_l2_bitmap);
> + qcow2_cache_entry_mark_dirty(s->l2_table_cache, scri.l2_slice);
> + }
> +
> + if (s->discard_passthrough[type]) {
> + qcow2_queue_discard(bs, (scri.l2_entry & L2E_OFFSET_MASK) +
> + offset_into_cluster(s, offset),
> + nb_subclusters * s->subcluster_size);
Are we sure that the cluster is allocated, i.e. that scri.l2_entry &
L2E_OFFSET_MASK != 0?
As a side note, I guess discard_in_l2_slice() should also use
qcow2_queue_discard() instead of bdrv_pdiscard() then.
> + }
> +
> + ret = 0;
> +out:
> + qcow2_cache_put(s->l2_table_cache, (void **) &scri.l2_slice);
> +
> + return ret;
> +}
> +
> int qcow2_cluster_discard(BlockDriverState *bs, uint64_t offset,
> uint64_t bytes, enum qcow2_discard_type type,
> bool full_discard)
> @@ -2049,19 +2117,36 @@ int qcow2_cluster_discard(BlockDriverState *bs, uint64_t offset,
> BDRVQcow2State *s = bs->opaque;
> uint64_t end_offset = offset + bytes;
> uint64_t nb_clusters;
> + unsigned head, tail;
> int64_t cleared;
> int ret;
>
> /* Caller must pass aligned values, except at image end */
> - assert(QEMU_IS_ALIGNED(offset, s->cluster_size));
> - assert(QEMU_IS_ALIGNED(end_offset, s->cluster_size) ||
> + assert(QEMU_IS_ALIGNED(offset, s->subcluster_size));
> + assert(QEMU_IS_ALIGNED(end_offset, s->subcluster_size) ||
> end_offset == bs->total_sectors << BDRV_SECTOR_BITS);
>
> - nb_clusters = size_to_clusters(s, bytes);
> + head = MIN(end_offset, ROUND_UP(offset, s->cluster_size)) - offset;
> + offset += head;
> +
> + tail = (end_offset >= bs->total_sectors << BDRV_SECTOR_BITS) ? 0 :
> + end_offset - MAX(offset, start_of_cluster(s, end_offset));
> + end_offset -= tail;
>
> s->cache_discards = true;
>
> + if (head) {
> + ret = discard_l2_subclusters(bs, offset - head,
> + size_to_subclusters(s, head), type,
> + full_discard, NULL);
> + if (ret < 0) {
> + goto fail;
> + }
> + }
> +
> /* Each L2 slice is handled by its own loop iteration */
> + nb_clusters = size_to_clusters(s, end_offset - offset);
> +
> while (nb_clusters > 0) {
I think the comment should stay attached to the `while`.
Hanna
> cleared = discard_in_l2_slice(bs, offset, nb_clusters, type,
> full_discard);
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 4/7] qcow2: make subclusters discardable
2023-10-20 21:56 ` [PATCH 4/7] qcow2: make subclusters discardable Andrey Drobyshev
2023-10-27 11:10 ` Jean-Louis Dupond
2023-10-31 16:32 ` Hanna Czenczek
@ 2023-10-31 16:33 ` Hanna Czenczek
2023-11-10 13:26 ` Andrey Drobyshev
2023-11-03 15:53 ` Hanna Czenczek
3 siblings, 1 reply; 28+ messages in thread
From: Hanna Czenczek @ 2023-10-31 16:33 UTC (permalink / raw)
To: Andrey Drobyshev, qemu-block; +Cc: qemu-devel, kwolf, eblake, berto, den
(Sorry, opened another reply window, forgot I already had one open...)
On 20.10.23 23:56, Andrey Drobyshev wrote:
> This commit makes the discard operation work on the subcluster level
> rather than cluster level. It introduces discard_l2_subclusters()
> function and makes use of it in qcow2 discard implementation, much like
> it's done with zero_in_l2_slice() / zero_l2_subclusters(). It also
> changes the qcow2 driver pdiscard_alignment to subcluster_size. That
> way subcluster-aligned discards lead to actual fallocate(PUNCH_HOLE)
> operation and free host disk space.
>
> This feature will let us gain additional disk space on guest
> TRIM/discard requests, especially when using large enough clusters
> (1M, 2M) with subclusters enabled.
>
> Signed-off-by: Andrey Drobyshev <andrey.drobyshev@virtuozzo.com>
> ---
> block/qcow2-cluster.c | 100 ++++++++++++++++++++++++++++++++++++++++--
> block/qcow2.c | 8 ++--
> 2 files changed, 101 insertions(+), 7 deletions(-)
>
> diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
> index 7c6fa5524c..cf40f2dc12 100644
> --- a/block/qcow2-cluster.c
> +++ b/block/qcow2-cluster.c
> @@ -2042,6 +2042,74 @@ discard_in_l2_slice(BlockDriverState *bs, uint64_t offset, uint64_t nb_clusters,
> return nb_clusters;
> }
>
> +static int coroutine_fn GRAPH_RDLOCK
> +discard_l2_subclusters(BlockDriverState *bs, uint64_t offset,
> + uint64_t nb_subclusters,
> + enum qcow2_discard_type type,
> + bool full_discard,
> + SubClusterRangeInfo *pscri)
> +{
> + BDRVQcow2State *s = bs->opaque;
> + uint64_t new_l2_bitmap, l2_bitmap_mask;
> + int ret, sc = offset_to_sc_index(s, offset);
> + SubClusterRangeInfo scri = { 0 };
> +
> + if (!pscri) {
> + ret = get_sc_range_info(bs, offset, nb_subclusters, &scri);
> + if (ret < 0) {
> + goto out;
> + }
> + } else {
> + scri = *pscri;
Allowing to takes this from the caller sounds dangerous, considering we
need to track who takes care of freeing scri.l2_slice.
> + }
> +
> + l2_bitmap_mask = QCOW_OFLAG_SUB_ALLOC_RANGE(sc, sc + nb_subclusters);
> + new_l2_bitmap = scri.l2_bitmap;
> + new_l2_bitmap &= ~l2_bitmap_mask;
> +
> + /*
> + * If there're no allocated subclusters left, we might as well discard
> + * the entire cluster. That way we'd also update the refcount table.
> + */
> + if (!(new_l2_bitmap & QCOW_L2_BITMAP_ALL_ALLOC)) {
What if there are subclusters in the cluster that are marked as zero,
outside of the discarded range? It sounds wrong to apply a discard with
either full_discard set or cleared to them.
> + return discard_in_l2_slice(bs,
> + QEMU_ALIGN_DOWN(offset, s->cluster_size),
> + 1, type, full_discard);
> + }
> +
> + /*
> + * Full discard means we fall through to the backing file, thus we only
> + * need to mark the subclusters as deallocated.
I think it also means we need to clear the zero bits.
Hanna
> + *
> + * Non-full discard means subclusters should be explicitly marked as
> + * zeroes. In this case QCOW2 specification requires the corresponding
> + * allocation status bits to be unset as well. If the subclusters are
> + * deallocated in the first place and there's no backing, the operation
> + * can be skipped.
> + */
> + if (!full_discard &&
> + (bs->backing || scri.l2_bitmap & l2_bitmap_mask)) {
> + new_l2_bitmap |= QCOW_OFLAG_SUB_ZERO_RANGE(sc, sc + nb_subclusters);
> + }
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 5/7] qcow2: zero_l2_subclusters: fall through to discard operation when requested
2023-10-20 21:56 ` [PATCH 5/7] qcow2: zero_l2_subclusters: fall through to discard operation when requested Andrey Drobyshev
@ 2023-11-03 15:19 ` Hanna Czenczek
2023-11-10 13:17 ` Andrey Drobyshev
0 siblings, 1 reply; 28+ messages in thread
From: Hanna Czenczek @ 2023-11-03 15:19 UTC (permalink / raw)
To: Andrey Drobyshev, qemu-block; +Cc: qemu-devel, kwolf, eblake, berto, den
On 20.10.23 23:56, Andrey Drobyshev wrote:
> When zeroizing subclusters within single cluster, detect usage of the
> BDRV_REQ_MAY_UNMAP flag and fall through to the subcluster-based discard
> operation, much like it's done with the cluster-based discards. That
> way subcluster-aligned operations "qemu-io -c 'write -z -u ...'" will
> lead to actual unmap.
Ever since the introduction of discard-no-unref, I wonder whether that
is actually an advantage or not. I can’t think of a reason why it’d be
advantageous to drop the allocation.
On the other hand, zero_in_l2_slice() does it indeed, so consistency is
a reasonable argument.
> Signed-off-by: Andrey Drobyshev <andrey.drobyshev@virtuozzo.com>
> ---
> block/qcow2-cluster.c | 17 ++++++++++++++---
> 1 file changed, 14 insertions(+), 3 deletions(-)
>
> diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
> index cf40f2dc12..040251f2c3 100644
> --- a/block/qcow2-cluster.c
> +++ b/block/qcow2-cluster.c
> @@ -2242,7 +2242,7 @@ zero_l2_subclusters(BlockDriverState *bs, uint64_t offset,
> unsigned nb_subclusters, int flags)
> {
> BDRVQcow2State *s = bs->opaque;
> - uint64_t new_l2_bitmap;
> + uint64_t new_l2_bitmap, l2_bitmap_mask;
> int ret, sc = offset_to_sc_index(s, offset);
> SubClusterRangeInfo scri = { 0 };
>
> @@ -2251,9 +2251,10 @@ zero_l2_subclusters(BlockDriverState *bs, uint64_t offset,
> goto out;
> }
>
> + l2_bitmap_mask = QCOW_OFLAG_SUB_ALLOC_RANGE(sc, sc + nb_subclusters);
“l2_bitmap_mask” already wasn’t a great name in patch 4 because it
doesn’t say what mask it is, but in patch 4, there was at least only one
relevant mask. Here, we have two (ALLOC_RANGE and ZERO_RANGE), so the
name should reflect what kind of mask it is.
> new_l2_bitmap = scri.l2_bitmap;
> - new_l2_bitmap |= QCOW_OFLAG_SUB_ZERO_RANGE(sc, sc + nb_subclusters);
> - new_l2_bitmap &= ~QCOW_OFLAG_SUB_ALLOC_RANGE(sc, sc + nb_subclusters);
> + new_l2_bitmap |= QCOW_OFLAG_SUB_ZERO_RANGE(sc, sc + nb_subclusters);
> + new_l2_bitmap &= ~l2_bitmap_mask;
>
> /*
> * If there're no non-zero subclusters left, we might as well zeroize
> @@ -2266,6 +2267,16 @@ zero_l2_subclusters(BlockDriverState *bs, uint64_t offset,
> 1, flags);
> }
>
> + /*
> + * If the request allows discarding subclusters and they're actually
> + * allocated, we go down the discard path since after the discard
> + * operation the subclusters are going to be read as zeroes anyway.
> + */
> + if ((flags & BDRV_REQ_MAY_UNMAP) && (scri.l2_bitmap & l2_bitmap_mask)) {
> + return discard_l2_subclusters(bs, offset, nb_subclusters,
> + QCOW2_DISCARD_REQUEST, false, &scri);
> + }
I wonder why it matters whether any subclusters are allocated, i.e. why
can’t we just immediately use discard_l2_subclusters() whenever
BDRV_REQ_MAY_UNMAP is set, without even fetching SCRI (that would also
save us passing SCRI here, which I’ve already said on patch 4, I don’t
find great).
Doesn’t discard_l2_subclusters() guarantee the clusters read as zero
when full_discard=false? There is this case where it won’t mark the
subclusters as zero if there is no backing file, and none of the
subclusters is allocated. But still, (A) those subclusters will read as
zero, and (B) if there is a problem there, why don’t we just always mark
those subclusters as zero? This optimization only has effect when the
guest discards/zeroes subclusters (not whole clusters) that were already
discarded, so sounds miniscule.
Hanna
> +
> if (new_l2_bitmap != scri.l2_bitmap) {
> set_l2_bitmap(s, scri.l2_slice, scri.l2_index, new_l2_bitmap);
> qcow2_cache_entry_mark_dirty(s->l2_table_cache, scri.l2_slice);
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 6/7] iotests/common.rc: add disk_usage function
2023-10-20 21:56 ` [PATCH 6/7] iotests/common.rc: add disk_usage function Andrey Drobyshev
@ 2023-11-03 15:20 ` Hanna Czenczek
2023-11-09 12:35 ` Andrey Drobyshev
0 siblings, 1 reply; 28+ messages in thread
From: Hanna Czenczek @ 2023-11-03 15:20 UTC (permalink / raw)
To: Andrey Drobyshev, qemu-block; +Cc: qemu-devel, kwolf, eblake, berto, den
On 20.10.23 23:56, Andrey Drobyshev wrote:
> Move the definition from iotests/250 to common.rc. This is used to
> detect real disk usage of sparse files. In particular, we want to use
> it for checking subclusters-based discards.
>
> Signed-off-by: Andrey Drobyshev <andrey.drobyshev@virtuozzo.com>
> ---
> tests/qemu-iotests/250 | 5 -----
> tests/qemu-iotests/common.rc | 6 ++++++
> 2 files changed, 6 insertions(+), 5 deletions(-)
>
> diff --git a/tests/qemu-iotests/250 b/tests/qemu-iotests/250
> index af48f83aba..c0a0dbc0ff 100755
> --- a/tests/qemu-iotests/250
> +++ b/tests/qemu-iotests/250
> @@ -52,11 +52,6 @@ _unsupported_imgopts data_file
> # bdrv_co_truncate(bs->file) call in qcow2_co_truncate(), which might succeed
> # anyway.
>
> -disk_usage()
> -{
> - du --block-size=1 $1 | awk '{print $1}'
> -}
> -
> size=2100M
>
> _make_test_img -o "cluster_size=1M,preallocation=metadata" $size
> diff --git a/tests/qemu-iotests/common.rc b/tests/qemu-iotests/common.rc
> index 95c12577dd..5d2ea26c7f 100644
> --- a/tests/qemu-iotests/common.rc
> +++ b/tests/qemu-iotests/common.rc
> @@ -140,6 +140,12 @@ _optstr_add()
> fi
> }
>
> +# report real disk usage for sparse files
> +disk_usage()
> +{
> + du --block-size=1 $1 | awk '{print $1}'
Pre-existing, but since you’re touching this now: Can you please change
the $1 to "$1"?
Hanna
> +}
> +
> # Set the variables to the empty string to turn Valgrind off
> # for specific processes, e.g.
> # $ VALGRIND_QEMU_IO= ./check -qcow2 -valgrind 015
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 7/7] iotests/271: check disk usage on subcluster-based discard/unmap
2023-10-20 21:56 ` [PATCH 7/7] iotests/271: check disk usage on subcluster-based discard/unmap Andrey Drobyshev
@ 2023-11-03 15:51 ` Hanna Czenczek
2023-11-03 15:59 ` Hanna Czenczek
2023-11-09 13:55 ` Andrey Drobyshev
0 siblings, 2 replies; 28+ messages in thread
From: Hanna Czenczek @ 2023-11-03 15:51 UTC (permalink / raw)
To: Andrey Drobyshev, qemu-block; +Cc: qemu-devel, kwolf, eblake, berto, den
On 20.10.23 23:56, Andrey Drobyshev wrote:
> Add _verify_du_delta() checker which is used to check that real disk
> usage delta meets the expectations. For now we use it for checking that
> subcluster-based discard/unmap operations lead to actual disk usage
> decrease (i.e. PUNCH_HOLE operation is performed).
I’m not too happy about checking the disk usage because that relies on
the underlying filesystem actually accepting and executing the unmap.
Why is it not enough to check the L2 bitmap?
…Coming back later (I had to fix the missing `ret = ` I mentioned in
patch 2, or this test would hang, so I couldn’t run it at first), I note
that checking the disk usage in fact doesn’t work on tmpfs. I usually
run the iotests in tmpfs, so that’s not great.
> Also add separate test case for discarding particular subcluster within
> one cluster.
>
> Signed-off-by: Andrey Drobyshev <andrey.drobyshev@virtuozzo.com>
> ---
> tests/qemu-iotests/271 | 25 ++++++++++++++++++++++++-
> tests/qemu-iotests/271.out | 2 ++
> 2 files changed, 26 insertions(+), 1 deletion(-)
>
> diff --git a/tests/qemu-iotests/271 b/tests/qemu-iotests/271
> index c7c2cadda0..5fcb209f5f 100755
> --- a/tests/qemu-iotests/271
> +++ b/tests/qemu-iotests/271
> @@ -81,6 +81,15 @@ _verify_l2_bitmap()
> fi
> }
>
> +# Check disk usage delta after a discard/unmap operation
> +# _verify_du_delta $before $after $expected_delta
> +_verify_du_delta()
> +{
> + if [ $(($1 - $2)) -ne $3 ]; then
> + printf "ERROR: unexpected delta: $1 - $2 = $(($1 - $2)) != $3\n"
> + fi
> +}
> +
> # This should be called as _run_test c=XXX sc=XXX off=XXX len=XXX cmd=XXX
> # c: cluster number (0 if unset)
> # sc: subcluster number inside cluster @c (0 if unset)
> @@ -198,9 +207,12 @@ for use_backing_file in yes no; do
> alloc="$(seq 0 31)"; zero=""
> _run_test sc=0 len=64k
>
> - ### Zero and unmap half of cluster #0 (this won't unmap it)
> + ### Zero and unmap half of cluster #0 (this will unmap it)
I think “it” refers to the cluster, and it is not unmapped. This test
case does not use a discard, but write -z instead, so it worked before.
(The L2 bitmap shown in the output doesn’t change, so functionally, this
patch series didn’t change this case.)
> alloc="$(seq 16 31)"; zero="$(seq 0 15)"
> + before=$(disk_usage "$TEST_IMG")
> _run_test sc=0 len=32k cmd=unmap
> + after=$(disk_usage "$TEST_IMG")
> + _verify_du_delta $before $after 32768
>
> ### Zero and unmap cluster #0
> alloc=""; zero="$(seq 0 31)"
For this following case shown truncated here, why don’t we try
“_run_test sc=16 len=32k cmd=unmap” instead of “sc=0 len=64k”? I.e.
unmap only the second half, which, thanks to patch 3, should still unmap
the whole cluster, because the first half is already unmapped.
> @@ -447,7 +459,10 @@ for use_backing_file in yes no; do
>
> # Subcluster-aligned request from clusters #12 to #14
> alloc="$(seq 0 15)"; zero="$(seq 16 31)"
> + before=$(disk_usage "$TEST_IMG")
> _run_test c=12 sc=16 len=128k cmd=unmap
> + after=$(disk_usage "$TEST_IMG")
> + _verify_du_delta $before $after $((128 * 1024))
> alloc=""; zero="$(seq 0 31)"
> _verify_l2_bitmap 13
> alloc="$(seq 16 31)"; zero="$(seq 0 15)"
> @@ -528,6 +543,14 @@ for use_backing_file in yes no; do
> else
> _make_test_img -o extended_l2=on 1M
> fi
> + # Write cluster #0 and discard its subclusters #0-#3
> + $QEMU_IO -c 'write -q 0 64k' "$TEST_IMG"
> + before=$(disk_usage "$TEST_IMG")
> + $QEMU_IO -c 'discard -q 0 8k' "$TEST_IMG"
> + after=$(disk_usage "$TEST_IMG")
> + _verify_du_delta $before $after 8192
> + alloc="$(seq 4 31)"; zero="$(seq 0 3)"
> + _verify_l2_bitmap 0
> # Write clusters #0-#2 and then discard them
> $QEMU_IO -c 'write -q 0 128k' "$TEST_IMG"
> $QEMU_IO -c 'discard -q 0 128k' "$TEST_IMG"
Similarly to above, I think it would be good if we combined this
following case with the one you added, i.e. to write 128k from the
beginning, drop the write here, and change the discard to be “discard -q
8k 120k”, i.e. skip the subclusters we have already discarded, to see
that this is still combined to discard the whole first cluster.
...Ah, see, and when I try this, the following assertion fails:
qemu-io: ../block/qcow2-cache.c:156: qcow2_cache_destroy: Assertion
`c->entries[i].ref == 0' failed.
./common.rc: line 220: 128894 Aborted (core dumped) (
VALGRIND_QEMU="${VALGRIND_QEMU_IO}" _qemu_proc_exec
"${VALGRIND_LOGFILE}" "$QEMU_IO_PROG" $QEMU_IO_ARGS "$@" )
Looks like an L2 table is leaked somewhere. That’s why SCRI should be a
g_auto()-able type.
Hanna
> diff --git a/tests/qemu-iotests/271.out b/tests/qemu-iotests/271.out
> index 5be780de76..0da8d72cde 100644
> --- a/tests/qemu-iotests/271.out
> +++ b/tests/qemu-iotests/271.out
> @@ -426,6 +426,7 @@ L2 entry #29: 0x0000000000000000 0000ffff00000000
> ### Discarding clusters with non-zero bitmaps (backing file: yes) ###
>
> Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1048576 backing_file=TEST_DIR/t.IMGFMT.base backing_fmt=raw
> +L2 entry #0: 0x8000000000050000 0000000ffffffff0
> L2 entry #0: 0x0000000000000000 ffffffff00000000
> L2 entry #1: 0x0000000000000000 ffffffff00000000
> Image resized.
> @@ -436,6 +437,7 @@ L2 entry #1: 0x0000000000000000 ffffffff00000000
> ### Discarding clusters with non-zero bitmaps (backing file: no) ###
>
> Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1048576
> +L2 entry #0: 0x8000000000050000 0000000ffffffff0
> L2 entry #0: 0x0000000000000000 ffffffff00000000
> L2 entry #1: 0x0000000000000000 ffffffff00000000
> Image resized.
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 4/7] qcow2: make subclusters discardable
2023-10-20 21:56 ` [PATCH 4/7] qcow2: make subclusters discardable Andrey Drobyshev
` (2 preceding siblings ...)
2023-10-31 16:33 ` Hanna Czenczek
@ 2023-11-03 15:53 ` Hanna Czenczek
3 siblings, 0 replies; 28+ messages in thread
From: Hanna Czenczek @ 2023-11-03 15:53 UTC (permalink / raw)
To: Andrey Drobyshev, qemu-block; +Cc: qemu-devel, kwolf, eblake, berto, den
On 20.10.23 23:56, Andrey Drobyshev wrote:
> This commit makes the discard operation work on the subcluster level
> rather than cluster level. It introduces discard_l2_subclusters()
> function and makes use of it in qcow2 discard implementation, much like
> it's done with zero_in_l2_slice() / zero_l2_subclusters(). It also
> changes the qcow2 driver pdiscard_alignment to subcluster_size. That
> way subcluster-aligned discards lead to actual fallocate(PUNCH_HOLE)
> operation and free host disk space.
>
> This feature will let us gain additional disk space on guest
> TRIM/discard requests, especially when using large enough clusters
> (1M, 2M) with subclusters enabled.
>
> Signed-off-by: Andrey Drobyshev <andrey.drobyshev@virtuozzo.com>
> ---
> block/qcow2-cluster.c | 100 ++++++++++++++++++++++++++++++++++++++++--
> block/qcow2.c | 8 ++--
> 2 files changed, 101 insertions(+), 7 deletions(-)
>
> diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
> index 7c6fa5524c..cf40f2dc12 100644
> --- a/block/qcow2-cluster.c
> +++ b/block/qcow2-cluster.c
> @@ -2042,6 +2042,74 @@ discard_in_l2_slice(BlockDriverState *bs, uint64_t offset, uint64_t nb_clusters,
> return nb_clusters;
> }
>
> +static int coroutine_fn GRAPH_RDLOCK
> +discard_l2_subclusters(BlockDriverState *bs, uint64_t offset,
> + uint64_t nb_subclusters,
> + enum qcow2_discard_type type,
> + bool full_discard,
> + SubClusterRangeInfo *pscri)
> +{
> + BDRVQcow2State *s = bs->opaque;
> + uint64_t new_l2_bitmap, l2_bitmap_mask;
> + int ret, sc = offset_to_sc_index(s, offset);
> + SubClusterRangeInfo scri = { 0 };
> +
> + if (!pscri) {
> + ret = get_sc_range_info(bs, offset, nb_subclusters, &scri);
> + if (ret < 0) {
> + goto out;
> + }
> + } else {
> + scri = *pscri;
> + }
> +
> + l2_bitmap_mask = QCOW_OFLAG_SUB_ALLOC_RANGE(sc, sc + nb_subclusters);
> + new_l2_bitmap = scri.l2_bitmap;
> + new_l2_bitmap &= ~l2_bitmap_mask;
> +
> + /*
> + * If there're no allocated subclusters left, we might as well discard
> + * the entire cluster. That way we'd also update the refcount table.
> + */
> + if (!(new_l2_bitmap & QCOW_L2_BITMAP_ALL_ALLOC)) {
> + return discard_in_l2_slice(bs,
> + QEMU_ALIGN_DOWN(offset, s->cluster_size),
> + 1, type, full_discard);
> + }
scri.l2_slice is leaked here.
Hanna
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 7/7] iotests/271: check disk usage on subcluster-based discard/unmap
2023-11-03 15:51 ` Hanna Czenczek
@ 2023-11-03 15:59 ` Hanna Czenczek
2023-11-09 14:05 ` Andrey Drobyshev
2023-11-09 13:55 ` Andrey Drobyshev
1 sibling, 1 reply; 28+ messages in thread
From: Hanna Czenczek @ 2023-11-03 15:59 UTC (permalink / raw)
To: Andrey Drobyshev, qemu-block; +Cc: qemu-devel, kwolf, eblake, berto, den
On 03.11.23 16:51, Hanna Czenczek wrote:
> On 20.10.23 23:56, Andrey Drobyshev wrote:
[...]
>> @@ -528,6 +543,14 @@ for use_backing_file in yes no; do
>> else
>> _make_test_img -o extended_l2=on 1M
>> fi
>> + # Write cluster #0 and discard its subclusters #0-#3
>> + $QEMU_IO -c 'write -q 0 64k' "$TEST_IMG"
>> + before=$(disk_usage "$TEST_IMG")
>> + $QEMU_IO -c 'discard -q 0 8k' "$TEST_IMG"
>> + after=$(disk_usage "$TEST_IMG")
>> + _verify_du_delta $before $after 8192
>> + alloc="$(seq 4 31)"; zero="$(seq 0 3)"
>> + _verify_l2_bitmap 0
>> # Write clusters #0-#2 and then discard them
>> $QEMU_IO -c 'write -q 0 128k' "$TEST_IMG"
>> $QEMU_IO -c 'discard -q 0 128k' "$TEST_IMG"
>
> Similarly to above, I think it would be good if we combined this
> following case with the one you added, i.e. to write 128k from the
> beginning, drop the write here, and change the discard to be “discard
> -q 8k 120k”, i.e. skip the subclusters we have already discarded, to
> see that this is still combined to discard the whole first cluster.
>
> ...Ah, see, and when I try this, the following assertion fails:
>
> qemu-io: ../block/qcow2-cache.c:156: qcow2_cache_destroy: Assertion
> `c->entries[i].ref == 0' failed.
> ./common.rc: line 220: 128894 Aborted (core dumped) (
> VALGRIND_QEMU="${VALGRIND_QEMU_IO}" _qemu_proc_exec
> "${VALGRIND_LOGFILE}" "$QEMU_IO_PROG" $QEMU_IO_ARGS "$@" )
>
> Looks like an L2 table is leaked somewhere. That’s why SCRI should be
> a g_auto()-able type.
Forgot to add: This single test case here is the only place where we
test the added functionality. I think there should be more cases. It
doesn’t really make sense now that 271 has so many cases for writing
zeroes, but so few for discarding, now that discarding works on
subclusters. Most of them should at least be considered whether we can
run them for discard as well.
I didn’t want to push for such an extensive set of tests, but, well, now
it turned out I overlooked a bug in patch 4, and only found it because I
thought “this place might also make a nice test case for this series”.
Hanna
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 2/7] qcow2: add get_sc_range_info() helper for working with subcluster ranges
2023-10-31 15:53 ` Hanna Czenczek
@ 2023-11-09 12:32 ` Andrey Drobyshev
0 siblings, 0 replies; 28+ messages in thread
From: Andrey Drobyshev @ 2023-11-09 12:32 UTC (permalink / raw)
To: Hanna Czenczek, qemu-block; +Cc: qemu-devel, kwolf, eblake, berto, den
Hello Hanna,
Sorry for the delay and thanks for your thorough and detailed review.
On 10/31/23 17:53, Hanna Czenczek wrote:
> On 20.10.23 23:56, Andrey Drobyshev wrote:
>> This helper simply obtains the l2 table parameters of the cluster which
>> contains the given subclusters range. Right now this info is being
>> obtained and used by zero_l2_subclusters(). As we're about to introduce
>> the subclusters discard operation, this helper would let us avoid code
>> duplication.
>>
>> Also introduce struct SubClusterRangeInfo, which would contain all the
>> needed params.
>>
>> Signed-off-by: Andrey Drobyshev <andrey.drobyshev@virtuozzo.com>
>> ---
>> block/qcow2-cluster.c | 90 +++++++++++++++++++++++++++++--------------
>> 1 file changed, 62 insertions(+), 28 deletions(-)
>>
>> diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
>> index 904f00d1b3..8801856b93 100644
>> --- a/block/qcow2-cluster.c
>> +++ b/block/qcow2-cluster.c
>> @@ -32,6 +32,13 @@
>> #include "qemu/memalign.h"
>> #include "trace.h"
>> +typedef struct SubClusterRangeInfo {
>> + uint64_t *l2_slice;
>
> We should document that this is a strong reference that must be returned
> via qcow2_cache_put(). Maybe you could also define a clean-up function
> using G_DEFINE_AUTO_CLEANUP_CLEAR_FUNC() that does this, allowing this
> type to be used with g_auto().
>
>> + int l2_index;
>> + uint64_t l2_entry;
>> + uint64_t l2_bitmap;
>> +} SubClusterRangeInfo;
>> +
>> int coroutine_fn qcow2_shrink_l1_table(BlockDriverState *bs,
>> uint64_t exact_size)
>> {
>> @@ -1892,6 +1899,50 @@ again:
>> return 0;
>> }
>> +static int get_sc_range_info(BlockDriverState *bs, uint64_t offset,
>> + unsigned nb_subclusters,
>> + SubClusterRangeInfo *scri)
>
> It would be good to have documentation for this function, for example
> that it only works on a single cluster, i.e. that the range denoted by
> @offset and @nb_subclusters must not cross a cluster boundary, and that
> @offset must be aligned to subclusters.
>
I figured out those restrictions are derived from the number of asserts
in the function body itself, much like it's done in
zero_l2_subclusters() and friends. But of course I don't mind
documenting this.
> In general, it is unclear to me at this point what this function does.
The sole purpose of this function is to avoid the code duplication,
since both zero_l2_subclusters() (pre-existing) and
discard_l2_subclusters() (newly introduced) need to obtain the same info
about the subclusters range they're working with.
> OK, it gets the SCRI for all subclusters in the cluster at @offset (this
> is what its name implies), but then it also has this loop that checks
> whether there are compressed clusters among the @nb_subclusters.
Look at the pre-existing implementation of zero_l2_subclusters(): it
also checks that the subcluster range we're dealing with isn't contained
within a compressed cluster (otherwise there's no point zeroizing
individual subclusters). So the logic isn't changed here. The only
reason I decided to loop through the subclusters (instead of calling
qcow2_get_cluster_type() for the whole cluster) is so that I could
detect invalid subclusters and return -EINVAL right away.
> It has a comment about being unable to zero/discard subclusters in compressed
> clusters, but the function name says nothing about this scope of
> zeroing/discarding.
>
Maybe rename it then to stress out that we're dealing with the regular
subclusters only? get_normal_sc_range_info()?
>> +{
>> + BDRVQcow2State *s = bs->opaque;
>> + int ret, sc_cleared = 0, sc_index = offset_to_sc_index(s, offset);
>> + QCow2SubclusterType sctype;
>> +
>> + /* Here we only work with the subclusters within single cluster. */
>> + assert(nb_subclusters > 0 && nb_subclusters <
>> s->subclusters_per_cluster);
>> + assert(sc_index + nb_subclusters <= s->subclusters_per_cluster);
>> + assert(offset_into_subcluster(s, offset) == 0);
>> +
>> + ret = get_cluster_table(bs, offset, &scri->l2_slice,
>> &scri->l2_index);
>> + if (ret < 0) {
>> + return ret;
>> + }
>> +
>> + scri->l2_entry = get_l2_entry(s, scri->l2_slice, scri->l2_index);
>> + scri->l2_bitmap = get_l2_bitmap(s, scri->l2_slice, scri->l2_index);
>> +
>> + do {
>> + qcow2_get_subcluster_range_type(bs, scri->l2_entry,
>> scri->l2_bitmap,
>> + sc_index, &sctype);
>
> I think there’s a `ret = ` missing here.
>
Of course there is, thanks for catching this.
>> + if (ret < 0) {
>> + return ret;
>> + }
>> +
>> + switch (sctype) {
>> + case QCOW2_SUBCLUSTER_COMPRESSED:
>> + /* We cannot partially zeroize/discard compressed
>> clusters. */
>> + return -ENOTSUP;
>> + case QCOW2_SUBCLUSTER_INVALID:
>> + return -EINVAL;
>> + default:
>> + break;
>> + }
>> +
>> + sc_cleared += ret;
>> + } while (sc_cleared < nb_subclusters);
>> +
>> + return 0;
>> +}
>> +
>> /*
>> * This discards as many clusters of nb_clusters as possible at once
>> (i.e.
>> * all clusters in the same L2 slice) and returns the number of
>> discarded
>> @@ -2097,44 +2148,27 @@ zero_l2_subclusters(BlockDriverState *bs,
>> uint64_t offset,
>> unsigned nb_subclusters)
>> {
>> BDRVQcow2State *s = bs->opaque;
>> - uint64_t *l2_slice;
>> - uint64_t old_l2_bitmap, l2_bitmap;
>> - int l2_index, ret, sc = offset_to_sc_index(s, offset);
>> + uint64_t new_l2_bitmap;
>> + int ret, sc = offset_to_sc_index(s, offset);
>> + SubClusterRangeInfo scri = { 0 };
>> - /* For full clusters use zero_in_l2_slice() instead */
>> - assert(nb_subclusters > 0 && nb_subclusters <
>> s->subclusters_per_cluster);
>> - assert(sc + nb_subclusters <= s->subclusters_per_cluster);
>> - assert(offset_into_subcluster(s, offset) == 0);
>> -
>> - ret = get_cluster_table(bs, offset, &l2_slice, &l2_index);
>> + ret = get_sc_range_info(bs, offset, nb_subclusters, &scri);
>> if (ret < 0) {
>> - return ret;
>> - }
>> -
>> - switch (qcow2_get_cluster_type(bs, get_l2_entry(s, l2_slice,
>> l2_index))) {
>> - case QCOW2_CLUSTER_COMPRESSED:
>> - ret = -ENOTSUP; /* We cannot partially zeroize compressed
>> clusters */
>> goto out;
>> - case QCOW2_CLUSTER_NORMAL:
>> - case QCOW2_CLUSTER_UNALLOCATED:
>> - break;
>> - default:
>> - g_assert_not_reached();
>> }
>> - old_l2_bitmap = l2_bitmap = get_l2_bitmap(s, l2_slice, l2_index);
>> -
>> - l2_bitmap |= QCOW_OFLAG_SUB_ZERO_RANGE(sc, sc + nb_subclusters);
>> - l2_bitmap &= ~QCOW_OFLAG_SUB_ALLOC_RANGE(sc, sc + nb_subclusters);
>> + new_l2_bitmap = scri.l2_bitmap;
>> + new_l2_bitmap |= QCOW_OFLAG_SUB_ZERO_RANGE(sc, sc +
>> nb_subclusters);
>> + new_l2_bitmap &= ~QCOW_OFLAG_SUB_ALLOC_RANGE(sc, sc +
>> nb_subclusters);
>> - if (old_l2_bitmap != l2_bitmap) {
>> - set_l2_bitmap(s, l2_slice, l2_index, l2_bitmap);
>> - qcow2_cache_entry_mark_dirty(s->l2_table_cache, l2_slice);
>> + if (new_l2_bitmap != scri.l2_bitmap) {
>> + set_l2_bitmap(s, scri.l2_slice, scri.l2_index, new_l2_bitmap);
>> + qcow2_cache_entry_mark_dirty(s->l2_table_cache, scri.l2_slice);
>> }
>> ret = 0;
>> out:
>> - qcow2_cache_put(s->l2_table_cache, (void **) &l2_slice);
>> + qcow2_cache_put(s->l2_table_cache, (void **) &scri.l2_slice);
>
> qcow2_cache_put() doesn’t look like it’s safe to be called if the table
> is NULL (i.e. `scri.l2_slice == NULL`). We can get here if
> get_sc_range_info() fails, so that may happen. I think you should either:
>
> (A) Implement G_DEFINE_AUTO_CLEANUP_CLEAR_FUNC() for the SCRI type so it
> isn’t an issue (at least I assume that could solve it), or
>
> (B) Call qcow2_cache_put() here only if `scri.l2_slice != NULL`.
>
Since in patch 5 I add the code path zero_l2_subclusters() ->
discard_l2_subclusters(), option (A) would only be valid if we manage to
skip SCRI param on this call. If I understand correctly, you're
suggesting adding a destructor for the SCRI type so that we call
qcow2_cache_put() in one place and one place only when this structure
goes out of the current scope. But even in this case we'd have to
perform the '!= NULL' check in that destructor, the only benefit is that
we do this in one place.
> In any case, I think it also makes sense to have get_sc_range_info()
> only return a valid table if it returns success, i.e. if there’s any
> error in get_sc_range_info(), it should clean up `scri.l2_slice` itself.
>
Yes, this seems like a sane behaviour.
> Hanna
>
>> return ret;
>> }
>
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 6/7] iotests/common.rc: add disk_usage function
2023-11-03 15:20 ` Hanna Czenczek
@ 2023-11-09 12:35 ` Andrey Drobyshev
0 siblings, 0 replies; 28+ messages in thread
From: Andrey Drobyshev @ 2023-11-09 12:35 UTC (permalink / raw)
To: Hanna Czenczek, qemu-block; +Cc: qemu-devel, kwolf, eblake, berto, den
On 11/3/23 17:20, Hanna Czenczek wrote:
> On 20.10.23 23:56, Andrey Drobyshev wrote:
>> Move the definition from iotests/250 to common.rc. This is used to
>> detect real disk usage of sparse files. In particular, we want to use
>> it for checking subclusters-based discards.
>>
>> Signed-off-by: Andrey Drobyshev <andrey.drobyshev@virtuozzo.com>
>> ---
>> tests/qemu-iotests/250 | 5 -----
>> tests/qemu-iotests/common.rc | 6 ++++++
>> 2 files changed, 6 insertions(+), 5 deletions(-)
>>
>> diff --git a/tests/qemu-iotests/250 b/tests/qemu-iotests/250
>> index af48f83aba..c0a0dbc0ff 100755
>> --- a/tests/qemu-iotests/250
>> +++ b/tests/qemu-iotests/250
>> @@ -52,11 +52,6 @@ _unsupported_imgopts data_file
>> # bdrv_co_truncate(bs->file) call in qcow2_co_truncate(), which
>> might succeed
>> # anyway.
>> -disk_usage()
>> -{
>> - du --block-size=1 $1 | awk '{print $1}'
>> -}
>> -
>> size=2100M
>> _make_test_img -o "cluster_size=1M,preallocation=metadata" $size
>> diff --git a/tests/qemu-iotests/common.rc b/tests/qemu-iotests/common.rc
>> index 95c12577dd..5d2ea26c7f 100644
>> --- a/tests/qemu-iotests/common.rc
>> +++ b/tests/qemu-iotests/common.rc
>> @@ -140,6 +140,12 @@ _optstr_add()
>> fi
>> }
>> +# report real disk usage for sparse files
>> +disk_usage()
>> +{
>> + du --block-size=1 $1 | awk '{print $1}'
>
> Pre-existing, but since you’re touching this now: Can you please change
> the $1 to "$1"?
>
Sure, will do in v2.
> Hanna
>
> [...]
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 7/7] iotests/271: check disk usage on subcluster-based discard/unmap
2023-11-03 15:51 ` Hanna Czenczek
2023-11-03 15:59 ` Hanna Czenczek
@ 2023-11-09 13:55 ` Andrey Drobyshev
1 sibling, 0 replies; 28+ messages in thread
From: Andrey Drobyshev @ 2023-11-09 13:55 UTC (permalink / raw)
To: Hanna Czenczek, qemu-block; +Cc: qemu-devel, kwolf, eblake, berto, den
On 11/3/23 17:51, Hanna Czenczek wrote:
> On 20.10.23 23:56, Andrey Drobyshev wrote:
>> Add _verify_du_delta() checker which is used to check that real disk
>> usage delta meets the expectations. For now we use it for checking that
>> subcluster-based discard/unmap operations lead to actual disk usage
>> decrease (i.e. PUNCH_HOLE operation is performed).
>
> I’m not too happy about checking the disk usage because that relies on
> the underlying filesystem actually accepting and executing the unmap.
> Why is it not enough to check the L2 bitmap?
>
> …Coming back later (I had to fix the missing `ret = ` I mentioned in
> patch 2, or this test would hang, so I couldn’t run it at first), I note
> that checking the disk usage in fact doesn’t work on tmpfs. I usually
> run the iotests in tmpfs, so that’s not great.
>
My original idea was to make sure that the PUNCH_HOLE operation did
indeed take place, i.e. there was an actual discard. For instance,
currently the discard operation initiated by qemu-io is called with the
QCOW2_DISCARD_REQUEST discard type, but if some other type is passed by
mistake, qcow2_queue_discard() won't be called, and though the
subclusters will be marked unallocated in L2 the data will still be
there. Not quite what we expect from discard operation.
BTW checking the disk usage on tmpfs works on my machine:
> # cd /tmp; df -Th /tmp
> Filesystem Type Size Used Avail Use% Mounted on
> tmpfs tmpfs 32G 2.5M 32G 1% /tmp
> # BUILD=/root/src/qemu/master/build
> # $BUILD/qemu-img create -f qcow2 -o extended_l2=on img.qcow2 1M
> Formatting 'img.qcow2', fmt=qcow2 cluster_size=65536 extended_l2=on compression_type=zlib size=1048576 lazy_refcounts=off refcount_bits=16
> # $BUILD/qemu-io -c 'write -q 0 128k' img.qcow2
> # du --block-size=1 img.qcow2
> 397312 img.qcow2
> # $BUILD/qemu-io -f qcow2 -c 'discard -q 0 8k' img.qcow2
> # du --block-size=1 img.qcow2
> 389120 img.qcow2
> # $BUILD/qemu-io -f qcow2 -c 'discard -q 8k 120k' img.qcow2
> # du --block-size=1 img.qcow2
> 266240 img.qcow2
I'm wondering what this might depend on and can't we overcome this?
>> Also add separate test case for discarding particular subcluster within
>> one cluster.
>>
>> Signed-off-by: Andrey Drobyshev <andrey.drobyshev@virtuozzo.com>
>> ---
>> tests/qemu-iotests/271 | 25 ++++++++++++++++++++++++-
>> tests/qemu-iotests/271.out | 2 ++
>> 2 files changed, 26 insertions(+), 1 deletion(-)
>>
>> diff --git a/tests/qemu-iotests/271 b/tests/qemu-iotests/271
>> index c7c2cadda0..5fcb209f5f 100755
>> --- a/tests/qemu-iotests/271
>> +++ b/tests/qemu-iotests/271
>> @@ -81,6 +81,15 @@ _verify_l2_bitmap()
>> fi
>> }
>> +# Check disk usage delta after a discard/unmap operation
>> +# _verify_du_delta $before $after $expected_delta
>> +_verify_du_delta()
>> +{
>> + if [ $(($1 - $2)) -ne $3 ]; then
>> + printf "ERROR: unexpected delta: $1 - $2 = $(($1 - $2)) != $3\n"
>> + fi
>> +}
>> +
>> # This should be called as _run_test c=XXX sc=XXX off=XXX len=XXX
>> cmd=XXX
>> # c: cluster number (0 if unset)
>> # sc: subcluster number inside cluster @c (0 if unset)
>> @@ -198,9 +207,12 @@ for use_backing_file in yes no; do
>> alloc="$(seq 0 31)"; zero=""
>> _run_test sc=0 len=64k
>> - ### Zero and unmap half of cluster #0 (this won't unmap it)
>> + ### Zero and unmap half of cluster #0 (this will unmap it)
>
> I think “it” refers to the cluster, and it is not unmapped. This test
> case does not use a discard, but write -z instead, so it worked before.
> (The L2 bitmap shown in the output doesn’t change, so functionally, this
> patch series didn’t change this case.)
>
From the _run_test() implementation:
> # cmd: the command to pass to qemu-io, must be one of
> # write -> write
> # zero -> write -z
> # unmap -> write -z -u <-------------
> # compress -> write -c
> # discard -> discard
> _run_test()
So it actually uses 'write -z -u', and we end up with an actual unmap.
I agree that the l2 bitmap doesn't change, that's why I specifically
added disk usage check to catch the changed functionality.
>> alloc="$(seq 16 31)"; zero="$(seq 0 15)"
>> + before=$(disk_usage "$TEST_IMG")
>> _run_test sc=0 len=32k cmd=unmap
>> + after=$(disk_usage "$TEST_IMG")
>> + _verify_du_delta $before $after 32768
>> ### Zero and unmap cluster #0
>> alloc=""; zero="$(seq 0 31)"
>
> For this following case shown truncated here, why don’t we try
> “_run_test sc=16 len=32k cmd=unmap” instead of “sc=0 len=64k”? I.e.
> unmap only the second half, which, thanks to patch 3, should still unmap
> the whole cluster, because the first half is already unmapped.
>
Agreed. And the interesting part is that here we'd be calling 'write -u
-z', thus following the zero_l2_subclusters() ->
discard_l2_subclusters() -> discard_in_l2_slice() path...
>> @@ -447,7 +459,10 @@ for use_backing_file in yes no; do
>> # Subcluster-aligned request from clusters #12 to #14
>> alloc="$(seq 0 15)"; zero="$(seq 16 31)"
>> + before=$(disk_usage "$TEST_IMG")
>> _run_test c=12 sc=16 len=128k cmd=unmap
>> + after=$(disk_usage "$TEST_IMG")
>> + _verify_du_delta $before $after $((128 * 1024))
>> alloc=""; zero="$(seq 0 31)"
>> _verify_l2_bitmap 13
>> alloc="$(seq 16 31)"; zero="$(seq 0 15)"
>> @@ -528,6 +543,14 @@ for use_backing_file in yes no; do
>> else
>> _make_test_img -o extended_l2=on 1M
>> fi
>> + # Write cluster #0 and discard its subclusters #0-#3
>> + $QEMU_IO -c 'write -q 0 64k' "$TEST_IMG"
>> + before=$(disk_usage "$TEST_IMG")
>> + $QEMU_IO -c 'discard -q 0 8k' "$TEST_IMG"
>> + after=$(disk_usage "$TEST_IMG")
>> + _verify_du_delta $before $after 8192
>> + alloc="$(seq 4 31)"; zero="$(seq 0 3)"
>> + _verify_l2_bitmap 0
>> # Write clusters #0-#2 and then discard them
>> $QEMU_IO -c 'write -q 0 128k' "$TEST_IMG"
>> $QEMU_IO -c 'discard -q 0 128k' "$TEST_IMG"
>
> Similarly to above, I think it would be good if we combined this
> following case with the one you added, i.e. to write 128k from the
> beginning, drop the write here, and change the discard to be “discard -q
> 8k 120k”, i.e. skip the subclusters we have already discarded, to see
> that this is still combined to discard the whole first cluster.
>
...and here we'd be calling discard directly, thus following the
discard_l2_subclusters() -> discard_in_l2_slice() path.
> ...Ah, see, and when I try this, the following assertion fails:
>
> qemu-io: ../block/qcow2-cache.c:156: qcow2_cache_destroy: Assertion
> `c->entries[i].ref == 0' failed.
> ./common.rc: line 220: 128894 Aborted (core dumped) (
> VALGRIND_QEMU="${VALGRIND_QEMU_IO}" _qemu_proc_exec
> "${VALGRIND_LOGFILE}" "$QEMU_IO_PROG" $QEMU_IO_ARGS "$@" )
>
Yes, I should've added qcow2_cache_put() when going
discard_l2_subclusters() -> discard_in_l2_slice(), same as I did on the
zero_l2_subclusters() -> zero_in_l2_slice() path. Thanks for catching.
> Looks like an L2 table is leaked somewhere. That’s why SCRI should be a
> g_auto()-able type.
>
In this case this indeed makes sense, since when we extend the operation
from the subclusters range to the entire cluster, SCRI is no longer
needed. The only question with this approach is the
zero_l2_subclusters() -> discard_l2_subclusters() path.
> Hanna
>
> [...]
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 7/7] iotests/271: check disk usage on subcluster-based discard/unmap
2023-11-03 15:59 ` Hanna Czenczek
@ 2023-11-09 14:05 ` Andrey Drobyshev
0 siblings, 0 replies; 28+ messages in thread
From: Andrey Drobyshev @ 2023-11-09 14:05 UTC (permalink / raw)
To: Hanna Czenczek, qemu-block; +Cc: qemu-devel, kwolf, eblake, berto, den
On 11/3/23 17:59, Hanna Czenczek wrote:
> On 03.11.23 16:51, Hanna Czenczek wrote:
>> On 20.10.23 23:56, Andrey Drobyshev wrote:
>
> [...]
>
>>> @@ -528,6 +543,14 @@ for use_backing_file in yes no; do
>>> else
>>> _make_test_img -o extended_l2=on 1M
>>> fi
>>> + # Write cluster #0 and discard its subclusters #0-#3
>>> + $QEMU_IO -c 'write -q 0 64k' "$TEST_IMG"
>>> + before=$(disk_usage "$TEST_IMG")
>>> + $QEMU_IO -c 'discard -q 0 8k' "$TEST_IMG"
>>> + after=$(disk_usage "$TEST_IMG")
>>> + _verify_du_delta $before $after 8192
>>> + alloc="$(seq 4 31)"; zero="$(seq 0 3)"
>>> + _verify_l2_bitmap 0
>>> # Write clusters #0-#2 and then discard them
>>> $QEMU_IO -c 'write -q 0 128k' "$TEST_IMG"
>>> $QEMU_IO -c 'discard -q 0 128k' "$TEST_IMG"
>>
>> Similarly to above, I think it would be good if we combined this
>> following case with the one you added, i.e. to write 128k from the
>> beginning, drop the write here, and change the discard to be “discard
>> -q 8k 120k”, i.e. skip the subclusters we have already discarded, to
>> see that this is still combined to discard the whole first cluster.
>>
>> ...Ah, see, and when I try this, the following assertion fails:
>>
>> qemu-io: ../block/qcow2-cache.c:156: qcow2_cache_destroy: Assertion
>> `c->entries[i].ref == 0' failed.
>> ./common.rc: line 220: 128894 Aborted (core dumped) (
>> VALGRIND_QEMU="${VALGRIND_QEMU_IO}" _qemu_proc_exec
>> "${VALGRIND_LOGFILE}" "$QEMU_IO_PROG" $QEMU_IO_ARGS "$@" )
>>
>> Looks like an L2 table is leaked somewhere. That’s why SCRI should be
>> a g_auto()-able type.
>
> Forgot to add: This single test case here is the only place where we
> test the added functionality. I think there should be more cases. It
> doesn’t really make sense now that 271 has so many cases for writing
> zeroes, but so few for discarding, now that discarding works on
> subclusters. Most of them should at least be considered whether we can
> run them for discard as well.
>
> I didn’t want to push for such an extensive set of tests, but, well, now
> it turned out I overlooked a bug in patch 4, and only found it because I
> thought “this place might also make a nice test case for this series”.
>
> Hanna
>
I agree that more coverage should be added. Based on the previous
email, I see the following cases:
1. Direct 'discard' on the subclusters range (discard_l2_subclusters()).
2. Direct 'discard' on the subclusters range, complementary to an
unallocated range (i.e. discard_l2_subclusters() -> discard_in_l2_slice()).
3. 'write -u -z' on the subclusters range (zero_l2_subclusters() ->
discard_l2_subclusters()).
4. 'write -u -z' on the subclusters range, complementary to an
unallocated range (zero_l2_subclusters() -> discard_l2_subclusters() ->
discard_in_l2_slice()).
Would also be nice to test the zero_l2_subclusters() ->
zero_in_l2_slice() path, but we'd have to somehow check the refcount
table for that since the L2 bitmap doesn't change.
Please let me know if you can think of anything else.
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 4/7] qcow2: make subclusters discardable
2023-10-31 16:32 ` Hanna Czenczek
@ 2023-11-09 15:05 ` Andrey Drobyshev
0 siblings, 0 replies; 28+ messages in thread
From: Andrey Drobyshev @ 2023-11-09 15:05 UTC (permalink / raw)
To: Hanna Czenczek, qemu-block; +Cc: qemu-devel, kwolf, eblake, berto, den
On 10/31/23 18:32, Hanna Czenczek wrote:
> On 20.10.23 23:56, Andrey Drobyshev wrote:
>> This commit makes the discard operation work on the subcluster level
>> rather than cluster level. It introduces discard_l2_subclusters()
>> function and makes use of it in qcow2 discard implementation, much like
>> it's done with zero_in_l2_slice() / zero_l2_subclusters(). It also
>> changes the qcow2 driver pdiscard_alignment to subcluster_size. That
>> way subcluster-aligned discards lead to actual fallocate(PUNCH_HOLE)
>> operation and free host disk space.
>>
>> This feature will let us gain additional disk space on guest
>> TRIM/discard requests, especially when using large enough clusters
>> (1M, 2M) with subclusters enabled.
>>
>> Signed-off-by: Andrey Drobyshev <andrey.drobyshev@virtuozzo.com>
>> ---
>> block/qcow2-cluster.c | 100 ++++++++++++++++++++++++++++++++++++++++--
>> block/qcow2.c | 8 ++--
>> 2 files changed, 101 insertions(+), 7 deletions(-)
>>
>> diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
>> index 7c6fa5524c..cf40f2dc12 100644
>> --- a/block/qcow2-cluster.c
>> +++ b/block/qcow2-cluster.c
>
> [...]
>
>> + if (scri.l2_bitmap != new_l2_bitmap) {
>> + set_l2_bitmap(s, scri.l2_slice, scri.l2_index, new_l2_bitmap);
>> + qcow2_cache_entry_mark_dirty(s->l2_table_cache, scri.l2_slice);
>> + }
>> +
>> + if (s->discard_passthrough[type]) {
>> + qcow2_queue_discard(bs, (scri.l2_entry & L2E_OFFSET_MASK) +
>> + offset_into_cluster(s, offset),
>> + nb_subclusters * s->subcluster_size);
>
> Are we sure that the cluster is allocated, i.e. that scri.l2_entry &
> L2E_OFFSET_MASK != 0?
>
I think it must be illegal to mark the entire cluster as unallocated and
yet mark some of the subclusters as allocated. So in the case you
describe we should detect it earlier in the '!(new_l2_bitmap &
QCOW_L2_BITMAP_ALL_ALLOC)' condition and fall back to discard_in_l2_slice().
> As a side note, I guess discard_in_l2_slice() should also use
> qcow2_queue_discard() instead of bdrv_pdiscard() then.
>
Yes, looks like it. I'll make it a separate patch then.
>> + }
>> +
>> + ret = 0;
>> +out:
>> + qcow2_cache_put(s->l2_table_cache, (void **) &scri.l2_slice);
>> +
>> + return ret;
>> +}
>> +
>> int qcow2_cluster_discard(BlockDriverState *bs, uint64_t offset,
>> uint64_t bytes, enum qcow2_discard_type type,
>> bool full_discard)
>> @@ -2049,19 +2117,36 @@ int qcow2_cluster_discard(BlockDriverState
>> *bs, uint64_t offset,
>> BDRVQcow2State *s = bs->opaque;
>> uint64_t end_offset = offset + bytes;
>> uint64_t nb_clusters;
>> + unsigned head, tail;
>> int64_t cleared;
>> int ret;
>> /* Caller must pass aligned values, except at image end */
>> - assert(QEMU_IS_ALIGNED(offset, s->cluster_size));
>> - assert(QEMU_IS_ALIGNED(end_offset, s->cluster_size) ||
>> + assert(QEMU_IS_ALIGNED(offset, s->subcluster_size));
>> + assert(QEMU_IS_ALIGNED(end_offset, s->subcluster_size) ||
>> end_offset == bs->total_sectors << BDRV_SECTOR_BITS);
>> - nb_clusters = size_to_clusters(s, bytes);
>> + head = MIN(end_offset, ROUND_UP(offset, s->cluster_size)) - offset;
>> + offset += head;
>> +
>> + tail = (end_offset >= bs->total_sectors << BDRV_SECTOR_BITS) ? 0 :
>> + end_offset - MAX(offset, start_of_cluster(s, end_offset));
>> + end_offset -= tail;
>> s->cache_discards = true;
>> + if (head) {
>> + ret = discard_l2_subclusters(bs, offset - head,
>> + size_to_subclusters(s, head), type,
>> + full_discard, NULL);
>> + if (ret < 0) {
>> + goto fail;
>> + }
>> + }
>> +
>> /* Each L2 slice is handled by its own loop iteration */
>> + nb_clusters = size_to_clusters(s, end_offset - offset);
>> +
>> while (nb_clusters > 0) {
>
> I think the comment should stay attached to the `while`.
>
Agreed.
> Hanna
>
>> cleared = discard_in_l2_slice(bs, offset, nb_clusters, type,
>> full_discard);
>
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 5/7] qcow2: zero_l2_subclusters: fall through to discard operation when requested
2023-11-03 15:19 ` Hanna Czenczek
@ 2023-11-10 13:17 ` Andrey Drobyshev
0 siblings, 0 replies; 28+ messages in thread
From: Andrey Drobyshev @ 2023-11-10 13:17 UTC (permalink / raw)
To: Hanna Czenczek, qemu-block; +Cc: qemu-devel, kwolf, eblake, berto, den
On 11/3/23 17:19, Hanna Czenczek wrote:
> On 20.10.23 23:56, Andrey Drobyshev wrote:
>> When zeroizing subclusters within single cluster, detect usage of the
>> BDRV_REQ_MAY_UNMAP flag and fall through to the subcluster-based discard
>> operation, much like it's done with the cluster-based discards. That
>> way subcluster-aligned operations "qemu-io -c 'write -z -u ...'" will
>> lead to actual unmap.
>
> Ever since the introduction of discard-no-unref, I wonder whether that
> is actually an advantage or not. I can’t think of a reason why it’d be
> advantageous to drop the allocation.
You mean omitting the UNallocation on the discard path?
>
> On the other hand, zero_in_l2_slice() does it indeed, so consistency is
> a reasonable argument.
>
>> Signed-off-by: Andrey Drobyshev <andrey.drobyshev@virtuozzo.com>
>> ---
>> block/qcow2-cluster.c | 17 ++++++++++++++---
>> 1 file changed, 14 insertions(+), 3 deletions(-)
>>
>> diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
>> index cf40f2dc12..040251f2c3 100644
>> --- a/block/qcow2-cluster.c
>> +++ b/block/qcow2-cluster.c
>> @@ -2242,7 +2242,7 @@ zero_l2_subclusters(BlockDriverState *bs,
>> uint64_t offset,
>> unsigned nb_subclusters, int flags)
>> {
>> BDRVQcow2State *s = bs->opaque;
>> - uint64_t new_l2_bitmap;
>> + uint64_t new_l2_bitmap, l2_bitmap_mask;
>> int ret, sc = offset_to_sc_index(s, offset);
>> SubClusterRangeInfo scri = { 0 };
>> @@ -2251,9 +2251,10 @@ zero_l2_subclusters(BlockDriverState *bs,
>> uint64_t offset,
>> goto out;
>> }
>> + l2_bitmap_mask = QCOW_OFLAG_SUB_ALLOC_RANGE(sc, sc +
>> nb_subclusters);
>
> “l2_bitmap_mask” already wasn’t a great name in patch 4 because it
> doesn’t say what mask it is, but in patch 4, there was at least only one
> relevant mask. Here, we have two (ALLOC_RANGE and ZERO_RANGE), so the
> name should reflect what kind of mask it is.
>
Noted.
>> new_l2_bitmap = scri.l2_bitmap;
>> - new_l2_bitmap |= QCOW_OFLAG_SUB_ZERO_RANGE(sc, sc +
>> nb_subclusters);
>> - new_l2_bitmap &= ~QCOW_OFLAG_SUB_ALLOC_RANGE(sc, sc +
>> nb_subclusters);
>> + new_l2_bitmap |= QCOW_OFLAG_SUB_ZERO_RANGE(sc, sc + nb_subclusters);
>> + new_l2_bitmap &= ~l2_bitmap_mask;
>> /*
>> * If there're no non-zero subclusters left, we might as well
>> zeroize
>> @@ -2266,6 +2267,16 @@ zero_l2_subclusters(BlockDriverState *bs,
>> uint64_t offset,
>> 1, flags);
>> }
>> + /*
>> + * If the request allows discarding subclusters and they're actually
>> + * allocated, we go down the discard path since after the discard
>> + * operation the subclusters are going to be read as zeroes anyway.
>> + */
>> + if ((flags & BDRV_REQ_MAY_UNMAP) && (scri.l2_bitmap &
>> l2_bitmap_mask)) {
>> + return discard_l2_subclusters(bs, offset, nb_subclusters,
>> + QCOW2_DISCARD_REQUEST, false,
>> &scri);
>> + }
>
> I wonder why it matters whether any subclusters are allocated, i.e. why
> can’t we just immediately use discard_l2_subclusters() whenever
> BDRV_REQ_MAY_UNMAP is set, without even fetching SCRI (that would also
> save us passing SCRI here, which I’ve already said on patch 4, I don’t
> find great).
>
Yes, this way we'd indeed be able to get rid of passing an additional
argument.
> Doesn’t discard_l2_subclusters() guarantee the clusters read as zero
> when full_discard=false? There is this case where it won’t mark the
> subclusters as zero if there is no backing file, and none of the
> subclusters is allocated. But still, (A) those subclusters will read as
> zero, and (B) if there is a problem there, why don’t we just always mark
> those subclusters as zero? This optimization only has effect when the
> guest discards/zeroes subclusters (not whole clusters) that were already
> discarded, so sounds miniscule.
>
> Hanna
>
Good question. I also can't think of any issues when
zeroizing/discarding already unallocated clusters.
Essentially you're saying that zeroizing subclusters is equivalent to
discard_l2_subclusters(full_discard=false). Then in
discard_l2_subclusters() implementation we may mark the subclusters as
zeroes regardless of their allocation status in case of
full_discard=false. But when dealing with the whole clusters this logic
isn't quite applicable cause if the l2 entry isn't allocated at all
there's no point marking it as zero. Correct?
> [...]
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 4/7] qcow2: make subclusters discardable
2023-10-31 16:33 ` Hanna Czenczek
@ 2023-11-10 13:26 ` Andrey Drobyshev
0 siblings, 0 replies; 28+ messages in thread
From: Andrey Drobyshev @ 2023-11-10 13:26 UTC (permalink / raw)
To: Hanna Czenczek, qemu-block; +Cc: qemu-devel, kwolf, eblake, berto, den
On 10/31/23 18:33, Hanna Czenczek wrote:
> (Sorry, opened another reply window, forgot I already had one open...)
>
> On 20.10.23 23:56, Andrey Drobyshev wrote:
>> This commit makes the discard operation work on the subcluster level
>> rather than cluster level. It introduces discard_l2_subclusters()
>> function and makes use of it in qcow2 discard implementation, much like
>> it's done with zero_in_l2_slice() / zero_l2_subclusters(). It also
>> changes the qcow2 driver pdiscard_alignment to subcluster_size. That
>> way subcluster-aligned discards lead to actual fallocate(PUNCH_HOLE)
>> operation and free host disk space.
>>
>> This feature will let us gain additional disk space on guest
>> TRIM/discard requests, especially when using large enough clusters
>> (1M, 2M) with subclusters enabled.
>>
>> Signed-off-by: Andrey Drobyshev <andrey.drobyshev@virtuozzo.com>
>> ---
>> block/qcow2-cluster.c | 100 ++++++++++++++++++++++++++++++++++++++++--
>> block/qcow2.c | 8 ++--
>> 2 files changed, 101 insertions(+), 7 deletions(-)
>>
>> diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
>> index 7c6fa5524c..cf40f2dc12 100644
>> --- a/block/qcow2-cluster.c
>> +++ b/block/qcow2-cluster.c
>> @@ -2042,6 +2042,74 @@ discard_in_l2_slice(BlockDriverState *bs,
>> uint64_t offset, uint64_t nb_clusters,
>> return nb_clusters;
>> }
>> +static int coroutine_fn GRAPH_RDLOCK
>> +discard_l2_subclusters(BlockDriverState *bs, uint64_t offset,
>> + uint64_t nb_subclusters,
>> + enum qcow2_discard_type type,
>> + bool full_discard,
>> + SubClusterRangeInfo *pscri)
>> +{
>> + BDRVQcow2State *s = bs->opaque;
>> + uint64_t new_l2_bitmap, l2_bitmap_mask;
>> + int ret, sc = offset_to_sc_index(s, offset);
>> + SubClusterRangeInfo scri = { 0 };
>> +
>> + if (!pscri) {
>> + ret = get_sc_range_info(bs, offset, nb_subclusters, &scri);
>> + if (ret < 0) {
>> + goto out;
>> + }
>> + } else {
>> + scri = *pscri;
>
> Allowing to takes this from the caller sounds dangerous, considering we
> need to track who takes care of freeing scri.l2_slice.
>
>> + }
>> +
>> + l2_bitmap_mask = QCOW_OFLAG_SUB_ALLOC_RANGE(sc, sc +
>> nb_subclusters);
>> + new_l2_bitmap = scri.l2_bitmap;
>> + new_l2_bitmap &= ~l2_bitmap_mask;
>> +
>> + /*
>> + * If there're no allocated subclusters left, we might as well
>> discard
>> + * the entire cluster. That way we'd also update the refcount
>> table.
>> + */
>> + if (!(new_l2_bitmap & QCOW_L2_BITMAP_ALL_ALLOC)) {
>
> What if there are subclusters in the cluster that are marked as zero,
> outside of the discarded range? It sounds wrong to apply a discard with
> either full_discard set or cleared to them.
>
Hmmm, then I think before falling to this path we should either:
1. Make sure full_discard=false. That is directly implied from what you
said in your other mail: discarding with full_discard=false is
equivalent to zeroizing;
2. Check that no subclusters within this cluster are explicitly marked
as zeroes.
3. Check that either 1) or 2) is true (if ((1) || (2))).
For now I like option 3) better.
>> + return discard_in_l2_slice(bs,
>> + QEMU_ALIGN_DOWN(offset,
>> s->cluster_size),
>> + 1, type, full_discard);
>> + }
>> +
>> + /*
>> + * Full discard means we fall through to the backing file, thus
>> we only
>> + * need to mark the subclusters as deallocated.
>
> I think it also means we need to clear the zero bits.
>
Yes, that seems right.
> Hanna
>
> [...]
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 4/7] qcow2: make subclusters discardable
2023-10-27 11:10 ` Jean-Louis Dupond
@ 2024-04-16 19:56 ` Andrey Drobyshev
2024-04-19 9:06 ` Jean-Louis Dupond
0 siblings, 1 reply; 28+ messages in thread
From: Andrey Drobyshev @ 2024-04-16 19:56 UTC (permalink / raw)
To: Jean-Louis Dupond, qemu-block
Cc: qemu-devel, hreitz, kwolf, eblake, berto, den
On 10/27/23 14:10, Jean-Louis Dupond wrote:
> [...]
>
> I've checked all the code paths, and as far as I see it nowhere breaks
> the discard_no_unref option.
> It's important that we don't introduce new code paths that can make
> holes in the qcow2 image when this option is enabled :)
>
> If you can confirm my conclusion, that would be great.
>
>
> Thanks
> Jean-Louis
>
Hi Jean-Louis,
I've finally got to working on v2 for this series. However I'm failing
to get a grasp on what this option is supposed to be doing and what are
we trying to avoid here.
Consider this simple example:
# cd build
# ./qemu-img create -f qcow2 unref.qcow2 192K
# ./qemu-img create -f qcow2 nounref.qcow2 192K
# ./qemu-io -c "write 0 192K" unref.qcow2
# ./qemu-io -c "write 0 192K" nounref.qcow2
#
# strace -fv -e fallocate ./qemu-io -c "discard 64K 64K" unref.qcow2
[pid 887710] fallocate(9, FALLOC_FL_KEEP_SIZE|FALLOC_FL_PUNCH_HOLE,
393216, 65536) = 0
discard 65536/65536 bytes at offset 65536
64 KiB, 1 ops; 00.00 sec (252.123 MiB/sec and 4033.9660 ops/sec)
#
# strace -fv -e fallocate ./qemu-io -c "reopen -o discard-no-unref=on"
-c "discard 64K 64K" nounref.qcow2
# [pid 887789] fallocate(9, FALLOC_FL_KEEP_SIZE|FALLOC_FL_PUNCH_HOLE,
393216, 65536) = 0
discard 65536/65536 bytes at offset 65536
64 KiB, 1 ops; 00.00 sec (345.457 MiB/sec and 5527.3049 ops/sec)
#
# ./qemu-img check unref.qcow2
No errors were found on the image.
2/3 = 66.67% allocated, 50.00% fragmented, 0.00% compressed clusters
Image end offset: 524288
# ./qemu-img check nounref.qcow2
No errors were found on the image.
3/3 = 100.00% allocated, 0.00% fragmented, 0.00% compressed clusters
Image end offset: 524288
#
# ls -la *.qcow2
-rw-r--r-- 1 root root 524288 Apr 16 22:42 nounref.qcow2
-rw-r--r-- 1 root root 524288 Apr 16 22:41 unref.qcow2
# du --block-size=1 *.qcow2
397312 nounref.qcow2
397312 unref.qcow2
I understand that by keeping the L2 entry we achieve that cluster
remains formally allocated, but no matter whether "discard-no-unref"
option is enabled fallocate(FALLOC_FL_KEEP_SIZE|FALLOC_FL_PUNCH_HOLE) is
being called leaving a hole in the file (e.g. file becomes sparse).
However you say in the comment above that we can't allow making new
holes in the file when this option is enabled. How does that correlate
and what do we achieve? And which logic do you think we need to follow
when discarding separate subclusters?
Andrey
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 4/7] qcow2: make subclusters discardable
2024-04-16 19:56 ` Andrey Drobyshev
@ 2024-04-19 9:06 ` Jean-Louis Dupond
0 siblings, 0 replies; 28+ messages in thread
From: Jean-Louis Dupond @ 2024-04-19 9:06 UTC (permalink / raw)
To: Andrey Drobyshev, qemu-block
Cc: qemu-devel, hreitz, kwolf, eblake, berto, den
On 16/04/2024 21:56, Andrey Drobyshev wrote:
> On 10/27/23 14:10, Jean-Louis Dupond wrote:
>> [...]
>>
>> I've checked all the code paths, and as far as I see it nowhere breaks
>> the discard_no_unref option.
>> It's important that we don't introduce new code paths that can make
>> holes in the qcow2 image when this option is enabled :)
>>
>> If you can confirm my conclusion, that would be great.
>>
>>
>> Thanks
>> Jean-Louis
>>
> Hi Jean-Louis,
>
> I've finally got to working on v2 for this series. However I'm failing
> to get a grasp on what this option is supposed to be doing and what are
> we trying to avoid here.
The discard-no-unref option causes qemu to only zero the blocks/clusters
that get discarded, but does NOT remove the reference of the cluster.
So the cluster stays allocated/referenced, but is just marked zero.
There are multiple scenario's where you would need this.
First of all when you have a pre-allocated image, you most likely
created it because you don't want fragmentation.
But if you don't have discard-no-unref enabled, you will end up with a
fragmented image anyway, because discard will create holes in your
image, and will be randomly allocated. Ending up with a fragmented image.
Another scenario (and why we implemented it), is that with a sparse
image, you allocate new blocks at the end of the 'allocation pointer'
(which points to the first available blocks in your image).
But if you do discards, afaik the pointer is not moved to the freed
cluster, but still allocates at the end until you reopen the image.
And even then, take you created a hole of 5 free clusters, and you need
to allocate 4 new clusters, it will use those 5 and leave 1 empty cluster.
But the next allocation needs 2 clusters, it will jump to the next free
space with at least 2 clusters. Leaving that 1 cluster unallocated.
And this caused us to have 'sparse' images of 110GB for 100GB images for
example. Just because the qcow2 images was full of small empty clusters
completely fragmented.
>
> Consider this simple example:
>
> # cd build
> # ./qemu-img create -f qcow2 unref.qcow2 192K
> # ./qemu-img create -f qcow2 nounref.qcow2 192K
> # ./qemu-io -c "write 0 192K" unref.qcow2
> # ./qemu-io -c "write 0 192K" nounref.qcow2
> #
> # strace -fv -e fallocate ./qemu-io -c "discard 64K 64K" unref.qcow2
> [pid 887710] fallocate(9, FALLOC_FL_KEEP_SIZE|FALLOC_FL_PUNCH_HOLE,
> 393216, 65536) = 0
> discard 65536/65536 bytes at offset 65536
> 64 KiB, 1 ops; 00.00 sec (252.123 MiB/sec and 4033.9660 ops/sec)
> #
> # strace -fv -e fallocate ./qemu-io -c "reopen -o discard-no-unref=on"
> -c "discard 64K 64K" nounref.qcow2
> # [pid 887789] fallocate(9, FALLOC_FL_KEEP_SIZE|FALLOC_FL_PUNCH_HOLE,
> 393216, 65536) = 0
> discard 65536/65536 bytes at offset 65536
> 64 KiB, 1 ops; 00.00 sec (345.457 MiB/sec and 5527.3049 ops/sec)
> #
> # ./qemu-img check unref.qcow2
>
> No errors were found on the image.
> 2/3 = 66.67% allocated, 50.00% fragmented, 0.00% compressed clusters
> Image end offset: 524288
> # ./qemu-img check nounref.qcow2
> No errors were found on the image.
> 3/3 = 100.00% allocated, 0.00% fragmented, 0.00% compressed clusters
> Image end offset: 524288
> #
> # ls -la *.qcow2
>
> -rw-r--r-- 1 root root 524288 Apr 16 22:42 nounref.qcow2
> -rw-r--r-- 1 root root 524288 Apr 16 22:41 unref.qcow2
> # du --block-size=1 *.qcow2
> 397312 nounref.qcow2
> 397312 unref.qcow2
>
> I understand that by keeping the L2 entry we achieve that cluster
> remains formally allocated, but no matter whether "discard-no-unref"
> option is enabled fallocate(FALLOC_FL_KEEP_SIZE|FALLOC_FL_PUNCH_HOLE) is
> being called leaving a hole in the file (e.g. file becomes sparse).
> However you say in the comment above that we can't allow making new
> holes in the file when this option is enabled. How does that correlate
> and what do we achieve? And which logic do you think we need to follow
> when discarding separate subclusters?
>
> Andrey
^ permalink raw reply [flat|nested] 28+ messages in thread
end of thread, other threads:[~2024-04-19 9:07 UTC | newest]
Thread overview: 28+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-10-20 21:56 [PATCH 0/7] qcow2: make subclusters discardable Andrey Drobyshev
2023-10-20 21:56 ` [PATCH 1/7] qcow2: make function update_refcount_discard() global Andrey Drobyshev
2023-10-31 15:27 ` Hanna Czenczek
2023-10-20 21:56 ` [PATCH 2/7] qcow2: add get_sc_range_info() helper for working with subcluster ranges Andrey Drobyshev
2023-10-31 15:53 ` Hanna Czenczek
2023-11-09 12:32 ` Andrey Drobyshev
2023-10-20 21:56 ` [PATCH 3/7] qcow2: zeroize the entire cluster when there're no non-zero subclusters Andrey Drobyshev
2023-10-31 16:06 ` Hanna Czenczek
2023-10-20 21:56 ` [PATCH 4/7] qcow2: make subclusters discardable Andrey Drobyshev
2023-10-27 11:10 ` Jean-Louis Dupond
2024-04-16 19:56 ` Andrey Drobyshev
2024-04-19 9:06 ` Jean-Louis Dupond
2023-10-31 16:32 ` Hanna Czenczek
2023-11-09 15:05 ` Andrey Drobyshev
2023-10-31 16:33 ` Hanna Czenczek
2023-11-10 13:26 ` Andrey Drobyshev
2023-11-03 15:53 ` Hanna Czenczek
2023-10-20 21:56 ` [PATCH 5/7] qcow2: zero_l2_subclusters: fall through to discard operation when requested Andrey Drobyshev
2023-11-03 15:19 ` Hanna Czenczek
2023-11-10 13:17 ` Andrey Drobyshev
2023-10-20 21:56 ` [PATCH 6/7] iotests/common.rc: add disk_usage function Andrey Drobyshev
2023-11-03 15:20 ` Hanna Czenczek
2023-11-09 12:35 ` Andrey Drobyshev
2023-10-20 21:56 ` [PATCH 7/7] iotests/271: check disk usage on subcluster-based discard/unmap Andrey Drobyshev
2023-11-03 15:51 ` Hanna Czenczek
2023-11-03 15:59 ` Hanna Czenczek
2023-11-09 14:05 ` Andrey Drobyshev
2023-11-09 13:55 ` Andrey Drobyshev
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).