* [Qemu-devel] [PULL 1/4] qcow2: Inform block layer about discard boundaries
2016-11-22 16:04 [Qemu-devel] [PULL 0/4] Block layer patches for 2.8.0-rc1 Kevin Wolf
@ 2016-11-22 16:04 ` Kevin Wolf
2016-11-22 16:04 ` [Qemu-devel] [PULL 2/4] block: Let write zeroes fallback work even with small max_transfer Kevin Wolf
` (3 subsequent siblings)
4 siblings, 0 replies; 6+ messages in thread
From: Kevin Wolf @ 2016-11-22 16:04 UTC (permalink / raw)
To: qemu-block; +Cc: kwolf, stefanha, qemu-devel
From: Eric Blake <eblake@redhat.com>
At the qcow2 layer, discard is only possible on a per-cluster
basis; at the moment, qcow2 silently rounds any unaligned
requests to this granularity. However, an upcoming patch will
fix a regression in the block layer ignoring too much of an
unaligned discard request, by changing the block layer to
break up a discard request at alignment boundaries; for that
to work, the block layer must know about our limits.
However, we can't go one step further by changing
qcow2_discard_clusters() to assert that requests are always
aligned, since that helper function is reached on paths
outside of the block layer.
CC: qemu-stable@nongnu.org
Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
block/qcow2.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/block/qcow2.c b/block/qcow2.c
index 6d5689a..e22f6dc 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -1206,6 +1206,7 @@ static void qcow2_refresh_limits(BlockDriverState *bs, Error **errp)
bs->bl.request_alignment = BDRV_SECTOR_SIZE;
}
bs->bl.pwrite_zeroes_alignment = s->cluster_size;
+ bs->bl.pdiscard_alignment = s->cluster_size;
}
static int qcow2_set_key(BlockDriverState *bs, const char *key)
--
1.8.3.1
^ permalink raw reply related [flat|nested] 6+ messages in thread
* [Qemu-devel] [PULL 2/4] block: Let write zeroes fallback work even with small max_transfer
2016-11-22 16:04 [Qemu-devel] [PULL 0/4] Block layer patches for 2.8.0-rc1 Kevin Wolf
2016-11-22 16:04 ` [Qemu-devel] [PULL 1/4] qcow2: Inform block layer about discard boundaries Kevin Wolf
@ 2016-11-22 16:04 ` Kevin Wolf
2016-11-22 16:04 ` [Qemu-devel] [PULL 3/4] block: Return -ENOTSUP rather than assert on unaligned discards Kevin Wolf
` (2 subsequent siblings)
4 siblings, 0 replies; 6+ messages in thread
From: Kevin Wolf @ 2016-11-22 16:04 UTC (permalink / raw)
To: qemu-block; +Cc: kwolf, stefanha, qemu-devel
From: Eric Blake <eblake@redhat.com>
Commit 443668ca rewrote the write_zeroes logic to guarantee that
an unaligned request never crosses a cluster boundary. But
in the rewrite, the new code assumed that at most one iteration
would be needed to get to an alignment boundary.
However, it is easy to trigger an assertion failure: the Linux
kernel limits loopback devices to advertise a max_transfer of
only 64k. Any operation that requires falling back to writes
rather than more efficient zeroing must obey max_transfer during
that fallback, which means an unaligned head may require multiple
iterations of the write fallbacks before reaching the aligned
boundaries, when layering a format with clusters larger than 64k
atop the protocol of file access to a loopback device.
Test case:
$ qemu-img create -f qcow2 -o cluster_size=1M file 10M
$ losetup /dev/loop2 /path/to/file
$ qemu-io -f qcow2 /dev/loop2
qemu-io> w 7m 1k
qemu-io> w -z 8003584 2093056
In fairness to Denis (as the original listed author of the culprit
commit), the faulty logic for at most one iteration is probably all
my fault in reworking his idea. But the solution is to restore what
was in place prior to that commit: when dealing with an unaligned
head or tail, iterate as many times as necessary while fragmenting
the operation at max_transfer boundaries.
Reported-by: Ed Swierk <eswierk@skyportsystems.com>
CC: qemu-stable@nongnu.org
CC: Denis V. Lunev <den@openvz.org>
Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
block/io.c | 13 ++++++++-----
1 file changed, 8 insertions(+), 5 deletions(-)
diff --git a/block/io.c b/block/io.c
index aa532a5..085ac34 100644
--- a/block/io.c
+++ b/block/io.c
@@ -1214,6 +1214,8 @@ static int coroutine_fn bdrv_co_do_pwrite_zeroes(BlockDriverState *bs,
int max_write_zeroes = MIN_NON_ZERO(bs->bl.max_pwrite_zeroes, INT_MAX);
int alignment = MAX(bs->bl.pwrite_zeroes_alignment,
bs->bl.request_alignment);
+ int max_transfer = MIN_NON_ZERO(bs->bl.max_transfer,
+ MAX_WRITE_ZEROES_BOUNCE_BUFFER);
assert(alignment % bs->bl.request_alignment == 0);
head = offset % alignment;
@@ -1229,9 +1231,12 @@ static int coroutine_fn bdrv_co_do_pwrite_zeroes(BlockDriverState *bs,
* boundaries.
*/
if (head) {
- /* Make a small request up to the first aligned sector. */
- num = MIN(count, alignment - head);
- head = 0;
+ /* Make a small request up to the first aligned sector. For
+ * convenience, limit this request to max_transfer even if
+ * we don't need to fall back to writes. */
+ num = MIN(MIN(count, max_transfer), alignment - head);
+ head = (head + num) % alignment;
+ assert(num < max_write_zeroes);
} else if (tail && num > alignment) {
/* Shorten the request to the last aligned sector. */
num -= tail;
@@ -1257,8 +1262,6 @@ static int coroutine_fn bdrv_co_do_pwrite_zeroes(BlockDriverState *bs,
if (ret == -ENOTSUP) {
/* Fall back to bounce buffer if write zeroes is unsupported */
- int max_transfer = MIN_NON_ZERO(bs->bl.max_transfer,
- MAX_WRITE_ZEROES_BOUNCE_BUFFER);
BdrvRequestFlags write_flags = flags & ~BDRV_REQ_ZERO_WRITE;
if ((flags & BDRV_REQ_FUA) &&
--
1.8.3.1
^ permalink raw reply related [flat|nested] 6+ messages in thread
* [Qemu-devel] [PULL 3/4] block: Return -ENOTSUP rather than assert on unaligned discards
2016-11-22 16:04 [Qemu-devel] [PULL 0/4] Block layer patches for 2.8.0-rc1 Kevin Wolf
2016-11-22 16:04 ` [Qemu-devel] [PULL 1/4] qcow2: Inform block layer about discard boundaries Kevin Wolf
2016-11-22 16:04 ` [Qemu-devel] [PULL 2/4] block: Let write zeroes fallback work even with small max_transfer Kevin Wolf
@ 2016-11-22 16:04 ` Kevin Wolf
2016-11-22 16:04 ` [Qemu-devel] [PULL 4/4] block: Pass unaligned discard requests to drivers Kevin Wolf
2016-11-22 19:30 ` [Qemu-devel] [Qemu-block] [PULL 0/4] Block layer patches for 2.8.0-rc1 Stefan Hajnoczi
4 siblings, 0 replies; 6+ messages in thread
From: Kevin Wolf @ 2016-11-22 16:04 UTC (permalink / raw)
To: qemu-block; +Cc: kwolf, stefanha, qemu-devel
From: Eric Blake <eblake@redhat.com>
Right now, the block layer rounds discard requests, so that
individual drivers are able to assert that discard requests
will never be unaligned. But there are some ISCSI devices
that track and coalesce multiple unaligned requests, turning it
into an actual discard if the requests eventually cover an
entire page, which implies that it is better to always pass
discard requests as low down the stack as possible.
In isolation, this patch has no semantic effect, since the
block layer currently never passes an unaligned request through.
But the block layer already has code that silently ignores
drivers that return -ENOTSUP for a discard request that cannot
be honored (as well as drivers that return 0 even when nothing
was done). But the next patch will update the block layer to
fragment discard requests, so that clients are guaranteed that
they are either dealing with an unaligned head or tail, or an
aligned core, making it similar to the block layer semantics of
write zero fragmentation.
CC: qemu-stable@nongnu.org
Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
block/iscsi.c | 4 +++-
block/qcow2.c | 5 +++++
block/sheepdog.c | 5 +++--
3 files changed, 11 insertions(+), 3 deletions(-)
diff --git a/block/iscsi.c b/block/iscsi.c
index 71bd523..0960929 100644
--- a/block/iscsi.c
+++ b/block/iscsi.c
@@ -1083,7 +1083,9 @@ coroutine_fn iscsi_co_pdiscard(BlockDriverState *bs, int64_t offset, int count)
struct IscsiTask iTask;
struct unmap_list list;
- assert(is_byte_request_lun_aligned(offset, count, iscsilun));
+ if (!is_byte_request_lun_aligned(offset, count, iscsilun)) {
+ return -ENOTSUP;
+ }
if (!iscsilun->lbp.lbpu) {
/* UNMAP is not supported by the target */
diff --git a/block/qcow2.c b/block/qcow2.c
index e22f6dc..7cfcd84 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -2491,6 +2491,11 @@ static coroutine_fn int qcow2_co_pdiscard(BlockDriverState *bs,
int ret;
BDRVQcow2State *s = bs->opaque;
+ if (!QEMU_IS_ALIGNED(offset | count, s->cluster_size)) {
+ assert(count < s->cluster_size);
+ return -ENOTSUP;
+ }
+
qemu_co_mutex_lock(&s->lock);
ret = qcow2_discard_clusters(bs, offset, count >> BDRV_SECTOR_BITS,
QCOW2_DISCARD_REQUEST, false);
diff --git a/block/sheepdog.c b/block/sheepdog.c
index 1fb9173..4c9af89 100644
--- a/block/sheepdog.c
+++ b/block/sheepdog.c
@@ -2829,8 +2829,9 @@ static coroutine_fn int sd_co_pdiscard(BlockDriverState *bs, int64_t offset,
iov.iov_len = sizeof(zero);
discard_iov.iov = &iov;
discard_iov.niov = 1;
- assert((offset & (BDRV_SECTOR_SIZE - 1)) == 0);
- assert((count & (BDRV_SECTOR_SIZE - 1)) == 0);
+ if (!QEMU_IS_ALIGNED(offset | count, BDRV_SECTOR_SIZE)) {
+ return -ENOTSUP;
+ }
acb = sd_aio_setup(bs, &discard_iov, offset >> BDRV_SECTOR_BITS,
count >> BDRV_SECTOR_BITS);
acb->aiocb_type = AIOCB_DISCARD_OBJ;
--
1.8.3.1
^ permalink raw reply related [flat|nested] 6+ messages in thread
* [Qemu-devel] [PULL 4/4] block: Pass unaligned discard requests to drivers
2016-11-22 16:04 [Qemu-devel] [PULL 0/4] Block layer patches for 2.8.0-rc1 Kevin Wolf
` (2 preceding siblings ...)
2016-11-22 16:04 ` [Qemu-devel] [PULL 3/4] block: Return -ENOTSUP rather than assert on unaligned discards Kevin Wolf
@ 2016-11-22 16:04 ` Kevin Wolf
2016-11-22 19:30 ` [Qemu-devel] [Qemu-block] [PULL 0/4] Block layer patches for 2.8.0-rc1 Stefan Hajnoczi
4 siblings, 0 replies; 6+ messages in thread
From: Kevin Wolf @ 2016-11-22 16:04 UTC (permalink / raw)
To: qemu-block; +Cc: kwolf, stefanha, qemu-devel
From: Eric Blake <eblake@redhat.com>
Discard is advisory, so rounding the requests to alignment
boundaries is never semantically wrong from the data that
the guest sees. But at least the Dell Equallogic iSCSI SANs
has an interesting property that its advertised discard
alignment is 15M, yet documents that discarding a sequence
of 1M slices will eventually result in the 15M page being
marked as discarded, and it is possible to observe which
pages have been discarded.
Between commits 9f1963b and b8d0a980, we converted the block
layer to a byte-based interface that ultimately ignores any
unaligned head or tail based on the driver's advertised
discard granularity, which means that qemu 2.7 refuses to
pass any discard request smaller than 15M down to the Dell
Equallogic hardware. This is a slight regression in behavior
compared to earlier qemu, where a guest executing discards
in power-of-2 chunks used to be able to get every page
discarded, but is now left with various pages still allocated
because the guest requests did not align with the hardware's
15M pages.
Since the SCSI specification says nothing about a minimum
discard granularity, and only documents the preferred
alignment, it is best if the block layer gives the driver
every bit of information about discard requests, rather than
rounding it to alignment boundaries early.
Rework the block layer discard algorithm to mirror the write
zero algorithm: always peel off any unaligned head or tail
and manage that in isolation, then do the bulk of the request
on an aligned boundary. The fallback when the driver returns
-ENOTSUP for an unaligned request is to silently ignore that
portion of the discard request; but for devices that can pass
the partial request all the way down to hardware, this can
result in the hardware coalescing requests and discarding
aligned pages after all.
Reported by: Peter Lieven <pl@kamp.de>
CC: qemu-stable@nongnu.org
Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
block/io.c | 45 ++++++++++++++++++++++++++++++++-------------
1 file changed, 32 insertions(+), 13 deletions(-)
diff --git a/block/io.c b/block/io.c
index 085ac34..4f00562 100644
--- a/block/io.c
+++ b/block/io.c
@@ -2424,7 +2424,7 @@ int coroutine_fn bdrv_co_pdiscard(BlockDriverState *bs, int64_t offset,
{
BdrvTrackedRequest req;
int max_pdiscard, ret;
- int head, align;
+ int head, tail, align;
if (!bs->drv) {
return -ENOMEDIUM;
@@ -2447,19 +2447,15 @@ int coroutine_fn bdrv_co_pdiscard(BlockDriverState *bs, int64_t offset,
return 0;
}
- /* Discard is advisory, so ignore any unaligned head or tail */
+ /* Discard is advisory, but some devices track and coalesce
+ * unaligned requests, so we must pass everything down rather than
+ * round here. Still, most devices will just silently ignore
+ * unaligned requests (by returning -ENOTSUP), so we must fragment
+ * the request accordingly. */
align = MAX(bs->bl.pdiscard_alignment, bs->bl.request_alignment);
assert(align % bs->bl.request_alignment == 0);
head = offset % align;
- if (head) {
- head = MIN(count, align - head);
- count -= head;
- offset += head;
- }
- count = QEMU_ALIGN_DOWN(count, align);
- if (!count) {
- return 0;
- }
+ tail = (offset + count) % align;
bdrv_inc_in_flight(bs);
tracked_request_begin(&req, bs, offset, count, BDRV_TRACKED_DISCARD);
@@ -2471,11 +2467,34 @@ int coroutine_fn bdrv_co_pdiscard(BlockDriverState *bs, int64_t offset,
max_pdiscard = QEMU_ALIGN_DOWN(MIN_NON_ZERO(bs->bl.max_pdiscard, INT_MAX),
align);
- assert(max_pdiscard);
+ assert(max_pdiscard >= bs->bl.request_alignment);
while (count > 0) {
int ret;
- int num = MIN(count, max_pdiscard);
+ int num = count;
+
+ if (head) {
+ /* Make small requests to get to alignment boundaries. */
+ num = MIN(count, align - head);
+ if (!QEMU_IS_ALIGNED(num, bs->bl.request_alignment)) {
+ num %= bs->bl.request_alignment;
+ }
+ head = (head + num) % align;
+ assert(num < max_pdiscard);
+ } else if (tail) {
+ if (num > align) {
+ /* Shorten the request to the last aligned cluster. */
+ num -= tail;
+ } else if (!QEMU_IS_ALIGNED(tail, bs->bl.request_alignment) &&
+ tail > bs->bl.request_alignment) {
+ tail %= bs->bl.request_alignment;
+ num -= tail;
+ }
+ }
+ /* limit request size */
+ if (num > max_pdiscard) {
+ num = max_pdiscard;
+ }
if (bs->drv->bdrv_co_pdiscard) {
ret = bs->drv->bdrv_co_pdiscard(bs, offset, num);
--
1.8.3.1
^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [Qemu-devel] [Qemu-block] [PULL 0/4] Block layer patches for 2.8.0-rc1
2016-11-22 16:04 [Qemu-devel] [PULL 0/4] Block layer patches for 2.8.0-rc1 Kevin Wolf
` (3 preceding siblings ...)
2016-11-22 16:04 ` [Qemu-devel] [PULL 4/4] block: Pass unaligned discard requests to drivers Kevin Wolf
@ 2016-11-22 19:30 ` Stefan Hajnoczi
4 siblings, 0 replies; 6+ messages in thread
From: Stefan Hajnoczi @ 2016-11-22 19:30 UTC (permalink / raw)
To: Kevin Wolf; +Cc: qemu-block, qemu-devel, stefanha
[-- Attachment #1: Type: text/plain, Size: 1300 bytes --]
On Tue, Nov 22, 2016 at 05:04:49PM +0100, Kevin Wolf wrote:
> The following changes since commit a7764f1548ef9946af30a8f96be9cef10761f0c1:
>
> Fix FreeBSD (10.x) build after 7dc9ae43 (2016-11-22 10:56:01 +0000)
>
> are available in the git repository at:
>
> git://repo.or.cz/qemu/kevin.git tags/for-upstream
>
> for you to fetch changes up to 3482b9bc411a9a12b2efde1018e1ddc906cd817e:
>
> block: Pass unaligned discard requests to drivers (2016-11-22 15:59:23 +0100)
>
> ----------------------------------------------------------------
> Block layer patches for 2.8.0-rc1
>
> ----------------------------------------------------------------
> Eric Blake (4):
> qcow2: Inform block layer about discard boundaries
> block: Let write zeroes fallback work even with small max_transfer
> block: Return -ENOTSUP rather than assert on unaligned discards
> block: Pass unaligned discard requests to drivers
>
> block/io.c | 58 ++++++++++++++++++++++++++++++++++++++------------------
> block/iscsi.c | 4 +++-
> block/qcow2.c | 6 ++++++
> block/sheepdog.c | 5 +++--
> 4 files changed, 52 insertions(+), 21 deletions(-)
>
Thanks, applied to my staging tree:
https://github.com/stefanha/qemu/commits/staging
Stefan
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 455 bytes --]
^ permalink raw reply [flat|nested] 6+ messages in thread