Re: [Qemu-devel] [PATCH v2 5/9] block: Pass unaligned discard requests to drivers

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Kevin Wolf <kwolf@redhat.com>
To: Eric Blake <eblake@redhat.com>
Cc: qemu-devel@nongnu.org, qemu-block@nongnu.org,
	pbonzini@redhat.com, qemu-stable@nongnu.org,
	Stefan Hajnoczi <stefanha@redhat.com>,
	Fam Zheng <famz@redhat.com>, Max Reitz <mreitz@redhat.com>
Subject: Re: [Qemu-devel] [PATCH v2 5/9] block: Pass unaligned discard requests to drivers
Date: Tue, 22 Nov 2016 15:03:43 +0100	[thread overview]
Message-ID: <20161122140343.GF5615@noname.redhat.com> (raw)
In-Reply-To: <1479413642-22463-6-git-send-email-eblake@redhat.com>

Am 17.11.2016 um 21:13 hat Eric Blake geschrieben:
> Discard is advisory, so rounding the requests to alignment
> boundaries is never semantically wrong from the data that
> the guest sees.  But at least the Dell Equallogic iSCSI SANs
> has an interesting property that its advertised discard
> alignment is 15M, yet documents that discarding a sequence
> of 1M slices will eventually result in the 15M page being
> marked as discarded, and it is possible to observe which
> pages have been discarded.
> 
> Between commits 9f1963b and b8d0a980, we converted the block
> layer to a byte-based interface that ultimately ignores any
> unaligned head or tail based on the driver's advertised
> discard granularity, which means that qemu 2.7 refuses to
> pass any discard request smaller than 15M down to the Dell
> Equallogic hardware.  This is a slight regression in behavior
> compared to earlier qemu, where a guest executing discards
> in power-of-2 chunks used to be able to get every page
> discarded, but is now left with various pages still allocated
> because the guest requests did not align with the hardware's
> 15M pages.
> 
> Since the SCSI specification says nothing about a minimum
> discard granularity, and only documents the preferred
> alignment, it is best if the block layer gives the driver
> every bit of information about discard requests, rather than
> rounding it to alignment boundaries early.
> 
> Rework the block layer discard algorithm to mirror the write
> zero algorithm: always peel off any unaligned head or tail
> and manage that in isolation, then do the bulk of the request
> on an aligned boundary.  The fallback when the driver returns
> -ENOTSUP for an unaligned request is to silently ignore that
> portion of the discard request; but for devices that can pass
> the partial request all the way down to hardware, this can
> result in the hardware coalescing requests and discarding
> aligned pages after all.
> 
> Reported by: Peter Lieven <pl@kamp.de>
> CC: qemu-stable@nongnu.org
> Signed-off-by: Eric Blake <eblake@redhat.com>
> 
> ---
> v2: Split at more boundaries.
> ---
>  block/io.c | 45 ++++++++++++++++++++++++++++++++-------------
>  1 file changed, 32 insertions(+), 13 deletions(-)
> 
> diff --git a/block/io.c b/block/io.c
> index 085ac34..4f00562 100644
> --- a/block/io.c
> +++ b/block/io.c
> @@ -2424,7 +2424,7 @@ int coroutine_fn bdrv_co_pdiscard(BlockDriverState *bs, int64_t offset,
>  {
>      BdrvTrackedRequest req;
>      int max_pdiscard, ret;
> -    int head, align;
> +    int head, tail, align;
> 
>      if (!bs->drv) {
>          return -ENOMEDIUM;
> @@ -2447,19 +2447,15 @@ int coroutine_fn bdrv_co_pdiscard(BlockDriverState *bs, int64_t offset,
>          return 0;
>      }
> 
> -    /* Discard is advisory, so ignore any unaligned head or tail */
> +    /* Discard is advisory, but some devices track and coalesce
> +     * unaligned requests, so we must pass everything down rather than
> +     * round here.  Still, most devices will just silently ignore
> +     * unaligned requests (by returning -ENOTSUP), so we must fragment
> +     * the request accordingly.  */
>      align = MAX(bs->bl.pdiscard_alignment, bs->bl.request_alignment);
>      assert(align % bs->bl.request_alignment == 0);
>      head = offset % align;
> -    if (head) {
> -        head = MIN(count, align - head);
> -        count -= head;
> -        offset += head;
> -    }
> -    count = QEMU_ALIGN_DOWN(count, align);
> -    if (!count) {
> -        return 0;
> -    }
> +    tail = (offset + count) % align;
> 
>      bdrv_inc_in_flight(bs);
>      tracked_request_begin(&req, bs, offset, count, BDRV_TRACKED_DISCARD);
> @@ -2471,11 +2467,34 @@ int coroutine_fn bdrv_co_pdiscard(BlockDriverState *bs, int64_t offset,
> 
>      max_pdiscard = QEMU_ALIGN_DOWN(MIN_NON_ZERO(bs->bl.max_pdiscard, INT_MAX),
>                                     align);
> -    assert(max_pdiscard);
> +    assert(max_pdiscard >= bs->bl.request_alignment);
> 
>      while (count > 0) {
>          int ret;
> -        int num = MIN(count, max_pdiscard);
> +        int num = count;
> +
> +        if (head) {
> +            /* Make small requests to get to alignment boundaries. */
> +            num = MIN(count, align - head);
> +            if (!QEMU_IS_ALIGNED(num, bs->bl.request_alignment)) {
> +                num %= bs->bl.request_alignment;
> +            }
> +            head = (head + num) % align;
> +            assert(num < max_pdiscard);
> +        } else if (tail) {
> +            if (num > align) {
> +                /* Shorten the request to the last aligned cluster.  */
> +                num -= tail;
> +            } else if (!QEMU_IS_ALIGNED(tail, bs->bl.request_alignment) &&
> +                       tail > bs->bl.request_alignment) {
> +                tail %= bs->bl.request_alignment;
> +                num -= tail;
> +            }
> +        }

Hm, from the commit message I expected splitting requests in three
(head, bulk, tail), but actually we can end up splitting it in five?

Just to check whether I got this right, let me try an example: Let's
assume request alignment 512, pdiscard alignment 64k, and we get a
discard request with offset 510, length 130k. This algorithm makes the
following calls to the driver:

1. pdiscard offset=510, len=2               | new count = 130k - 2
2. pdiscard offset=512, len=(64k - 512)     | new count = 66k + 510
3. pdiscard offset=64k, len=64k             | new count = 2558
4. pdiscard offset=128k, len=2048           | new count = 510
5. pdiscard offset=130k, len=510            | new count = 0

Correct?

If so, is this really necessary or even helpful? I see that the iscsi
driver throws away requests 1 and 5 and needs the split because
otherwise it would disregard the areas covered by 2 and 4, too. But why
can't or shouldn't the iscsi driver do this rounding itself when it gets
combined 1+2 and 4+5 requests?

Kevin

next prev parent reply	other threads:[~2016-11-22 14:04 UTC|newest]

Thread overview: 47+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-11-17 20:13 [Qemu-devel] [PATCH v2 for-2.8* 0/9] Fix block regressions, add blkdebug tests Eric Blake
2016-11-17 20:13 ` [Qemu-devel] [PATCH v2 1/9] nbd: Allow unmap and fua during write zeroes Eric Blake
2016-11-17 21:10   ` Max Reitz
2016-11-17 21:14     ` Eric Blake
2016-11-18 14:12     ` Paolo Bonzini
2016-11-17 20:13 ` [Qemu-devel] [PATCH v2 2/9] qcow2: Inform block layer about discard boundaries Eric Blake
2016-11-17 21:24   ` Max Reitz
2016-11-17 20:13 ` [Qemu-devel] [PATCH v3 3/9] block: Let write zeroes fallback work even with small max_transfer Eric Blake
2016-11-17 21:40   ` Max Reitz
2016-11-22 13:16   ` Kevin Wolf
2016-11-22 13:22     ` Eric Blake
2016-11-22 13:30       ` Kevin Wolf
2016-11-17 20:13 ` [Qemu-devel] [PATCH v2 4/9] block: Return -ENOTSUP rather than assert on unaligned discards Eric Blake
2016-11-17 22:01   ` Max Reitz
2016-11-18 22:48     ` Eric Blake
2016-11-17 20:13 ` [Qemu-devel] [PATCH v2 5/9] block: Pass unaligned discard requests to drivers Eric Blake
2016-11-17 22:26   ` Max Reitz
2016-11-17 23:01     ` Eric Blake
2016-11-17 23:03       ` Max Reitz
2016-11-17 23:44   ` Max Reitz
2016-11-18  1:13     ` Eric Blake
2016-11-19 22:05       ` Max Reitz
2016-11-21 13:39         ` Peter Lieven
2016-11-22 14:03   ` Kevin Wolf [this message]
2016-11-22 14:13     ` Eric Blake
2016-11-22 14:56       ` Eric Blake
2016-11-17 20:13 ` [Qemu-devel] [PATCH v2 6/9] blkdebug: Sanity check block layer guarantees Eric Blake
2016-11-17 22:36   ` Max Reitz
2016-11-17 20:14 ` [Qemu-devel] [PATCH v2 7/9] blkdebug: Add pass-through write_zero and discard support Eric Blake
2016-11-17 22:47   ` Max Reitz
2016-11-18 23:08     ` Eric Blake
2016-11-17 23:27   ` Max Reitz
2016-11-18  1:17     ` Eric Blake
2016-11-17 20:14 ` [Qemu-devel] [PATCH v2 8/9] blkdebug: Add ability to override unmap geometries Eric Blake
2016-11-17 23:02   ` Max Reitz
2016-11-21 21:11     ` Eric Blake
2016-11-21 21:29       ` Eric Blake
2016-11-17 20:14 ` [Qemu-devel] [PATCH v2 9/9] tests: Add coverage for recent block geometry fixes Eric Blake
2016-11-17 23:19   ` Max Reitz
2016-11-18  1:19     ` Eric Blake
2016-11-17 23:42   ` Max Reitz
2016-11-18  1:28     ` Eric Blake
2016-11-19 21:45       ` Max Reitz
2016-11-19 22:17         ` Max Reitz
2016-11-21 11:38           ` Kevin Wolf
2016-11-21 16:16             ` Eric Blake
2016-11-22 16:05 ` [Qemu-devel] [PATCH v2 for-2.8* 0/9] Fix block regressions, add blkdebug tests Kevin Wolf

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20161122140343.GF5615@noname.redhat.com \
    --to=kwolf@redhat.com \
    --cc=eblake@redhat.com \
    --cc=famz@redhat.com \
    --cc=mreitz@redhat.com \
    --cc=pbonzini@redhat.com \
    --cc=qemu-block@nongnu.org \
    --cc=qemu-devel@nongnu.org \
    --cc=qemu-stable@nongnu.org \
    --cc=stefanha@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.