qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: Max Reitz <mreitz@redhat.com>
To: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>,
	qemu-block@nongnu.org
Cc: fam@euphon.net, kwolf@redhat.com, qemu-devel@nongnu.org,
	armbru@redhat.com, stefanha@redhat.com, den@openvz.org
Subject: Re: [PATCH v6 08/15] block: introduce preallocate filter
Date: Thu, 24 Sep 2020 18:50:38 +0200	[thread overview]
Message-ID: <73b1b0eb-5ca3-a436-5783-695d62fe337e@redhat.com> (raw)
In-Reply-To: <20200918181951.21752-9-vsementsov@virtuozzo.com>


[-- Attachment #1.1: Type: text/plain, Size: 11087 bytes --]

On 18.09.20 20:19, Vladimir Sementsov-Ogievskiy wrote:
> It's intended to be inserted between format and protocol nodes to
> preallocate additional space (expanding protocol file) on writes
> crossing EOF. It improves performance for file-systems with slow
> allocation.
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> ---
>  docs/system/qemu-block-drivers.rst.inc |  26 ++
>  qapi/block-core.json                   |  20 +-
>  block/preallocate.c                    | 556 +++++++++++++++++++++++++
>  block/meson.build                      |   1 +
>  4 files changed, 602 insertions(+), 1 deletion(-)
>  create mode 100644 block/preallocate.c

Looks good to me in general.

[...]

> diff --git a/block/preallocate.c b/block/preallocate.c
> new file mode 100644
> index 0000000000..6510ad0149
> --- /dev/null
> +++ b/block/preallocate.c

[...]

> +/*
> + * Handle reopen.
> + *
> + * We must implement reopen handlers, otherwise reopen just don't work. Handle
> + * new options and don't care about preallocation state, as it is handled in
> + * set/check permission handlers.
> + */
> +
> +static int preallocate_reopen_prepare(BDRVReopenState *reopen_state,
> +                                      BlockReopenQueue *queue, Error **errp)
> +{
> +    PreallocateOpts *opts = g_new0(PreallocateOpts, 1);
> +
> +    if (!preallocate_absorb_opts(opts, reopen_state->options,
> +                                 reopen_state->bs->file->bs, errp)) {

I tried to find out whether this refers to the old file child, or the
post-reopen one.  As far as I could find out, there is no generic
implementation for changing the file child as part of x-blockdev-reopen:

{"error": {"class": "GenericError", "desc": "Cannot change the option
'file'"}}

Now that’s a shame.  That means you can’t reasonably integrate a
preallocate filter into an existing node graph unless the format driver
checks for the respective child option and issues a replace_node on
commit or something, right?  I suppose any driver who’d want to
implement child replacement would need to attach the new node in
prepare() as some pseudo-child, and then drop the old one and rename the
new one in commit().  I don’t think any driver does that yet, so I
suppose no format driver allows replacement of children yet (except for
quorum...).

Does anyone know what the status on that is?  Are there any plans for
implementing child replacement in reopen, or did I just miss something?

(It looks like the backing child can be replaced, but that’s probably
not a child where the preallocate filter would be placed on top...).

> +        g_free(opts);
> +        return -EINVAL;
> +    }
> +
> +    reopen_state->opaque = opts;
> +
> +    return 0;
> +}

[...]

> +/*
> + * Call on each write. Returns true if @want_merge_zero is true and the region
> + * [offset, offset + bytes) is zeroed (as a result of this call or earlier
> + * preallocation).
> + *
> + * want_merge_zero is used to merge write-zero request with preallocation in
> + * one bdrv_co_pwrite_zeroes() call.
> + */
> +static bool coroutine_fn handle_write(BlockDriverState *bs, int64_t offset,
> +                                      int64_t bytes, bool want_merge_zero)
> +{
> +    BDRVPreallocateState *s = bs->opaque;
> +    int64_t end = offset + bytes;
> +    int64_t prealloc_start, prealloc_end;
> +    int ret;
> +
> +    if (!has_prealloc_perms(bs)) {

Took me a bit to figure out, because it takes a trip to
preallocate_child_perm() to figure out exactly when we’re going to have
the necessary permissions for this to pass.

Then it turns out that this is going to be the case exactly when the
parents collectively have the same permissions (WRITE+RESIZE) on the
preallocate node.

Then I had to wonder whether it’s appropriate not to preallocate if
WRITE is taken, but RESIZE isn’t.  But that seems entirely correct, yes.
 If noone is going to grow the file, then there is no need for
preallocation.  (Vice versa, if noone is going to write, but only
resize, then there is no need for preallocation either.)

So this seems correct, yes.

(It could be argued that if one parent has WRITE, and another has RESIZE
(but neither has both), then we probably don’t need preallocation
either.  But in such an arcane case (which is impossible to figure out
in .bdrv_child_perm() anyway), we might as well just do preallocation.
Won’t hurt.)

> +        /* We don't have state neither should try to recover it */
> +        return false;
> +    }
> +
> +    if (s->data_end < 0) {
> +        s->data_end = bdrv_getlength(bs->file->bs);
> +        if (s->data_end < 0) {
> +            return false;
> +        }
> +
> +        if (s->file_end < 0) {
> +            s->file_end = s->data_end;
> +        }
> +    }
> +
> +    if (end <= s->data_end) {
> +        return false;
> +    }
> +
> +    /* We have valid s->data_end, and request writes beyond it. */
> +
> +    s->data_end = end;
> +    if (s->zero_start < 0 || !want_merge_zero) {
> +        s->zero_start = end;

Skipping this on want_merge_zero == true means that zero writes can be
cached; if you repeatedly perform zero writes into the preallocated
area, then none of those will actually be executed.  I legitimately
don’t know whether that’s OK.

I suppose until someone tells me it isn’t OK, I’ll believe it is.

> +    }
> +
> +    if (s->file_end < 0) {
> +        s->file_end = bdrv_getlength(bs->file->bs);
> +        if (s->file_end < 0) {
> +            return false;
> +        }
> +    }
> +
> +    /* Now s->data_end, s->zero_start and s->file_end are valid. */
> +
> +    if (end <= s->file_end) {
> +        /* No preallocation needed. */
> +        return want_merge_zero && offset >= s->zero_start;
> +    }
> +
> +    /* Now we want new preallocation, as request writes beyond s->data_end. */

s/data_end/file_end/

> +
> +    prealloc_start = want_merge_zero ? MIN(offset, s->file_end) : s->file_end;

I suppose you intentionally use s->file_end here instead of @end, even
if offset <= s->file_end.  I just mention it because I wonder whether
it’s really better to effectively write twice to the same area in such
cases (once zeroes for preallocation, then immediately the data) instead
of only writing the data and then preallocating past it.

(Though if it were the same code just with @end instead of s->file_end
for offset <= s->file_end, then the order would be to preallocate past
@end, and then to write the data.  Which might be suboptimal in terms of
how the blocks are then ordered in the filesystem.)

> +    prealloc_end = QEMU_ALIGN_UP(offset + bytes + s->opts.prealloc_size,

s/offset + bytes/end/?

> +                                 s->opts.prealloc_align);
> +    s->file_end = end;

Why is this set here, when it’s always overwritten after
bdrv_co_pwrite_zeroes() anyway?

(It also seems a bit wrong, because at this point we don’t know yet
whether the data write is going to succeed, so we don’t know for sure
whether the file end is really going to be @end without the preallocation.)

> +
> +    ret = bdrv_co_pwrite_zeroes(
> +            bs->file, prealloc_start, prealloc_end - prealloc_start,
> +            BDRV_REQ_NO_FALLBACK | BDRV_REQ_SERIALISING | BDRV_REQ_NO_WAIT);
> +    if (ret < 0) {
> +        s->file_end = ret;
> +        return false;
> +    }
> +
> +    s->file_end = prealloc_end;
> +    return want_merge_zero;
> +}
> +
> +static int coroutine_fn preallocate_co_pwrite_zeroes(BlockDriverState *bs,
> +        int64_t offset, int bytes, BdrvRequestFlags flags)
> +{
> +    bool want_merge_zero =
> +        (flags & (BDRV_REQ_ZERO_WRITE | BDRV_REQ_NO_FALLBACK)) == flags;

Isn’t this the same as !(flags & ~(ZERO_WRITE | NO_FALLBACK))?  (Maybe
only I would find that simpler to understand, though.)

> +    if (handle_write(bs, offset, bytes, want_merge_zero)) {
> +        return 0;
> +    }
> +
> +    return bdrv_co_pwrite_zeroes(bs->file, offset, bytes, flags);
> +}

[...]

> +static int coroutine_fn
> +preallocate_co_truncate(BlockDriverState *bs, int64_t offset,
> +                        bool exact, PreallocMode prealloc,
> +                        BdrvRequestFlags flags, Error **errp)
> +{
> +    ERRP_GUARD();
> +    BDRVPreallocateState *s = bs->opaque;
> +    int ret;
> +
> +    if (s->data_end >= 0 && offset > s->data_end) {
> +        if (s->file_end < 0) {
> +            s->file_end = bdrv_getlength(bs->file->bs);
> +            if (s->file_end < 0) {
> +                error_setg(errp, "failed to get file length");
> +                return s->file_end;
> +            }
> +        }
> +
> +        if (prealloc == PREALLOC_MODE_FALLOC) {
> +            /*
> +             * If offset <= s->file_end, the task is already done, just
> +             * update s->file_end, to move part of "filter preallocation"

s/file_end/data_end/

> +             * to "preallocation requested by user".
> +             * Otherwise just proceed to preallocate missing part.
> +             */
> +            if (offset <= s->file_end) {
> +                s->data_end = offset;
> +                return 0;
> +            }

[...]

> +static int preallocate_check_perm(BlockDriverState *bs,
> +                                  uint64_t perm, uint64_t shared, Error **errp)
> +{
> +    BDRVPreallocateState *s = bs->opaque;
> +
> +    if (s->data_end >= 0 && !can_write_resize(perm)) {
> +        /*
> +         * Loose permissions.

*Lose

(I assume)

> +         * We should truncate in check_perm, as in set_perm bs->file->perm will
> +         * be already changed, and we should not violate it.
> +         */
> +        if (s->file_end < 0) {
> +            s->file_end = bdrv_getlength(bs->file->bs);
> +            if (s->file_end < 0) {
> +                error_setg(errp, "Failed to get file length");
> +                return s->file_end;
> +            }
> +        }
> +
> +        if (s->data_end < s->file_end) {
> +            int ret = bdrv_truncate(bs->file, s->data_end, true,
> +                                    PREALLOC_MODE_OFF, 0, NULL);
> +            if (ret < 0) {
> +                error_setg(errp, "Failed to drop preallocation");
> +                s->file_end = ret;
> +                return ret;
> +            }
> +        }
> +        /*
> +         * We will drop our permissions, as well as allow shared
> +         * permissions, anyone will be able to change the child, so mark
> +         * all states invalid. We'll regain control if get good permissions
> +         * back.
> +         */
> +        s->data_end = s->file_end = s->zero_start = -EINVAL;

Shouldn’t we clear these variables whenever !can_write_resize(perm), not
just if also s->data_end >= 0?

> +    }
> +
> +    return 0;
> +}

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

  reply	other threads:[~2020-09-24 17:00 UTC|newest]

Thread overview: 37+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-09-18 18:19 [PATCH v6 00/15] preallocate filter Vladimir Sementsov-Ogievskiy
2020-09-18 18:19 ` [PATCH v6 01/15] block: simplify comment to BDRV_REQ_SERIALISING Vladimir Sementsov-Ogievskiy
2020-09-28 15:34   ` Alberto Garcia
2020-09-18 18:19 ` [PATCH v6 02/15] block/io.c: drop assertion on double waiting for request serialisation Vladimir Sementsov-Ogievskiy
2020-09-18 18:19 ` [PATCH v6 03/15] block/io: split out bdrv_find_conflicting_request Vladimir Sementsov-Ogievskiy
2020-09-18 18:19 ` [PATCH v6 04/15] block/io: bdrv_wait_serialising_requests_locked: drop extra bs arg Vladimir Sementsov-Ogievskiy
2020-09-18 18:19 ` [PATCH v6 05/15] block: bdrv_mark_request_serialising: split non-waiting function Vladimir Sementsov-Ogievskiy
2020-09-18 18:19 ` [PATCH v6 06/15] block: introduce BDRV_REQ_NO_WAIT flag Vladimir Sementsov-Ogievskiy
2020-09-18 18:19 ` [PATCH v6 07/15] block: bdrv_check_perm(): process children anyway Vladimir Sementsov-Ogievskiy
2020-09-24 14:25   ` Max Reitz
2020-09-24 14:55     ` Vladimir Sementsov-Ogievskiy
2020-09-18 18:19 ` [PATCH v6 08/15] block: introduce preallocate filter Vladimir Sementsov-Ogievskiy
2020-09-24 16:50   ` Max Reitz [this message]
2020-09-24 17:36     ` Vladimir Sementsov-Ogievskiy
2020-09-18 18:19 ` [PATCH v6 09/15] qemu-io: add preallocate mode parameter for truncate command Vladimir Sementsov-Ogievskiy
2020-09-24 17:08   ` Max Reitz
2020-09-18 18:19 ` [PATCH v6 10/15] iotests: qemu_io_silent: support --image-opts Vladimir Sementsov-Ogievskiy
2020-09-25  7:10   ` Max Reitz
2020-09-18 18:19 ` [PATCH v6 11/15] iotests: add 298 to test new preallocate filter driver Vladimir Sementsov-Ogievskiy
2020-09-25  8:26   ` Max Reitz
2020-09-25  8:49     ` Vladimir Sementsov-Ogievskiy
2020-09-25  9:11       ` Max Reitz
2020-09-25 15:11         ` Vladimir Sementsov-Ogievskiy
2020-09-25 15:32           ` Vladimir Sementsov-Ogievskiy
2020-10-01  7:40             ` Max Reitz
2020-10-01  8:41           ` Thomas Huth
2020-09-18 18:19 ` [PATCH v6 12/15] scripts/simplebench: support iops Vladimir Sementsov-Ogievskiy
2020-09-25  8:54   ` Max Reitz
2020-09-18 18:19 ` [PATCH v6 13/15] scripts/simplebench: improve view of ascii table Vladimir Sementsov-Ogievskiy
2020-09-25  9:31   ` Max Reitz
2020-09-25 16:58     ` Vladimir Sementsov-Ogievskiy
2020-09-18 18:19 ` [PATCH v6 14/15] scripts/simplebench: improve ascii table: add difference line Vladimir Sementsov-Ogievskiy
2020-09-25 10:24   ` Max Reitz
2020-09-25 17:13     ` Vladimir Sementsov-Ogievskiy
2020-10-01  8:22       ` Max Reitz
2020-09-18 18:19 ` [PATCH v6 15/15] scripts/simplebench: add bench_prealloc.py Vladimir Sementsov-Ogievskiy
2020-09-25 11:25   ` Max Reitz

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=73b1b0eb-5ca3-a436-5783-695d62fe337e@redhat.com \
    --to=mreitz@redhat.com \
    --cc=armbru@redhat.com \
    --cc=den@openvz.org \
    --cc=fam@euphon.net \
    --cc=kwolf@redhat.com \
    --cc=qemu-block@nongnu.org \
    --cc=qemu-devel@nongnu.org \
    --cc=stefanha@redhat.com \
    --cc=vsementsov@virtuozzo.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).