qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: Max Reitz <mreitz@redhat.com>
To: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>,
	qemu-block@nongnu.org, qemu-devel@nongnu.org
Cc: fam@euphon.net, stefanha@redhat.com, jcody@redhat.com,
	kwolf@redhat.com, den@openvz.org, eblake@redhat.com,
	jsnow@redhat.com
Subject: Re: [Qemu-devel] [PATCH v5 07/11] block: introduce backup-top filter driver
Date: Wed, 16 Jan 2019 17:02:07 +0100	[thread overview]
Message-ID: <5a822e18-6967-5059-bf21-6891aa701af4@redhat.com> (raw)
In-Reply-To: <20181229122027.42245-8-vsementsov@virtuozzo.com>

[-- Attachment #1: Type: text/plain, Size: 15568 bytes --]

On 29.12.18 13:20, Vladimir Sementsov-Ogievskiy wrote:
> Backup-top filter does copy-before-write operation. It should be
> inserted above active disk and has a target node for CBW, like the
> following:
> 
>     +-------+
>     | Guest |
>     +---+---+
>         |r,w
>         v
>     +---+-----------+  target   +---------------+
>     | backup_top    |---------->| target(qcow2) |
>     +---+-----------+   CBW     +---+-----------+
>         |
> backing |r,w
>         v
>     +---+---------+
>     | Active disk |
>     +-------------+
> 
> The driver will be used in backup instead of write-notifiers.
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> ---
>  block/backup-top.h  |  43 +++++++
>  block/backup-top.c  | 306 ++++++++++++++++++++++++++++++++++++++++++++
>  block/Makefile.objs |   2 +
>  3 files changed, 351 insertions(+)
>  create mode 100644 block/backup-top.h
>  create mode 100644 block/backup-top.c

Looks good to me overall, comments below:

> diff --git a/block/backup-top.h b/block/backup-top.h
> new file mode 100644
> index 0000000000..abe4dda574
> --- /dev/null
> +++ b/block/backup-top.h
> @@ -0,0 +1,43 @@
> +/*
> + * backup-top filter driver
> + *
> + * The driver performs Copy-Before-Write (CBW) operation: it is injected above
> + * some node, and before each write it copies _old_ data to the target node.
> + *
> + * Copyright (c) 2018 Virtuozzo International GmbH. All rights reserved.
> + *
> + * Author:
> + *  Sementsov-Ogievskiy Vladimir <vsementsov@virtuozzo.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License, or
> + * (at your option) any later version.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with this program. If not, see <http://www.gnu.org/licenses/>.
> + */
> +

Include guards are missing here.

> +#include "qemu/osdep.h"
> +
> +#include "block/block_int.h"
> +
> +typedef struct BDRVBackupTopState {
> +    HBitmap *copy_bitmap; /* what should be copied to @target on guest write. */
> +    BdrvChild *target;
> +
> +    uint64_t bytes_copied;
> +} BDRVBackupTopState;
> +
> +void bdrv_backup_top_drop(BlockDriverState *bs);
> +uint64_t bdrv_backup_top_progress(BlockDriverState *bs);
> +
> +BlockDriverState *bdrv_backup_top_append(BlockDriverState *source,
> +                                         BlockDriverState *target,
> +                                         HBitmap *copy_bitmap,
> +                                         Error **errp);

A comment describing at least parts of the interface would be nice; for
instance, that @target will be forced into the same AioContext as
@source, or that the backup-top node will be an implicit node.

> diff --git a/block/backup-top.c b/block/backup-top.c
> new file mode 100644
> index 0000000000..0e7b3e3142
> --- /dev/null
> +++ b/block/backup-top.c
> @@ -0,0 +1,306 @@
> +/*
> + * backup-top filter driver
> + *
> + * The driver performs Copy-Before-Write (CBW) operation: it is injected above
> + * some node, and before each write it copies _old_ data to the target node.
> + *
> + * Copyright (c) 2018 Virtuozzo International GmbH. All rights reserved.
> + *
> + * Author:
> + *  Sementsov-Ogievskiy Vladimir <vsementsov@virtuozzo.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License, or
> + * (at your option) any later version.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with this program. If not, see <http://www.gnu.org/licenses/>.
> + */
> +
> +#include "qemu/osdep.h"
> +
> +#include "qemu/cutils.h"
> +#include "qapi/error.h"
> +#include "block/block_int.h"
> +#include "block/qdict.h"
> +
> +#include "block/backup-top.h"
> +
> +static coroutine_fn int backup_top_co_preadv(
> +        BlockDriverState *bs, uint64_t offset, uint64_t bytes,
> +        QEMUIOVector *qiov, int flags)
> +{
> +    /*
> +     * Features to be implemented:
> +     * F1. COR. save read data to fleecing target for fast access
> +     *     (to reduce reads). This possibly may be done with use of copy-on-read
> +     *     filter, but we need an ability to make COR requests optional: for
> +     *     example, if target is a ram-cache, and if it is full now, we should
> +     *     skip doing COR request, as it is actually not necessary.
> +     *
> +     * F2. Feature for guest: read from fleecing target if data is in ram-cache
> +     *     and is unchanged
> +     */
> +
> +    return bdrv_co_preadv(bs->backing, offset, bytes, qiov, flags);
> +}
> +
> +static coroutine_fn int backup_top_cbw(BlockDriverState *bs, uint64_t offset,
> +                                          uint64_t bytes)

I don't know if there is a real guideline, but usually parameters are
aligned at the opening parenthesis.

> +{
> +    int ret = 0;
> +    BDRVBackupTopState *s = bs->opaque;
> +    uint64_t gran = 1UL << hbitmap_granularity(s->copy_bitmap);
> +    uint64_t end = QEMU_ALIGN_UP(offset + bytes, gran);
> +    uint64_t off = QEMU_ALIGN_DOWN(offset, gran), len;
> +    size_t align = MAX(bdrv_opt_mem_align(bs->backing->bs),
> +                       bdrv_opt_mem_align(s->target->bs));
> +    struct iovec iov = {
> +        .iov_base = qemu_memalign(align, end - off),

I think it would be more correct to use qemu_blockalign() here and set
bs->bl.opt_mem_alignment somewhere.

> +        .iov_len = end - off

How about just leaving this out here... [1]

> +    };
> +    QEMUIOVector qiov;
> +
> +    qemu_iovec_init_external(&qiov, &iov, 1);

[1] ...and move this... [1]

> +
> +    /*
> +     * Features to be implemented:
> +     * F3. parallelize copying loop
> +     * F4. detect zeros

Isn't there detect-zeroes for that?

> +     * F5. use block_status ?
> +     * F6. don't copy clusters which are already cached by COR [see F1]
> +     * F7. if target is ram-cache and it is full, there should be a possibility
> +     *     to drop not necessary data (cached by COR [see F1]) to handle CBW
> +     *     fast.

Another idea: Read all dirty data from bs->backing (in parallel, more or
less), and then perform all writes (to bs->backing and s->target) in
parallel as well.

(So the writes do not have to wait on one another)

> +     */
> +
> +    len = end - off;
> +    while (hbitmap_next_dirty_area(s->copy_bitmap, &off, &len)) {
> +        iov.iov_len = qiov.size = len;

[1] ...here, and not setting qiov.size?

> +
> +        hbitmap_reset(s->copy_bitmap, off, len);
> +
> +        ret = bdrv_co_preadv(bs->backing, off, len, &qiov,
> +                             BDRV_REQ_NO_SERIALISING);
> +        if (ret < 0) {
> +            hbitmap_set(s->copy_bitmap, off, len);
> +            goto finish;
> +        }
> +
> +        ret = bdrv_co_pwritev(s->target, off, len, &qiov, BDRV_REQ_SERIALISING);
> +        if (ret < 0) {
> +            hbitmap_set(s->copy_bitmap, off, len);
> +            goto finish;
> +        }
> +
> +        s->bytes_copied += len;
> +        off += len;
> +        if (off >= end) {
> +            break;
> +        }
> +        len = end - off;
> +    }
> +
> +finish:
> +    qemu_vfree(iov.iov_base);
> +
> +    /*
> +     * F8. we fail guest request in case of error. We can alter it by
> +     * possibility to fail copying process instead, or retry several times, or
> +     * may be guest pause, etc.
> +     */
> +    return ret;
> +}
> +
> +static int coroutine_fn backup_top_co_pdiscard(BlockDriverState *bs,
> +                                                  int64_t offset, int bytes)

(Indentation)

> +{
> +    int ret = backup_top_cbw(bs, offset, bytes);
> +    if (ret < 0) {
> +        return ret;
> +    }
> +
> +    /*
> +     * Features to be implemented:
> +     * F9. possibility of lazy discard: just defer the discard after fleecing
> +     *     completion. If write (or new discard) occurs to the same area, just
> +     *     drop deferred discard.
> +     */
> +
> +    return bdrv_co_pdiscard(bs->backing, offset, bytes);
> +}
> +
> +static int coroutine_fn backup_top_co_pwrite_zeroes(BlockDriverState *bs,
> +        int64_t offset, int bytes, BdrvRequestFlags flags)
> +{
> +    int ret = backup_top_cbw(bs, offset, bytes);
> +    if (ret < 0) {
> +        return ret;
> +    }
> +
> +    return bdrv_co_pwrite_zeroes(bs->backing, offset, bytes, flags);
> +}
> +
> +static coroutine_fn int backup_top_co_pwritev(BlockDriverState *bs,
> +                                                 uint64_t offset,
> +                                                 uint64_t bytes,
> +                                                 QEMUIOVector *qiov, int flags)

(Indentation)

> +{
> +    int ret = backup_top_cbw(bs, offset, bytes);
> +    if (ret < 0) {
> +        return ret;
> +    }
> +
> +    return bdrv_co_pwritev(bs->backing, offset, bytes, qiov, flags);
> +}
> +
> +static int coroutine_fn backup_top_co_flush(BlockDriverState *bs)
> +{
> +    if (!bs->backing) {
> +        return 0;
> +    }
> +
> +    return bdrv_co_flush(bs->backing->bs);

Why not the target as well?

> +}
> +
> +static void backup_top_refresh_filename(BlockDriverState *bs, QDict *opts)
> +{
> +    if (bs->backing == NULL) {
> +        /*
> +         * we can be here after failed bdrv_attach_child in
> +         * bdrv_set_backing_hd
> +         */
> +        return;
> +    }
> +    bdrv_refresh_filename(bs->backing->bs);
> +    pstrcpy(bs->exact_filename, sizeof(bs->exact_filename),
> +            bs->backing->bs->filename);
> +}
> +
> +static void backup_top_child_perm(BlockDriverState *bs, BdrvChild *c,
> +                                       const BdrvChildRole *role,
> +                                       BlockReopenQueue *reopen_queue,
> +                                       uint64_t perm, uint64_t shared,
> +                                       uint64_t *nperm, uint64_t *nshared)

(Indentation)

> +{
> +    bdrv_filter_default_perms(bs, c, role, reopen_queue, perm, shared, nperm,
> +                              nshared);
> +
> +    if (role == &child_file) {
> +        /*
> +         * share write to target, to not interfere guest writes to it's disk

*interfere with guest writes to its disk

> +         * which will be in target backing chain
> +         */
> +        *nshared = *nshared | BLK_PERM_WRITE;
> +        *nperm = *nperm | BLK_PERM_WRITE;

Why do you need to take the write permission?  This filter does not
perform any writes unless it receives writes, right?  (So
bdrv_filter_default_perms() should take care of this.)

> +    } else {
> +        *nperm = *nperm | BLK_PERM_CONSISTENT_READ;
> +    }
> +}
> +
> +BlockDriver bdrv_backup_top_filter = {
> +    .format_name = "backup-top",
> +    .instance_size = sizeof(BDRVBackupTopState),
> +
> +    .bdrv_co_preadv             = backup_top_co_preadv,
> +    .bdrv_co_pwritev            = backup_top_co_pwritev,
> +    .bdrv_co_pwrite_zeroes      = backup_top_co_pwrite_zeroes,
> +    .bdrv_co_pdiscard           = backup_top_co_pdiscard,
> +    .bdrv_co_flush              = backup_top_co_flush,
> +
> +    .bdrv_co_block_status       = bdrv_co_block_status_from_backing,
> +
> +    .bdrv_refresh_filename      = backup_top_refresh_filename,
> +
> +    .bdrv_child_perm            = backup_top_child_perm,
> +
> +    .is_filter = true,
> +};
> +
> +BlockDriverState *bdrv_backup_top_append(BlockDriverState *source,
> +                                         BlockDriverState *target,
> +                                         HBitmap *copy_bitmap,
> +                                         Error **errp)
> +{
> +    Error *local_err = NULL;
> +    BDRVBackupTopState *state;
> +    BlockDriverState *top = bdrv_new_open_driver(&bdrv_backup_top_filter,
> +                                                 NULL, BDRV_O_RDWR, errp);
> +
> +    if (!top) {
> +        return NULL;
> +    }
> +
> +    top->implicit = true;
> +    top->total_sectors = source->total_sectors;
> +    top->opaque = state = g_new0(BDRVBackupTopState, 1);
> +    state->copy_bitmap = copy_bitmap;
> +
> +    bdrv_ref(target);
> +    state->target = bdrv_attach_child(top, target, "target", &child_file, errp);
> +    if (!state->target) {
> +        bdrv_unref(target);
> +        bdrv_unref(top);
> +        return NULL;
> +    }
> +
> +    bdrv_set_aio_context(top, bdrv_get_aio_context(source));
> +    bdrv_set_aio_context(target, bdrv_get_aio_context(source));
> +
> +    bdrv_drained_begin(source);
> +
> +    bdrv_ref(top);
> +    bdrv_append(top, source, &local_err);
> +
> +    if (local_err) {
> +        bdrv_unref(top);

This is done automatically by bdrv_append().

> +    }
> +
> +    bdrv_drained_end(source);
> +
> +    if (local_err) {
> +        bdrv_unref_child(top, state->target);
> +        bdrv_unref(top);
> +        error_propagate(errp, local_err);
> +        return NULL;
> +    }
> +
> +    return top;
> +}
> +
> +void bdrv_backup_top_drop(BlockDriverState *bs)
> +{
> +    BDRVBackupTopState *s = bs->opaque;
> +
> +    AioContext *aio_context = bdrv_get_aio_context(bs);
> +
> +    aio_context_acquire(aio_context);
> +
> +    bdrv_drained_begin(bs);
> +
> +    bdrv_child_try_set_perm(bs->backing, 0, BLK_PERM_ALL, &error_abort);
> +    bdrv_replace_node(bs, backing_bs(bs), &error_abort);
> +    bdrv_set_backing_hd(bs, NULL, &error_abort);

This is done automatically in bdrv_close(), and after bs has been
replaced by backing_bs(bs), I don't think new requests should come in,
so I don't think this needs to be done here.

> +
> +    bdrv_drained_end(bs);
> +
> +    if (s->target) {
> +        bdrv_unref_child(bs, s->target);
> +    }

And this should be done in a .bdrv_close() implementation, I think.

Max

> +    bdrv_unref(bs);
> +
> +    aio_context_release(aio_context);
> +}
> +
> +uint64_t bdrv_backup_top_progress(BlockDriverState *bs)
> +{
> +    BDRVBackupTopState *s = bs->opaque;
> +
> +    return s->bytes_copied;
> +}
> diff --git a/block/Makefile.objs b/block/Makefile.objs
> index 7a81892a52..9072115c09 100644
> --- a/block/Makefile.objs
> +++ b/block/Makefile.objs
> @@ -40,6 +40,8 @@ block-obj-y += throttle.o copy-on-read.o
>  
>  block-obj-y += crypto.o
>  
> +block-obj-y += backup-top.o
> +
>  common-obj-y += stream.o
>  
>  nfs.o-libs         := $(LIBNFS_LIBS)
> 



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

  reply	other threads:[~2019-01-16 16:02 UTC|newest]

Thread overview: 54+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-12-29 12:20 [Qemu-devel] [PATCH v5 00/11] backup-top filter driver for backup Vladimir Sementsov-Ogievskiy
2018-12-29 12:20 ` [Qemu-devel] [PATCH v5 01/11] block/backup: simplify backup_incremental_init_copy_bitmap Vladimir Sementsov-Ogievskiy
2019-01-14 13:10   ` Max Reitz
2019-01-14 13:13     ` Max Reitz
2019-01-14 14:01     ` Vladimir Sementsov-Ogievskiy
2019-01-14 14:13       ` Max Reitz
2019-01-14 14:48         ` Vladimir Sementsov-Ogievskiy
2019-01-16 13:05           ` Max Reitz
2019-01-23  8:20             ` Vladimir Sementsov-Ogievskiy
2019-01-23 13:19               ` Max Reitz
2019-01-23 14:36               ` Eric Blake
2019-01-24 14:20                 ` Vladimir Sementsov-Ogievskiy
2018-12-29 12:20 ` [Qemu-devel] [PATCH v5 02/11] block/backup: move to copy_bitmap with granularity Vladimir Sementsov-Ogievskiy
2019-01-14 14:10   ` Max Reitz
2018-12-29 12:20 ` [Qemu-devel] [PATCH v5 03/11] block: improve should_update_child Vladimir Sementsov-Ogievskiy
2019-01-14 14:32   ` Max Reitz
2019-01-14 16:13     ` Vladimir Sementsov-Ogievskiy
2019-01-16 13:17       ` Max Reitz
2018-12-29 12:20 ` [Qemu-devel] [PATCH v5 04/11] iotests: handle -f argument correctly for qemu_io_silent Vladimir Sementsov-Ogievskiy
2019-01-14 14:36   ` Max Reitz
2018-12-29 12:20 ` [Qemu-devel] [PATCH v5 05/11] iotests: allow resume_drive by node name Vladimir Sementsov-Ogievskiy
2019-01-14 14:46   ` Max Reitz
2019-01-14 16:06     ` Vladimir Sementsov-Ogievskiy
2019-01-16 13:11       ` Max Reitz
2019-01-23 13:22         ` Vladimir Sementsov-Ogievskiy
2019-01-23 13:31           ` Vladimir Sementsov-Ogievskiy
2019-01-23 13:33             ` Max Reitz
2018-12-29 12:20 ` [Qemu-devel] [PATCH v5 06/11] iotests: prepare 055 to graph changes during backup job Vladimir Sementsov-Ogievskiy
2019-01-16 13:48   ` Max Reitz
2018-12-29 12:20 ` [Qemu-devel] [PATCH v5 07/11] block: introduce backup-top filter driver Vladimir Sementsov-Ogievskiy
2019-01-16 16:02   ` Max Reitz [this message]
2019-01-17 12:13     ` Vladimir Sementsov-Ogievskiy
2019-01-18 12:05       ` Max Reitz
2019-01-23 13:47         ` Vladimir Sementsov-Ogievskiy
2019-04-13 16:08     ` Vladimir Sementsov-Ogievskiy
2019-04-13 16:08       ` Vladimir Sementsov-Ogievskiy
2019-04-13 17:03       ` Vladimir Sementsov-Ogievskiy
2019-04-13 17:03         ` Vladimir Sementsov-Ogievskiy
2018-12-29 12:20 ` [Qemu-devel] [PATCH v5 08/11] block/io: refactor wait_serialising_requests Vladimir Sementsov-Ogievskiy
2019-01-16 16:18   ` Max Reitz
2018-12-29 12:20 ` [Qemu-devel] [PATCH v5 09/11] block: add lock/unlock range functions Vladimir Sementsov-Ogievskiy
2019-01-16 16:36   ` Max Reitz
2018-12-29 12:20 ` [Qemu-devel] [PATCH v5 10/11] block/backup: tiny refactor backup_job_create Vladimir Sementsov-Ogievskiy
2019-01-18 13:00   ` Max Reitz
2018-12-29 12:20 ` [Qemu-devel] [PATCH v5 11/11] block/backup: use backup-top instead of write notifiers Vladimir Sementsov-Ogievskiy
2019-01-18 14:56   ` Max Reitz
2019-01-28 11:29     ` Vladimir Sementsov-Ogievskiy
2019-01-28 15:59       ` Max Reitz
2019-01-28 16:44         ` Vladimir Sementsov-Ogievskiy
2019-01-28 16:53           ` Max Reitz
2019-01-28 17:14             ` Vladimir Sementsov-Ogievskiy
2019-01-28 17:40           ` Kevin Wolf
2019-01-28 19:00             ` Vladimir Sementsov-Ogievskiy
2019-01-23 15:26 ` [Qemu-devel] [PATCH v5 00/11] backup-top filter driver for backup no-reply

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5a822e18-6967-5059-bf21-6891aa701af4@redhat.com \
    --to=mreitz@redhat.com \
    --cc=den@openvz.org \
    --cc=eblake@redhat.com \
    --cc=fam@euphon.net \
    --cc=jcody@redhat.com \
    --cc=jsnow@redhat.com \
    --cc=kwolf@redhat.com \
    --cc=qemu-block@nongnu.org \
    --cc=qemu-devel@nongnu.org \
    --cc=stefanha@redhat.com \
    --cc=vsementsov@virtuozzo.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).