Re: [Qemu-devel] [PATCH] qemu-img: make convert async

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

From: Kevin Wolf <kwolf@redhat.com>
To: Peter Lieven <pl@kamp.de>
Cc: qemu-devel@nongnu.org, qemu-block@nongnu.org, stefanha@gmail.com,
	ct@flyingcircus.io, mreitz@redhat.com, famz@redhat.com,
	eblake@redhat.com
Subject: Re: [Qemu-devel] [PATCH] qemu-img: make convert async
Date: Fri, 24 Feb 2017 16:02:50 +0100	[thread overview]
Message-ID: <20170224150250.GB4429@noname.redhat.com> (raw)
In-Reply-To: <1487680191-13096-1-git-send-email-pl@kamp.de>

Am 21.02.2017 um 13:29 hat Peter Lieven geschrieben:
> the convert process is currently completely implemented with sync operations.
> That means it reads one buffer and then writes it. No parallelism and each sync
> request takes as long as it takes until it is completed.
> 
> This can be a big performance hit when the convert process reads and writes
> to devices which do not benefit from kernel readahead or pagecache.
> In our environment we heavily have the following two use cases when using
> qemu-img convert.
> 
> a) reading from NFS and writing to iSCSI for deploying templates
> b) reading from iSCSI and writing to NFS for backups
> 
> In both processes we use libiscsi and libnfs so we have no kernel cache.
> 
> This patch changes the convert process to work with parallel running coroutines
> which can significantly improve performance for network storage devices:
> 
> qemu-img (master)
>  nfs -> iscsi 22.8 secs
>  nfs -> ram   11.7 secs
>  ram -> iscsi 12.3 secs
> 
> qemu-img-async (8 coroutines, in-order write disabled)
>  nfs -> iscsi 11.0 secs
>  nfs -> ram   10.4 secs
>  ram -> iscsi  9.0 secs
> 
> This patches introduces 2 new cmdline parameters. The -m parameter to specify
> the number of coroutines running in parallel (defaults to 8). And the -W paremeter to
> allow qemu-img to write to the target out of order rather than sequential. This improves
> performance as the writes do not have to wait for each other to complete.
> 
> Signed-off-by: Peter Lieven <pl@kamp.de>

> @@ -1563,9 +1581,13 @@ static int convert_read(ImgConvertState *s, int64_t sector_num, int nb_sectors,
>          blk = s->src[s->src_cur];
>          bs_sectors = s->src_sectors[s->src_cur];
>  
>          n = MIN(nb_sectors, bs_sectors - (sector_num - s->src_cur_offset));

convert_co_read() uses global state here (s->src, s->src_cur,
s->src_cur_offset) while not running under a lock. Are you sure that
this is correct when some coroutines are operating on the first source
image, but others are already working on the second one?

The same variables are used in convert_iteration_sectors(), but I think
there it's okay because we're just starting a new request and are
holding s->lock.

> @@ -1651,12 +1685,122 @@ static int convert_write(ImgConvertState *s, int64_t sector_num, int nb_sectors,
>      return 0;
>  }
>  
> -static int convert_do_copy(ImgConvertState *s)
> +static void coroutine_fn convert_co_do_copy(void *opaque)
>  {
> +    ImgConvertState *s = opaque;
>      uint8_t *buf = NULL;
> -    int64_t sector_num, allocated_done;
> -    int ret;
> -    int n;
> +    int ret, i;
> +    int index = -1;
> +
> +    for (i = 0; i < s->num_coroutines; i++) {
> +        if (s->co[i] == qemu_coroutine_self()) {
> +            index = i;
> +            break;
> +        }
> +    }
> +    assert(index >= 0);
> +
> +    s->running_coroutines++;
> +    buf = blk_blockalign(s->target, s->buf_sectors * BDRV_SECTOR_SIZE);
> +
> +    while (1) {
> +        int n;
> +        int64_t sector_num;
> +        enum ImgConvertBlockStatus status;
> +
> +        qemu_co_mutex_lock(&s->lock);
> +        if (s->ret != -EINPROGRESS || s->sector_num >= s->total_sectors) {
> +            qemu_co_mutex_unlock(&s->lock);
> +            goto out;
> +        }
> +        n = convert_iteration_sectors(s, s->sector_num);
> +        if (n < 0) {
> +            qemu_co_mutex_unlock(&s->lock);
> +            s->ret = n;
> +            goto out;
> +        }
> +        /* safe current sector and allocation status to local variables */

s/safe/save/

> +        sector_num = s->sector_num;
> +        status = s->status;
> +        if (!s->min_sparse && s->status == BLK_ZERO) {
> +            n = MIN(n, s->buf_sectors);
> +        }
> +        /* increment global sector counter so that other coroutines can
> +         * already continue reading beyond this request */
> +        s->sector_num += n;
> +        qemu_co_mutex_unlock(&s->lock);

Here we unlock, so after this point we can only access globally valid
values from s, but not request-specific ones like s->status or
s->sector_num...

> +        if (status == BLK_DATA || (!s->min_sparse && s->status == BLK_ZERO)) {

...but s->status is used here...

> +            s->allocated_done += n;
> +            qemu_progress_print(100.0 * s->allocated_done /
> +                                        s->allocated_sectors, 0);
> +        }
> +
> +        if (status == BLK_DATA) {
> +            ret = convert_co_read(s, sector_num, n, buf);
> +            if (ret < 0) {
> +                error_report("error while reading sector %" PRId64
> +                             ": %s", sector_num, strerror(-ret));
> +                s->ret = ret;
> +                goto out;
> +            }
> +        } else if (!s->min_sparse && s->status == BLK_ZERO) {

...and here.

> +            status = BLK_DATA;
> +            memset(buf, 0x00, n * BDRV_SECTOR_SIZE);
> +        }
> +
> +        if (s->wr_in_order) {
> +            /* keep writes in order */
> +            while (s->wr_offs != sector_num) {
> +                if (s->ret != -EINPROGRESS) {
> +                    goto out;
> +                }
> +                s->wait_sector_num[index] = sector_num;
> +                qemu_coroutine_yield();
> +            }
> +            s->wait_sector_num[index] = -1;
> +        }
> +
> +        ret = convert_co_write(s, sector_num, n, buf, status);
> +        if (ret < 0) {
> +            error_report("error while writing sector %" PRId64
> +                         ": %s", sector_num, strerror(-ret));
> +            s->ret = ret;
> +            goto out;
> +        }
> +
> +        if (s->wr_in_order) {
> +            /* reenter the coroutine that might have waited
> +             * for this write to complete */
> +            s->wr_offs = sector_num + n;
> +            for (i = 0; i < s->num_coroutines; i++) {
> +                if (s->co[i] && s->wait_sector_num[i] == s->wr_offs) {
> +                    /*
> +                     * A -> B -> A cannot occur because A has
> +                     * s->wait_sector_num[i] == -1 during A -> B. Therefore
> +                     * B will never enter A during this time window.
> +                     */
> +                    qemu_coroutine_enter(s->co[i]);
> +                    break;
> +                }
> +            }
> +        }
> +    }
> +
> +out:
> +    qemu_vfree(buf);
> +    s->co[index] = NULL;
> +    s->running_coroutines--;
> +    if (!s->running_coroutines && s->ret == -EINPROGRESS) {
> +        /* the convert job finished successfully */
> +        s->ret = 0;
> +    }
> +}
> diff --git a/qemu-img.texi b/qemu-img.texi
> index 174aae3..6715ff3 100644
> --- a/qemu-img.texi
> +++ b/qemu-img.texi
> @@ -137,6 +137,12 @@ Parameters to convert subcommand:
>  
>  @item -n
>  Skip the creation of the target volume
> +@item -m
> +Number of parallel coroutines for the convert process
> +@item -W
> +Allow to write out of order to the destination. This is option improves performance,
> +but is only recommened for preallocated devices like host devices or other
> +raw block devices.
>  @end table
>  
>  Parameters to dd subcommand:
> @@ -296,7 +302,7 @@ Error on reading data
>  
>  @end table
>  
> -@item convert [-c] [-p] [-n] [-f @var{fmt}] [-t @var{cache}] [-T @var{src_cache}] [-O @var{output_fmt}] [-o @var{options}] [-s @var{snapshot_id_or_name}] [-l @var{snapshot_param}] [-S @var{sparse_size}] @var{filename} [@var{filename2} [...]] @var{output_filename}
> +@item convert [-c] [-p] [-n] [-f @var{fmt}] [-t @var{cache}] [-T @var{src_cache}] [-O @var{output_fmt}] [-o @var{options}] [-s @var{snapshot_id_or_name}] [-l @var{snapshot_param}] [-m @var{num_coroutines}] [-W] {[-S @var{sparse_size}] @var{filename} [@var{filename2} [...]] @var{output_filename}

The extra { before -S causes 'make qemu-doc.html' to fail (as reported
by patchew).

Kevin

next prev parent reply	other threads:[~2017-02-24 15:03 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-02-21 12:29 [Qemu-devel] [PATCH] qemu-img: make convert async Peter Lieven
2017-02-21 14:10 ` no-reply
2017-02-22 15:31 ` Stefan Hajnoczi
2017-02-23 10:03   ` Peter Lieven
2017-02-24 15:02 ` Kevin Wolf [this message]
2017-02-24 20:09   ` Peter Lieven

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20170224150250.GB4429@noname.redhat.com \
    --to=kwolf@redhat.com \
    --cc=ct@flyingcircus.io \
    --cc=eblake@redhat.com \
    --cc=famz@redhat.com \
    --cc=mreitz@redhat.com \
    --cc=pl@kamp.de \
    --cc=qemu-block@nongnu.org \
    --cc=qemu-devel@nongnu.org \
    --cc=stefanha@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).