From: Stefano Garzarella <sgarzare@redhat.com>
To: dillaman@redhat.com
Cc: qemu-devel <qemu-devel@nongnu.org>, Kevin Wolf <kwolf@redhat.com>,
Josh Durgin <jdurgin@redhat.com>,
qemu-block <qemu-block@nongnu.org>,
Markus Armbruster <armbru@redhat.com>,
Max Reitz <mreitz@redhat.com>
Subject: Re: [Qemu-devel] [Qemu-block] [PATCH] block/rbd: add preallocation support
Date: Mon, 29 Apr 2019 16:08:51 +0200 [thread overview]
Message-ID: <20190429140851.pgzieu4jr3jgdd4e@steredhat> (raw)
In-Reply-To: <CA+aFP1Cmc+3TcYJr73jYRmtCWNTAWBcjEtn+Afe7DKZqUYp7=g@mail.gmail.com>
On Mon, Apr 29, 2019 at 09:00:26AM -0400, Jason Dillaman wrote:
> On Mon, Apr 29, 2019 at 8:47 AM Stefano Garzarella <sgarzare@redhat.com> wrote:
> >
> > On Sat, Apr 27, 2019 at 08:43:26AM -0400, Jason Dillaman wrote:
> > > On Sat, Apr 27, 2019 at 7:37 AM Stefano Garzarella <sgarzare@redhat.com> wrote:
> > > >
> > > > This patch adds the support of preallocation (off/full) for the RBD
> > > > block driver.
> > > > If available, we use rbd_writesame() to quickly fill the image when
> > > > full preallocation is required.
> > > >
> > > > Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
> > > > ---
> > > > block/rbd.c | 149 ++++++++++++++++++++++++++++++++++++++-----
> > > > qapi/block-core.json | 4 +-
> > > > 2 files changed, 136 insertions(+), 17 deletions(-)
> > > >
> > > > diff --git a/block/rbd.c b/block/rbd.c
> > > > index 0c549c9935..29dd1bb040 100644
> > > > --- a/block/rbd.c
> > > > +++ b/block/rbd.c
> > > > @@ -13,6 +13,7 @@
> > > >
> > > > #include "qemu/osdep.h"
> > > >
> > > > +#include "qemu/units.h"
> > > > #include <rbd/librbd.h>
> > > > #include "qapi/error.h"
> > > > #include "qemu/error-report.h"
> > > > @@ -331,6 +332,110 @@ static void qemu_rbd_memset(RADOSCB *rcb, int64_t offs)
> > > > }
> > > > }
> > > >
> > > > +static int qemu_rbd_do_truncate(rbd_image_t image, int64_t offset,
> > > > + PreallocMode prealloc, Error **errp)
> > > > +{
> > > > + uint64_t current_length;
> > > > + char *buf = NULL;
> > > > + int ret;
> > > > +
> > > > + ret = rbd_get_size(image, ¤t_length);
> > > > + if (ret < 0) {
> > > > + error_setg_errno(errp, -ret, "Failed to get file length");
> > > > + goto out;
> > > > + }
> > > > +
> > > > + if (current_length > offset && prealloc != PREALLOC_MODE_OFF) {
> > > > + error_setg(errp, "Cannot use preallocation for shrinking files");
> > > > + ret = -ENOTSUP;
> > > > + goto out;
> > > > + }
> > > > +
> > > > + switch (prealloc) {
> > > > + case PREALLOC_MODE_FULL: {
> > > > + uint64_t current_offset = current_length;
> > > > + int buf_size = 64 * KiB;
> > >
> > > This should probably be 512B or 4KiB (which is the smallest striped
> > > unit size). Not only will this avoid sending unnecessary zeroes to the
> > > backing cluster, writesame silently turns into a standard write if
> > > your buffer isn't properly aligned with the min(object size, stripe
> > > unit size).
> > >
> >
> > Okay, I'll change it on v2.
> > Should we query about the "stripe_unit" size or we simply use the
> > smallest allowed?
>
> Technically we don't prevent a user from choosing terrible stripe unit
> sizes (e.g. 1 byte), so you are probably safe to just use 4KiB.
>
> > > > + ssize_t bytes;
> > > > +
> > > > + buf = g_malloc(buf_size);
> > > > + /*
> > > > + * Some versions of rbd_writesame() discards writes of buffers with
> > > > + * all zeroes. In order to avoid this behaviour, we set the first byte
> > > > + * to one.
> > > > + */
> > > > + buf[0] = 1;
> > >
> > > You could also use "rados_conf_set(cluster,
> > > "rbd_discard_on_zeroed_write_same", "false").
> > >
> >
> > I tried it, but it is not supported on all versions. (eg. I have Ceph
> > v12.2.11 on my Fedora 29 and it is not supported, but rbd_writesame() is
> > available)
> >
> > Maybe we can use both: "rbd_discard_on_zeroed_write_same = false" and
> > "buf[0] = 1"
>
> Probably not worth the effort if it's not supported across all releases.
>
> > > > + ret = rbd_resize(image, offset);
> > > > + if (ret < 0) {
> > > > + error_setg_errno(errp, -ret, "Failed to resize file");
> > > > + goto out;
> > > > + }
> > > > +
> > > > +#ifdef LIBRBD_SUPPORTS_WRITESAME
> > > > + while (offset - current_offset > buf_size) {
> > > > + /*
> > > > + * rbd_writesame() supports only request where the size of the
> > > > + * operation is multiple of buffer size and it must be less or
> > > > + * equal to INT_MAX.
> > > > + */
> > > > + bytes = MIN(offset - current_offset, INT_MAX);
> > > > + bytes -= bytes % buf_size;
> > >
> > > Using the default object size of 4MiB, this write size would result in
> > > up to 512 concurrent ops to the backing cluster. Perhaps the size
> > > should be bounded such that only a dozen or so concurrent requests are
> > > issued per write, always rounded next largest object / stripe period
> > > size. librbd and the rbd CLI usually try to bound themselves to the
> > > value in the "rbd_concurrent_management_ops" configuration setting
> > > (currently defaults to 10).
> > >
> >
> > Do you suggest to use "rbd_concurrent_management_ops" to limit
> > concurrent requests or use a new QEMU parameters for the RBD driver?
>
> I think it would be nicer to just query the
> "rbd_concurrent_management_ops" limit to derive your writesame size
> since the Ceph cluster admin can globally set that option to match the
> available parallelism of the cluster.
>
Thanks for the details, I'll send a v2 following yuor comments!
Stefano
WARNING: multiple messages have this Message-ID (diff)
From: Stefano Garzarella <sgarzare@redhat.com>
To: dillaman@redhat.com
Cc: Kevin Wolf <kwolf@redhat.com>, Josh Durgin <jdurgin@redhat.com>,
qemu-block <qemu-block@nongnu.org>,
Markus Armbruster <armbru@redhat.com>,
qemu-devel <qemu-devel@nongnu.org>, Max Reitz <mreitz@redhat.com>
Subject: Re: [Qemu-devel] [Qemu-block] [PATCH] block/rbd: add preallocation support
Date: Mon, 29 Apr 2019 16:08:51 +0200 [thread overview]
Message-ID: <20190429140851.pgzieu4jr3jgdd4e@steredhat> (raw)
Message-ID: <20190429140851.vVZe1GazIHxHezNoSgBTRSf42ud6pxwS2mERSVhOXTw@z> (raw)
In-Reply-To: <CA+aFP1Cmc+3TcYJr73jYRmtCWNTAWBcjEtn+Afe7DKZqUYp7=g@mail.gmail.com>
On Mon, Apr 29, 2019 at 09:00:26AM -0400, Jason Dillaman wrote:
> On Mon, Apr 29, 2019 at 8:47 AM Stefano Garzarella <sgarzare@redhat.com> wrote:
> >
> > On Sat, Apr 27, 2019 at 08:43:26AM -0400, Jason Dillaman wrote:
> > > On Sat, Apr 27, 2019 at 7:37 AM Stefano Garzarella <sgarzare@redhat.com> wrote:
> > > >
> > > > This patch adds the support of preallocation (off/full) for the RBD
> > > > block driver.
> > > > If available, we use rbd_writesame() to quickly fill the image when
> > > > full preallocation is required.
> > > >
> > > > Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
> > > > ---
> > > > block/rbd.c | 149 ++++++++++++++++++++++++++++++++++++++-----
> > > > qapi/block-core.json | 4 +-
> > > > 2 files changed, 136 insertions(+), 17 deletions(-)
> > > >
> > > > diff --git a/block/rbd.c b/block/rbd.c
> > > > index 0c549c9935..29dd1bb040 100644
> > > > --- a/block/rbd.c
> > > > +++ b/block/rbd.c
> > > > @@ -13,6 +13,7 @@
> > > >
> > > > #include "qemu/osdep.h"
> > > >
> > > > +#include "qemu/units.h"
> > > > #include <rbd/librbd.h>
> > > > #include "qapi/error.h"
> > > > #include "qemu/error-report.h"
> > > > @@ -331,6 +332,110 @@ static void qemu_rbd_memset(RADOSCB *rcb, int64_t offs)
> > > > }
> > > > }
> > > >
> > > > +static int qemu_rbd_do_truncate(rbd_image_t image, int64_t offset,
> > > > + PreallocMode prealloc, Error **errp)
> > > > +{
> > > > + uint64_t current_length;
> > > > + char *buf = NULL;
> > > > + int ret;
> > > > +
> > > > + ret = rbd_get_size(image, ¤t_length);
> > > > + if (ret < 0) {
> > > > + error_setg_errno(errp, -ret, "Failed to get file length");
> > > > + goto out;
> > > > + }
> > > > +
> > > > + if (current_length > offset && prealloc != PREALLOC_MODE_OFF) {
> > > > + error_setg(errp, "Cannot use preallocation for shrinking files");
> > > > + ret = -ENOTSUP;
> > > > + goto out;
> > > > + }
> > > > +
> > > > + switch (prealloc) {
> > > > + case PREALLOC_MODE_FULL: {
> > > > + uint64_t current_offset = current_length;
> > > > + int buf_size = 64 * KiB;
> > >
> > > This should probably be 512B or 4KiB (which is the smallest striped
> > > unit size). Not only will this avoid sending unnecessary zeroes to the
> > > backing cluster, writesame silently turns into a standard write if
> > > your buffer isn't properly aligned with the min(object size, stripe
> > > unit size).
> > >
> >
> > Okay, I'll change it on v2.
> > Should we query about the "stripe_unit" size or we simply use the
> > smallest allowed?
>
> Technically we don't prevent a user from choosing terrible stripe unit
> sizes (e.g. 1 byte), so you are probably safe to just use 4KiB.
>
> > > > + ssize_t bytes;
> > > > +
> > > > + buf = g_malloc(buf_size);
> > > > + /*
> > > > + * Some versions of rbd_writesame() discards writes of buffers with
> > > > + * all zeroes. In order to avoid this behaviour, we set the first byte
> > > > + * to one.
> > > > + */
> > > > + buf[0] = 1;
> > >
> > > You could also use "rados_conf_set(cluster,
> > > "rbd_discard_on_zeroed_write_same", "false").
> > >
> >
> > I tried it, but it is not supported on all versions. (eg. I have Ceph
> > v12.2.11 on my Fedora 29 and it is not supported, but rbd_writesame() is
> > available)
> >
> > Maybe we can use both: "rbd_discard_on_zeroed_write_same = false" and
> > "buf[0] = 1"
>
> Probably not worth the effort if it's not supported across all releases.
>
> > > > + ret = rbd_resize(image, offset);
> > > > + if (ret < 0) {
> > > > + error_setg_errno(errp, -ret, "Failed to resize file");
> > > > + goto out;
> > > > + }
> > > > +
> > > > +#ifdef LIBRBD_SUPPORTS_WRITESAME
> > > > + while (offset - current_offset > buf_size) {
> > > > + /*
> > > > + * rbd_writesame() supports only request where the size of the
> > > > + * operation is multiple of buffer size and it must be less or
> > > > + * equal to INT_MAX.
> > > > + */
> > > > + bytes = MIN(offset - current_offset, INT_MAX);
> > > > + bytes -= bytes % buf_size;
> > >
> > > Using the default object size of 4MiB, this write size would result in
> > > up to 512 concurrent ops to the backing cluster. Perhaps the size
> > > should be bounded such that only a dozen or so concurrent requests are
> > > issued per write, always rounded next largest object / stripe period
> > > size. librbd and the rbd CLI usually try to bound themselves to the
> > > value in the "rbd_concurrent_management_ops" configuration setting
> > > (currently defaults to 10).
> > >
> >
> > Do you suggest to use "rbd_concurrent_management_ops" to limit
> > concurrent requests or use a new QEMU parameters for the RBD driver?
>
> I think it would be nicer to just query the
> "rbd_concurrent_management_ops" limit to derive your writesame size
> since the Ceph cluster admin can globally set that option to match the
> available parallelism of the cluster.
>
Thanks for the details, I'll send a v2 following yuor comments!
Stefano
next prev parent reply other threads:[~2019-04-29 14:08 UTC|newest]
Thread overview: 21+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-04-27 11:36 [Qemu-devel] [PATCH] block/rbd: add preallocation support Stefano Garzarella
2019-04-27 11:36 ` Stefano Garzarella
2019-04-27 12:43 ` [Qemu-devel] [Qemu-block] " Jason Dillaman
2019-04-27 12:43 ` Jason Dillaman
2019-04-29 12:47 ` Stefano Garzarella
2019-04-29 12:47 ` Stefano Garzarella
2019-04-29 13:00 ` Jason Dillaman
2019-04-29 13:00 ` Jason Dillaman
2019-04-29 14:08 ` Stefano Garzarella [this message]
2019-04-29 14:08 ` Stefano Garzarella
2019-05-07 6:34 ` [Qemu-devel] Use of PreallocMode in block drivers (was: [PATCH] block/rbd: add preallocation support) Markus Armbruster
2019-05-07 8:36 ` Stefano Garzarella
2019-05-08 11:44 ` [Qemu-devel] Use of PreallocMode in block drivers Markus Armbruster
2019-05-09 8:26 ` Stefano Garzarella
2019-05-09 12:07 ` Markus Armbruster
2019-05-10 8:36 ` Stefano Garzarella
2019-05-09 13:29 ` Peter Krempa
2019-05-10 8:38 ` Stefano Garzarella
2019-05-22 8:57 ` Stefano Garzarella
2019-05-22 16:25 ` Markus Armbruster
2019-05-23 13:39 ` Stefano Garzarella
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20190429140851.pgzieu4jr3jgdd4e@steredhat \
--to=sgarzare@redhat.com \
--cc=armbru@redhat.com \
--cc=dillaman@redhat.com \
--cc=jdurgin@redhat.com \
--cc=kwolf@redhat.com \
--cc=mreitz@redhat.com \
--cc=qemu-block@nongnu.org \
--cc=qemu-devel@nongnu.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).