From: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
To: Max Reitz <mreitz@redhat.com>, qemu-block@nongnu.org
Cc: qemu-devel@nongnu.org, armbru@redhat.com, eblake@redhat.com,
fam@euphon.net, stefanha@redhat.com, kwolf@redhat.com,
den@openvz.org
Subject: Re: [PATCH v6 08/15] block: introduce preallocate filter
Date: Thu, 24 Sep 2020 20:36:32 +0300 [thread overview]
Message-ID: <fd9c5668-5f4e-a56b-b14d-a169ff97411f@virtuozzo.com> (raw)
In-Reply-To: <73b1b0eb-5ca3-a436-5783-695d62fe337e@redhat.com>
24.09.2020 19:50, Max Reitz wrote:
> On 18.09.20 20:19, Vladimir Sementsov-Ogievskiy wrote:
>> It's intended to be inserted between format and protocol nodes to
>> preallocate additional space (expanding protocol file) on writes
>> crossing EOF. It improves performance for file-systems with slow
>> allocation.
>>
>> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
>> ---
>> docs/system/qemu-block-drivers.rst.inc | 26 ++
>> qapi/block-core.json | 20 +-
>> block/preallocate.c | 556 +++++++++++++++++++++++++
>> block/meson.build | 1 +
>> 4 files changed, 602 insertions(+), 1 deletion(-)
>> create mode 100644 block/preallocate.c
>
> Looks good to me in general.
>
> [...]
>
>> diff --git a/block/preallocate.c b/block/preallocate.c
>> new file mode 100644
>> index 0000000000..6510ad0149
>> --- /dev/null
>> +++ b/block/preallocate.c
>
> [...]
>
>> +/*
>> + * Handle reopen.
>> + *
>> + * We must implement reopen handlers, otherwise reopen just don't work. Handle
>> + * new options and don't care about preallocation state, as it is handled in
>> + * set/check permission handlers.
>> + */
>> +
>> +static int preallocate_reopen_prepare(BDRVReopenState *reopen_state,
>> + BlockReopenQueue *queue, Error **errp)
>> +{
>> + PreallocateOpts *opts = g_new0(PreallocateOpts, 1);
>> +
>> + if (!preallocate_absorb_opts(opts, reopen_state->options,
>> + reopen_state->bs->file->bs, errp)) {
>
> I tried to find out whether this refers to the old file child, or the
> post-reopen one.
Note, that it's needed only to check request_alignment. Probably it's better to pass request_alignment to preallocate_absorb_opts, not the whole child.
> As far as I could find out, there is no generic
> implementation for changing the file child as part of x-blockdev-reopen:
>
> {"error": {"class": "GenericError", "desc": "Cannot change the option
> 'file'"}}
>
> Now that’s a shame. That means you can’t reasonably integrate a
> preallocate filter into an existing node graph unless the format driver
> checks for the respective child option and issues a replace_node on
> commit or something, right? I suppose any driver who’d want to
> implement child replacement would need to attach the new node in
> prepare() as some pseudo-child, and then drop the old one and rename the
> new one in commit(). I don’t think any driver does that yet, so I
> suppose no format driver allows replacement of children yet (except for
> quorum...).
>
> Does anyone know what the status on that is? Are there any plans for
> implementing child replacement in reopen, or did I just miss something?
>
> (It looks like the backing child can be replaced, but that’s probably
> not a child where the preallocate filter would be placed on top...).
Hm. I didn't care about it, because main use case is to insert the filter at start, specifying it in -drive or in -blockdev options.
Probably, we need a separate API which will insert/remove filters like it is done in block jobs code, not reopening the block device.
>
>> + g_free(opts);
>> + return -EINVAL;
>> + }
>> +
>> + reopen_state->opaque = opts;
>> +
>> + return 0;
>> +}
>
> [...]
>
>> +/*
>> + * Call on each write. Returns true if @want_merge_zero is true and the region
>> + * [offset, offset + bytes) is zeroed (as a result of this call or earlier
>> + * preallocation).
>> + *
>> + * want_merge_zero is used to merge write-zero request with preallocation in
>> + * one bdrv_co_pwrite_zeroes() call.
>> + */
>> +static bool coroutine_fn handle_write(BlockDriverState *bs, int64_t offset,
>> + int64_t bytes, bool want_merge_zero)
>> +{
>> + BDRVPreallocateState *s = bs->opaque;
>> + int64_t end = offset + bytes;
>> + int64_t prealloc_start, prealloc_end;
>> + int ret;
>> +
>> + if (!has_prealloc_perms(bs)) {
>
> Took me a bit to figure out, because it takes a trip to
> preallocate_child_perm() to figure out exactly when we’re going to have
> the necessary permissions for this to pass.
>
> Then it turns out that this is going to be the case exactly when the
> parents collectively have the same permissions (WRITE+RESIZE) on the
> preallocate node.
>
> Then I had to wonder whether it’s appropriate not to preallocate if
> WRITE is taken, but RESIZE isn’t. But that seems entirely correct, yes.
> If noone is going to grow the file, then there is no need for
> preallocation. (Vice versa, if noone is going to write, but only
> resize, then there is no need for preallocation either.)
>
> So this seems correct, yes.
>
> (It could be argued that if one parent has WRITE, and another has RESIZE
> (but neither has both), then we probably don’t need preallocation
> either. But in such an arcane case (which is impossible to figure out
> in .bdrv_child_perm() anyway), we might as well just do preallocation.
> Won’t hurt.)
>
>> + /* We don't have state neither should try to recover it */
>> + return false;
>> + }
>> +
>> + if (s->data_end < 0) {
>> + s->data_end = bdrv_getlength(bs->file->bs);
>> + if (s->data_end < 0) {
>> + return false;
>> + }
>> +
>> + if (s->file_end < 0) {
>> + s->file_end = s->data_end;
>> + }
>> + }
>> +
>> + if (end <= s->data_end) {
>> + return false;
>> + }
>> +
>> + /* We have valid s->data_end, and request writes beyond it. */
>> +
>> + s->data_end = end;
>> + if (s->zero_start < 0 || !want_merge_zero) {
>> + s->zero_start = end;
>
> Skipping this on want_merge_zero == true means that zero writes can be
> cached; if you repeatedly perform zero writes into the preallocated
> area, then none of those will actually be executed. I legitimately
> don’t know whether that’s OK.
>
> I suppose until someone tells me it isn’t OK, I’ll believe it is.
Skipping zero-writes to preallocated area is significant for performance.
If I remember correctly, handle_alloc_space() in qcow2 will otherwise do extra write-zeroes, which reduces performance.
So, we assume that zero-write after EOF is a kind of preallocation. And avoid preallocating same region twice.
Using zero_start instead of data_end is probably less significant, I'm not sure.
>
>> + }
>> +
>> + if (s->file_end < 0) {
>> + s->file_end = bdrv_getlength(bs->file->bs);
>> + if (s->file_end < 0) {
>> + return false;
>> + }
>> + }
>> +
>> + /* Now s->data_end, s->zero_start and s->file_end are valid. */
>> +
>> + if (end <= s->file_end) {
>> + /* No preallocation needed. */
>> + return want_merge_zero && offset >= s->zero_start;
>> + }
>> +
>> + /* Now we want new preallocation, as request writes beyond s->data_end. */
>
> s/data_end/file_end/>
>> +
>> + prealloc_start = want_merge_zero ? MIN(offset, s->file_end) : s->file_end;
>
> I suppose you intentionally use s->file_end here instead of @end, even
> if offset <= s->file_end. I just mention it because I wonder whether
> it’s really better to effectively write twice to the same area in such
> cases (once zeroes for preallocation, then immediately the data) instead
> of only writing the data and then preallocating past it.
>
> (Though if it were the same code just with @end instead of s->file_end
> for offset <= s->file_end, then the order would be to preallocate past
> @end, and then to write the data. Which might be suboptimal in terms of
> how the blocks are then ordered in the filesystem.)
Yes, I think it's better to preallocate from file_end, in this way we do our best for good file blocks location in filesystem. And writing twice shouldn't really matter with big enough preallocation size (a lot larger than usual guest write operation).
>
>> + prealloc_end = QEMU_ALIGN_UP(offset + bytes + s->opts.prealloc_size,
>
> s/offset + bytes/end/?
yes
>
>> + s->opts.prealloc_align);
>> + s->file_end = end;
>
> Why is this set here, when it’s always overwritten after
> bdrv_co_pwrite_zeroes() anyway?
>
> (It also seems a bit wrong, because at this point we don’t know yet
> whether the data write is going to succeed, so we don’t know for sure
> whether the file end is really going to be @end without the preallocation.)
Agree, that's strange and seems wrong. Will drop.
>
>> +
>> + ret = bdrv_co_pwrite_zeroes(
>> + bs->file, prealloc_start, prealloc_end - prealloc_start,
>> + BDRV_REQ_NO_FALLBACK | BDRV_REQ_SERIALISING | BDRV_REQ_NO_WAIT);
>> + if (ret < 0) {
>> + s->file_end = ret;
>> + return false;
>> + }
>> +
>> + s->file_end = prealloc_end;
>> + return want_merge_zero;
>> +}
>> +
>> +static int coroutine_fn preallocate_co_pwrite_zeroes(BlockDriverState *bs,
>> + int64_t offset, int bytes, BdrvRequestFlags flags)
>> +{
>> + bool want_merge_zero =
>> + (flags & (BDRV_REQ_ZERO_WRITE | BDRV_REQ_NO_FALLBACK)) == flags;
>
> Isn’t this the same as !(flags & ~(ZERO_WRITE | NO_FALLBACK))? (Maybe
> only I would find that simpler to understand, though.)
Actually, I think yours is more usual notation and it uses "flags" once. Will change.
>
>> + if (handle_write(bs, offset, bytes, want_merge_zero)) {
>> + return 0;
>> + }
>> +
>> + return bdrv_co_pwrite_zeroes(bs->file, offset, bytes, flags);
>> +}
>
> [...]
>
>> +static int coroutine_fn
>> +preallocate_co_truncate(BlockDriverState *bs, int64_t offset,
>> + bool exact, PreallocMode prealloc,
>> + BdrvRequestFlags flags, Error **errp)
>> +{
>> + ERRP_GUARD();
>> + BDRVPreallocateState *s = bs->opaque;
>> + int ret;
>> +
>> + if (s->data_end >= 0 && offset > s->data_end) {
>> + if (s->file_end < 0) {
>> + s->file_end = bdrv_getlength(bs->file->bs);
>> + if (s->file_end < 0) {
>> + error_setg(errp, "failed to get file length");
>> + return s->file_end;
>> + }
>> + }
>> +
>> + if (prealloc == PREALLOC_MODE_FALLOC) {
>> + /*
>> + * If offset <= s->file_end, the task is already done, just
>> + * update s->file_end, to move part of "filter preallocation"
>
> s/file_end/data_end/
>
>> + * to "preallocation requested by user".
>> + * Otherwise just proceed to preallocate missing part.
>> + */
>> + if (offset <= s->file_end) {
>> + s->data_end = offset;
>> + return 0;
>> + }
>
> [...]
>
>> +static int preallocate_check_perm(BlockDriverState *bs,
>> + uint64_t perm, uint64_t shared, Error **errp)
>> +{
>> + BDRVPreallocateState *s = bs->opaque;
>> +
>> + if (s->data_end >= 0 && !can_write_resize(perm)) {
>> + /*
>> + * Loose permissions.
>
> *Lose
>
> (I assume)
Oops)
>
>> + * We should truncate in check_perm, as in set_perm bs->file->perm will
>> + * be already changed, and we should not violate it.
>> + */
>> + if (s->file_end < 0) {
>> + s->file_end = bdrv_getlength(bs->file->bs);
>> + if (s->file_end < 0) {
>> + error_setg(errp, "Failed to get file length");
>> + return s->file_end;
>> + }
>> + }
>> +
>> + if (s->data_end < s->file_end) {
>> + int ret = bdrv_truncate(bs->file, s->data_end, true,
>> + PREALLOC_MODE_OFF, 0, NULL);
>> + if (ret < 0) {
>> + error_setg(errp, "Failed to drop preallocation");
>> + s->file_end = ret;
>> + return ret;
>> + }
>> + }
>> + /*
>> + * We will drop our permissions, as well as allow shared
>> + * permissions, anyone will be able to change the child, so mark
>> + * all states invalid. We'll regain control if get good permissions
>> + * back.
>> + */
>> + s->data_end = s->file_end = s->zero_start = -EINVAL;
>
> Shouldn’t we clear these variables whenever !can_write_resize(perm), not
> just if also s->data_end >= 0?
Yes, will fix.
>
>> + }
>> +
>> + return 0;
>> +}
>
> Max
>
Thanks!
--
Best regards,
Vladimir
next prev parent reply other threads:[~2020-09-24 17:37 UTC|newest]
Thread overview: 37+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-09-18 18:19 [PATCH v6 00/15] preallocate filter Vladimir Sementsov-Ogievskiy
2020-09-18 18:19 ` [PATCH v6 01/15] block: simplify comment to BDRV_REQ_SERIALISING Vladimir Sementsov-Ogievskiy
2020-09-28 15:34 ` Alberto Garcia
2020-09-18 18:19 ` [PATCH v6 02/15] block/io.c: drop assertion on double waiting for request serialisation Vladimir Sementsov-Ogievskiy
2020-09-18 18:19 ` [PATCH v6 03/15] block/io: split out bdrv_find_conflicting_request Vladimir Sementsov-Ogievskiy
2020-09-18 18:19 ` [PATCH v6 04/15] block/io: bdrv_wait_serialising_requests_locked: drop extra bs arg Vladimir Sementsov-Ogievskiy
2020-09-18 18:19 ` [PATCH v6 05/15] block: bdrv_mark_request_serialising: split non-waiting function Vladimir Sementsov-Ogievskiy
2020-09-18 18:19 ` [PATCH v6 06/15] block: introduce BDRV_REQ_NO_WAIT flag Vladimir Sementsov-Ogievskiy
2020-09-18 18:19 ` [PATCH v6 07/15] block: bdrv_check_perm(): process children anyway Vladimir Sementsov-Ogievskiy
2020-09-24 14:25 ` Max Reitz
2020-09-24 14:55 ` Vladimir Sementsov-Ogievskiy
2020-09-18 18:19 ` [PATCH v6 08/15] block: introduce preallocate filter Vladimir Sementsov-Ogievskiy
2020-09-24 16:50 ` Max Reitz
2020-09-24 17:36 ` Vladimir Sementsov-Ogievskiy [this message]
2020-09-18 18:19 ` [PATCH v6 09/15] qemu-io: add preallocate mode parameter for truncate command Vladimir Sementsov-Ogievskiy
2020-09-24 17:08 ` Max Reitz
2020-09-18 18:19 ` [PATCH v6 10/15] iotests: qemu_io_silent: support --image-opts Vladimir Sementsov-Ogievskiy
2020-09-25 7:10 ` Max Reitz
2020-09-18 18:19 ` [PATCH v6 11/15] iotests: add 298 to test new preallocate filter driver Vladimir Sementsov-Ogievskiy
2020-09-25 8:26 ` Max Reitz
2020-09-25 8:49 ` Vladimir Sementsov-Ogievskiy
2020-09-25 9:11 ` Max Reitz
2020-09-25 15:11 ` Vladimir Sementsov-Ogievskiy
2020-09-25 15:32 ` Vladimir Sementsov-Ogievskiy
2020-10-01 7:40 ` Max Reitz
2020-10-01 8:41 ` Thomas Huth
2020-09-18 18:19 ` [PATCH v6 12/15] scripts/simplebench: support iops Vladimir Sementsov-Ogievskiy
2020-09-25 8:54 ` Max Reitz
2020-09-18 18:19 ` [PATCH v6 13/15] scripts/simplebench: improve view of ascii table Vladimir Sementsov-Ogievskiy
2020-09-25 9:31 ` Max Reitz
2020-09-25 16:58 ` Vladimir Sementsov-Ogievskiy
2020-09-18 18:19 ` [PATCH v6 14/15] scripts/simplebench: improve ascii table: add difference line Vladimir Sementsov-Ogievskiy
2020-09-25 10:24 ` Max Reitz
2020-09-25 17:13 ` Vladimir Sementsov-Ogievskiy
2020-10-01 8:22 ` Max Reitz
2020-09-18 18:19 ` [PATCH v6 15/15] scripts/simplebench: add bench_prealloc.py Vladimir Sementsov-Ogievskiy
2020-09-25 11:25 ` Max Reitz
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=fd9c5668-5f4e-a56b-b14d-a169ff97411f@virtuozzo.com \
--to=vsementsov@virtuozzo.com \
--cc=armbru@redhat.com \
--cc=den@openvz.org \
--cc=eblake@redhat.com \
--cc=fam@euphon.net \
--cc=kwolf@redhat.com \
--cc=mreitz@redhat.com \
--cc=qemu-block@nongnu.org \
--cc=qemu-devel@nongnu.org \
--cc=stefanha@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).