From: "Denis V. Lunev" <den@virtuozzo.com>
To: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>,
Kevin Wolf <kwolf@redhat.com>
Cc: qemu-block@nongnu.org, stefanha@redhat.com, eblake@redhat.com,
eesposit@redhat.com, pbonzini@redhat.com, qemu-devel@nongnu.org,
"Denis V. Lunev" <den@openvz.org>
Subject: Re: [PATCH v2 03/21] preallocate: Don't poll during permission updates
Date: Mon, 9 Oct 2023 11:28:13 +0200 [thread overview]
Message-ID: <94e17e07-d791-4cfe-970f-cdb9e7ae378b@virtuozzo.com> (raw)
In-Reply-To: <f210d1fd-5a4e-43a5-b76b-553e74708a5f@yandex-team.ru>
On 10/6/23 20:10, Vladimir Sementsov-Ogievskiy wrote:
> On 06.10.23 11:56, Kevin Wolf wrote:
>> Am 05.10.2023 um 21:55 hat Vladimir Sementsov-Ogievskiy geschrieben:
>>> On 11.09.23 12:46, Kevin Wolf wrote:
>>>> When the permission related BlockDriver callbacks are called, we
>>>> are in
>>>> the middle of an operation traversing the block graph. Polling in
>>>> such a
>>>> place is a very bad idea because the graph could change in unexpected
>>>> ways. In the future, callers will also hold the graph lock, which is
>>>> likely to turn polling into a deadlock.
>>>>
>>>> So we need to get rid of calls to functions like bdrv_getlength() or
>>>> bdrv_truncate() there as these functions poll internally. They are
>>>> currently used so that when no parent has write/resize permissions on
>>>> the image any more, the preallocate filter drops the extra
>>>> preallocated
>>>> area in the image file and gives up write/resize permissions itself.
>>>>
>>>> In order to achieve this without polling in .bdrv_check_perm, don't
>>>> immediately truncate the image, but only schedule a BH to do so. The
>>>> filter keeps the write/resize permissions a bit longer now until
>>>> the BH
>>>> has executed.
>>>>
>>>> There is one case in which delaying doesn't work: Reopening the image
>>>> read-only. In this case, bs->file will likely be reopened read-only,
>>>> too, so keeping write permissions a bit longer on it doesn't work. But
>>>> we can already cover this case in preallocate_reopen_prepare() and not
>>>> rely on the permission updates for it.
>>>
>>> Hmm, now I found one more "future" case.
>>>
>>> I now try to rebase my "[PATCH v7 0/7] blockdev-replace"
>>> https://patchew.org/QEMU/20230421114102.884457-1-vsementsov@yandex-team.ru/
>>>
>>>
>>> And it breaks after this commit.
>>>
>>> By accident, blockdev-replace series uses exactly "preallocate" filter
>>> to test insertion/removing of filters. And removing is broken now.
>>>
>>> Removing is done as follows:
>>>
>>> 1. We have filter inserted: disk0 -file-> filter -file-> file0
>>>
>>> 2. blockdev-replace, replaces file child of disk0, so we should get
>>> the picture*: disk0 -file-> file0 <-file- filter
>>>
>>> 3. blockdev-del filter
>>>
>>>
>>> But step [2] fails, as now preallocate filter doesn't drop permissions
>>> during the operation (postponing this for a while) and the picture* is
>>> impossible. Permission check fails.
>>>
>>> Hmmm... Any idea how blockdev-replace and preallocate filter should
>>> work :) ? Maybe, doing truncation in .drain_begin() will help? Will
>>> try
>>
>> Hm... What preallocate tries to do is really tricky...
>>
>> Of course, the error is correct, this is an invalid configuration if
>> preallocate can still resize the image. So it would have to truncate the
>> file earlier, but the first time that preallocate knows of the change is
>> already too late to run requests.
>>
>> Truncating on drain_begin feels more like a hack, but as long as it does
>> the job... Of course, this will have the preallocation truncated away on
>> events that have nothing to do with removing the filter. It's not
>> necessarily a disaster because preallocation is only an optimisation,
>> but it doesn't feel great.
>
> Hmm, yes, that's not good.
>
>>
>> Maybe let's take a step back: Which scenario is the preallocate driver
>> meant for and why do we even need to truncate the image file after
>> removing the filter? I suppose the filter doesn't make sense with raw
>> images because these are fixed size anyway, and pretty much any other
>> image format should be able to tolerate a permanently rounded up file
>> size. As long as you don't write to the preallocated area, it shouldn't
>> take space either on any sane filesystem.
>>
>> Hmm, actually both VHD and VMDK can have footers, better avoid it with
>> those... But if truncating the image file on close is critical, what do
>> you do on crashes? Maybe preallocate should just not be considered
>> compatible with these formats?
>>
>
> Originally preallocate filter was made to be used with qcow2, on some
> proprietary storage, where:
>
> 1. Allocating of big chunk works a lot fater than allocating several
> smaller chunks
> 2. Holes are not free and/or file length is not free, so we really
> want to truncate the file back on close
>
> Den, correct me if I'm wrong.
1. Absolutely correct. This is true when the file attributes
are stored in a centralized place aka metadata storage
and requests to it does not scale well.
2. This is at my opinion has different meaning. We have
tried to make local storage behavior and distributed
storage behavior to be the same when VM is off, i.e.
the file should be in the same state (no free blocks
at the end of the file).
>
> Good thing is that in this scenario we don't need to remove the filter
> in runtime, so there is no problem.
>
Yes, this filter is not dynamic in that respect. It is either
here or not here.
>
> Now I think that the generic solution is just add a new handler
> .bdrv_pre_replace, so blockdev-replace may work as follows:
>
> drain_begin
>
> call .bdrv_pre_replace for all affected nodes
>
> do the replace
>
> drain_end
>
> And prellocate filter would do truncation in this .bdrv_pre_replace
> handler and set some flag, that we have nothing to trunctate (the flag
> is automatically cleared on .drained_end handler). Then during
> permission update if we see that nothing-to-truncate flag, we can drop
> permissions immediately.
>
> But this difficulty may be postponed, and I can just document that
> preallocate filter doesn't support removing in runtime (and modify the
> test to use another filter, or just not to remove preallocate filter).
>
Den
next prev parent reply other threads:[~2023-10-09 9:28 UTC|newest]
Thread overview: 31+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-09-11 9:45 [PATCH v2 00/21] Graph locking part 4 (node management) Kevin Wolf
2023-09-11 9:46 ` [PATCH v2 01/21] block: Remove unused BlockReopenQueueEntry.perms_checked Kevin Wolf
2023-09-11 9:46 ` [PATCH v2 02/21] preallocate: Factor out preallocate_truncate_to_real_size() Kevin Wolf
2023-09-11 9:46 ` [PATCH v2 03/21] preallocate: Don't poll during permission updates Kevin Wolf
2023-10-05 19:55 ` Vladimir Sementsov-Ogievskiy
2023-10-06 8:56 ` Kevin Wolf
2023-10-06 18:10 ` Vladimir Sementsov-Ogievskiy
2023-10-09 9:28 ` Denis V. Lunev [this message]
2023-09-11 9:46 ` [PATCH v2 04/21] block: Take AioContext lock for bdrv_append() more consistently Kevin Wolf
2023-09-11 9:46 ` [PATCH v2 05/21] block: Introduce bdrv_schedule_unref() Kevin Wolf
2023-09-12 16:49 ` Stefan Hajnoczi
2023-09-11 9:46 ` [PATCH v2 06/21] block-coroutine-wrapper: Add no_co_wrapper_bdrv_wrlock functions Kevin Wolf
2023-09-11 9:46 ` [PATCH v2 07/21] block-coroutine-wrapper: Allow arbitrary parameter names Kevin Wolf
2023-09-11 9:46 ` [PATCH v2 08/21] block: Mark bdrv_replace_child_noperm() GRAPH_WRLOCK Kevin Wolf
2023-09-11 9:46 ` [PATCH v2 09/21] block: Mark bdrv_replace_child_tran() GRAPH_WRLOCK Kevin Wolf
2023-09-11 9:46 ` [PATCH v2 10/21] block: Mark bdrv_attach_child_common() GRAPH_WRLOCK Kevin Wolf
2023-09-11 9:46 ` [PATCH v2 11/21] block: Call transaction callbacks with lock held Kevin Wolf
2023-09-11 9:46 ` [PATCH v2 12/21] block: Mark bdrv_attach_child() GRAPH_WRLOCK Kevin Wolf
2023-09-11 9:46 ` [PATCH v2 13/21] block: Mark bdrv_parent_perms_conflict() and callers GRAPH_RDLOCK Kevin Wolf
2023-09-11 9:46 ` [PATCH v2 14/21] block: Mark bdrv_get_cumulative_perm() " Kevin Wolf
2023-09-11 9:46 ` [PATCH v2 15/21] block: Mark bdrv_child_perm() GRAPH_RDLOCK Kevin Wolf
2023-09-11 9:46 ` [PATCH v2 16/21] block: Mark bdrv_parent_cb_change_media() GRAPH_RDLOCK Kevin Wolf
2023-09-11 9:46 ` [PATCH v2 17/21] block: Take graph rdlock in bdrv_drop_intermediate() Kevin Wolf
2023-09-12 16:49 ` Stefan Hajnoczi
2023-09-11 9:46 ` [PATCH v2 18/21] block: Take graph rdlock in bdrv_change_aio_context() Kevin Wolf
2023-09-12 16:49 ` Stefan Hajnoczi
2023-09-11 9:46 ` [PATCH v2 19/21] block: Mark bdrv_root_unref_child() GRAPH_WRLOCK Kevin Wolf
2023-09-11 9:46 ` [PATCH v2 20/21] block: Mark bdrv_unref_child() GRAPH_WRLOCK Kevin Wolf
2023-09-11 9:46 ` [PATCH v2 21/21] block: Mark bdrv_add/del_child() and caller GRAPH_WRLOCK Kevin Wolf
2023-09-12 16:49 ` [PATCH v2 00/21] Graph locking part 4 (node management) Stefan Hajnoczi
2023-09-14 13:12 ` Kevin Wolf
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=94e17e07-d791-4cfe-970f-cdb9e7ae378b@virtuozzo.com \
--to=den@virtuozzo.com \
--cc=den@openvz.org \
--cc=eblake@redhat.com \
--cc=eesposit@redhat.com \
--cc=kwolf@redhat.com \
--cc=pbonzini@redhat.com \
--cc=qemu-block@nongnu.org \
--cc=qemu-devel@nongnu.org \
--cc=stefanha@redhat.com \
--cc=vsementsov@yandex-team.ru \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).