From: Claudio Fontana <cfontana@suse.de>
To: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Nikolay Borisov <nborisov@suse.com>,
berrange@redhat.com, qemu-devel@nongnu.org,
Claudio Fontana <Claudio.Fontana@suse.com>,
Jim Fehlig <jfehlig@suse.com>,
quintela@redhat.com
Subject: Re: towards a workable O_DIRECT outmigration to a file
Date: Thu, 18 Aug 2022 20:45:42 +0200 [thread overview]
Message-ID: <55725d92-42f9-a960-0117-e9ba924bf6e5@suse.de> (raw)
In-Reply-To: <4c984c87-d8c4-0af5-0619-9509a23f916c@suse.de>
On 8/18/22 20:09, Claudio Fontana wrote:
> On 8/18/22 18:31, Dr. David Alan Gilbert wrote:
>> * Claudio Fontana (cfontana@suse.de) wrote:
>>> On 8/18/22 14:38, Dr. David Alan Gilbert wrote:
>>>> * Nikolay Borisov (nborisov@suse.com) wrote:
>>>>> [adding Juan and David to cc as I had missed them. ]
>>>>
>>>> Hi Nikolay,
>>>>
>>>>> On 11.08.22 г. 16:47 ч., Nikolay Borisov wrote:
>>>>>> Hello,
>>>>>>
>>>>>> I'm currently looking into implementing a 'file:' uri for migration save
>>>>>> in qemu. Ideally the solution will be O_DIRECT compatible. I'm aware of
>>>>>> the branch https://gitlab.com/berrange/qemu/-/tree/mig-file. In the
>>>>>> process of brainstorming how a solution would like the a couple of
>>>>>> questions transpired that I think warrant wider discussion in the
>>>>>> community.
>>>>
>>>> OK, so this seems to be a continuation with Claudio and Daniel and co as
>>>> of a few months back. I'd definitely be leaving libvirt sides of the
>>>> question here to Dan, and so that also means definitely looking at that
>>>> tree above.
>>>
>>> Hi Dave, yes, Nikolai is trying to continue on the qemu side.
>>>
>>> We have something working with libvirt for our short term needs which offers good performance,
>>> but it is clear that that simple solution is barred for upstream libvirt merging.
>>>
>>>
>>>>
>>>>>> First, implementing a solution which is self-contained within qemu would
>>>>>> be easy enough( famous last words) but the gist is one has to only care
>>>>>> about the format within qemu. However, I'm being told that what libvirt
>>>>>> does is prepend its own custom header to the resulting saved file, then
>>>>>> slipstreams the migration stream from qemu. Now with the solution that I
>>>>>> envision I intend to keep all write-related logic inside qemu, this
>>>>>> means there's no way to incorporate the logic of libvirt. The reason I'd
>>>>>> like to keep the write process within qemu is to avoid an extra copy of
>>>>>> data between the two processes (qemu outging migration and libvirt),
>>>>>> with the current fd approach qemu is passed an fd, data is copied
>>>>>> between qemu/libvirt and finally the libvirt_iohelper writes the data.
>>>>>> So the question which remains to be answered is how would libvirt make
>>>>>> use of this new functionality in qemu? I was thinking something along
>>>>>> the lines of :
>>>>>>
>>>>>> 1. Qemu writes its migration stream to a file, ideally on a filesystem
>>>>>> which supports reflink - xfs/btrfs
>>>>>>
>>>>>> 2. Libvirt writes it's header to a separate file
>>>>>> 2.1 Reflinks the qemu's stream right after its header
>>>>>> 2.2 Writes its trailer
>>>>>>
>>>>>> 3. Unlink() qemu's file, now only libvirt's file remains on-disk.
>>>>>>
>>>>>> I wouldn't call this solution hacky though it definitely leaves some
>>>>>> bitter aftertaste.
>>>>
>>>> Wouldn't it be simpler to tell libvirt to write it's header, then tell
>>>> qemu to append everything?
>>>
>>> I would think so as well.
>>>
>>>>
>>>>>> Another solution would be to extend the 'fd:' protocol to allow multiple
>>>>>> descriptors (for multifd) support to be passed in. The reason dup()
>>>>>> can't be used is because in order for multifd to be supported it's
>>>>>> required to be able to write to multiple, non-overlapping regions of the
>>>>>> file. And duplicated fd's share their offsets etc. But that really seems
>>>>>> more or less hacky. Alternatively it's possible that pwrite() are used
>>>>>> to write to non-overlapping regions in the file. Any feedback is
>>>>>> welcomed.
>>>>
>>>> I do like the idea of letting fd: take multiple fd's.
>>>
>>> Fine in my view, I think we will still need then a helper process in libvirt to merge the data into a single file, no?
>>> In case the libvirt multifd to single file multithreaded helper I proposed before is helpful as a reference you could reuse/modify those patches.
>>
>> Eww that's messy isn't it.
>> (You don't fancy a huge sparse file do you?)
>
> Wait am I missing something obvious here?
>
> Maybe we don't need any libvirt extra process.
>
> why don't we open the _single_ file multiple times from libvirt,
>
> Lets say the "main channel" fd is opened, we write the libvirt header,
> then reopen again the same file multiple times,
> and finally pass all fds to qemu, one fd for each parallel transfer channel we want to use
> (so we solve all the permissions, security labels issues etc).
>
> And then from QEMU we can write to those fds at the right offsets for each separate channel,
> which is easier from QEMU because we can know exactly how much data we need to transfer before starting the migration,
> so we have even less need for "holes", possibly only minor ones for single byte adjustments
> for uneven division of the interleaved file.
Or even better, not pass multiple fds, but just _one_ fd,
and then from qemu write using multiple threads and pread / pwrite , so we don't have the additional complication of managing a bunch of fds.
Ciao,
CLaudio
>
> What is wrong with this one, or does anyone see some other better approach?
>
> Thanks,
>
> C
>
>>
>>> Maybe this new way will be acceptable to libvirt,
>>> ie avoiding the multifd code -> socket, but still merging the data from the multiple fds into a single file?
>>
>> It feels to me like the problem here is really what we want is something
>> closer to a dump than the migration code; you don't need all that
>> overhead of the code to deal with live migration bitmaps and dirty pages
>> that aren't going to happen.
>> Something that just does a nice single write(2) (for each memory
>> region);
>> and then ties the device state on.
>>
>> Dave
>>
>>>>
>>>> Dave
>>>>
>>>
>>> Thanks for your comments,
>>>
>>> Claudio
>>>>>>
>>>>>>
>>>>>> Regards,
>>>>>> Nikolay
>>>>>
>>>
>
next prev parent reply other threads:[~2022-08-18 18:52 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-08-11 13:47 towards a workable O_DIRECT outmigration to a file Nikolay Borisov
2022-08-11 14:10 ` Nikolay Borisov
2022-08-18 12:38 ` Dr. David Alan Gilbert
2022-08-18 12:52 ` Claudio Fontana
2022-08-18 16:31 ` Dr. David Alan Gilbert
2022-08-18 18:09 ` Claudio Fontana
2022-08-18 18:45 ` Claudio Fontana [this message]
2022-08-18 18:49 ` Dr. David Alan Gilbert
2022-08-18 19:14 ` Claudio Fontana
2022-08-18 18:13 ` Claudio Fontana
2022-09-08 10:26 ` [PATCH] migration: support file: uri for source migration Nikolay Borisov
2022-09-12 15:41 ` Daniel P. Berrangé
2022-09-12 16:30 ` Nikolay Borisov
2022-09-12 16:43 ` Daniel P. Berrangé
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=55725d92-42f9-a960-0117-e9ba924bf6e5@suse.de \
--to=cfontana@suse.de \
--cc=Claudio.Fontana@suse.com \
--cc=berrange@redhat.com \
--cc=dgilbert@redhat.com \
--cc=jfehlig@suse.com \
--cc=nborisov@suse.com \
--cc=qemu-devel@nongnu.org \
--cc=quintela@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).