From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([208.118.235.92]:51614)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <coreyb@linux.vnet.ibm.com>) id 1Sm5Ld-0004by-1n
	for qemu-devel@nongnu.org; Tue, 03 Jul 2012 11:49:11 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <coreyb@linux.vnet.ibm.com>) id 1Sm5LV-0004vi-PP
	for qemu-devel@nongnu.org; Tue, 03 Jul 2012 11:49:04 -0400
Received: from e38.co.us.ibm.com ([32.97.110.159]:42088)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <coreyb@linux.vnet.ibm.com>) id 1Sm5LV-0004tz-A6
	for qemu-devel@nongnu.org; Tue, 03 Jul 2012 11:48:57 -0400
Received: from /spool/local
	by e38.co.us.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only!
	Violators will be prosecuted
	for <qemu-devel@nongnu.org> from <coreyb@linux.vnet.ibm.com>;
	Tue, 3 Jul 2012 09:48:50 -0600
Received: from d03relay04.boulder.ibm.com (d03relay04.boulder.ibm.com
	[9.17.195.106])
	by d03dlp02.boulder.ibm.com (Postfix) with ESMTP id DAE113E40181
	for <qemu-devel@nongnu.org>; Tue,  3 Jul 2012 15:41:28 +0000 (WET)
Received: from d03av03.boulder.ibm.com (d03av03.boulder.ibm.com [9.17.195.169])
	by d03relay04.boulder.ibm.com (8.13.8/8.13.8/NCO v10.0) with ESMTP id
	q63Fesbg136692
	for <qemu-devel@nongnu.org>; Tue, 3 Jul 2012 09:41:12 -0600
Received: from d03av03.boulder.ibm.com (loopback [127.0.0.1])
	by d03av03.boulder.ibm.com (8.14.4/8.13.1/NCO v10.0 AVout) with ESMTP
	id q63FeOpO029925
	for <qemu-devel@nongnu.org>; Tue, 3 Jul 2012 09:40:24 -0600
Message-ID: <4FF31265.1000308@linux.vnet.ibm.com>
Date: Tue, 03 Jul 2012 11:40:21 -0400
From: Corey Bryant <coreyb@linux.vnet.ibm.com>
MIME-Version: 1.0
References: <1340390174-7493-1-git-send-email-coreyb@linux.vnet.ibm.com>
	<20120626091004.GA14451@redhat.com> <4FE9A0F0.2050809@redhat.com>
	<20120626175045.2c7011b3@doriath.home>	<4FEA37A9.10707@linux.vnet.ibm.com>
	<4FEA3D9C.8080205@redhat.com> <4FF21A67.8010100@linux.vnet.ibm.com>
In-Reply-To: <4FF21A67.8010100@linux.vnet.ibm.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Subject: Re: [Qemu-devel] [PATCH v4 0/7] file descriptor
	passing	using	pass-fd
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Eric Blake <eblake@redhat.com>, Kevin Wolf <kwolf@redhat.com>
Cc: aliguori@us.ibm.com, stefanha@linux.vnet.ibm.com, libvir-list@redhat.com, qemu-devel@nongnu.org, Luiz Capitulino <lcapitulino@redhat.com>, pbonzini@redhat.com


On 07/02/2012 06:02 PM, Corey Bryant wrote:
>
>
> On 06/26/2012 06:54 PM, Eric Blake wrote:
>> On 06/26/2012 04:28 PM, Corey Bryant wrote:
>>
>>>>>> With this proposed series, we have usage akin to:
>>>>>>
>>>>>>     1. pass_fd FDSET={M} -> returns a string "/dev/fd/N" showing
>>>>>> QEMU's
>>>>>>        view of the FD
>>>>>>     2. drive_add file=/dev/fd/N
>>>>>>     3. if failure:
>>>>>>          close_fd "/dev/fd/N"
>>>>>
>>>>> In fact, there are more steps:
>>>>>
>>>>> 4. use it successfully
>>>>> 5. close_fd "/dev/fd/N"
>>>>>
>>>>> I think it would well be possible that qemu just closes the fd when
>>>>> it's
>>>>> not used internally any more.
>>>>
>>>> pass-fd could have a flag indicating qemu to do that.
>>>>
>>>
>>> It seems like libvirt would be in a better position to understand when a
>>> file is no longer in use, and then it can call close_fd.  No?  Of course
>>> the the only fd that needs to be closed is the originally passed fd. The
>>> dup'd fd's are closed by QEMU.
>>
>> The more we discuss this, the more I'm convinced that commands like
>> 'block-commit' that will reopen a file to change from O_RDONLY to O_RDWR
>> will need to have an optional argument that says the name of the file to
>> reopen.  That is, I've seen three proposals:
>>
>> Proposal One: enhance every command to take an fd:name protocol.
>> PRO: Successful use of a name removes the fd from the 'getfd' list.
>> CON: Lots of commands to change.
>> CON: The command sequence is not atomic.
>> CON: Block layer does not have a file name tied to the fd, so the reopen
>> operation also needs to learn the fd:name protocol, but since we're
>> already touching the command it's not much more work.
>> USAGE:
>> 1. getfd FDSET={M}, ties new fd to "name"
>> 2. drive_add fd=name - looks up the fd by name from the list
>> 3a. on success, fd is no longer in the list, drive_add consumed it
>> 3b. on failure, libvirt must call 'closefd name' to avoid a leak
>> 4. getfd FDSET={M2}, ties new fd to "newname"
>> 5. block-commit fd=newname - looks up fd by name from the list
>> 6a. on success, fd is no longer in the list, block-commit consumed it
>> 6b. on failure, libvirt must call 'closefd newname' to avoid a leak
>>
>> Proposal Two: Create a permanent name via 'getfd' or 'pass-fd', then
>> update that name to the new fd in advance of any operation that needs to
>> do a reopen.
>> PRO: All other commands work without impact by using qemu_open(), by
>> getting a clone of the permanent name.
>> CON: The permanent name must remain open as long as qemu might need to
>> reopen it, and reopening needs the pass-fd force option.
>> CON: The command sequence is not atomic.
>> USAGE:
>> 1. pass_fd FDSET={M} -> returns an integer N (or a string "/dev/fd/N")
>> showing QEMU's permanent name of the FD
>> 2. drive_add file=<permanent name> means that qemu_open() will clone the
>> fd, and drive_add is now using yet another FD while N remains open;
>> meanwhile, the block layer knows that the drive is tied to the permanent
>> name
>> 3. pass-fd force FDSET={M2} -> replaces fd N with the passed M2, and
>> still returns /dev/fd/N
>> 4. block-commit - looks up the block-device name (/dev/fd/N), which maps
>> back to the permanent fd, and gets a copy of the newly passed M2
>> 5. on completion (success or failure), libvirt must call 'closefd name'
>> to avoid a leak
>>
>> Proposal Three: Use thread-local fds passed alongside any command,
>> rather than passing via a separate command
>> PRO: All commands can now atomically handle fd passing
>> PRO: Commands that only need a one-shot open can now use fd
>> CON: Commands that need to reopen still need modification to allow a
>> name override during the reopen
>> 1. drive_add nfds=1 file="/dev/fdlist/1" FDSET={M} -> on success, the fd
>> is used as the block file, on failure it is atomically closed, so there
>> is no leak either way. However, the block file has no permanent name.
>> 2. block-commit nfds=1 file-override="/dev/fdlist/1" FDSET={M2} -> file
>> override must be passed again (since we have no guarantee that the
>> /dev/fdlist/1 of _this_ command matches the command-local naming used in
>> the previous command), but the fd is again used atomically
>>
>> Under proposal 3, there is no need to dup fd's, merely only to check
>> that qemu_open("/dev/fdlist/n", flags) has compatible flags with the fd
>> received via FDSET.  And the fact that things are made atomic means that
>> libvirt no longer has to worry about calling closefd, nor does it have
>> to worry about being interrupted between two monitor commands and not
>> knowing what state qemu is in.  But it does mean teaching every possible
>> command that wants to do a reopen to learn how to use fd overrides
>> rather than to reuse the permanent name that was originally in place on
>> the first open.
>>
>
> Here's another option that Kevin and I discussed today on IRC.  I've
> modified a few minor details since the discussion. And Kevin please
> correct me if anything is wrong.
>
> Proposal Four: Pass a set of fds via 'pass-fds'.  The group of fds
> should all refer to the same file, but may have different access flags
> (ie. O_RDWR, O_RDONLY).  qemu_open can then dup the fd that has the
> matching access mode flags.
> pass-fds:
>      { 'command': 'pass-fds',
>        'data': {'fdsetname': 'str', '*close': 'bool'},
>        'returns': 'string' }
> close-fds:
>      { 'command': 'close-fds',
>        'data': {'fdsetname': 'str' }
> where:
>       @fdsetname - the file descriptor set name
>       @close - *if true QEMU closes the monitor fds when they expire.
>                if false, fds remain on the monitor until close-fds
>                command.
> PRO: Supports reopen
> PRO: All other commands work without impact by using qemu_open()
> PRO: No fd leakage if close==true specified
> CON: Determining when to close fds when close==true may be tricky
> USAGE:
> 1. pass-fd FDSET={M} -> returns a string "/dev/fdset/1") that refers to
> the passed set of fds.
> 2. drive_add file=/dev/fdset/1 -- qemu_open() uses the first fd from the
> list that has access flags matching the qemu_open() action flags.
> 3. block-commit -- qemu_open() reopens "/dev/fdset/1" by using the first
> fd from the list that has access flags matching the qemu_open() action
> flags.
> 4. The monitor fds are closed:
>     A) *If @close == true, qemu closes the set of fds when the timer
>        expires
>     B) If @close == false, libvirt must issue close-fds command to close
>        the set of fds
>
> *How to solve this has yet to be determined.  Kevin mentioned
> potentially using a bottom-half or a timer in qemu_close().
>

Thanks again for taking time to discuss this at today's QEMU community call.

Here's the proposal we discussed at the call.  Please let me know if I 
missed anything or if there are any issues with this design.

Proposal Five:  New monitor commands enable adding/removing an fd 
to/from a set.  The set of fds should all refer to the same file, but 
may have different access flags (ie. O_RDWR, O_RDONLY).  qemu_open can 
then dup the fd that has the matching access mode flags.
PRO: Supports reopen
PRO: All other commands work without impact by using qemu_open()
PRO: No fd leakage (fds are associated with monitor connection and, if 
not in use, closed when monitor disconnects)
PRO: Security-wise this is ok since libvirt can manage the set of fd's 
(ie. it can add/remove an O_RDWR fd to/from the set as needed).
CON: Not atomic (e.g. doesn't add an fd with single drive_add command).
USAGE:
1. add-fd /dev/fdset/1 FDSET={M} -> qemu adds fd to set named 
"/dev/fdset/1" - command returns qemu fd (e.g fd=4) to caller.  libvirt 
in-use flag turned on for fd.
2. drive_add file=/dev/fdset/1 -> qemu_open uses the first fd from the
set that has access flags matching the qemu_open action flags. 
qemu_open increments refcount for this fd.
3. add-fd /dev/fdset/1 FDSET={M} -> qemu adds fd to set named 
"/dev/fdset/1" - command returns qemu fd to caller (e.g fd=5).  libvirt 
in-use flag turned on for fd.
3. block-commit -> qemu_open reopens "/dev/fdset/1" by using the first
fd from the set that has access flags matching the qemu_open action 
flags.  qemu_open increments refcount for this fd.
4. remove-fd /dev/fdset/1 5 -> caller requests fd==5 be removed from the 
set.  turns libvirt in-use flag off marking the fd ready to be closed 
when qemu is done with it.
5. qemu_close decrements refcount for fd, and closes fd when refcount is 
zero and libvirt in use flag is off.

More functional details:
-If libvirt crashes it can call "query-fd /dev/fdset/1" to determine 
which fds are open in the set.
-If monitor connection closes, qemu will close fds that have a refcount 
of zero.  Do we also need a qemu in-use flag in case refcount is zero 
and fd is still in use?
-This support requires introduction of qemu_close function that will be 
called everywhere in block layer that close is currently called.

Notes:
-Patch series 1 will include support for all of the above.  This will be 
my initial focus.
-Patch series 2 will include command line support that enables 
association of command line fd with a monitor set.  This will be my 
secondary focus, most likely after patch series 1 is applied.


-- 
Regards,
Corey