Re: [Qemu-devel] [PATCH v5 6/6] block: Enable qemu_open/close to work with fd sets

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

From: Corey Bryant <coreyb@linux.vnet.ibm.com>
To: Kevin Wolf <kwolf@redhat.com>
Cc: aliguori@us.ibm.com, stefanha@linux.vnet.ibm.com,
	libvir-list@redhat.com, qemu-devel@nongnu.org,
	lcapitulino@redhat.com, eblake@redhat.com
Subject: Re: [Qemu-devel] [PATCH v5 6/6] block: Enable qemu_open/close to work with fd sets
Date: Tue, 24 Jul 2012 23:41:56 -0400	[thread overview]
Message-ID: <500F6B04.4020508@linux.vnet.ibm.com> (raw)
In-Reply-To: <500E901D.3080801@redhat.com>



On 07/24/2012 08:07 AM, Kevin Wolf wrote:
> Am 23.07.2012 15:08, schrieb Corey Bryant:
>> When qemu_open is passed a filename of the "/dev/fdset/nnn"
>> format (where nnn is the fdset ID), an fd with matching access
>> mode flags will be searched for within the specified monitor
>> fd set.  If the fd is found, a dup of the fd will be returned
>> from qemu_open.
>>
>> Each fd set has a reference count.  The purpose of the reference
>> count is to determine if an fd set contains file descriptors that
>> have open dup() references that have not yet been closed.  It is
>> incremented on qemu_open and decremented on qemu_close.  It is
>> not until the refcount is zero that file desriptors in an fd set
>> can be closed.  If an fd set has dup() references open, then we
>> must keep the other fds in the fd set open in case a reopen
>> of the file occurs that requires an fd with a different access
>> mode.
>>
>> Signed-off-by: Corey Bryant <coreyb@linux.vnet.ibm.com>
>>
>> v2:
>>   -Get rid of file_open and move dup code to qemu_open
>>    (kwolf@redhat.com)
>>   -Use strtol wrapper instead of atoi (kwolf@redhat.com)
>>
>> v3:
>>   -Add note about fd leakage (eblake@redhat.com)
>>
>> v4
>>   -Moved patch to be later in series (lcapitulino@redhat.com)
>>   -Update qemu_open to check access mode flags and set flags that
>>    can be set (eblake@redhat.com, kwolf@redhat.com)
>>
>> v5:
>>   -This patch was overhauled quite a bit in this version, with
>>    the addition of fd set and refcount support.
>>   -Use qemu_set_cloexec() on dup'd fd (eblake@redhat.com)
>>   -Modify flags set by fcntl on dup'd fd (eblake@redhat.com)
>>   -Reduce syscalls when setting flags for dup'd fd (eblake@redhat.com)
>>   -Fix O_RDWR, O_RDONLY, O_WRONLY checks (eblake@redhat.com)
>> ---
>>   block/raw-posix.c |   24 +++++-----
>>   block/raw-win32.c |    2 +-
>>   block/vmdk.c      |    4 +-
>>   block/vpc.c       |    2 +-
>>   block/vvfat.c     |   12 ++---
>>   cutils.c          |    5 ++
>>   monitor.c         |   85 +++++++++++++++++++++++++++++++++
>>   monitor.h         |    4 ++
>>   osdep.c           |  138 ++++++++++++++++++++++++++++++++++++++++++++++++++++-
>>   qemu-common.h     |    3 +-
>>   qemu-tool.c       |   12 +++++
>>   11 files changed, 267 insertions(+), 24 deletions(-)
>>
>> diff --git a/block/raw-posix.c b/block/raw-posix.c
>> index a172de3..5d0a801 100644
>> --- a/block/raw-posix.c
>> +++ b/block/raw-posix.c
>> @@ -271,7 +271,7 @@ static int raw_open_common(BlockDriverState *bs, const char *filename,
>>   out_free_buf:
>>       qemu_vfree(s->aligned_buf);
>>   out_close:
>> -    qemu_close(fd);
>> +    qemu_close(fd, filename);
>>       return -errno;
>>   }
>
> Hm, not a nice interface where qemu_close() needs the filename and
> (worse) could be given a wrong filename. Maybe it would be better to
> maintain a list of fd -> fdset mappings in qemu_open/close?
>

I agree, I don't really like it either.

We already have a list of fd -> fdset mappings (mon_fdset_fd_t -> 
mon_fdset_t).  Would it be too costly to loop through all the fdsets/fds 
at the beginning of every qemu_close()?

> But if we decided to keep it like this, please use the right interface
> from the beginning in patch 5 instead of updating it here.
>

Ok

>> @@ -2551,6 +2551,91 @@ static void monitor_fdsets_set_in_use(Monitor *mon, bool in_use)
>>       }
>>   }
>>
>> +void monitor_fdset_increment_refcount(Monitor *mon, int64_t fdset_id)
>> +{
>> +    mon_fdset_t *mon_fdset;
>> +
>> +    if (!mon) {
>> +        return;
>> +    }
>> +
>> +    QLIST_FOREACH(mon_fdset, &mon->fdsets, next) {
>> +        if (mon_fdset->id == fdset_id) {
>> +            mon_fdset->refcount++;
>> +            break;
>> +        }
>> +    }
>> +}
>> +
>> +void monitor_fdset_decrement_refcount(Monitor *mon, int64_t fdset_id)
>> +{
>> +    mon_fdset_t *mon_fdset;
>> +
>> +    if (!mon) {
>> +        return;
>> +    }
>> +
>> +    QLIST_FOREACH(mon_fdset, &mon->fdsets, next) {
>> +        if (mon_fdset->id == fdset_id) {
>> +            mon_fdset->refcount--;
>> +            if (mon_fdset->refcount == 0) {
>> +                monitor_fdset_cleanup(mon_fdset);
>> +            }
>> +            break;
>> +        }
>> +    }
>> +}
>
> These two functions are almost the same. Would a
> monitor_fdset_update_refcount(mon, fdset_id, value) make sense? These
> functions could then be kept as thin wrappers around it, or they could
> even be dropped completely.
>

This makes sense and I'll try one of these approaches.  I actually 
started to do something along these lines in v5 but reverted back to the 
two independent functions because it was easier to read the code.

>> +
>> +int monitor_fdset_get_fd(Monitor *mon, int64_t fdset_id, int flags)
>> +{
>> +    mon_fdset_t *mon_fdset;
>> +    mon_fdset_fd_t *mon_fdset_fd;
>> +    int mon_fd_flags;
>> +
>> +    if (!mon) {
>> +        errno = ENOENT;
>> +        return -1;
>> +    }
>> +
>> +    QLIST_FOREACH(mon_fdset, &mon->fdsets, next) {
>> +        if (mon_fdset->id != fdset_id) {
>> +            continue;
>> +        }
>> +        QLIST_FOREACH(mon_fdset_fd, &mon_fdset->fds, next) {
>> +            if (mon_fdset_fd->removed) {
>> +                continue;
>> +            }
>> +
>> +            mon_fd_flags = fcntl(mon_fdset_fd->fd, F_GETFL);
>> +            if (mon_fd_flags == -1) {
>> +                return -1;
>> +            }
>> +
>> +            switch (flags & O_ACCMODE) {
>> +            case O_RDWR:
>> +                if ((mon_fd_flags & O_ACCMODE) == O_RDWR) {
>> +                    return mon_fdset_fd->fd;
>> +                }
>> +                break;
>> +            case O_RDONLY:
>> +                if ((mon_fd_flags & O_ACCMODE) == O_RDONLY) {
>> +                    return mon_fdset_fd->fd;
>> +                }
>> +                break;
>> +            case O_WRONLY:
>> +                if ((mon_fd_flags & O_ACCMODE) == O_WRONLY) {
>> +                    return mon_fdset_fd->fd;
>> +                }
>> +                break;
>> +            }
>
> I think you mean:
>
>    if ((flags & O_ACCMODE) == (mon_fd_flags & O_ACCMODE)) {
>        return mon_fdset_fd->fd;
>    }

Yes, that would be a bit simpler wouldn't it. :)

>
> What about other flags that cannot be set with fcntl(), like O_SYNC on
> older kernels or possibly non-Linux? (The block layer doesn't use it any
> more, but I think we want to keep the function generally useful)
>

I see what you're getting at here.  Basically you could have 2 fds in an 
fdset with the same access mode flags, but one has O_SYNC on and the 
other has O_SYNC off.  That seems like it would make sense to implement. 
  As a work-around, I think a user could just create a separate fdset 
for the same file with different O_SYNC value.  But from a client 
perspective, it would be nicer to have this taken care of for you.

>> +        }
>> +        errno = EACCES;
>> +        return -1;
>> +    }
>> +    errno = ENOENT;
>> +    return -1;
>> +}
>> +
>>   /* mon_cmds and info_cmds would be sorted at runtime */
>>   static mon_cmd_t mon_cmds[] = {
>>   #include "hmp-commands.h"
>
>> @@ -75,6 +76,79 @@ int qemu_madvise(void *addr, size_t len, int advice)
>>   #endif
>>   }
>>
>> +/*
>> + * Dups an fd and sets the flags
>> + */
>> +static int qemu_dup(int fd, int flags)
>> +{
>> +    int i;
>> +    int ret;
>> +    int serrno;
>> +    int dup_flags;
>> +    int setfl_flags[] = { O_APPEND, O_ASYNC, O_DIRECT, O_NOATIME,
>> +                          O_NONBLOCK, 0 };
>> +
>> +    if (flags & O_CLOEXEC) {
>> +        ret = fcntl(fd, F_DUPFD_CLOEXEC, 0);
>> +        if (ret == -1 && errno == EINVAL) {
>> +            ret = dup(fd);
>> +            if (ret != -1 && fcntl_setfl(ret, O_CLOEXEC) == -1) {
>> +                goto fail;
>> +            }
>> +        }
>> +    } else {
>> +        ret = dup(fd);
>> +    }
>> +
>> +    if (ret == -1) {
>> +        goto fail;
>> +    }
>> +
>> +    dup_flags = fcntl(ret, F_GETFL);
>> +    if (dup_flags == -1) {
>> +        goto fail;
>> +    }
>> +
>> +    if ((flags & O_SYNC) != (dup_flags & O_SYNC)) {
>> +        errno = EINVAL;
>> +        goto fail;
>> +    }
>
> It's worth trying to set it before failing, newer kernels can do it. But
> as I said above, if you can fail here, it makes sense to consider O_SYNC
> when selecting the right file descriptor from the fdset.
>

I'm on a 3.4.4 Fedora kernel that doesn't appear to support 
fcntl(O_SYNC), but perhaps I'm doing something wrong.  Here's my test 
code (shortened for simplicty):

int main() {
     int fd;
     int flags;

     fd = open("/tmp/corey", O_RDWR | O_CREAT,
               S_IRUSR | S_IWUSR | S_IRGRP | S_IROTH);

     flags = fcntl(fd, F_GETFL);
     printf("flags=%08x\n", flags); //A

     fcntl(fd, F_SETFL, O_SYNC);
     printf("O_SYNC=%08x\n", O_SYNC);

     flags = fcntl(fd, F_GETFL);
     printf("flags=%08x\n", flags); //B
}

fcntl doesn't fail, but the flags I print out at A are the same as the 
flags printed out at B, so it appears that O_SYNC doesn't get set.

>> +    /* Set/unset flags that we can with fcntl */
>> +    i = 0;
>> +    while (setfl_flags[i] != 0) {
>> +        if (flags & setfl_flags[i]) {
>> +            dup_flags = (dup_flags | setfl_flags[i]);
>> +        } else {
>> +            dup_flags = (dup_flags & ~setfl_flags[i]);
>> +        }
>> +        i++;
>> +    }
>
> What about this instead of the loop:
>
>    int setfl_flags = O_APPEND | O_ASYNC | ... ;
>
>    dup_flags &= ~setfl_flags;
>    dup_flags |= (flags & setfl_flags);
>
>

I like your suggestion, it's much simpler.

>> +
>> +    if (fcntl(ret, F_SETFL, dup_flags) == -1) {
>> +        goto fail;
>> +    }
>> +
>> +    /* Truncate the file in the cases that open() would truncate it */
>> +    if (flags & O_TRUNC ||
>> +            ((flags & (O_CREAT | O_EXCL)) == (O_CREAT | O_EXCL))) {
>> +        if (ftruncate(ret, 0) == -1) {
>> +            goto fail;
>> +        }
>> +    }
>
> O_CREAT | O_EXCL kind of loses its meaning here, but okay, it's hard to
> do better with file descriptors.
>

I agree and I don't know if we can do any better.

>> +
>> +    qemu_set_cloexec(ret);
>
> Wait... If O_CLOEXEC is set, you set the flag immediately and if it
> isn't you set it at the end of the function? What's the intended use
> case for not using O_CLOEXEC then?
>

This is a mistake.  I think I just need to be using qemu_set_cloexec() 
instead of fcntl_setfl() earlier in this function and get rid of this 
latter call to qemu_set_cloexec().

-- 
Regards,
Corey

next prev parent reply	other threads:[~2012-07-25  3:42 UTC|newest]

Thread overview: 36+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-07-23 13:07 [Qemu-devel] [PATCH v5 0/6] file descriptor passing using fd sets Corey Bryant
2012-07-23 13:08 ` [Qemu-devel] [PATCH v5 1/6] qemu-char: Add MSG_CMSG_CLOEXEC flag to recvmsg Corey Bryant
2012-07-23 22:50   ` Eric Blake
2012-07-24  2:19     ` Corey Bryant
2012-07-23 13:08 ` [Qemu-devel] [PATCH v5 2/6] qapi: Introduce add-fd, remove-fd, query-fdsets Corey Bryant
2012-07-25 18:16   ` Eric Blake
2012-07-26  2:55     ` Corey Bryant
2012-07-23 13:08 ` [Qemu-devel] [PATCH v5 3/6] monitor: Clean up fd sets on monitor disconnect Corey Bryant
2012-07-23 13:08 ` [Qemu-devel] [PATCH v5 4/6] block: Convert open calls to qemu_open Corey Bryant
2012-07-25 19:22   ` Eric Blake
2012-07-26  3:11     ` Corey Bryant
2012-07-23 13:08 ` [Qemu-devel] [PATCH v5 5/6] block: Convert close calls to qemu_close Corey Bryant
2012-07-23 13:08 ` [Qemu-devel] [PATCH v5 6/6] block: Enable qemu_open/close to work with fd sets Corey Bryant
2012-07-23 13:14   ` Corey Bryant
2012-08-02 22:21     ` Corey Bryant
2012-08-06  9:15       ` Kevin Wolf
2012-08-06 13:32         ` Corey Bryant
2012-08-06 13:51           ` Kevin Wolf
2012-08-06 14:15             ` Corey Bryant
2012-08-07 16:43               ` Corey Bryant
2012-07-24 12:07   ` Kevin Wolf
2012-07-25  3:41     ` Corey Bryant [this message]
2012-07-25  8:22       ` Kevin Wolf
2012-07-25 19:25         ` Eric Blake
2012-07-26  3:21           ` Corey Bryant
2012-07-26 13:13             ` Eric Blake
2012-07-26 13:16               ` Kevin Wolf
2012-07-27  4:07                 ` Corey Bryant
2012-07-25 19:43   ` Eric Blake
2012-07-26  3:57     ` Corey Bryant
2012-07-26  9:07       ` Kevin Wolf
2012-07-27  3:59         ` Corey Bryant
2012-07-27  4:03         ` Corey Bryant
2012-08-02 15:08       ` Corey Bryant
2012-07-24 12:09 ` [Qemu-devel] [PATCH v5 0/6] file descriptor passing using " Kevin Wolf
2012-07-25  3:42   ` Corey Bryant

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=500F6B04.4020508@linux.vnet.ibm.com \
    --to=coreyb@linux.vnet.ibm.com \
    --cc=aliguori@us.ibm.com \
    --cc=eblake@redhat.com \
    --cc=kwolf@redhat.com \
    --cc=lcapitulino@redhat.com \
    --cc=libvir-list@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=stefanha@linux.vnet.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).