* [Qemu-devel] Safely reopening image files by stashing fds @ 2011-08-05 8:40 Stefan Hajnoczi 2011-08-05 9:04 ` Paolo Bonzini ` (2 more replies) 0 siblings, 3 replies; 39+ messages in thread From: Stefan Hajnoczi @ 2011-08-05 8:40 UTC (permalink / raw) To: Supriya Kannery; +Cc: Kevin Wolf, Anthony Liguori, qemu-devel We've discussed safe methods for reopening image files (e.g. useful for changing the hostcache parameter). The problem is that closing the file first and then opening it again exposes us to the error case where the open fails. At that point we cannot get to the file anymore and our options are to terminate QEMU, pause the VM, or offline the block device. This window of vulnerability can be eliminated by keeping the file descriptor around and falling back to it should the open fail. The challenge for the file descriptor approach is that image formats, like VMDK, can span multiple files. Therefore the solution is not as simple as stashing a single file descriptor and reopening from it. Here is the outline for an fd stashing mechanism that can handle reopening multi-file images and could also solve the file descriptor passing problem for libvirt: 1. Extract monitor getfd/closefd functionality The monitor already supports fd stashing with getfd/closefd commands. But the fd stash code is part of Monitor and we need to extract it into its own object. /* A stashed file descriptor */ typedef FDEntry { const char *name; int fd; QLIST_ENTRY(FDEntry) next; } FDEntry; /* A container for stashing file descriptors */ typedef struct FDStash { QLIST_HEAD(, FDEntry) fds; } FDStash; void fdstash_init(FDStash *stash); /** * Clear stashed file descriptors and close them */ void fdstash_cleanup(FDStash *stash); /** * Stash a file descriptor and give up ownership * * If a file descriptor is already present with the same name the old fd is * closed and replaced by the new one. */ void fdstash_give(FDStash *stash, const char *name, int fd); /** * Find and take ownership of a stashed file descriptor * * Return the file descriptor or -ENOENT if not found. */ int fdstash_take(FDStash *stash, const char *name); The monitor is refactored to use this code instead of open coding fd stashing. 2. Introduce a function to extract open file descriptors from an block device Add a new .bdrv_extract_fds(BlockDriverState *bs, FDStash *stash) interface, which defaults to calling bdrv_extract_fds(bs->file, stash). VMDK and protocols can implement this function to support extracting open fds from a block device. Note that they need to dup(2) fds before giving them to the fdstash, otherwise the fd will be closed when the block device is closed/deleted. 3. Rework bdrv_open() to take a FDStash Check the FDStash before opening an image file on the host file system. This makes it possible to open an image file and use existing stashed fds. 4. Implement bdrv_reopen() First call bdrv_extract_fds() to stash the file descriptors, then close the block device. Try opening the new image but if that fails, reopen using the stashed file descriptors. Thoughts? Stefan ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [Qemu-devel] Safely reopening image files by stashing fds 2011-08-05 8:40 [Qemu-devel] Safely reopening image files by stashing fds Stefan Hajnoczi @ 2011-08-05 9:04 ` Paolo Bonzini 2011-08-05 9:27 ` Stefan Hajnoczi 2011-08-05 9:07 ` Kevin Wolf 2011-08-05 20:16 ` Blue Swirl 2 siblings, 1 reply; 39+ messages in thread From: Paolo Bonzini @ 2011-08-05 9:04 UTC (permalink / raw) To: Stefan Hajnoczi; +Cc: Kevin Wolf, Supriya Kannery, Anthony Liguori, qemu-devel On 08/05/2011 10:40 AM, Stefan Hajnoczi wrote: > 4. Implement bdrv_reopen() > > First call bdrv_extract_fds() to stash the file descriptors, then close the > block device. Try opening the new image but if that fails, reopen using the > stashed file descriptors. Why not do the latter unconditionally? Paolo ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [Qemu-devel] Safely reopening image files by stashing fds 2011-08-05 9:04 ` Paolo Bonzini @ 2011-08-05 9:27 ` Stefan Hajnoczi 2011-08-05 9:55 ` Paolo Bonzini ` (2 more replies) 0 siblings, 3 replies; 39+ messages in thread From: Stefan Hajnoczi @ 2011-08-05 9:27 UTC (permalink / raw) To: Paolo Bonzini; +Cc: Kevin Wolf, Supriya Kannery, Anthony Liguori, qemu-devel On Fri, Aug 5, 2011 at 10:04 AM, Paolo Bonzini <pbonzini@redhat.com> wrote: > On 08/05/2011 10:40 AM, Stefan Hajnoczi wrote: >> >> 4. Implement bdrv_reopen() >> >> First call bdrv_extract_fds() to stash the file descriptors, then close >> the >> block device. Try opening the new image but if that fails, reopen using >> the >> stashed file descriptors. > > Why not do the latter unconditionally? Because you cannot change O_DIRECT on an open fd :(. This is why we're going through this pain. The only method I've found that works is to open("/proc/self/fd/X", new_flags) but that's non-portable. Stefan ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [Qemu-devel] Safely reopening image files by stashing fds 2011-08-05 9:27 ` Stefan Hajnoczi @ 2011-08-05 9:55 ` Paolo Bonzini 2011-08-05 13:03 ` Stefan Hajnoczi 2011-08-05 13:12 ` Daniel P. Berrange 2011-08-05 14:27 ` Christoph Hellwig 2 siblings, 1 reply; 39+ messages in thread From: Paolo Bonzini @ 2011-08-05 9:55 UTC (permalink / raw) To: Stefan Hajnoczi; +Cc: Kevin Wolf, Supriya Kannery, Anthony Liguori, qemu-devel On 08/05/2011 11:27 AM, Stefan Hajnoczi wrote: >>> First call bdrv_extract_fds() to stash the file descriptors, then close >>> the block device. Try opening the new image but if that fails, reopen using >>> the stashed file descriptors. >> >> Why not do the latter unconditionally? > > Because you cannot change O_DIRECT on an open fd:(. This is why > we're going through this pain. > > The only method I've found that works is to open("/proc/self/fd/X", > new_flags) but that's non-portable. Maybe I'm missing something obvious, but so is O_DIRECT, no? :) So for Linux you can dup the stashed file descriptors using /proc/self/fd and change flags directly, and for other OSes you can dup them using dup2 and change flags with F_SETFL. In any case, reopening can always be done using the stashed descriptors (or BlockDriverStates). Paolo ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [Qemu-devel] Safely reopening image files by stashing fds 2011-08-05 9:55 ` Paolo Bonzini @ 2011-08-05 13:03 ` Stefan Hajnoczi 0 siblings, 0 replies; 39+ messages in thread From: Stefan Hajnoczi @ 2011-08-05 13:03 UTC (permalink / raw) To: Paolo Bonzini; +Cc: Kevin Wolf, Supriya Kannery, Anthony Liguori, qemu-devel On Fri, Aug 5, 2011 at 10:55 AM, Paolo Bonzini <pbonzini@redhat.com> wrote: > On 08/05/2011 11:27 AM, Stefan Hajnoczi wrote: >>>> >>>> First call bdrv_extract_fds() to stash the file descriptors, then close >>>> the block device. Try opening the new image but if that fails, reopen >>>> using >>>> the stashed file descriptors. >>> >>> Why not do the latter unconditionally? >> >> Because you cannot change O_DIRECT on an open fd:(. This is why >> we're going through this pain. >> >> The only method I've found that works is to open("/proc/self/fd/X", >> new_flags) but that's non-portable. > > Maybe I'm missing something obvious, but so is O_DIRECT, no? :) http://www.freebsd.org/cgi/man.cgi?query=open&apropos=0&sektion=0&manpath=FreeBSD+8.1-RELEASE&format=html > So for Linux you can dup the stashed file descriptors using /proc/self/fd > and change flags directly, and for other OSes you can dup them using dup2 > and change flags with F_SETFL. In any case, reopening can always be done > using the stashed descriptors (or BlockDriverStates). F_SETFL doesn't let you change arbitrary flags. Stefan ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [Qemu-devel] Safely reopening image files by stashing fds 2011-08-05 9:27 ` Stefan Hajnoczi 2011-08-05 9:55 ` Paolo Bonzini @ 2011-08-05 13:12 ` Daniel P. Berrange 2011-08-05 14:28 ` Christoph Hellwig 2011-08-05 14:27 ` Christoph Hellwig 2 siblings, 1 reply; 39+ messages in thread From: Daniel P. Berrange @ 2011-08-05 13:12 UTC (permalink / raw) To: Stefan Hajnoczi Cc: Kevin Wolf, Paolo Bonzini, Anthony Liguori, qemu-devel, Supriya Kannery On Fri, Aug 05, 2011 at 10:27:00AM +0100, Stefan Hajnoczi wrote: > On Fri, Aug 5, 2011 at 10:04 AM, Paolo Bonzini <pbonzini@redhat.com> wrote: > > On 08/05/2011 10:40 AM, Stefan Hajnoczi wrote: > >> > >> 4. Implement bdrv_reopen() > >> > >> First call bdrv_extract_fds() to stash the file descriptors, then close > >> the > >> block device. Try opening the new image but if that fails, reopen using > >> the > >> stashed file descriptors. > > > > Why not do the latter unconditionally? > > Because you cannot change O_DIRECT on an open fd :(. This is why > we're going through this pain. Hmm, I remember hearing that before, but looking at the current fcntl() manpage, it claims you *can* change O_DIRECT using SET_FL. Perhaps this is a newish feature, but it'd be nicer to use it if possible ? [man 2 fcntl] F_SETFL (long) Set the file status flags to the value specified by arg. File access mode (O_RDONLY, O_WRONLY, O_RDWR) and file creation flags (i.e., O_CREAT, O_EXCL, O_NOCTTY, O_TRUNC) in arg are ignored. On Linux this command can only change the O_APPEND, O_ASYNC, O_DIRECT, O_NOATIME, and O_NONBLOCK flags. [/man] Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :| ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [Qemu-devel] Safely reopening image files by stashing fds 2011-08-05 13:12 ` Daniel P. Berrange @ 2011-08-05 14:28 ` Christoph Hellwig 2011-08-05 15:24 ` Stefan Hajnoczi 0 siblings, 1 reply; 39+ messages in thread From: Christoph Hellwig @ 2011-08-05 14:28 UTC (permalink / raw) To: Daniel P. Berrange Cc: Kevin Wolf, Supriya Kannery, Anthony Liguori, Stefan Hajnoczi, qemu-devel, Paolo Bonzini On Fri, Aug 05, 2011 at 02:12:48PM +0100, Daniel P. Berrange wrote: > > Because you cannot change O_DIRECT on an open fd :(. This is why > > we're going through this pain. > > Hmm, I remember hearing that before, but looking at the current fcntl() > manpage, it claims you *can* change O_DIRECT using SET_FL. Perhaps this > is a newish feature, but it'd be nicer to use it if possible ? It's been there since day 1 of O_DIRECT support. ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [Qemu-devel] Safely reopening image files by stashing fds 2011-08-05 14:28 ` Christoph Hellwig @ 2011-08-05 15:24 ` Stefan Hajnoczi 2011-08-05 15:43 ` Kevin Wolf 0 siblings, 1 reply; 39+ messages in thread From: Stefan Hajnoczi @ 2011-08-05 15:24 UTC (permalink / raw) To: Christoph Hellwig Cc: Kevin Wolf, Supriya Kannery, Anthony Liguori, qemu-devel, Paolo Bonzini On Fri, Aug 5, 2011 at 3:28 PM, Christoph Hellwig <hch@lst.de> wrote: > On Fri, Aug 05, 2011 at 02:12:48PM +0100, Daniel P. Berrange wrote: >> > Because you cannot change O_DIRECT on an open fd :(. This is why >> > we're going through this pain. >> >> Hmm, I remember hearing that before, but looking at the current fcntl() >> manpage, it claims you *can* change O_DIRECT using SET_FL. Perhaps this >> is a newish feature, but it'd be nicer to use it if possible ? > > It's been there since day 1 of O_DIRECT support. Sorry, my bad. So for Linux we could just use fcntl for block_set_hostcache and not bother with reopening. However, we will need to reopen should we wish to support changing O_DSYNC. Stefan ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [Qemu-devel] Safely reopening image files by stashing fds 2011-08-05 15:24 ` Stefan Hajnoczi @ 2011-08-05 15:43 ` Kevin Wolf 2011-08-05 15:49 ` Anthony Liguori 0 siblings, 1 reply; 39+ messages in thread From: Kevin Wolf @ 2011-08-05 15:43 UTC (permalink / raw) To: Stefan Hajnoczi Cc: Supriya Kannery, Anthony Liguori, qemu-devel, Paolo Bonzini, Christoph Hellwig Am 05.08.2011 17:24, schrieb Stefan Hajnoczi: > On Fri, Aug 5, 2011 at 3:28 PM, Christoph Hellwig <hch@lst.de> wrote: >> On Fri, Aug 05, 2011 at 02:12:48PM +0100, Daniel P. Berrange wrote: >>>> Because you cannot change O_DIRECT on an open fd :(. This is why >>>> we're going through this pain. >>> >>> Hmm, I remember hearing that before, but looking at the current fcntl() >>> manpage, it claims you *can* change O_DIRECT using SET_FL. Perhaps this >>> is a newish feature, but it'd be nicer to use it if possible ? >> >> It's been there since day 1 of O_DIRECT support. > > Sorry, my bad. So for Linux we could just use fcntl for > block_set_hostcache and not bother with reopening. However, we will > need to reopen should we wish to support changing O_DSYNC. We do wish to support that. Anthony thinks that allowing the guest to toggle WCE is a prerequisite for making cache=writeback the default. And this is something that I definitely want to do for 1.0. Kevin ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [Qemu-devel] Safely reopening image files by stashing fds 2011-08-05 15:43 ` Kevin Wolf @ 2011-08-05 15:49 ` Anthony Liguori 2011-08-08 7:02 ` Supriya Kannery 0 siblings, 1 reply; 39+ messages in thread From: Anthony Liguori @ 2011-08-05 15:49 UTC (permalink / raw) To: Kevin Wolf Cc: Stefan Hajnoczi, Paolo Bonzini, Christoph Hellwig, qemu-devel, Supriya Kannery On 08/05/2011 10:43 AM, Kevin Wolf wrote: > Am 05.08.2011 17:24, schrieb Stefan Hajnoczi: >> On Fri, Aug 5, 2011 at 3:28 PM, Christoph Hellwig<hch@lst.de> wrote: >>> On Fri, Aug 05, 2011 at 02:12:48PM +0100, Daniel P. Berrange wrote: >>>>> Because you cannot change O_DIRECT on an open fd :(. This is why >>>>> we're going through this pain. >>>> >>>> Hmm, I remember hearing that before, but looking at the current fcntl() >>>> manpage, it claims you *can* change O_DIRECT using SET_FL. Perhaps this >>>> is a newish feature, but it'd be nicer to use it if possible ? >>> >>> It's been there since day 1 of O_DIRECT support. >> >> Sorry, my bad. So for Linux we could just use fcntl for >> block_set_hostcache and not bother with reopening. However, we will >> need to reopen should we wish to support changing O_DSYNC. > > We do wish to support that. > > Anthony thinks that allowing the guest to toggle WCE is a prerequisite > for making cache=writeback the default. And this is something that I > definitely want to do for 1.0. Indeed. Regards, Anthony Liguori > Kevin > ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [Qemu-devel] Safely reopening image files by stashing fds 2011-08-05 15:49 ` Anthony Liguori @ 2011-08-08 7:02 ` Supriya Kannery 2011-08-08 8:12 ` Kevin Wolf 0 siblings, 1 reply; 39+ messages in thread From: Supriya Kannery @ 2011-08-08 7:02 UTC (permalink / raw) To: Anthony Liguori Cc: Kevin Wolf, Stefan Hajnoczi, qemu-devel, Christoph Hellwig, Paolo Bonzini On 08/05/2011 09:19 PM, Anthony Liguori wrote: > On 08/05/2011 10:43 AM, Kevin Wolf wrote: >> Am 05.08.2011 17:24, schrieb Stefan Hajnoczi: >>> On Fri, Aug 5, 2011 at 3:28 PM, Christoph Hellwig<hch@lst.de> wrote: >>>> On Fri, Aug 05, 2011 at 02:12:48PM +0100, Daniel P. Berrange wrote: >>>>>> Because you cannot change O_DIRECT on an open fd :(. This is why >>>>>> we're going through this pain. >>>>> >>>>> Hmm, I remember hearing that before, but looking at the current >>>>> fcntl() >>>>> manpage, it claims you *can* change O_DIRECT using SET_FL. Perhaps >>>>> this >>>>> is a newish feature, but it'd be nicer to use it if possible ? >>>> >>>> It's been there since day 1 of O_DIRECT support. >>> >>> Sorry, my bad. So for Linux we could just use fcntl for >>> block_set_hostcache and not bother with reopening. However, we will >>> need to reopen should we wish to support changing O_DSYNC. >> >> We do wish to support that. >> >> Anthony thinks that allowing the guest to toggle WCE is a prerequisite >> for making cache=writeback the default. And this is something that I >> definitely want to do for 1.0. > > Indeed. > We discussed the following so far... 1. How to safely reopen image files 2. Dynamic hostcache change 3. Support for dynamic change of O_DSYNC Since 2 is independent of 1, shall I go ahead implementing hostcache change using fcntl. Implementation for safely reopening image files using "BDRVReopenState" can be done separately as a pre-requisite before implementing 3 Thanks, Supriya > Regards, > > Anthony Liguori > >> Kevin >> > > ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [Qemu-devel] Safely reopening image files by stashing fds 2011-08-08 7:02 ` Supriya Kannery @ 2011-08-08 8:12 ` Kevin Wolf 2011-08-09 9:22 ` supriya kannery 0 siblings, 1 reply; 39+ messages in thread From: Kevin Wolf @ 2011-08-08 8:12 UTC (permalink / raw) To: supriyak; +Cc: qemu-devel, Stefan Hajnoczi, Christoph Hellwig, Paolo Bonzini Am 08.08.2011 09:02, schrieb Supriya Kannery: > On 08/05/2011 09:19 PM, Anthony Liguori wrote: >> On 08/05/2011 10:43 AM, Kevin Wolf wrote: >>> Am 05.08.2011 17:24, schrieb Stefan Hajnoczi: >>>> On Fri, Aug 5, 2011 at 3:28 PM, Christoph Hellwig<hch@lst.de> wrote: >>>>> On Fri, Aug 05, 2011 at 02:12:48PM +0100, Daniel P. Berrange wrote: >>>>>>> Because you cannot change O_DIRECT on an open fd :(. This is why >>>>>>> we're going through this pain. >>>>>> >>>>>> Hmm, I remember hearing that before, but looking at the current >>>>>> fcntl() >>>>>> manpage, it claims you *can* change O_DIRECT using SET_FL. Perhaps >>>>>> this >>>>>> is a newish feature, but it'd be nicer to use it if possible ? >>>>> >>>>> It's been there since day 1 of O_DIRECT support. >>>> >>>> Sorry, my bad. So for Linux we could just use fcntl for >>>> block_set_hostcache and not bother with reopening. However, we will >>>> need to reopen should we wish to support changing O_DSYNC. >>> >>> We do wish to support that. >>> >>> Anthony thinks that allowing the guest to toggle WCE is a prerequisite >>> for making cache=writeback the default. And this is something that I >>> definitely want to do for 1.0. >> >> Indeed. >> > > We discussed the following so far... > 1. How to safely reopen image files > 2. Dynamic hostcache change > 3. Support for dynamic change of O_DSYNC > > Since 2 is independent of 1, shall I go ahead implementing > hostcache change using fcntl. > > Implementation for safely reopening image files using "BDRVReopenState" > can be done separately as a pre-requisite before implementing 3 Doing it separately means that we would introduce yet another callback that is used just to change O_DIRECT. In the end we want it to use bdrv_reopen(), too, so I'm not sure if there is a need for a temporary solution. Actually, once we know what we really want (I haven't seen many comments on the BDRVReopenState suggestion yet), it should be pretty easy to implement. Kevin ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [Qemu-devel] Safely reopening image files by stashing fds 2011-08-08 8:12 ` Kevin Wolf @ 2011-08-09 9:22 ` supriya kannery 2011-08-09 9:51 ` Kevin Wolf 2011-10-10 18:28 ` [Qemu-devel] " Kevin Wolf 0 siblings, 2 replies; 39+ messages in thread From: supriya kannery @ 2011-08-09 9:22 UTC (permalink / raw) To: Kevin Wolf Cc: supriyak, Stefan Hajnoczi, qemu-devel, Paolo Bonzini, Christoph Hellwig Kevin Wolf wrote: > Am 08.08.2011 09:02, schrieb Supriya Kannery: > >> On 08/05/2011 09:19 PM, Anthony Liguori wrote: >> >>> On 08/05/2011 10:43 AM, Kevin Wolf wrote: >>> >>>> Am 05.08.2011 17:24, schrieb Stefan Hajnoczi: >>>> >>>>> On Fri, Aug 5, 2011 at 3:28 PM, Christoph Hellwig<hch@lst.de> wrote: >>>>> >>>>>> On Fri, Aug 05, 2011 at 02:12:48PM +0100, Daniel P. Berrange wrote: >>>>>> >>>>>>>> Because you cannot change O_DIRECT on an open fd :(. This is why >>>>>>>> we're going through this pain. >>>>>>>> >>>>>>> Hmm, I remember hearing that before, but looking at the current >>>>>>> fcntl() >>>>>>> manpage, it claims you *can* change O_DIRECT using SET_FL. Perhaps >>>>>>> this >>>>>>> is a newish feature, but it'd be nicer to use it if possible ? >>>>>>> >>>>>> It's been there since day 1 of O_DIRECT support. >>>>>> >>>>> Sorry, my bad. So for Linux we could just use fcntl for >>>>> block_set_hostcache and not bother with reopening. However, we will >>>>> need to reopen should we wish to support changing O_DSYNC. >>>>> >>>> We do wish to support that. >>>> >>>> Anthony thinks that allowing the guest to toggle WCE is a prerequisite >>>> for making cache=writeback the default. And this is something that I >>>> definitely want to do for 1.0. >>>> >>> Indeed. >>> >>> >> We discussed the following so far... >> 1. How to safely reopen image files >> 2. Dynamic hostcache change >> 3. Support for dynamic change of O_DSYNC >> >> Since 2 is independent of 1, shall I go ahead implementing >> hostcache change using fcntl. >> >> Implementation for safely reopening image files using "BDRVReopenState" >> can be done separately as a pre-requisite before implementing 3 >> > > Doing it separately means that we would introduce yet another callback > that is used just to change O_DIRECT. In the end we want it to use > bdrv_reopen(), too, so I'm not sure if there is a need for a temporary > solution. > > Could you please explain "In the end we want it to use bdrv_reopen" at bit more. When fcntl() can change O_DIRECT on open fd , is there a need to "re-open" the image file? Considering the current way of having separate high level commands for changing block parameters (block_set_hostcache, and may be block_set_flush in furture), these dynamic requests will be sequential. So wouldn't it be better to avoid re-opening of image if possible for individual flag change request that comes in? > Actually, once we know what we really want (I haven't seen many comments > on the BDRVReopenState suggestion yet), it should be pretty easy to > implement. > > Kevin > Will work on to get an RFC patch with this proposed BDRVReopenState to get more inputs. ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [Qemu-devel] Safely reopening image files by stashing fds 2011-08-09 9:22 ` supriya kannery @ 2011-08-09 9:51 ` Kevin Wolf 2011-08-09 9:32 ` supriya kannery 2011-10-10 18:28 ` [Qemu-devel] " Kevin Wolf 1 sibling, 1 reply; 39+ messages in thread From: Kevin Wolf @ 2011-08-09 9:51 UTC (permalink / raw) To: supriya kannery Cc: supriyak, Stefan Hajnoczi, qemu-devel, Paolo Bonzini, Christoph Hellwig Am 09.08.2011 11:22, schrieb supriya kannery: > Kevin Wolf wrote: >> Am 08.08.2011 09:02, schrieb Supriya Kannery: >> >>> On 08/05/2011 09:19 PM, Anthony Liguori wrote: >>> >>>> On 08/05/2011 10:43 AM, Kevin Wolf wrote: >>>> >>>>> Am 05.08.2011 17:24, schrieb Stefan Hajnoczi: >>>>> >>>>>> On Fri, Aug 5, 2011 at 3:28 PM, Christoph Hellwig<hch@lst.de> wrote: >>>>>> >>>>>>> On Fri, Aug 05, 2011 at 02:12:48PM +0100, Daniel P. Berrange wrote: >>>>>>> >>>>>>>>> Because you cannot change O_DIRECT on an open fd :(. This is why >>>>>>>>> we're going through this pain. >>>>>>>>> >>>>>>>> Hmm, I remember hearing that before, but looking at the current >>>>>>>> fcntl() >>>>>>>> manpage, it claims you *can* change O_DIRECT using SET_FL. Perhaps >>>>>>>> this >>>>>>>> is a newish feature, but it'd be nicer to use it if possible ? >>>>>>>> >>>>>>> It's been there since day 1 of O_DIRECT support. >>>>>>> >>>>>> Sorry, my bad. So for Linux we could just use fcntl for >>>>>> block_set_hostcache and not bother with reopening. However, we will >>>>>> need to reopen should we wish to support changing O_DSYNC. >>>>>> >>>>> We do wish to support that. >>>>> >>>>> Anthony thinks that allowing the guest to toggle WCE is a prerequisite >>>>> for making cache=writeback the default. And this is something that I >>>>> definitely want to do for 1.0. >>>>> >>>> Indeed. >>>> >>>> >>> We discussed the following so far... >>> 1. How to safely reopen image files >>> 2. Dynamic hostcache change >>> 3. Support for dynamic change of O_DSYNC >>> >>> Since 2 is independent of 1, shall I go ahead implementing >>> hostcache change using fcntl. >>> >>> Implementation for safely reopening image files using "BDRVReopenState" >>> can be done separately as a pre-requisite before implementing 3 >>> >> >> Doing it separately means that we would introduce yet another callback >> that is used just to change O_DIRECT. In the end we want it to use >> bdrv_reopen(), too, so I'm not sure if there is a need for a temporary >> solution. >> >> > Could you please explain "In the end we want it to use bdrv_reopen" at > bit more. > When fcntl() can change O_DIRECT on open fd , is there a need to > "re-open" > the image file? What I meant is that in the end, with a generic bdrv_reopen(), we can have raw-posix only call dup() and fcntl() instead of doing a close()/open() sequence if it can satisfy the new flags this way. But this would be an implementation detail and not be visible in the interface. Kevin ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [Qemu-devel] Safely reopening image files by stashing fds 2011-08-09 9:51 ` Kevin Wolf @ 2011-08-09 9:32 ` supriya kannery 2011-08-16 19:18 ` [Qemu-devel] [RFC] " Supriya Kannery 2011-08-16 19:18 ` Supriya Kannery 0 siblings, 2 replies; 39+ messages in thread From: supriya kannery @ 2011-08-09 9:32 UTC (permalink / raw) To: Kevin Wolf Cc: supriyak, Stefan Hajnoczi, qemu-devel, Paolo Bonzini, Christoph Hellwig Kevin Wolf wrote: > Am 09.08.2011 11:22, schrieb supriya kannery: > >> Kevin Wolf wrote: >> >>> Am 08.08.2011 09:02, schrieb Supriya Kannery: >>> >>> >>>> On 08/05/2011 09:19 PM, Anthony Liguori wrote: >>>> >>>> >>>>> On 08/05/2011 10:43 AM, Kevin Wolf wrote: >>>>> >>>>> >>>>>> Am 05.08.2011 17:24, schrieb Stefan Hajnoczi: >>>>>> >>>>>> >>>>>>> On Fri, Aug 5, 2011 at 3:28 PM, Christoph Hellwig<hch@lst.de> wrote: >>>>>>> >>>>>>> >>>>>>>> On Fri, Aug 05, 2011 at 02:12:48PM +0100, Daniel P. Berrange wrote: >>>>>>>> >>>>>>>> >>>>>>>>>> Because you cannot change O_DIRECT on an open fd :(. This is why >>>>>>>>>> we're going through this pain. >>>>>>>>>> >>>>>>>>>> >>>>>>>>> Hmm, I remember hearing that before, but looking at the current >>>>>>>>> fcntl() >>>>>>>>> manpage, it claims you *can* change O_DIRECT using SET_FL. Perhaps >>>>>>>>> this >>>>>>>>> is a newish feature, but it'd be nicer to use it if possible ? >>>>>>>>> >>>>>>>>> >>>>>>>> It's been there since day 1 of O_DIRECT support. >>>>>>>> >>>>>>>> >>>>>>> Sorry, my bad. So for Linux we could just use fcntl for >>>>>>> block_set_hostcache and not bother with reopening. However, we will >>>>>>> need to reopen should we wish to support changing O_DSYNC. >>>>>>> >>>>>>> >>>>>> We do wish to support that. >>>>>> >>>>>> Anthony thinks that allowing the guest to toggle WCE is a prerequisite >>>>>> for making cache=writeback the default. And this is something that I >>>>>> definitely want to do for 1.0. >>>>>> >>>>>> >>>>> Indeed. >>>>> >>>>> >>>>> >>>> We discussed the following so far... >>>> 1. How to safely reopen image files >>>> 2. Dynamic hostcache change >>>> 3. Support for dynamic change of O_DSYNC >>>> >>>> Since 2 is independent of 1, shall I go ahead implementing >>>> hostcache change using fcntl. >>>> >>>> Implementation for safely reopening image files using "BDRVReopenState" >>>> can be done separately as a pre-requisite before implementing 3 >>>> >>>> >>> Doing it separately means that we would introduce yet another callback >>> that is used just to change O_DIRECT. In the end we want it to use >>> bdrv_reopen(), too, so I'm not sure if there is a need for a temporary >>> solution. >>> >>> >>> >> Could you please explain "In the end we want it to use bdrv_reopen" at >> bit more. >> When fcntl() can change O_DIRECT on open fd , is there a need to >> "re-open" >> the image file? >> > > What I meant is that in the end, with a generic bdrv_reopen(), we can > have raw-posix only call dup() and fcntl() instead of doing a > close()/open() sequence if it can satisfy the new flags this way. But > this would be an implementation detail and not be visible in the interface. > > Kevin > ok - thanks, Supriya ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [Qemu-devel] [RFC] Safely reopening image files by stashing fds 2011-08-09 9:32 ` supriya kannery @ 2011-08-16 19:18 ` Supriya Kannery 2011-08-16 19:18 ` Supriya Kannery 1 sibling, 0 replies; 39+ messages in thread From: Supriya Kannery @ 2011-08-16 19:18 UTC (permalink / raw) To: supriya kannery Cc: Kevin Wolf, Stefan Hajnoczi, Christoph Hellwig, qemu-devel, Paolo Bonzini On 08/09/2011 03:02 PM, supriya kannery wrote: > Kevin Wolf wrote: >> Am 09.08.2011 11:22, schrieb supriya kannery: >>> Kevin Wolf wrote: >> >> What I meant is that in the end, with a generic bdrv_reopen(), we can >> have raw-posix only call dup() and fcntl() instead of doing a >> close()/open() sequence if it can satisfy the new flags this way. But >> this would be an implementation detail and not be visible in the >> interface. >> >> Kevin > > ok > - thanks, Supriya > Though I started the RFC patch with defining BDRVReopenState, ended up in enhancing struct BlockDriver with .bdrv_reopen. bdrv_reopen mplementation specific to respective driver is assigned to this function pointer. Please find the implementation of O_DIRECT flag change, done in raw-posix.c. Similar implementation can be done for vmdk (with bdrv_reopen_commit and bdrv_reopen_abort as internal functions in vmdk.c for opening split files), sheepdog, nbd etc.. Request for comments on this approach. Note: Following patch is for demonstrating the approach using raw-posix as sample driver. This patch is not complete. --- block.c | 33 +++++++++++++++++++++++---------- block/raw-posix.c | 34 ++++++++++++++++++++++++++++++++++ block_int.h | 1 + 3 files changed, 58 insertions(+), 10 deletions(-) Index: qemu/block/raw-posix.c =================================================================== --- qemu.orig/block/raw-posix.c +++ qemu/block/raw-posix.c @@ -276,6 +276,39 @@ static int raw_open(BlockDriverState *bs return raw_open_common(bs, filename, flags, 0); } +static int raw_reopen(BlockDriverState *bs, int flags) +{ + BDRVRawState *s = bs->opaque; + int new_fd; + int ret = 0; + + /* Handle request for toggling O_DIRECT */ + if ((bs->open_flags | BDRV_O_NOCACHE) ^ + (flags | BDRV_O_NOCACHE)) + { + if ((new_fd = dup(s->fd)) <= 0) { + return -1; + } + + if ((ret = fcntl_setfl(new_fd, flags)) < 0) { + close(new_fd); + return ret; + } + + /* Set new flags, so replace old fd with new one */ + close(s->fd); + s->fd = new_fd; + s->open_flags = flags; + bs->open_flags = flags; + } + + /* + * TBD: handle O_DSYNC and other flags for which image + * file has to be reopened + */ + return 0; +} + /* XXX: use host sector size if necessary with: #ifdef DIOCGSECTORSIZE { @@ -886,6 +919,7 @@ static BlockDriver bdrv_file = { .instance_size = sizeof(BDRVRawState), .bdrv_probe = NULL, /* no probe for protocols */ .bdrv_file_open = raw_open, + .bdrv_reopen = raw_reopen, .bdrv_read = raw_read, .bdrv_write = raw_write, .bdrv_close = raw_close, Index: qemu/block.c =================================================================== --- qemu.orig/block.c +++ qemu/block.c @@ -687,20 +687,33 @@ int bdrv_reopen(BlockDriverState *bs, in qerror_report(QERR_DATA_SYNC_FAILED, bs->device_name); return ret; } - open_flags = bs->open_flags; - bdrv_close(bs); - ret = bdrv_open(bs, bs->filename, bdrv_flags, drv); - if (ret < 0) { - /* Reopen failed. Try to open with original flags */ - qerror_report(QERR_REOPEN_FILE_FAILED, bs->filename); - ret = bdrv_open(bs, bs->filename, open_flags, drv); + /* Use driver specific reopen() if available */ + if (drv->bdrv_reopen) { + if ((ret = drv->bdrv_reopen(bs, bdrv_flags)) < 0) { + qerror_report(QERR_REOPEN_FILE_FAILED, bs->filename); + return ret; + } + } else { + /* Use reopen procedure common for drivers */ + open_flags = bs->open_flags; + bdrv_close(bs); + + ret = bdrv_open(bs, bs->filename, bdrv_flags, drv); if (ret < 0) { - /* Reopen failed with orig and modified flags */ - abort(); + /* + * Reopen failed. Try to open with original flags + */ + qerror_report(QERR_REOPEN_FILE_FAILED, bs->filename); + ret = bdrv_open(bs, bs->filename, open_flags, drv); + if (ret < 0) { + /* + * Reopen with orig and modified flags failed + */ + abort(); + } } } - return ret; } Index: qemu/block_int.h =================================================================== --- qemu.orig/block_int.h +++ qemu/block_int.h @@ -54,6 +54,7 @@ struct BlockDriver { int (*bdrv_probe)(const uint8_t *buf, int buf_size, const char *filename); int (*bdrv_probe_device)(const char *filename); int (*bdrv_open)(BlockDriverState *bs, int flags); + int (*bdrv_reopen)(BlockDriverState *bs, int flags); int (*bdrv_file_open)(BlockDriverState *bs, const char *filename, int flags); int (*bdrv_read)(BlockDriverState *bs, int64_t sector_num, uint8_t *buf, int nb_sectors); ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [Qemu-devel] [RFC] Safely reopening image files by stashing fds 2011-08-09 9:32 ` supriya kannery 2011-08-16 19:18 ` [Qemu-devel] [RFC] " Supriya Kannery @ 2011-08-16 19:18 ` Supriya Kannery 2011-08-17 14:35 ` Kevin Wolf 1 sibling, 1 reply; 39+ messages in thread From: Supriya Kannery @ 2011-08-16 19:18 UTC (permalink / raw) To: supriya kannery Cc: Kevin Wolf, Stefan Hajnoczi, Christoph Hellwig, qemu-devel, Paolo Bonzini On 08/09/2011 03:02 PM, supriya kannery wrote: > Kevin Wolf wrote: >> Am 09.08.2011 11:22, schrieb supriya kannery: >>> Kevin Wolf wrote: >> >> What I meant is that in the end, with a generic bdrv_reopen(), we can >> have raw-posix only call dup() and fcntl() instead of doing a >> close()/open() sequence if it can satisfy the new flags this way. But >> this would be an implementation detail and not be visible in the >> interface. >> >> Kevin > > ok > - thanks, Supriya > Though I started the RFC patch with defining BDRVReopenState, ended up in enhancing struct BlockDriver with .bdrv_reopen. bdrv_reopen mplementation specific to respective driver is assigned to this function pointer. Please find the implementation of O_DIRECT flag change, done in raw-posix.c. Similar implementation can be done for vmdk (with bdrv_reopen_commit and bdrv_reopen_abort as internal functions in vmdk.c for opening split files), sheepdog, nbd etc.. Request for comments on this approach. Note: Following patch is for demonstrating the approach using raw-posix as sample driver. This patch is not complete. - thanks, Supriya --- block.c | 33 +++++++++++++++++++++++---------- block/raw-posix.c | 34 ++++++++++++++++++++++++++++++++++ block_int.h | 1 + 3 files changed, 58 insertions(+), 10 deletions(-) Index: qemu/block/raw-posix.c =================================================================== --- qemu.orig/block/raw-posix.c +++ qemu/block/raw-posix.c @@ -276,6 +276,39 @@ static int raw_open(BlockDriverState *bs return raw_open_common(bs, filename, flags, 0); } +static int raw_reopen(BlockDriverState *bs, int flags) +{ + BDRVRawState *s = bs->opaque; + int new_fd; + int ret = 0; + + /* Handle request for toggling O_DIRECT */ + if ((bs->open_flags | BDRV_O_NOCACHE) ^ + (flags | BDRV_O_NOCACHE)) + { + if ((new_fd = dup(s->fd)) <= 0) { + return -1; + } + + if ((ret = fcntl_setfl(new_fd, flags)) < 0) { + close(new_fd); + return ret; + } + + /* Set new flags, so replace old fd with new one */ + close(s->fd); + s->fd = new_fd; + s->open_flags = flags; + bs->open_flags = flags; + } + + /* + * TBD: handle O_DSYNC and other flags for which image + * file has to be reopened + */ + return 0; +} + /* XXX: use host sector size if necessary with: #ifdef DIOCGSECTORSIZE { @@ -886,6 +919,7 @@ static BlockDriver bdrv_file = { .instance_size = sizeof(BDRVRawState), .bdrv_probe = NULL, /* no probe for protocols */ .bdrv_file_open = raw_open, + .bdrv_reopen = raw_reopen, .bdrv_read = raw_read, .bdrv_write = raw_write, .bdrv_close = raw_close, Index: qemu/block.c =================================================================== --- qemu.orig/block.c +++ qemu/block.c @@ -687,20 +687,33 @@ int bdrv_reopen(BlockDriverState *bs, in qerror_report(QERR_DATA_SYNC_FAILED, bs->device_name); return ret; } - open_flags = bs->open_flags; - bdrv_close(bs); - ret = bdrv_open(bs, bs->filename, bdrv_flags, drv); - if (ret < 0) { - /* Reopen failed. Try to open with original flags */ - qerror_report(QERR_REOPEN_FILE_FAILED, bs->filename); - ret = bdrv_open(bs, bs->filename, open_flags, drv); + /* Use driver specific reopen() if available */ + if (drv->bdrv_reopen) { + if ((ret = drv->bdrv_reopen(bs, bdrv_flags)) < 0) { + qerror_report(QERR_REOPEN_FILE_FAILED, bs->filename); + return ret; + } + } else { + /* Use reopen procedure common for drivers */ + open_flags = bs->open_flags; + bdrv_close(bs); + + ret = bdrv_open(bs, bs->filename, bdrv_flags, drv); if (ret < 0) { - /* Reopen failed with orig and modified flags */ - abort(); + /* + * Reopen failed. Try to open with original flags + */ + qerror_report(QERR_REOPEN_FILE_FAILED, bs->filename); + ret = bdrv_open(bs, bs->filename, open_flags, drv); + if (ret < 0) { + /* + * Reopen with orig and modified flags failed + */ + abort(); + } } } - return ret; } Index: qemu/block_int.h =================================================================== --- qemu.orig/block_int.h +++ qemu/block_int.h @@ -54,6 +54,7 @@ struct BlockDriver { int (*bdrv_probe)(const uint8_t *buf, int buf_size, const char *filename); int (*bdrv_probe_device)(const char *filename); int (*bdrv_open)(BlockDriverState *bs, int flags); + int (*bdrv_reopen)(BlockDriverState *bs, int flags); int (*bdrv_file_open)(BlockDriverState *bs, const char *filename, int flags); int (*bdrv_read)(BlockDriverState *bs, int64_t sector_num, uint8_t *buf, int nb_sectors); ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [Qemu-devel] [RFC] Safely reopening image files by stashing fds 2011-08-16 19:18 ` Supriya Kannery @ 2011-08-17 14:35 ` Kevin Wolf 0 siblings, 0 replies; 39+ messages in thread From: Kevin Wolf @ 2011-08-17 14:35 UTC (permalink / raw) To: supriyak Cc: Stefan Hajnoczi, Paolo Bonzini, Christoph Hellwig, qemu-devel, supriya kannery Am 16.08.2011 21:18, schrieb Supriya Kannery: > On 08/09/2011 03:02 PM, supriya kannery wrote: > > Kevin Wolf wrote: > >> Am 09.08.2011 11:22, schrieb supriya kannery: > >>> Kevin Wolf wrote: > >> > >> What I meant is that in the end, with a generic bdrv_reopen(), we can > >> have raw-posix only call dup() and fcntl() instead of doing a > >> close()/open() sequence if it can satisfy the new flags this way. But > >> this would be an implementation detail and not be visible in the > >> interface. > >> > >> Kevin > > > > ok > > - thanks, Supriya > > > > Though I started the RFC patch with defining BDRVReopenState, ended up > in enhancing struct BlockDriver with .bdrv_reopen. bdrv_reopen > mplementation specific to respective driver is assigned to this > function pointer. > > Please find the implementation of O_DIRECT flag change, done in > raw-posix.c. Similar implementation can be done for vmdk (with > bdrv_reopen_commit and bdrv_reopen_abort as internal functions in > vmdk.c for opening split files), sheepdog, nbd etc.. I haven't looked at the code yet, but this can't work. I have thought of BDRVReopenState for a reason. If you have multiple files like VMDK can have, then you want an all-or-nothing semantics of reopen. With only .bdrv_reopen, I can't see a way to guarantee this. If reopening the first file succeeds and the second one goes wrong, you have the new flags on the first file, but the old flags on the second one. You would have to re-reopen the first file with the old flags, but then this is not guaranteed to succeed. Kevin ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [Qemu-devel] Safely reopening image files by stashing fds 2011-08-09 9:22 ` supriya kannery 2011-08-09 9:51 ` Kevin Wolf @ 2011-10-10 18:28 ` Kevin Wolf 2011-10-11 5:21 ` Supriya Kannery 1 sibling, 1 reply; 39+ messages in thread From: Kevin Wolf @ 2011-10-10 18:28 UTC (permalink / raw) To: supriya kannery Cc: supriyak, Paolo Bonzini, Christoph Hellwig, qemu-devel, Stefan Hajnoczi Am 09.08.2011 11:22, schrieb supriya kannery: > Kevin Wolf wrote: >> Am 08.08.2011 09:02, schrieb Supriya Kannery: >> >>> On 08/05/2011 09:19 PM, Anthony Liguori wrote: >>> >>>> On 08/05/2011 10:43 AM, Kevin Wolf wrote: >>>> >>>>> Am 05.08.2011 17:24, schrieb Stefan Hajnoczi: >>>>> >>>>>> On Fri, Aug 5, 2011 at 3:28 PM, Christoph Hellwig<hch@lst.de> wrote: >>>>>> >>>>>>> On Fri, Aug 05, 2011 at 02:12:48PM +0100, Daniel P. Berrange wrote: >>>>>>> >>>>>>>>> Because you cannot change O_DIRECT on an open fd :(. This is why >>>>>>>>> we're going through this pain. >>>>>>>>> >>>>>>>> Hmm, I remember hearing that before, but looking at the current >>>>>>>> fcntl() >>>>>>>> manpage, it claims you *can* change O_DIRECT using SET_FL. Perhaps >>>>>>>> this >>>>>>>> is a newish feature, but it'd be nicer to use it if possible ? >>>>>>>> >>>>>>> It's been there since day 1 of O_DIRECT support. >>>>>>> >>>>>> Sorry, my bad. So for Linux we could just use fcntl for >>>>>> block_set_hostcache and not bother with reopening. However, we will >>>>>> need to reopen should we wish to support changing O_DSYNC. >>>>>> >>>>> We do wish to support that. >>>>> >>>>> Anthony thinks that allowing the guest to toggle WCE is a prerequisite >>>>> for making cache=writeback the default. And this is something that I >>>>> definitely want to do for 1.0. >>>>> >>>> Indeed. >>>> >>>> >>> We discussed the following so far... >>> 1. How to safely reopen image files >>> 2. Dynamic hostcache change >>> 3. Support for dynamic change of O_DSYNC >>> >>> Since 2 is independent of 1, shall I go ahead implementing >>> hostcache change using fcntl. >>> >>> Implementation for safely reopening image files using "BDRVReopenState" >>> can be done separately as a pre-requisite before implementing 3 >>> >> >> Doing it separately means that we would introduce yet another callback >> that is used just to change O_DIRECT. In the end we want it to use >> bdrv_reopen(), too, so I'm not sure if there is a need for a temporary >> solution. >> >> > Could you please explain "In the end we want it to use bdrv_reopen" at > bit more. > When fcntl() can change O_DIRECT on open fd , is there a need to > "re-open" > the image file? > > Considering the current way of having separate high level commands for > changing block parameters (block_set_hostcache, and may be block_set_flush > in furture), these dynamic requests will be sequential. So wouldn't it > be better to > avoid re-opening of image if possible for individual flag change request > that comes in? >> Actually, once we know what we really want (I haven't seen many comments >> on the BDRVReopenState suggestion yet), it should be pretty easy to >> implement. >> >> Kevin >> > Will work on to get an RFC patch with this proposed BDRVReopenState to > get more > inputs. Are you still going to prepare an RFC patch implementing bdrv_reopen_prepare/commit/abort using a BDRVReopenState? Or has it even been posted and I just missed it? Kevin ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [Qemu-devel] Safely reopening image files by stashing fds 2011-10-10 18:28 ` [Qemu-devel] " Kevin Wolf @ 2011-10-11 5:21 ` Supriya Kannery 0 siblings, 0 replies; 39+ messages in thread From: Supriya Kannery @ 2011-10-11 5:21 UTC (permalink / raw) To: Kevin Wolf Cc: Stefan Hajnoczi, Paolo Bonzini, Christoph Hellwig, qemu-devel, supriya kannery On 10/10/2011 11:58 PM, Kevin Wolf wrote: > Am 09.08.2011 11:22, schrieb supriya kannery: >> Kevin Wolf wrote: >>> Am 08.08.2011 09:02, schrieb Supriya Kannery: >>> >>>> On 08/05/2011 09:19 PM, Anthony Liguori wrote: >>>> >>>>> On 08/05/2011 10:43 AM, Kevin Wolf wrote: >>>>> >>>>>> Am 05.08.2011 17:24, schrieb Stefan Hajnoczi: >>>>>> >>>>>>> On Fri, Aug 5, 2011 at 3:28 PM, Christoph Hellwig<hch@lst.de> wrote: >>>>>>> >>>>>>>> On Fri, Aug 05, 2011 at 02:12:48PM +0100, Daniel P. Berrange wrote: >>>>>>>> >>>>>>>>>> Because you cannot change O_DIRECT on an open fd :(. This is why >>>>>>>>>> we're going through this pain. >>>>>>>>>> >>>>>>>>> Hmm, I remember hearing that before, but looking at the current >>>>>>>>> fcntl() >>>>>>>>> manpage, it claims you *can* change O_DIRECT using SET_FL. Perhaps >>>>>>>>> this >>>>>>>>> is a newish feature, but it'd be nicer to use it if possible ? >>>>>>>>> >>>>>>>> It's been there since day 1 of O_DIRECT support. >>>>>>>> >>>>>>> Sorry, my bad. So for Linux we could just use fcntl for >>>>>>> block_set_hostcache and not bother with reopening. However, we will >>>>>>> need to reopen should we wish to support changing O_DSYNC. >>>>>>> >>>>>> We do wish to support that. >>>>>> >>>>>> Anthony thinks that allowing the guest to toggle WCE is a prerequisite >>>>>> for making cache=writeback the default. And this is something that I >>>>>> definitely want to do for 1.0. >>>>>> >>>>> Indeed. >>>>> >>>>> >>>> We discussed the following so far... >>>> 1. How to safely reopen image files >>>> 2. Dynamic hostcache change >>>> 3. Support for dynamic change of O_DSYNC >>>> >>>> Since 2 is independent of 1, shall I go ahead implementing >>>> hostcache change using fcntl. >>>> >>>> Implementation for safely reopening image files using "BDRVReopenState" >>>> can be done separately as a pre-requisite before implementing 3 >>>> >>> >>> Doing it separately means that we would introduce yet another callback >>> that is used just to change O_DIRECT. In the end we want it to use >>> bdrv_reopen(), too, so I'm not sure if there is a need for a temporary >>> solution. >>> >>> >> Could you please explain "In the end we want it to use bdrv_reopen" at >> bit more. >> When fcntl() can change O_DIRECT on open fd , is there a need to >> "re-open" >> the image file? >> >> Considering the current way of having separate high level commands for >> changing block parameters (block_set_hostcache, and may be block_set_flush >> in furture), these dynamic requests will be sequential. So wouldn't it >> be better to >> avoid re-opening of image if possible for individual flag change request >> that comes in? >>> Actually, once we know what we really want (I haven't seen many comments >>> on the BDRVReopenState suggestion yet), it should be pretty easy to >>> implement. >>> >>> Kevin >>> >> Will work on to get an RFC patch with this proposed BDRVReopenState to >> get more >> inputs. > > Are you still going to prepare an RFC patch implementing > bdrv_reopen_prepare/commit/abort using a BDRVReopenState? Or has it even > been posted and I just missed it? > > Kevin Pls review the patchset posted at http://lists.gnu.org/archive/html/qemu-devel/2011-10/msg01014.html Working on further to similarly use BDRVReopenState for raw-win32 images. thanks, Supriya ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [Qemu-devel] Safely reopening image files by stashing fds 2011-08-05 9:27 ` Stefan Hajnoczi 2011-08-05 9:55 ` Paolo Bonzini 2011-08-05 13:12 ` Daniel P. Berrange @ 2011-08-05 14:27 ` Christoph Hellwig 2 siblings, 0 replies; 39+ messages in thread From: Christoph Hellwig @ 2011-08-05 14:27 UTC (permalink / raw) To: Stefan Hajnoczi Cc: Kevin Wolf, Paolo Bonzini, Anthony Liguori, qemu-devel, Supriya Kannery On Fri, Aug 05, 2011 at 10:27:00AM +0100, Stefan Hajnoczi wrote: > > Why not do the latter unconditionally? > > Because you cannot change O_DIRECT on an open fd :(. This is why > we're going through this pain. You can. What you can't right now is O_SYNC, but we're going to change that. ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [Qemu-devel] Safely reopening image files by stashing fds 2011-08-05 8:40 [Qemu-devel] Safely reopening image files by stashing fds Stefan Hajnoczi 2011-08-05 9:04 ` Paolo Bonzini @ 2011-08-05 9:07 ` Kevin Wolf 2011-08-05 9:29 ` Stefan Hajnoczi 2011-08-05 20:16 ` Blue Swirl 2 siblings, 1 reply; 39+ messages in thread From: Kevin Wolf @ 2011-08-05 9:07 UTC (permalink / raw) To: Stefan Hajnoczi; +Cc: Supriya Kannery, Anthony Liguori, qemu-devel Am 05.08.2011 10:40, schrieb Stefan Hajnoczi: > We've discussed safe methods for reopening image files (e.g. useful for > changing the hostcache parameter). The problem is that closing the file first > and then opening it again exposes us to the error case where the open fails. > At that point we cannot get to the file anymore and our options are to > terminate QEMU, pause the VM, or offline the block device. > > This window of vulnerability can be eliminated by keeping the file descriptor > around and falling back to it should the open fail. > > The challenge for the file descriptor approach is that image formats, like > VMDK, can span multiple files. Therefore the solution is not as simple as > stashing a single file descriptor and reopening from it. So far I agree. The rest I believe is wrong because you can't assume that every backend uses file descriptors. The qemu block layer is based on BlockDriverStates, not fds. They are a concept that should be hidden in raw-posix. I think something like this could do: struct BDRVReopenState { BlockDriverState *bs; /* can be extended by block drivers */ }; .bdrv_reopen(BlockDriverState *bs, BDRVReopenState **reopen_state, int flags); .bdrv_reopen_commit(BDRVReopenState *reopen_state); .bdrv_reopen_abort(BDRVReopenState *reopen_state); raw-posix would store the old file descriptor in its reopen_state. On commit, it closes the old descriptors, on abort it reverts to the old one and closes the newly opened one. Makes things a bit more complicated than the simple bdrv_reopen I had in mind before, but it allows VMDK to get an all-or-nothing semantics. Kevin ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [Qemu-devel] Safely reopening image files by stashing fds 2011-08-05 9:07 ` Kevin Wolf @ 2011-08-05 9:29 ` Stefan Hajnoczi 2011-08-05 9:48 ` Kevin Wolf 0 siblings, 1 reply; 39+ messages in thread From: Stefan Hajnoczi @ 2011-08-05 9:29 UTC (permalink / raw) To: Kevin Wolf; +Cc: Supriya Kannery, Anthony Liguori, qemu-devel On Fri, Aug 5, 2011 at 10:07 AM, Kevin Wolf <kwolf@redhat.com> wrote: > Am 05.08.2011 10:40, schrieb Stefan Hajnoczi: >> We've discussed safe methods for reopening image files (e.g. useful for >> changing the hostcache parameter). The problem is that closing the file first >> and then opening it again exposes us to the error case where the open fails. >> At that point we cannot get to the file anymore and our options are to >> terminate QEMU, pause the VM, or offline the block device. >> >> This window of vulnerability can be eliminated by keeping the file descriptor >> around and falling back to it should the open fail. >> >> The challenge for the file descriptor approach is that image formats, like >> VMDK, can span multiple files. Therefore the solution is not as simple as >> stashing a single file descriptor and reopening from it. > > So far I agree. The rest I believe is wrong because you can't assume > that every backend uses file descriptors. The qemu block layer is based > on BlockDriverStates, not fds. They are a concept that should be hidden > in raw-posix. > > I think something like this could do: > > struct BDRVReopenState { > BlockDriverState *bs; > /* can be extended by block drivers */ > }; > > .bdrv_reopen(BlockDriverState *bs, BDRVReopenState **reopen_state, int > flags); > .bdrv_reopen_commit(BDRVReopenState *reopen_state); > .bdrv_reopen_abort(BDRVReopenState *reopen_state); > > raw-posix would store the old file descriptor in its reopen_state. On > commit, it closes the old descriptors, on abort it reverts to the old > one and closes the newly opened one. > > Makes things a bit more complicated than the simple bdrv_reopen I had in > mind before, but it allows VMDK to get an all-or-nothing semantics. Can you show how bdrv_reopen() would use these new interfaces? I'm not 100% clear on the idea. Stefan ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [Qemu-devel] Safely reopening image files by stashing fds 2011-08-05 9:29 ` Stefan Hajnoczi @ 2011-08-05 9:48 ` Kevin Wolf 2011-08-08 14:49 ` Stefan Hajnoczi 0 siblings, 1 reply; 39+ messages in thread From: Kevin Wolf @ 2011-08-05 9:48 UTC (permalink / raw) To: Stefan Hajnoczi; +Cc: Supriya Kannery, Anthony Liguori, qemu-devel Am 05.08.2011 11:29, schrieb Stefan Hajnoczi: > On Fri, Aug 5, 2011 at 10:07 AM, Kevin Wolf <kwolf@redhat.com> wrote: >> Am 05.08.2011 10:40, schrieb Stefan Hajnoczi: >>> We've discussed safe methods for reopening image files (e.g. useful for >>> changing the hostcache parameter). The problem is that closing the file first >>> and then opening it again exposes us to the error case where the open fails. >>> At that point we cannot get to the file anymore and our options are to >>> terminate QEMU, pause the VM, or offline the block device. >>> >>> This window of vulnerability can be eliminated by keeping the file descriptor >>> around and falling back to it should the open fail. >>> >>> The challenge for the file descriptor approach is that image formats, like >>> VMDK, can span multiple files. Therefore the solution is not as simple as >>> stashing a single file descriptor and reopening from it. >> >> So far I agree. The rest I believe is wrong because you can't assume >> that every backend uses file descriptors. The qemu block layer is based >> on BlockDriverStates, not fds. They are a concept that should be hidden >> in raw-posix. >> >> I think something like this could do: >> >> struct BDRVReopenState { >> BlockDriverState *bs; >> /* can be extended by block drivers */ >> }; >> >> .bdrv_reopen(BlockDriverState *bs, BDRVReopenState **reopen_state, int >> flags); >> .bdrv_reopen_commit(BDRVReopenState *reopen_state); >> .bdrv_reopen_abort(BDRVReopenState *reopen_state); >> >> raw-posix would store the old file descriptor in its reopen_state. On >> commit, it closes the old descriptors, on abort it reverts to the old >> one and closes the newly opened one. >> >> Makes things a bit more complicated than the simple bdrv_reopen I had in >> mind before, but it allows VMDK to get an all-or-nothing semantics. > > Can you show how bdrv_reopen() would use these new interfaces? I'm > not 100% clear on the idea. Well, you wouldn't only call bdrv_reopen, but also either bdrv_reopen_commit/abort (for the top-level caller we can have a wrapper function that does both, but that's syntactic sugar). For example we would have: int vmdk_reopen() { *((VMDKReopenState**) rs) = malloc(); foreach (extent in s->extents) { ret = bdrv_reopen(extent->file, &extent->reopen_state) if (ret < 0) goto fail; } return 0; fail: foreach (extent in rs->already_reopened) { bdrv_reopen_abort(extent->reopen_state); } return ret; } void vmdk_reopen_commit() { foreach (extent in s->extents) { bdrv_reopen_commit(extent->reopen_state); } free(rs); } void vmdk_reopen_abort() { foreach (extent in s->extents) { bdrv_reopen_abort(extent->reopen_state); } free(rs); } The top-level caller, which isn't a block driver, but just wants to have an image reopened, will do something like this (as I said, this should probably be a wrapper function in block.c): BDRVReopenState *rs; ret = bdrv_reopen(bs, &rs); if (ret < 0) { goto fail; } bdrv_reopen_commit(rs); Kevin ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [Qemu-devel] Safely reopening image files by stashing fds 2011-08-05 9:48 ` Kevin Wolf @ 2011-08-08 14:49 ` Stefan Hajnoczi 2011-08-08 15:16 ` Kevin Wolf 0 siblings, 1 reply; 39+ messages in thread From: Stefan Hajnoczi @ 2011-08-08 14:49 UTC (permalink / raw) To: Kevin Wolf; +Cc: Supriya Kannery, Anthony Liguori, qemu-devel On Fri, Aug 5, 2011 at 10:48 AM, Kevin Wolf <kwolf@redhat.com> wrote: > Am 05.08.2011 11:29, schrieb Stefan Hajnoczi: >> On Fri, Aug 5, 2011 at 10:07 AM, Kevin Wolf <kwolf@redhat.com> wrote: >>> Am 05.08.2011 10:40, schrieb Stefan Hajnoczi: >>>> We've discussed safe methods for reopening image files (e.g. useful for >>>> changing the hostcache parameter). The problem is that closing the file first >>>> and then opening it again exposes us to the error case where the open fails. >>>> At that point we cannot get to the file anymore and our options are to >>>> terminate QEMU, pause the VM, or offline the block device. >>>> >>>> This window of vulnerability can be eliminated by keeping the file descriptor >>>> around and falling back to it should the open fail. >>>> >>>> The challenge for the file descriptor approach is that image formats, like >>>> VMDK, can span multiple files. Therefore the solution is not as simple as >>>> stashing a single file descriptor and reopening from it. >>> >>> So far I agree. The rest I believe is wrong because you can't assume >>> that every backend uses file descriptors. The qemu block layer is based >>> on BlockDriverStates, not fds. They are a concept that should be hidden >>> in raw-posix. >>> >>> I think something like this could do: >>> >>> struct BDRVReopenState { >>> BlockDriverState *bs; >>> /* can be extended by block drivers */ >>> }; >>> >>> .bdrv_reopen(BlockDriverState *bs, BDRVReopenState **reopen_state, int >>> flags); >>> .bdrv_reopen_commit(BDRVReopenState *reopen_state); >>> .bdrv_reopen_abort(BDRVReopenState *reopen_state); >>> >>> raw-posix would store the old file descriptor in its reopen_state. On >>> commit, it closes the old descriptors, on abort it reverts to the old >>> one and closes the newly opened one. >>> >>> Makes things a bit more complicated than the simple bdrv_reopen I had in >>> mind before, but it allows VMDK to get an all-or-nothing semantics. >> >> Can you show how bdrv_reopen() would use these new interfaces? I'm >> not 100% clear on the idea. > > Well, you wouldn't only call bdrv_reopen, but also either > bdrv_reopen_commit/abort (for the top-level caller we can have a wrapper > function that does both, but that's syntactic sugar). > > For example we would have: > > int vmdk_reopen() .bdrv_reopen() is a confusing name for this operation because it does not reopen anything. bdrv_prepare_reopen() might be clearer. > { > *((VMDKReopenState**) rs) = malloc(); > > foreach (extent in s->extents) { > ret = bdrv_reopen(extent->file, &extent->reopen_state) > if (ret < 0) > goto fail; > } > return 0; > > fail: > foreach (extent in rs->already_reopened) { > bdrv_reopen_abort(extent->reopen_state); > } > return ret; > } > void vmdk_reopen_commit() > { > foreach (extent in s->extents) { > bdrv_reopen_commit(extent->reopen_state); > } > free(rs); > } > > void vmdk_reopen_abort() > { > foreach (extent in s->extents) { > bdrv_reopen_abort(extent->reopen_state); > } > free(rs); > } Does the caller invoke bdrv_close(bs) after bdrv_prepare_reopen(bs, &rs)? There is more state than just the file descriptors and I'm not sure that that gets preserved unless we add code to stash away stuff. I'm basically hoping this interface does not require touching every BlockDriver. Stefan ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [Qemu-devel] Safely reopening image files by stashing fds 2011-08-08 14:49 ` Stefan Hajnoczi @ 2011-08-08 15:16 ` Kevin Wolf 2011-08-09 10:25 ` Stefan Hajnoczi 0 siblings, 1 reply; 39+ messages in thread From: Kevin Wolf @ 2011-08-08 15:16 UTC (permalink / raw) To: Stefan Hajnoczi; +Cc: Supriya Kannery, Anthony Liguori, qemu-devel Am 08.08.2011 16:49, schrieb Stefan Hajnoczi: > On Fri, Aug 5, 2011 at 10:48 AM, Kevin Wolf <kwolf@redhat.com> wrote: >> Am 05.08.2011 11:29, schrieb Stefan Hajnoczi: >>> On Fri, Aug 5, 2011 at 10:07 AM, Kevin Wolf <kwolf@redhat.com> wrote: >>>> Am 05.08.2011 10:40, schrieb Stefan Hajnoczi: >>>>> We've discussed safe methods for reopening image files (e.g. useful for >>>>> changing the hostcache parameter). The problem is that closing the file first >>>>> and then opening it again exposes us to the error case where the open fails. >>>>> At that point we cannot get to the file anymore and our options are to >>>>> terminate QEMU, pause the VM, or offline the block device. >>>>> >>>>> This window of vulnerability can be eliminated by keeping the file descriptor >>>>> around and falling back to it should the open fail. >>>>> >>>>> The challenge for the file descriptor approach is that image formats, like >>>>> VMDK, can span multiple files. Therefore the solution is not as simple as >>>>> stashing a single file descriptor and reopening from it. >>>> >>>> So far I agree. The rest I believe is wrong because you can't assume >>>> that every backend uses file descriptors. The qemu block layer is based >>>> on BlockDriverStates, not fds. They are a concept that should be hidden >>>> in raw-posix. >>>> >>>> I think something like this could do: >>>> >>>> struct BDRVReopenState { >>>> BlockDriverState *bs; >>>> /* can be extended by block drivers */ >>>> }; >>>> >>>> .bdrv_reopen(BlockDriverState *bs, BDRVReopenState **reopen_state, int >>>> flags); >>>> .bdrv_reopen_commit(BDRVReopenState *reopen_state); >>>> .bdrv_reopen_abort(BDRVReopenState *reopen_state); >>>> >>>> raw-posix would store the old file descriptor in its reopen_state. On >>>> commit, it closes the old descriptors, on abort it reverts to the old >>>> one and closes the newly opened one. >>>> >>>> Makes things a bit more complicated than the simple bdrv_reopen I had in >>>> mind before, but it allows VMDK to get an all-or-nothing semantics. >>> >>> Can you show how bdrv_reopen() would use these new interfaces? I'm >>> not 100% clear on the idea. >> >> Well, you wouldn't only call bdrv_reopen, but also either >> bdrv_reopen_commit/abort (for the top-level caller we can have a wrapper >> function that does both, but that's syntactic sugar). >> >> For example we would have: >> >> int vmdk_reopen() > > .bdrv_reopen() is a confusing name for this operation because it does > not reopen anything. bdrv_prepare_reopen() might be clearer. Makes sense. > >> { >> *((VMDKReopenState**) rs) = malloc(); >> >> foreach (extent in s->extents) { >> ret = bdrv_reopen(extent->file, &extent->reopen_state) >> if (ret < 0) >> goto fail; >> } >> return 0; >> >> fail: >> foreach (extent in rs->already_reopened) { >> bdrv_reopen_abort(extent->reopen_state); >> } >> return ret; >> } > >> void vmdk_reopen_commit() >> { >> foreach (extent in s->extents) { >> bdrv_reopen_commit(extent->reopen_state); >> } >> free(rs); >> } >> >> void vmdk_reopen_abort() >> { >> foreach (extent in s->extents) { >> bdrv_reopen_abort(extent->reopen_state); >> } >> free(rs); >> } > > Does the caller invoke bdrv_close(bs) after bdrv_prepare_reopen(bs, > &rs)? No. Closing the old backend would be part of bdrv_reopen_commit. Do you have a use case where it would be helpful if the caller invoked bdrv_close? > There is more state than just the file descriptors and I'm not > sure that that gets preserved unless we add code to stash away stuff. > I'm basically hoping this interface does not require touching every > BlockDriver. If we only want to change flags like O_DIRECT or O_SYNC, I think format drivers (except VMDK) can use a standard implementation that just reopens bs->file. If we wanted bdrv_reopen to ensure that all caches are dropped etc. then I think we need a specific implementation in all drivers unless bdrv->bdrv_open/bdrv_close is good enough to emulate it. Kevin ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [Qemu-devel] Safely reopening image files by stashing fds 2011-08-08 15:16 ` Kevin Wolf @ 2011-08-09 10:25 ` Stefan Hajnoczi 2011-08-09 10:35 ` Kevin Wolf 0 siblings, 1 reply; 39+ messages in thread From: Stefan Hajnoczi @ 2011-08-09 10:25 UTC (permalink / raw) To: Kevin Wolf; +Cc: Supriya Kannery, Anthony Liguori, qemu-devel On Mon, Aug 8, 2011 at 4:16 PM, Kevin Wolf <kwolf@redhat.com> wrote: > Am 08.08.2011 16:49, schrieb Stefan Hajnoczi: >> On Fri, Aug 5, 2011 at 10:48 AM, Kevin Wolf <kwolf@redhat.com> wrote: >>> Am 05.08.2011 11:29, schrieb Stefan Hajnoczi: >>>> On Fri, Aug 5, 2011 at 10:07 AM, Kevin Wolf <kwolf@redhat.com> wrote: >>>>> Am 05.08.2011 10:40, schrieb Stefan Hajnoczi: >>>>>> We've discussed safe methods for reopening image files (e.g. useful for >>>>>> changing the hostcache parameter). The problem is that closing the file first >>>>>> and then opening it again exposes us to the error case where the open fails. >>>>>> At that point we cannot get to the file anymore and our options are to >>>>>> terminate QEMU, pause the VM, or offline the block device. >>>>>> >>>>>> This window of vulnerability can be eliminated by keeping the file descriptor >>>>>> around and falling back to it should the open fail. >>>>>> >>>>>> The challenge for the file descriptor approach is that image formats, like >>>>>> VMDK, can span multiple files. Therefore the solution is not as simple as >>>>>> stashing a single file descriptor and reopening from it. >>>>> >>>>> So far I agree. The rest I believe is wrong because you can't assume >>>>> that every backend uses file descriptors. The qemu block layer is based >>>>> on BlockDriverStates, not fds. They are a concept that should be hidden >>>>> in raw-posix. >>>>> >>>>> I think something like this could do: >>>>> >>>>> struct BDRVReopenState { >>>>> BlockDriverState *bs; >>>>> /* can be extended by block drivers */ >>>>> }; >>>>> >>>>> .bdrv_reopen(BlockDriverState *bs, BDRVReopenState **reopen_state, int >>>>> flags); >>>>> .bdrv_reopen_commit(BDRVReopenState *reopen_state); >>>>> .bdrv_reopen_abort(BDRVReopenState *reopen_state); >>>>> >>>>> raw-posix would store the old file descriptor in its reopen_state. On >>>>> commit, it closes the old descriptors, on abort it reverts to the old >>>>> one and closes the newly opened one. >>>>> >>>>> Makes things a bit more complicated than the simple bdrv_reopen I had in >>>>> mind before, but it allows VMDK to get an all-or-nothing semantics. >>>> >>>> Can you show how bdrv_reopen() would use these new interfaces? I'm >>>> not 100% clear on the idea. >>> >>> Well, you wouldn't only call bdrv_reopen, but also either >>> bdrv_reopen_commit/abort (for the top-level caller we can have a wrapper >>> function that does both, but that's syntactic sugar). >>> >>> For example we would have: >>> >>> int vmdk_reopen() >> >> .bdrv_reopen() is a confusing name for this operation because it does >> not reopen anything. bdrv_prepare_reopen() might be clearer. > > Makes sense. > >> >>> { >>> *((VMDKReopenState**) rs) = malloc(); >>> >>> foreach (extent in s->extents) { >>> ret = bdrv_reopen(extent->file, &extent->reopen_state) >>> if (ret < 0) >>> goto fail; >>> } >>> return 0; >>> >>> fail: >>> foreach (extent in rs->already_reopened) { >>> bdrv_reopen_abort(extent->reopen_state); >>> } >>> return ret; >>> } >> >>> void vmdk_reopen_commit() >>> { >>> foreach (extent in s->extents) { >>> bdrv_reopen_commit(extent->reopen_state); >>> } >>> free(rs); >>> } >>> >>> void vmdk_reopen_abort() >>> { >>> foreach (extent in s->extents) { >>> bdrv_reopen_abort(extent->reopen_state); >>> } >>> free(rs); >>> } >> >> Does the caller invoke bdrv_close(bs) after bdrv_prepare_reopen(bs, >> &rs)? > > No. Closing the old backend would be part of bdrv_reopen_commit. > > Do you have a use case where it would be helpful if the caller invoked > bdrv_close? When the caller does bdrv_close() two BlockDriverStates are never open for the same image file. I thought this was a property we wanted. Also, in the block_set_hostcache case we need to reopen without switching to a new BlockDriverState instance. That means the reopen needs to be in-place with respect to the BlockDriverState *bs pointer. We cannot create a new instance. Stefan ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [Qemu-devel] Safely reopening image files by stashing fds 2011-08-09 10:25 ` Stefan Hajnoczi @ 2011-08-09 10:35 ` Kevin Wolf 2011-08-09 10:50 ` Stefan Hajnoczi 0 siblings, 1 reply; 39+ messages in thread From: Kevin Wolf @ 2011-08-09 10:35 UTC (permalink / raw) To: Stefan Hajnoczi; +Cc: Supriya Kannery, Anthony Liguori, qemu-devel Am 09.08.2011 12:25, schrieb Stefan Hajnoczi: > On Mon, Aug 8, 2011 at 4:16 PM, Kevin Wolf <kwolf@redhat.com> wrote: >> Am 08.08.2011 16:49, schrieb Stefan Hajnoczi: >>> On Fri, Aug 5, 2011 at 10:48 AM, Kevin Wolf <kwolf@redhat.com> wrote: >>>> Am 05.08.2011 11:29, schrieb Stefan Hajnoczi: >>>>> On Fri, Aug 5, 2011 at 10:07 AM, Kevin Wolf <kwolf@redhat.com> wrote: >>>>>> Am 05.08.2011 10:40, schrieb Stefan Hajnoczi: >>>>>>> We've discussed safe methods for reopening image files (e.g. useful for >>>>>>> changing the hostcache parameter). The problem is that closing the file first >>>>>>> and then opening it again exposes us to the error case where the open fails. >>>>>>> At that point we cannot get to the file anymore and our options are to >>>>>>> terminate QEMU, pause the VM, or offline the block device. >>>>>>> >>>>>>> This window of vulnerability can be eliminated by keeping the file descriptor >>>>>>> around and falling back to it should the open fail. >>>>>>> >>>>>>> The challenge for the file descriptor approach is that image formats, like >>>>>>> VMDK, can span multiple files. Therefore the solution is not as simple as >>>>>>> stashing a single file descriptor and reopening from it. >>>>>> >>>>>> So far I agree. The rest I believe is wrong because you can't assume >>>>>> that every backend uses file descriptors. The qemu block layer is based >>>>>> on BlockDriverStates, not fds. They are a concept that should be hidden >>>>>> in raw-posix. >>>>>> >>>>>> I think something like this could do: >>>>>> >>>>>> struct BDRVReopenState { >>>>>> BlockDriverState *bs; >>>>>> /* can be extended by block drivers */ >>>>>> }; >>>>>> >>>>>> .bdrv_reopen(BlockDriverState *bs, BDRVReopenState **reopen_state, int >>>>>> flags); >>>>>> .bdrv_reopen_commit(BDRVReopenState *reopen_state); >>>>>> .bdrv_reopen_abort(BDRVReopenState *reopen_state); >>>>>> >>>>>> raw-posix would store the old file descriptor in its reopen_state. On >>>>>> commit, it closes the old descriptors, on abort it reverts to the old >>>>>> one and closes the newly opened one. >>>>>> >>>>>> Makes things a bit more complicated than the simple bdrv_reopen I had in >>>>>> mind before, but it allows VMDK to get an all-or-nothing semantics. >>>>> >>>>> Can you show how bdrv_reopen() would use these new interfaces? I'm >>>>> not 100% clear on the idea. >>>> >>>> Well, you wouldn't only call bdrv_reopen, but also either >>>> bdrv_reopen_commit/abort (for the top-level caller we can have a wrapper >>>> function that does both, but that's syntactic sugar). >>>> >>>> For example we would have: >>>> >>>> int vmdk_reopen() >>> >>> .bdrv_reopen() is a confusing name for this operation because it does >>> not reopen anything. bdrv_prepare_reopen() might be clearer. >> >> Makes sense. >> >>> >>>> { >>>> *((VMDKReopenState**) rs) = malloc(); >>>> >>>> foreach (extent in s->extents) { >>>> ret = bdrv_reopen(extent->file, &extent->reopen_state) >>>> if (ret < 0) >>>> goto fail; >>>> } >>>> return 0; >>>> >>>> fail: >>>> foreach (extent in rs->already_reopened) { >>>> bdrv_reopen_abort(extent->reopen_state); >>>> } >>>> return ret; >>>> } >>> >>>> void vmdk_reopen_commit() >>>> { >>>> foreach (extent in s->extents) { >>>> bdrv_reopen_commit(extent->reopen_state); >>>> } >>>> free(rs); >>>> } >>>> >>>> void vmdk_reopen_abort() >>>> { >>>> foreach (extent in s->extents) { >>>> bdrv_reopen_abort(extent->reopen_state); >>>> } >>>> free(rs); >>>> } >>> >>> Does the caller invoke bdrv_close(bs) after bdrv_prepare_reopen(bs, >>> &rs)? >> >> No. Closing the old backend would be part of bdrv_reopen_commit. >> >> Do you have a use case where it would be helpful if the caller invoked >> bdrv_close? > > When the caller does bdrv_close() two BlockDriverStates are never open > for the same image file. I thought this was a property we wanted. > > Also, in the block_set_hostcache case we need to reopen without > switching to a new BlockDriverState instance. That means the reopen > needs to be in-place with respect to the BlockDriverState *bs pointer. > We cannot create a new instance. Yes, but where do you even get the second BlockDriverState from? My prototype only returns an int, not a new BlockDriverState. Until bdrv_reopen_commit() it would refer to the old file descriptors etc. and after bdrv_reopen_commit() the very same BlockDriverState would refer to the new ones. Kevin ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [Qemu-devel] Safely reopening image files by stashing fds 2011-08-09 10:35 ` Kevin Wolf @ 2011-08-09 10:50 ` Stefan Hajnoczi 2011-08-09 10:56 ` Stefan Hajnoczi 0 siblings, 1 reply; 39+ messages in thread From: Stefan Hajnoczi @ 2011-08-09 10:50 UTC (permalink / raw) To: Kevin Wolf; +Cc: Supriya Kannery, Anthony Liguori, qemu-devel On Tue, Aug 9, 2011 at 11:35 AM, Kevin Wolf <kwolf@redhat.com> wrote: > Am 09.08.2011 12:25, schrieb Stefan Hajnoczi: >> On Mon, Aug 8, 2011 at 4:16 PM, Kevin Wolf <kwolf@redhat.com> wrote: >>> Am 08.08.2011 16:49, schrieb Stefan Hajnoczi: >>>> On Fri, Aug 5, 2011 at 10:48 AM, Kevin Wolf <kwolf@redhat.com> wrote: >>>>> Am 05.08.2011 11:29, schrieb Stefan Hajnoczi: >>>>>> On Fri, Aug 5, 2011 at 10:07 AM, Kevin Wolf <kwolf@redhat.com> wrote: >>>>>>> Am 05.08.2011 10:40, schrieb Stefan Hajnoczi: >>>>>>>> We've discussed safe methods for reopening image files (e.g. useful for >>>>>>>> changing the hostcache parameter). The problem is that closing the file first >>>>>>>> and then opening it again exposes us to the error case where the open fails. >>>>>>>> At that point we cannot get to the file anymore and our options are to >>>>>>>> terminate QEMU, pause the VM, or offline the block device. >>>>>>>> >>>>>>>> This window of vulnerability can be eliminated by keeping the file descriptor >>>>>>>> around and falling back to it should the open fail. >>>>>>>> >>>>>>>> The challenge for the file descriptor approach is that image formats, like >>>>>>>> VMDK, can span multiple files. Therefore the solution is not as simple as >>>>>>>> stashing a single file descriptor and reopening from it. >>>>>>> >>>>>>> So far I agree. The rest I believe is wrong because you can't assume >>>>>>> that every backend uses file descriptors. The qemu block layer is based >>>>>>> on BlockDriverStates, not fds. They are a concept that should be hidden >>>>>>> in raw-posix. >>>>>>> >>>>>>> I think something like this could do: >>>>>>> >>>>>>> struct BDRVReopenState { >>>>>>> BlockDriverState *bs; >>>>>>> /* can be extended by block drivers */ >>>>>>> }; >>>>>>> >>>>>>> .bdrv_reopen(BlockDriverState *bs, BDRVReopenState **reopen_state, int >>>>>>> flags); >>>>>>> .bdrv_reopen_commit(BDRVReopenState *reopen_state); >>>>>>> .bdrv_reopen_abort(BDRVReopenState *reopen_state); >>>>>>> >>>>>>> raw-posix would store the old file descriptor in its reopen_state. On >>>>>>> commit, it closes the old descriptors, on abort it reverts to the old >>>>>>> one and closes the newly opened one. >>>>>>> >>>>>>> Makes things a bit more complicated than the simple bdrv_reopen I had in >>>>>>> mind before, but it allows VMDK to get an all-or-nothing semantics. >>>>>> >>>>>> Can you show how bdrv_reopen() would use these new interfaces? I'm >>>>>> not 100% clear on the idea. >>>>> >>>>> Well, you wouldn't only call bdrv_reopen, but also either >>>>> bdrv_reopen_commit/abort (for the top-level caller we can have a wrapper >>>>> function that does both, but that's syntactic sugar). >>>>> >>>>> For example we would have: >>>>> >>>>> int vmdk_reopen() >>>> >>>> .bdrv_reopen() is a confusing name for this operation because it does >>>> not reopen anything. bdrv_prepare_reopen() might be clearer. >>> >>> Makes sense. >>> >>>> >>>>> { >>>>> *((VMDKReopenState**) rs) = malloc(); >>>>> >>>>> foreach (extent in s->extents) { >>>>> ret = bdrv_reopen(extent->file, &extent->reopen_state) >>>>> if (ret < 0) >>>>> goto fail; >>>>> } >>>>> return 0; >>>>> >>>>> fail: >>>>> foreach (extent in rs->already_reopened) { >>>>> bdrv_reopen_abort(extent->reopen_state); >>>>> } >>>>> return ret; >>>>> } >>>> >>>>> void vmdk_reopen_commit() >>>>> { >>>>> foreach (extent in s->extents) { >>>>> bdrv_reopen_commit(extent->reopen_state); >>>>> } >>>>> free(rs); >>>>> } >>>>> >>>>> void vmdk_reopen_abort() >>>>> { >>>>> foreach (extent in s->extents) { >>>>> bdrv_reopen_abort(extent->reopen_state); >>>>> } >>>>> free(rs); >>>>> } >>>> >>>> Does the caller invoke bdrv_close(bs) after bdrv_prepare_reopen(bs, >>>> &rs)? >>> >>> No. Closing the old backend would be part of bdrv_reopen_commit. >>> >>> Do you have a use case where it would be helpful if the caller invoked >>> bdrv_close? >> >> When the caller does bdrv_close() two BlockDriverStates are never open >> for the same image file. I thought this was a property we wanted. >> >> Also, in the block_set_hostcache case we need to reopen without >> switching to a new BlockDriverState instance. That means the reopen >> needs to be in-place with respect to the BlockDriverState *bs pointer. >> We cannot create a new instance. > > Yes, but where do you even get the second BlockDriverState from? > > My prototype only returns an int, not a new BlockDriverState. Until > bdrv_reopen_commit() it would refer to the old file descriptors etc. and > after bdrv_reopen_commit() the very same BlockDriverState would refer to > the new ones. It seems I don't understand the API. I thought it was: do_block_set_hostcache() { bdrv_prepare_reopen(bs, &rs); ...open new file and check everything is okay... if (ret == 0) { bdrv_reopen_commit(rs); } else { bdrv_reopen_abort(rs); } return ret; } If the caller isn't opening the new file then what's the point of giving the caller control over prepare, commit, and abort? Stefan ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [Qemu-devel] Safely reopening image files by stashing fds 2011-08-09 10:50 ` Stefan Hajnoczi @ 2011-08-09 10:56 ` Stefan Hajnoczi 2011-08-09 11:39 ` Kevin Wolf 0 siblings, 1 reply; 39+ messages in thread From: Stefan Hajnoczi @ 2011-08-09 10:56 UTC (permalink / raw) To: Kevin Wolf; +Cc: Supriya Kannery, Anthony Liguori, qemu-devel On Tue, Aug 9, 2011 at 11:50 AM, Stefan Hajnoczi <stefanha@gmail.com> wrote: > On Tue, Aug 9, 2011 at 11:35 AM, Kevin Wolf <kwolf@redhat.com> wrote: >> Am 09.08.2011 12:25, schrieb Stefan Hajnoczi: >>> On Mon, Aug 8, 2011 at 4:16 PM, Kevin Wolf <kwolf@redhat.com> wrote: >>>> Am 08.08.2011 16:49, schrieb Stefan Hajnoczi: >>>>> On Fri, Aug 5, 2011 at 10:48 AM, Kevin Wolf <kwolf@redhat.com> wrote: >>>>>> Am 05.08.2011 11:29, schrieb Stefan Hajnoczi: >>>>>>> On Fri, Aug 5, 2011 at 10:07 AM, Kevin Wolf <kwolf@redhat.com> wrote: >>>>>>>> Am 05.08.2011 10:40, schrieb Stefan Hajnoczi: >>>>>>>>> We've discussed safe methods for reopening image files (e.g. useful for >>>>>>>>> changing the hostcache parameter). The problem is that closing the file first >>>>>>>>> and then opening it again exposes us to the error case where the open fails. >>>>>>>>> At that point we cannot get to the file anymore and our options are to >>>>>>>>> terminate QEMU, pause the VM, or offline the block device. >>>>>>>>> >>>>>>>>> This window of vulnerability can be eliminated by keeping the file descriptor >>>>>>>>> around and falling back to it should the open fail. >>>>>>>>> >>>>>>>>> The challenge for the file descriptor approach is that image formats, like >>>>>>>>> VMDK, can span multiple files. Therefore the solution is not as simple as >>>>>>>>> stashing a single file descriptor and reopening from it. >>>>>>>> >>>>>>>> So far I agree. The rest I believe is wrong because you can't assume >>>>>>>> that every backend uses file descriptors. The qemu block layer is based >>>>>>>> on BlockDriverStates, not fds. They are a concept that should be hidden >>>>>>>> in raw-posix. >>>>>>>> >>>>>>>> I think something like this could do: >>>>>>>> >>>>>>>> struct BDRVReopenState { >>>>>>>> BlockDriverState *bs; >>>>>>>> /* can be extended by block drivers */ >>>>>>>> }; >>>>>>>> >>>>>>>> .bdrv_reopen(BlockDriverState *bs, BDRVReopenState **reopen_state, int >>>>>>>> flags); >>>>>>>> .bdrv_reopen_commit(BDRVReopenState *reopen_state); >>>>>>>> .bdrv_reopen_abort(BDRVReopenState *reopen_state); >>>>>>>> >>>>>>>> raw-posix would store the old file descriptor in its reopen_state. On >>>>>>>> commit, it closes the old descriptors, on abort it reverts to the old >>>>>>>> one and closes the newly opened one. >>>>>>>> >>>>>>>> Makes things a bit more complicated than the simple bdrv_reopen I had in >>>>>>>> mind before, but it allows VMDK to get an all-or-nothing semantics. >>>>>>> >>>>>>> Can you show how bdrv_reopen() would use these new interfaces? I'm >>>>>>> not 100% clear on the idea. >>>>>> >>>>>> Well, you wouldn't only call bdrv_reopen, but also either >>>>>> bdrv_reopen_commit/abort (for the top-level caller we can have a wrapper >>>>>> function that does both, but that's syntactic sugar). >>>>>> >>>>>> For example we would have: >>>>>> >>>>>> int vmdk_reopen() >>>>> >>>>> .bdrv_reopen() is a confusing name for this operation because it does >>>>> not reopen anything. bdrv_prepare_reopen() might be clearer. >>>> >>>> Makes sense. >>>> >>>>> >>>>>> { >>>>>> *((VMDKReopenState**) rs) = malloc(); >>>>>> >>>>>> foreach (extent in s->extents) { >>>>>> ret = bdrv_reopen(extent->file, &extent->reopen_state) >>>>>> if (ret < 0) >>>>>> goto fail; >>>>>> } >>>>>> return 0; >>>>>> >>>>>> fail: >>>>>> foreach (extent in rs->already_reopened) { >>>>>> bdrv_reopen_abort(extent->reopen_state); >>>>>> } >>>>>> return ret; >>>>>> } >>>>> >>>>>> void vmdk_reopen_commit() >>>>>> { >>>>>> foreach (extent in s->extents) { >>>>>> bdrv_reopen_commit(extent->reopen_state); >>>>>> } >>>>>> free(rs); >>>>>> } >>>>>> >>>>>> void vmdk_reopen_abort() >>>>>> { >>>>>> foreach (extent in s->extents) { >>>>>> bdrv_reopen_abort(extent->reopen_state); >>>>>> } >>>>>> free(rs); >>>>>> } >>>>> >>>>> Does the caller invoke bdrv_close(bs) after bdrv_prepare_reopen(bs, >>>>> &rs)? >>>> >>>> No. Closing the old backend would be part of bdrv_reopen_commit. >>>> >>>> Do you have a use case where it would be helpful if the caller invoked >>>> bdrv_close? >>> >>> When the caller does bdrv_close() two BlockDriverStates are never open >>> for the same image file. I thought this was a property we wanted. >>> >>> Also, in the block_set_hostcache case we need to reopen without >>> switching to a new BlockDriverState instance. That means the reopen >>> needs to be in-place with respect to the BlockDriverState *bs pointer. >>> We cannot create a new instance. >> >> Yes, but where do you even get the second BlockDriverState from? >> >> My prototype only returns an int, not a new BlockDriverState. Until >> bdrv_reopen_commit() it would refer to the old file descriptors etc. and >> after bdrv_reopen_commit() the very same BlockDriverState would refer to >> the new ones. > > It seems I don't understand the API. I thought it was: > > do_block_set_hostcache() > { > bdrv_prepare_reopen(bs, &rs); > ...open new file and check everything is okay... > if (ret == 0) { > bdrv_reopen_commit(rs); > } else { > bdrv_reopen_abort(rs); > } > return ret; > } > > If the caller isn't opening the new file then what's the point of > giving the caller control over prepare, commit, and abort? After sending the last email I realized what I was missing: You need the prepare, commit, and abort API in order to handle multi-file block drivers like VMDK. Stefan ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [Qemu-devel] Safely reopening image files by stashing fds 2011-08-09 10:56 ` Stefan Hajnoczi @ 2011-08-09 11:39 ` Kevin Wolf 2011-08-09 12:00 ` Stefan Hajnoczi 0 siblings, 1 reply; 39+ messages in thread From: Kevin Wolf @ 2011-08-09 11:39 UTC (permalink / raw) To: Stefan Hajnoczi; +Cc: Supriya Kannery, Anthony Liguori, qemu-devel Am 09.08.2011 12:56, schrieb Stefan Hajnoczi: > On Tue, Aug 9, 2011 at 11:50 AM, Stefan Hajnoczi <stefanha@gmail.com> wrote: >> On Tue, Aug 9, 2011 at 11:35 AM, Kevin Wolf <kwolf@redhat.com> wrote: >>> Am 09.08.2011 12:25, schrieb Stefan Hajnoczi: >>>> On Mon, Aug 8, 2011 at 4:16 PM, Kevin Wolf <kwolf@redhat.com> wrote: >>>>> Am 08.08.2011 16:49, schrieb Stefan Hajnoczi: >>>>>> On Fri, Aug 5, 2011 at 10:48 AM, Kevin Wolf <kwolf@redhat.com> wrote: >>>>>>> Am 05.08.2011 11:29, schrieb Stefan Hajnoczi: >>>>>>>> On Fri, Aug 5, 2011 at 10:07 AM, Kevin Wolf <kwolf@redhat.com> wrote: >>>>>>>>> Am 05.08.2011 10:40, schrieb Stefan Hajnoczi: >>>>>>>>>> We've discussed safe methods for reopening image files (e.g. useful for >>>>>>>>>> changing the hostcache parameter). The problem is that closing the file first >>>>>>>>>> and then opening it again exposes us to the error case where the open fails. >>>>>>>>>> At that point we cannot get to the file anymore and our options are to >>>>>>>>>> terminate QEMU, pause the VM, or offline the block device. >>>>>>>>>> >>>>>>>>>> This window of vulnerability can be eliminated by keeping the file descriptor >>>>>>>>>> around and falling back to it should the open fail. >>>>>>>>>> >>>>>>>>>> The challenge for the file descriptor approach is that image formats, like >>>>>>>>>> VMDK, can span multiple files. Therefore the solution is not as simple as >>>>>>>>>> stashing a single file descriptor and reopening from it. >>>>>>>>> >>>>>>>>> So far I agree. The rest I believe is wrong because you can't assume >>>>>>>>> that every backend uses file descriptors. The qemu block layer is based >>>>>>>>> on BlockDriverStates, not fds. They are a concept that should be hidden >>>>>>>>> in raw-posix. >>>>>>>>> >>>>>>>>> I think something like this could do: >>>>>>>>> >>>>>>>>> struct BDRVReopenState { >>>>>>>>> BlockDriverState *bs; >>>>>>>>> /* can be extended by block drivers */ >>>>>>>>> }; >>>>>>>>> >>>>>>>>> .bdrv_reopen(BlockDriverState *bs, BDRVReopenState **reopen_state, int >>>>>>>>> flags); >>>>>>>>> .bdrv_reopen_commit(BDRVReopenState *reopen_state); >>>>>>>>> .bdrv_reopen_abort(BDRVReopenState *reopen_state); >>>>>>>>> >>>>>>>>> raw-posix would store the old file descriptor in its reopen_state. On >>>>>>>>> commit, it closes the old descriptors, on abort it reverts to the old >>>>>>>>> one and closes the newly opened one. >>>>>>>>> >>>>>>>>> Makes things a bit more complicated than the simple bdrv_reopen I had in >>>>>>>>> mind before, but it allows VMDK to get an all-or-nothing semantics. >>>>>>>> >>>>>>>> Can you show how bdrv_reopen() would use these new interfaces? I'm >>>>>>>> not 100% clear on the idea. >>>>>>> >>>>>>> Well, you wouldn't only call bdrv_reopen, but also either >>>>>>> bdrv_reopen_commit/abort (for the top-level caller we can have a wrapper >>>>>>> function that does both, but that's syntactic sugar). >>>>>>> >>>>>>> For example we would have: >>>>>>> >>>>>>> int vmdk_reopen() >>>>>> >>>>>> .bdrv_reopen() is a confusing name for this operation because it does >>>>>> not reopen anything. bdrv_prepare_reopen() might be clearer. >>>>> >>>>> Makes sense. >>>>> >>>>>> >>>>>>> { >>>>>>> *((VMDKReopenState**) rs) = malloc(); >>>>>>> >>>>>>> foreach (extent in s->extents) { >>>>>>> ret = bdrv_reopen(extent->file, &extent->reopen_state) >>>>>>> if (ret < 0) >>>>>>> goto fail; >>>>>>> } >>>>>>> return 0; >>>>>>> >>>>>>> fail: >>>>>>> foreach (extent in rs->already_reopened) { >>>>>>> bdrv_reopen_abort(extent->reopen_state); >>>>>>> } >>>>>>> return ret; >>>>>>> } >>>>>> >>>>>>> void vmdk_reopen_commit() >>>>>>> { >>>>>>> foreach (extent in s->extents) { >>>>>>> bdrv_reopen_commit(extent->reopen_state); >>>>>>> } >>>>>>> free(rs); >>>>>>> } >>>>>>> >>>>>>> void vmdk_reopen_abort() >>>>>>> { >>>>>>> foreach (extent in s->extents) { >>>>>>> bdrv_reopen_abort(extent->reopen_state); >>>>>>> } >>>>>>> free(rs); >>>>>>> } >>>>>> >>>>>> Does the caller invoke bdrv_close(bs) after bdrv_prepare_reopen(bs, >>>>>> &rs)? >>>>> >>>>> No. Closing the old backend would be part of bdrv_reopen_commit. >>>>> >>>>> Do you have a use case where it would be helpful if the caller invoked >>>>> bdrv_close? >>>> >>>> When the caller does bdrv_close() two BlockDriverStates are never open >>>> for the same image file. I thought this was a property we wanted. >>>> >>>> Also, in the block_set_hostcache case we need to reopen without >>>> switching to a new BlockDriverState instance. That means the reopen >>>> needs to be in-place with respect to the BlockDriverState *bs pointer. >>>> We cannot create a new instance. >>> >>> Yes, but where do you even get the second BlockDriverState from? >>> >>> My prototype only returns an int, not a new BlockDriverState. Until >>> bdrv_reopen_commit() it would refer to the old file descriptors etc. and >>> after bdrv_reopen_commit() the very same BlockDriverState would refer to >>> the new ones. >> >> It seems I don't understand the API. I thought it was: >> >> do_block_set_hostcache() >> { >> bdrv_prepare_reopen(bs, &rs); >> ...open new file and check everything is okay... >> if (ret == 0) { >> bdrv_reopen_commit(rs); >> } else { >> bdrv_reopen_abort(rs); >> } >> return ret; >> } >> >> If the caller isn't opening the new file then what's the point of >> giving the caller control over prepare, commit, and abort? > > After sending the last email I realized what I was missing: > > You need the prepare, commit, and abort API in order to handle > multi-file block drivers like VMDK. Yes, this is whole point of separating commit out. Does the proposal make sense to you now? Kevin ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [Qemu-devel] Safely reopening image files by stashing fds 2011-08-09 11:39 ` Kevin Wolf @ 2011-08-09 12:00 ` Stefan Hajnoczi 2011-08-09 12:24 ` Kevin Wolf 0 siblings, 1 reply; 39+ messages in thread From: Stefan Hajnoczi @ 2011-08-09 12:00 UTC (permalink / raw) To: Kevin Wolf; +Cc: Supriya Kannery, Anthony Liguori, qemu-devel On Tue, Aug 09, 2011 at 01:39:13PM +0200, Kevin Wolf wrote: > Am 09.08.2011 12:56, schrieb Stefan Hajnoczi: > > On Tue, Aug 9, 2011 at 11:50 AM, Stefan Hajnoczi <stefanha@gmail.com> wrote: > >> On Tue, Aug 9, 2011 at 11:35 AM, Kevin Wolf <kwolf@redhat.com> wrote: > >>> Am 09.08.2011 12:25, schrieb Stefan Hajnoczi: > >>>> On Mon, Aug 8, 2011 at 4:16 PM, Kevin Wolf <kwolf@redhat.com> wrote: > >>>>> Am 08.08.2011 16:49, schrieb Stefan Hajnoczi: > >>>>>> On Fri, Aug 5, 2011 at 10:48 AM, Kevin Wolf <kwolf@redhat.com> wrote: > >>>>>>> Am 05.08.2011 11:29, schrieb Stefan Hajnoczi: > >>>>>>>> On Fri, Aug 5, 2011 at 10:07 AM, Kevin Wolf <kwolf@redhat.com> wrote: > >>>>>>>>> Am 05.08.2011 10:40, schrieb Stefan Hajnoczi: > >>>>>>>>>> We've discussed safe methods for reopening image files (e.g. useful for > >>>>>>>>>> changing the hostcache parameter). The problem is that closing the file first > >>>>>>>>>> and then opening it again exposes us to the error case where the open fails. > >>>>>>>>>> At that point we cannot get to the file anymore and our options are to > >>>>>>>>>> terminate QEMU, pause the VM, or offline the block device. > >>>>>>>>>> > >>>>>>>>>> This window of vulnerability can be eliminated by keeping the file descriptor > >>>>>>>>>> around and falling back to it should the open fail. > >>>>>>>>>> > >>>>>>>>>> The challenge for the file descriptor approach is that image formats, like > >>>>>>>>>> VMDK, can span multiple files. Therefore the solution is not as simple as > >>>>>>>>>> stashing a single file descriptor and reopening from it. > >>>>>>>>> > >>>>>>>>> So far I agree. The rest I believe is wrong because you can't assume > >>>>>>>>> that every backend uses file descriptors. The qemu block layer is based > >>>>>>>>> on BlockDriverStates, not fds. They are a concept that should be hidden > >>>>>>>>> in raw-posix. > >>>>>>>>> > >>>>>>>>> I think something like this could do: > >>>>>>>>> > >>>>>>>>> struct BDRVReopenState { > >>>>>>>>> BlockDriverState *bs; > >>>>>>>>> /* can be extended by block drivers */ > >>>>>>>>> }; > >>>>>>>>> > >>>>>>>>> .bdrv_reopen(BlockDriverState *bs, BDRVReopenState **reopen_state, int > >>>>>>>>> flags); > >>>>>>>>> .bdrv_reopen_commit(BDRVReopenState *reopen_state); > >>>>>>>>> .bdrv_reopen_abort(BDRVReopenState *reopen_state); > >>>>>>>>> > >>>>>>>>> raw-posix would store the old file descriptor in its reopen_state. On > >>>>>>>>> commit, it closes the old descriptors, on abort it reverts to the old > >>>>>>>>> one and closes the newly opened one. > >>>>>>>>> > >>>>>>>>> Makes things a bit more complicated than the simple bdrv_reopen I had in > >>>>>>>>> mind before, but it allows VMDK to get an all-or-nothing semantics. > >>>>>>>> > >>>>>>>> Can you show how bdrv_reopen() would use these new interfaces? I'm > >>>>>>>> not 100% clear on the idea. > >>>>>>> > >>>>>>> Well, you wouldn't only call bdrv_reopen, but also either > >>>>>>> bdrv_reopen_commit/abort (for the top-level caller we can have a wrapper > >>>>>>> function that does both, but that's syntactic sugar). > >>>>>>> > >>>>>>> For example we would have: > >>>>>>> > >>>>>>> int vmdk_reopen() > >>>>>> > >>>>>> .bdrv_reopen() is a confusing name for this operation because it does > >>>>>> not reopen anything. bdrv_prepare_reopen() might be clearer. > >>>>> > >>>>> Makes sense. > >>>>> > >>>>>> > >>>>>>> { > >>>>>>> *((VMDKReopenState**) rs) = malloc(); > >>>>>>> > >>>>>>> foreach (extent in s->extents) { > >>>>>>> ret = bdrv_reopen(extent->file, &extent->reopen_state) > >>>>>>> if (ret < 0) > >>>>>>> goto fail; > >>>>>>> } > >>>>>>> return 0; > >>>>>>> > >>>>>>> fail: > >>>>>>> foreach (extent in rs->already_reopened) { > >>>>>>> bdrv_reopen_abort(extent->reopen_state); > >>>>>>> } > >>>>>>> return ret; > >>>>>>> } > >>>>>> > >>>>>>> void vmdk_reopen_commit() > >>>>>>> { > >>>>>>> foreach (extent in s->extents) { > >>>>>>> bdrv_reopen_commit(extent->reopen_state); > >>>>>>> } > >>>>>>> free(rs); > >>>>>>> } > >>>>>>> > >>>>>>> void vmdk_reopen_abort() > >>>>>>> { > >>>>>>> foreach (extent in s->extents) { > >>>>>>> bdrv_reopen_abort(extent->reopen_state); > >>>>>>> } > >>>>>>> free(rs); > >>>>>>> } > >>>>>> > >>>>>> Does the caller invoke bdrv_close(bs) after bdrv_prepare_reopen(bs, > >>>>>> &rs)? > >>>>> > >>>>> No. Closing the old backend would be part of bdrv_reopen_commit. > >>>>> > >>>>> Do you have a use case where it would be helpful if the caller invoked > >>>>> bdrv_close? > >>>> > >>>> When the caller does bdrv_close() two BlockDriverStates are never open > >>>> for the same image file. I thought this was a property we wanted. > >>>> > >>>> Also, in the block_set_hostcache case we need to reopen without > >>>> switching to a new BlockDriverState instance. That means the reopen > >>>> needs to be in-place with respect to the BlockDriverState *bs pointer. > >>>> We cannot create a new instance. > >>> > >>> Yes, but where do you even get the second BlockDriverState from? > >>> > >>> My prototype only returns an int, not a new BlockDriverState. Until > >>> bdrv_reopen_commit() it would refer to the old file descriptors etc. and > >>> after bdrv_reopen_commit() the very same BlockDriverState would refer to > >>> the new ones. > >> > >> It seems I don't understand the API. I thought it was: > >> > >> do_block_set_hostcache() > >> { > >> bdrv_prepare_reopen(bs, &rs); > >> ...open new file and check everything is okay... > >> if (ret == 0) { > >> bdrv_reopen_commit(rs); > >> } else { > >> bdrv_reopen_abort(rs); > >> } > >> return ret; > >> } > >> > >> If the caller isn't opening the new file then what's the point of > >> giving the caller control over prepare, commit, and abort? > > > > After sending the last email I realized what I was missing: > > > > You need the prepare, commit, and abort API in order to handle > > multi-file block drivers like VMDK. > > Yes, this is whole point of separating commit out. Does the proposal > make sense to you now? It depends on the details. Adding more functions that every BlockDriver must implement is bad, so it's important that we only drop this functionality into raw-posix.c, vmdk.c, and block.c as appropriate. I liked the idea of doing a generic FDStash type that the monitor and bdrv_reopen() can use. Blue's idea to hook at the qemu_open() level takes that further. But if we can do prepare, commit, and abort in a relatively simple way then I'm for it. Stefan ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [Qemu-devel] Safely reopening image files by stashing fds 2011-08-09 12:00 ` Stefan Hajnoczi @ 2011-08-09 12:24 ` Kevin Wolf 2011-08-09 19:39 ` Blue Swirl 0 siblings, 1 reply; 39+ messages in thread From: Kevin Wolf @ 2011-08-09 12:24 UTC (permalink / raw) To: Stefan Hajnoczi; +Cc: Supriya Kannery, Anthony Liguori, qemu-devel Am 09.08.2011 14:00, schrieb Stefan Hajnoczi: > On Tue, Aug 09, 2011 at 01:39:13PM +0200, Kevin Wolf wrote: >> Am 09.08.2011 12:56, schrieb Stefan Hajnoczi: >>> On Tue, Aug 9, 2011 at 11:50 AM, Stefan Hajnoczi <stefanha@gmail.com> wrote: >>>> On Tue, Aug 9, 2011 at 11:35 AM, Kevin Wolf <kwolf@redhat.com> wrote: >>>>> Am 09.08.2011 12:25, schrieb Stefan Hajnoczi: >>>>>> On Mon, Aug 8, 2011 at 4:16 PM, Kevin Wolf <kwolf@redhat.com> wrote: >>>>>>> Am 08.08.2011 16:49, schrieb Stefan Hajnoczi: >>>>>>>> On Fri, Aug 5, 2011 at 10:48 AM, Kevin Wolf <kwolf@redhat.com> wrote: >>>>>>>>> Am 05.08.2011 11:29, schrieb Stefan Hajnoczi: >>>>>>>>>> On Fri, Aug 5, 2011 at 10:07 AM, Kevin Wolf <kwolf@redhat.com> wrote: >>>>>>>>>>> Am 05.08.2011 10:40, schrieb Stefan Hajnoczi: >>>>>>>>>>>> We've discussed safe methods for reopening image files (e.g. useful for >>>>>>>>>>>> changing the hostcache parameter). The problem is that closing the file first >>>>>>>>>>>> and then opening it again exposes us to the error case where the open fails. >>>>>>>>>>>> At that point we cannot get to the file anymore and our options are to >>>>>>>>>>>> terminate QEMU, pause the VM, or offline the block device. >>>>>>>>>>>> >>>>>>>>>>>> This window of vulnerability can be eliminated by keeping the file descriptor >>>>>>>>>>>> around and falling back to it should the open fail. >>>>>>>>>>>> >>>>>>>>>>>> The challenge for the file descriptor approach is that image formats, like >>>>>>>>>>>> VMDK, can span multiple files. Therefore the solution is not as simple as >>>>>>>>>>>> stashing a single file descriptor and reopening from it. >>>>>>>>>>> >>>>>>>>>>> So far I agree. The rest I believe is wrong because you can't assume >>>>>>>>>>> that every backend uses file descriptors. The qemu block layer is based >>>>>>>>>>> on BlockDriverStates, not fds. They are a concept that should be hidden >>>>>>>>>>> in raw-posix. >>>>>>>>>>> >>>>>>>>>>> I think something like this could do: >>>>>>>>>>> >>>>>>>>>>> struct BDRVReopenState { >>>>>>>>>>> BlockDriverState *bs; >>>>>>>>>>> /* can be extended by block drivers */ >>>>>>>>>>> }; >>>>>>>>>>> >>>>>>>>>>> .bdrv_reopen(BlockDriverState *bs, BDRVReopenState **reopen_state, int >>>>>>>>>>> flags); >>>>>>>>>>> .bdrv_reopen_commit(BDRVReopenState *reopen_state); >>>>>>>>>>> .bdrv_reopen_abort(BDRVReopenState *reopen_state); >>>>>>>>>>> >>>>>>>>>>> raw-posix would store the old file descriptor in its reopen_state. On >>>>>>>>>>> commit, it closes the old descriptors, on abort it reverts to the old >>>>>>>>>>> one and closes the newly opened one. >>>>>>>>>>> >>>>>>>>>>> Makes things a bit more complicated than the simple bdrv_reopen I had in >>>>>>>>>>> mind before, but it allows VMDK to get an all-or-nothing semantics. >>>>>>>>>> >>>>>>>>>> Can you show how bdrv_reopen() would use these new interfaces? I'm >>>>>>>>>> not 100% clear on the idea. >>>>>>>>> >>>>>>>>> Well, you wouldn't only call bdrv_reopen, but also either >>>>>>>>> bdrv_reopen_commit/abort (for the top-level caller we can have a wrapper >>>>>>>>> function that does both, but that's syntactic sugar). >>>>>>>>> >>>>>>>>> For example we would have: >>>>>>>>> >>>>>>>>> int vmdk_reopen() >>>>>>>> >>>>>>>> .bdrv_reopen() is a confusing name for this operation because it does >>>>>>>> not reopen anything. bdrv_prepare_reopen() might be clearer. >>>>>>> >>>>>>> Makes sense. >>>>>>> >>>>>>>> >>>>>>>>> { >>>>>>>>> *((VMDKReopenState**) rs) = malloc(); >>>>>>>>> >>>>>>>>> foreach (extent in s->extents) { >>>>>>>>> ret = bdrv_reopen(extent->file, &extent->reopen_state) >>>>>>>>> if (ret < 0) >>>>>>>>> goto fail; >>>>>>>>> } >>>>>>>>> return 0; >>>>>>>>> >>>>>>>>> fail: >>>>>>>>> foreach (extent in rs->already_reopened) { >>>>>>>>> bdrv_reopen_abort(extent->reopen_state); >>>>>>>>> } >>>>>>>>> return ret; >>>>>>>>> } >>>>>>>> >>>>>>>>> void vmdk_reopen_commit() >>>>>>>>> { >>>>>>>>> foreach (extent in s->extents) { >>>>>>>>> bdrv_reopen_commit(extent->reopen_state); >>>>>>>>> } >>>>>>>>> free(rs); >>>>>>>>> } >>>>>>>>> >>>>>>>>> void vmdk_reopen_abort() >>>>>>>>> { >>>>>>>>> foreach (extent in s->extents) { >>>>>>>>> bdrv_reopen_abort(extent->reopen_state); >>>>>>>>> } >>>>>>>>> free(rs); >>>>>>>>> } >>>>>>>> >>>>>>>> Does the caller invoke bdrv_close(bs) after bdrv_prepare_reopen(bs, >>>>>>>> &rs)? >>>>>>> >>>>>>> No. Closing the old backend would be part of bdrv_reopen_commit. >>>>>>> >>>>>>> Do you have a use case where it would be helpful if the caller invoked >>>>>>> bdrv_close? >>>>>> >>>>>> When the caller does bdrv_close() two BlockDriverStates are never open >>>>>> for the same image file. I thought this was a property we wanted. >>>>>> >>>>>> Also, in the block_set_hostcache case we need to reopen without >>>>>> switching to a new BlockDriverState instance. That means the reopen >>>>>> needs to be in-place with respect to the BlockDriverState *bs pointer. >>>>>> We cannot create a new instance. >>>>> >>>>> Yes, but where do you even get the second BlockDriverState from? >>>>> >>>>> My prototype only returns an int, not a new BlockDriverState. Until >>>>> bdrv_reopen_commit() it would refer to the old file descriptors etc. and >>>>> after bdrv_reopen_commit() the very same BlockDriverState would refer to >>>>> the new ones. >>>> >>>> It seems I don't understand the API. I thought it was: >>>> >>>> do_block_set_hostcache() >>>> { >>>> bdrv_prepare_reopen(bs, &rs); >>>> ...open new file and check everything is okay... >>>> if (ret == 0) { >>>> bdrv_reopen_commit(rs); >>>> } else { >>>> bdrv_reopen_abort(rs); >>>> } >>>> return ret; >>>> } >>>> >>>> If the caller isn't opening the new file then what's the point of >>>> giving the caller control over prepare, commit, and abort? >>> >>> After sending the last email I realized what I was missing: >>> >>> You need the prepare, commit, and abort API in order to handle >>> multi-file block drivers like VMDK. >> >> Yes, this is whole point of separating commit out. Does the proposal >> make sense to you now? > > It depends on the details. Adding more functions that every BlockDriver > must implement is bad, so it's important that we only drop this > functionality into raw-posix.c, vmdk.c, and block.c as appropriate. Yes, I agree. > I liked the idea of doing a generic FDStash type that the monitor and > bdrv_reopen() can use. Blue's idea to hook at the qemu_open() level > takes that further. Well, having something that works for the raw-posix, the monitor and maybe some more things is nice. Having something that works for raw-posix, Sheepdog and rbd is an absolute requirement and I can't see how FDStash solves that. Even raw-win32 doesn't have an int fd, but a HANDLE hfile for its backend. So my main concern is that file descriptors are a concept not as generic as it needs to be to suit all block drivers. > But if we can do prepare, commit, and abort in a relatively simple way > then I'm for it. Great, then let's do that. Kevin ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [Qemu-devel] Safely reopening image files by stashing fds 2011-08-09 12:24 ` Kevin Wolf @ 2011-08-09 19:39 ` Blue Swirl 2011-08-10 7:58 ` Kevin Wolf 0 siblings, 1 reply; 39+ messages in thread From: Blue Swirl @ 2011-08-09 19:39 UTC (permalink / raw) To: Kevin Wolf; +Cc: Stefan Hajnoczi, Anthony Liguori, qemu-devel, Supriya Kannery On Tue, Aug 9, 2011 at 12:24 PM, Kevin Wolf <kwolf@redhat.com> wrote: > Am 09.08.2011 14:00, schrieb Stefan Hajnoczi: >> On Tue, Aug 09, 2011 at 01:39:13PM +0200, Kevin Wolf wrote: >>> Am 09.08.2011 12:56, schrieb Stefan Hajnoczi: >>>> On Tue, Aug 9, 2011 at 11:50 AM, Stefan Hajnoczi <stefanha@gmail.com> wrote: >>>>> On Tue, Aug 9, 2011 at 11:35 AM, Kevin Wolf <kwolf@redhat.com> wrote: >>>>>> Am 09.08.2011 12:25, schrieb Stefan Hajnoczi: >>>>>>> On Mon, Aug 8, 2011 at 4:16 PM, Kevin Wolf <kwolf@redhat.com> wrote: >>>>>>>> Am 08.08.2011 16:49, schrieb Stefan Hajnoczi: >>>>>>>>> On Fri, Aug 5, 2011 at 10:48 AM, Kevin Wolf <kwolf@redhat.com> wrote: >>>>>>>>>> Am 05.08.2011 11:29, schrieb Stefan Hajnoczi: >>>>>>>>>>> On Fri, Aug 5, 2011 at 10:07 AM, Kevin Wolf <kwolf@redhat.com> wrote: >>>>>>>>>>>> Am 05.08.2011 10:40, schrieb Stefan Hajnoczi: >>>>>>>>>>>>> We've discussed safe methods for reopening image files (e.g. useful for >>>>>>>>>>>>> changing the hostcache parameter). The problem is that closing the file first >>>>>>>>>>>>> and then opening it again exposes us to the error case where the open fails. >>>>>>>>>>>>> At that point we cannot get to the file anymore and our options are to >>>>>>>>>>>>> terminate QEMU, pause the VM, or offline the block device. >>>>>>>>>>>>> >>>>>>>>>>>>> This window of vulnerability can be eliminated by keeping the file descriptor >>>>>>>>>>>>> around and falling back to it should the open fail. >>>>>>>>>>>>> >>>>>>>>>>>>> The challenge for the file descriptor approach is that image formats, like >>>>>>>>>>>>> VMDK, can span multiple files. Therefore the solution is not as simple as >>>>>>>>>>>>> stashing a single file descriptor and reopening from it. >>>>>>>>>>>> >>>>>>>>>>>> So far I agree. The rest I believe is wrong because you can't assume >>>>>>>>>>>> that every backend uses file descriptors. The qemu block layer is based >>>>>>>>>>>> on BlockDriverStates, not fds. They are a concept that should be hidden >>>>>>>>>>>> in raw-posix. >>>>>>>>>>>> >>>>>>>>>>>> I think something like this could do: >>>>>>>>>>>> >>>>>>>>>>>> struct BDRVReopenState { >>>>>>>>>>>> BlockDriverState *bs; >>>>>>>>>>>> /* can be extended by block drivers */ >>>>>>>>>>>> }; >>>>>>>>>>>> >>>>>>>>>>>> .bdrv_reopen(BlockDriverState *bs, BDRVReopenState **reopen_state, int >>>>>>>>>>>> flags); >>>>>>>>>>>> .bdrv_reopen_commit(BDRVReopenState *reopen_state); >>>>>>>>>>>> .bdrv_reopen_abort(BDRVReopenState *reopen_state); >>>>>>>>>>>> >>>>>>>>>>>> raw-posix would store the old file descriptor in its reopen_state. On >>>>>>>>>>>> commit, it closes the old descriptors, on abort it reverts to the old >>>>>>>>>>>> one and closes the newly opened one. >>>>>>>>>>>> >>>>>>>>>>>> Makes things a bit more complicated than the simple bdrv_reopen I had in >>>>>>>>>>>> mind before, but it allows VMDK to get an all-or-nothing semantics. >>>>>>>>>>> >>>>>>>>>>> Can you show how bdrv_reopen() would use these new interfaces? I'm >>>>>>>>>>> not 100% clear on the idea. >>>>>>>>>> >>>>>>>>>> Well, you wouldn't only call bdrv_reopen, but also either >>>>>>>>>> bdrv_reopen_commit/abort (for the top-level caller we can have a wrapper >>>>>>>>>> function that does both, but that's syntactic sugar). >>>>>>>>>> >>>>>>>>>> For example we would have: >>>>>>>>>> >>>>>>>>>> int vmdk_reopen() >>>>>>>>> >>>>>>>>> .bdrv_reopen() is a confusing name for this operation because it does >>>>>>>>> not reopen anything. bdrv_prepare_reopen() might be clearer. >>>>>>>> >>>>>>>> Makes sense. >>>>>>>> >>>>>>>>> >>>>>>>>>> { >>>>>>>>>> *((VMDKReopenState**) rs) = malloc(); >>>>>>>>>> >>>>>>>>>> foreach (extent in s->extents) { >>>>>>>>>> ret = bdrv_reopen(extent->file, &extent->reopen_state) >>>>>>>>>> if (ret < 0) >>>>>>>>>> goto fail; >>>>>>>>>> } >>>>>>>>>> return 0; >>>>>>>>>> >>>>>>>>>> fail: >>>>>>>>>> foreach (extent in rs->already_reopened) { >>>>>>>>>> bdrv_reopen_abort(extent->reopen_state); >>>>>>>>>> } >>>>>>>>>> return ret; >>>>>>>>>> } >>>>>>>>> >>>>>>>>>> void vmdk_reopen_commit() >>>>>>>>>> { >>>>>>>>>> foreach (extent in s->extents) { >>>>>>>>>> bdrv_reopen_commit(extent->reopen_state); >>>>>>>>>> } >>>>>>>>>> free(rs); >>>>>>>>>> } >>>>>>>>>> >>>>>>>>>> void vmdk_reopen_abort() >>>>>>>>>> { >>>>>>>>>> foreach (extent in s->extents) { >>>>>>>>>> bdrv_reopen_abort(extent->reopen_state); >>>>>>>>>> } >>>>>>>>>> free(rs); >>>>>>>>>> } >>>>>>>>> >>>>>>>>> Does the caller invoke bdrv_close(bs) after bdrv_prepare_reopen(bs, >>>>>>>>> &rs)? >>>>>>>> >>>>>>>> No. Closing the old backend would be part of bdrv_reopen_commit. >>>>>>>> >>>>>>>> Do you have a use case where it would be helpful if the caller invoked >>>>>>>> bdrv_close? >>>>>>> >>>>>>> When the caller does bdrv_close() two BlockDriverStates are never open >>>>>>> for the same image file. I thought this was a property we wanted. >>>>>>> >>>>>>> Also, in the block_set_hostcache case we need to reopen without >>>>>>> switching to a new BlockDriverState instance. That means the reopen >>>>>>> needs to be in-place with respect to the BlockDriverState *bs pointer. >>>>>>> We cannot create a new instance. >>>>>> >>>>>> Yes, but where do you even get the second BlockDriverState from? >>>>>> >>>>>> My prototype only returns an int, not a new BlockDriverState. Until >>>>>> bdrv_reopen_commit() it would refer to the old file descriptors etc. and >>>>>> after bdrv_reopen_commit() the very same BlockDriverState would refer to >>>>>> the new ones. >>>>> >>>>> It seems I don't understand the API. I thought it was: >>>>> >>>>> do_block_set_hostcache() >>>>> { >>>>> bdrv_prepare_reopen(bs, &rs); >>>>> ...open new file and check everything is okay... >>>>> if (ret == 0) { >>>>> bdrv_reopen_commit(rs); >>>>> } else { >>>>> bdrv_reopen_abort(rs); >>>>> } >>>>> return ret; >>>>> } >>>>> >>>>> If the caller isn't opening the new file then what's the point of >>>>> giving the caller control over prepare, commit, and abort? >>>> >>>> After sending the last email I realized what I was missing: >>>> >>>> You need the prepare, commit, and abort API in order to handle >>>> multi-file block drivers like VMDK. >>> >>> Yes, this is whole point of separating commit out. Does the proposal >>> make sense to you now? >> >> It depends on the details. Adding more functions that every BlockDriver >> must implement is bad, so it's important that we only drop this >> functionality into raw-posix.c, vmdk.c, and block.c as appropriate. > > Yes, I agree. > >> I liked the idea of doing a generic FDStash type that the monitor and >> bdrv_reopen() can use. Blue's idea to hook at the qemu_open() level >> takes that further. > > Well, having something that works for the raw-posix, the monitor and > maybe some more things is nice. Having something that works for > raw-posix, Sheepdog and rbd is an absolute requirement and I can't see > how FDStash solves that. For Sheepdog also network access functions would need to be hooked. RBD seems to use librados functions for low level I/O, so that needs some RBD specific wrappers. > Even raw-win32 doesn't have an int fd, but a > HANDLE hfile for its backend. Just replace "int fd" with "FDStash fd" everywhere? > So my main concern is that file descriptors are a concept not as generic > as it needs to be to suit all block drivers. > >> But if we can do prepare, commit, and abort in a relatively simple way >> then I'm for it. > > Great, then let's do that. > > Kevin > > ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [Qemu-devel] Safely reopening image files by stashing fds 2011-08-09 19:39 ` Blue Swirl @ 2011-08-10 7:58 ` Kevin Wolf 2011-08-10 17:20 ` Blue Swirl 0 siblings, 1 reply; 39+ messages in thread From: Kevin Wolf @ 2011-08-10 7:58 UTC (permalink / raw) To: Blue Swirl; +Cc: Stefan Hajnoczi, Anthony Liguori, qemu-devel, Supriya Kannery Am 09.08.2011 21:39, schrieb Blue Swirl: > On Tue, Aug 9, 2011 at 12:24 PM, Kevin Wolf <kwolf@redhat.com> wrote: >> Am 09.08.2011 14:00, schrieb Stefan Hajnoczi: >>> On Tue, Aug 09, 2011 at 01:39:13PM +0200, Kevin Wolf wrote: >>>> Am 09.08.2011 12:56, schrieb Stefan Hajnoczi: >>>>> On Tue, Aug 9, 2011 at 11:50 AM, Stefan Hajnoczi <stefanha@gmail.com> wrote: >>>>>> On Tue, Aug 9, 2011 at 11:35 AM, Kevin Wolf <kwolf@redhat.com> wrote: >>>>>>> Am 09.08.2011 12:25, schrieb Stefan Hajnoczi: >>>>>>>> On Mon, Aug 8, 2011 at 4:16 PM, Kevin Wolf <kwolf@redhat.com> wrote: >>>>>>>>> Am 08.08.2011 16:49, schrieb Stefan Hajnoczi: >>>>>>>>>> On Fri, Aug 5, 2011 at 10:48 AM, Kevin Wolf <kwolf@redhat.com> wrote: >>>>>>>>>>> Am 05.08.2011 11:29, schrieb Stefan Hajnoczi: >>>>>>>>>>>> On Fri, Aug 5, 2011 at 10:07 AM, Kevin Wolf <kwolf@redhat.com> wrote: >>>>>>>>>>>>> Am 05.08.2011 10:40, schrieb Stefan Hajnoczi: >>>>>>>>>>>>>> We've discussed safe methods for reopening image files (e.g. useful for >>>>>>>>>>>>>> changing the hostcache parameter). The problem is that closing the file first >>>>>>>>>>>>>> and then opening it again exposes us to the error case where the open fails. >>>>>>>>>>>>>> At that point we cannot get to the file anymore and our options are to >>>>>>>>>>>>>> terminate QEMU, pause the VM, or offline the block device. >>>>>>>>>>>>>> >>>>>>>>>>>>>> This window of vulnerability can be eliminated by keeping the file descriptor >>>>>>>>>>>>>> around and falling back to it should the open fail. >>>>>>>>>>>>>> >>>>>>>>>>>>>> The challenge for the file descriptor approach is that image formats, like >>>>>>>>>>>>>> VMDK, can span multiple files. Therefore the solution is not as simple as >>>>>>>>>>>>>> stashing a single file descriptor and reopening from it. >>>>>>>>>>>>> >>>>>>>>>>>>> So far I agree. The rest I believe is wrong because you can't assume >>>>>>>>>>>>> that every backend uses file descriptors. The qemu block layer is based >>>>>>>>>>>>> on BlockDriverStates, not fds. They are a concept that should be hidden >>>>>>>>>>>>> in raw-posix. >>>>>>>>>>>>> >>>>>>>>>>>>> I think something like this could do: >>>>>>>>>>>>> >>>>>>>>>>>>> struct BDRVReopenState { >>>>>>>>>>>>> BlockDriverState *bs; >>>>>>>>>>>>> /* can be extended by block drivers */ >>>>>>>>>>>>> }; >>>>>>>>>>>>> >>>>>>>>>>>>> .bdrv_reopen(BlockDriverState *bs, BDRVReopenState **reopen_state, int >>>>>>>>>>>>> flags); >>>>>>>>>>>>> .bdrv_reopen_commit(BDRVReopenState *reopen_state); >>>>>>>>>>>>> .bdrv_reopen_abort(BDRVReopenState *reopen_state); >>>>>>>>>>>>> >>>>>>>>>>>>> raw-posix would store the old file descriptor in its reopen_state. On >>>>>>>>>>>>> commit, it closes the old descriptors, on abort it reverts to the old >>>>>>>>>>>>> one and closes the newly opened one. >>>>>>>>>>>>> >>>>>>>>>>>>> Makes things a bit more complicated than the simple bdrv_reopen I had in >>>>>>>>>>>>> mind before, but it allows VMDK to get an all-or-nothing semantics. >>>>>>>>>>>> >>>>>>>>>>>> Can you show how bdrv_reopen() would use these new interfaces? I'm >>>>>>>>>>>> not 100% clear on the idea. >>>>>>>>>>> >>>>>>>>>>> Well, you wouldn't only call bdrv_reopen, but also either >>>>>>>>>>> bdrv_reopen_commit/abort (for the top-level caller we can have a wrapper >>>>>>>>>>> function that does both, but that's syntactic sugar). >>>>>>>>>>> >>>>>>>>>>> For example we would have: >>>>>>>>>>> >>>>>>>>>>> int vmdk_reopen() >>>>>>>>>> >>>>>>>>>> .bdrv_reopen() is a confusing name for this operation because it does >>>>>>>>>> not reopen anything. bdrv_prepare_reopen() might be clearer. >>>>>>>>> >>>>>>>>> Makes sense. >>>>>>>>> >>>>>>>>>> >>>>>>>>>>> { >>>>>>>>>>> *((VMDKReopenState**) rs) = malloc(); >>>>>>>>>>> >>>>>>>>>>> foreach (extent in s->extents) { >>>>>>>>>>> ret = bdrv_reopen(extent->file, &extent->reopen_state) >>>>>>>>>>> if (ret < 0) >>>>>>>>>>> goto fail; >>>>>>>>>>> } >>>>>>>>>>> return 0; >>>>>>>>>>> >>>>>>>>>>> fail: >>>>>>>>>>> foreach (extent in rs->already_reopened) { >>>>>>>>>>> bdrv_reopen_abort(extent->reopen_state); >>>>>>>>>>> } >>>>>>>>>>> return ret; >>>>>>>>>>> } >>>>>>>>>> >>>>>>>>>>> void vmdk_reopen_commit() >>>>>>>>>>> { >>>>>>>>>>> foreach (extent in s->extents) { >>>>>>>>>>> bdrv_reopen_commit(extent->reopen_state); >>>>>>>>>>> } >>>>>>>>>>> free(rs); >>>>>>>>>>> } >>>>>>>>>>> >>>>>>>>>>> void vmdk_reopen_abort() >>>>>>>>>>> { >>>>>>>>>>> foreach (extent in s->extents) { >>>>>>>>>>> bdrv_reopen_abort(extent->reopen_state); >>>>>>>>>>> } >>>>>>>>>>> free(rs); >>>>>>>>>>> } >>>>>>>>>> >>>>>>>>>> Does the caller invoke bdrv_close(bs) after bdrv_prepare_reopen(bs, >>>>>>>>>> &rs)? >>>>>>>>> >>>>>>>>> No. Closing the old backend would be part of bdrv_reopen_commit. >>>>>>>>> >>>>>>>>> Do you have a use case where it would be helpful if the caller invoked >>>>>>>>> bdrv_close? >>>>>>>> >>>>>>>> When the caller does bdrv_close() two BlockDriverStates are never open >>>>>>>> for the same image file. I thought this was a property we wanted. >>>>>>>> >>>>>>>> Also, in the block_set_hostcache case we need to reopen without >>>>>>>> switching to a new BlockDriverState instance. That means the reopen >>>>>>>> needs to be in-place with respect to the BlockDriverState *bs pointer. >>>>>>>> We cannot create a new instance. >>>>>>> >>>>>>> Yes, but where do you even get the second BlockDriverState from? >>>>>>> >>>>>>> My prototype only returns an int, not a new BlockDriverState. Until >>>>>>> bdrv_reopen_commit() it would refer to the old file descriptors etc. and >>>>>>> after bdrv_reopen_commit() the very same BlockDriverState would refer to >>>>>>> the new ones. >>>>>> >>>>>> It seems I don't understand the API. I thought it was: >>>>>> >>>>>> do_block_set_hostcache() >>>>>> { >>>>>> bdrv_prepare_reopen(bs, &rs); >>>>>> ...open new file and check everything is okay... >>>>>> if (ret == 0) { >>>>>> bdrv_reopen_commit(rs); >>>>>> } else { >>>>>> bdrv_reopen_abort(rs); >>>>>> } >>>>>> return ret; >>>>>> } >>>>>> >>>>>> If the caller isn't opening the new file then what's the point of >>>>>> giving the caller control over prepare, commit, and abort? >>>>> >>>>> After sending the last email I realized what I was missing: >>>>> >>>>> You need the prepare, commit, and abort API in order to handle >>>>> multi-file block drivers like VMDK. >>>> >>>> Yes, this is whole point of separating commit out. Does the proposal >>>> make sense to you now? >>> >>> It depends on the details. Adding more functions that every BlockDriver >>> must implement is bad, so it's important that we only drop this >>> functionality into raw-posix.c, vmdk.c, and block.c as appropriate. >> >> Yes, I agree. >> >>> I liked the idea of doing a generic FDStash type that the monitor and >>> bdrv_reopen() can use. Blue's idea to hook at the qemu_open() level >>> takes that further. >> >> Well, having something that works for the raw-posix, the monitor and >> maybe some more things is nice. Having something that works for >> raw-posix, Sheepdog and rbd is an absolute requirement and I can't see >> how FDStash solves that. > > For Sheepdog also network access functions would need to be hooked. > RBD seems to use librados functions for low level I/O, so that needs > some RBD specific wrappers. > >> Even raw-win32 doesn't have an int fd, but a >> HANDLE hfile for its backend. > > Just replace "int fd" with "FDStash fd" everywhere? And how is FDStash defined? The current proposed is this one: /* A stashed file descriptor */ typedef FDEntry { const char *name; int fd; QLIST_ENTRY(FDEntry) next; } FDEntry; /* A container for stashing file descriptors */ typedef struct FDStash { QLIST_HEAD(, FDEntry) fds; } FDStash; So there we have the int fd again. If you want something generic, you would have to start letting block drivers extend FDEntry with their own fields (and how would the first bdrv_open work then?) and basically arrive at something like a BDRVReopenState. Kevin ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [Qemu-devel] Safely reopening image files by stashing fds 2011-08-10 7:58 ` Kevin Wolf @ 2011-08-10 17:20 ` Blue Swirl 2011-08-11 7:37 ` Kevin Wolf 0 siblings, 1 reply; 39+ messages in thread From: Blue Swirl @ 2011-08-10 17:20 UTC (permalink / raw) To: Kevin Wolf; +Cc: Stefan Hajnoczi, Anthony Liguori, qemu-devel, Supriya Kannery On Wed, Aug 10, 2011 at 7:58 AM, Kevin Wolf <kwolf@redhat.com> wrote: > Am 09.08.2011 21:39, schrieb Blue Swirl: >> On Tue, Aug 9, 2011 at 12:24 PM, Kevin Wolf <kwolf@redhat.com> wrote: >>> Am 09.08.2011 14:00, schrieb Stefan Hajnoczi: >>>> On Tue, Aug 09, 2011 at 01:39:13PM +0200, Kevin Wolf wrote: >>>>> Am 09.08.2011 12:56, schrieb Stefan Hajnoczi: >>>>>> On Tue, Aug 9, 2011 at 11:50 AM, Stefan Hajnoczi <stefanha@gmail.com> wrote: >>>>>>> On Tue, Aug 9, 2011 at 11:35 AM, Kevin Wolf <kwolf@redhat.com> wrote: >>>>>>>> Am 09.08.2011 12:25, schrieb Stefan Hajnoczi: >>>>>>>>> On Mon, Aug 8, 2011 at 4:16 PM, Kevin Wolf <kwolf@redhat.com> wrote: >>>>>>>>>> Am 08.08.2011 16:49, schrieb Stefan Hajnoczi: >>>>>>>>>>> On Fri, Aug 5, 2011 at 10:48 AM, Kevin Wolf <kwolf@redhat.com> wrote: >>>>>>>>>>>> Am 05.08.2011 11:29, schrieb Stefan Hajnoczi: >>>>>>>>>>>>> On Fri, Aug 5, 2011 at 10:07 AM, Kevin Wolf <kwolf@redhat.com> wrote: >>>>>>>>>>>>>> Am 05.08.2011 10:40, schrieb Stefan Hajnoczi: >>>>>>>>>>>>>>> We've discussed safe methods for reopening image files (e.g. useful for >>>>>>>>>>>>>>> changing the hostcache parameter). The problem is that closing the file first >>>>>>>>>>>>>>> and then opening it again exposes us to the error case where the open fails. >>>>>>>>>>>>>>> At that point we cannot get to the file anymore and our options are to >>>>>>>>>>>>>>> terminate QEMU, pause the VM, or offline the block device. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> This window of vulnerability can be eliminated by keeping the file descriptor >>>>>>>>>>>>>>> around and falling back to it should the open fail. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> The challenge for the file descriptor approach is that image formats, like >>>>>>>>>>>>>>> VMDK, can span multiple files. Therefore the solution is not as simple as >>>>>>>>>>>>>>> stashing a single file descriptor and reopening from it. >>>>>>>>>>>>>> >>>>>>>>>>>>>> So far I agree. The rest I believe is wrong because you can't assume >>>>>>>>>>>>>> that every backend uses file descriptors. The qemu block layer is based >>>>>>>>>>>>>> on BlockDriverStates, not fds. They are a concept that should be hidden >>>>>>>>>>>>>> in raw-posix. >>>>>>>>>>>>>> >>>>>>>>>>>>>> I think something like this could do: >>>>>>>>>>>>>> >>>>>>>>>>>>>> struct BDRVReopenState { >>>>>>>>>>>>>> BlockDriverState *bs; >>>>>>>>>>>>>> /* can be extended by block drivers */ >>>>>>>>>>>>>> }; >>>>>>>>>>>>>> >>>>>>>>>>>>>> .bdrv_reopen(BlockDriverState *bs, BDRVReopenState **reopen_state, int >>>>>>>>>>>>>> flags); >>>>>>>>>>>>>> .bdrv_reopen_commit(BDRVReopenState *reopen_state); >>>>>>>>>>>>>> .bdrv_reopen_abort(BDRVReopenState *reopen_state); >>>>>>>>>>>>>> >>>>>>>>>>>>>> raw-posix would store the old file descriptor in its reopen_state. On >>>>>>>>>>>>>> commit, it closes the old descriptors, on abort it reverts to the old >>>>>>>>>>>>>> one and closes the newly opened one. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Makes things a bit more complicated than the simple bdrv_reopen I had in >>>>>>>>>>>>>> mind before, but it allows VMDK to get an all-or-nothing semantics. >>>>>>>>>>>>> >>>>>>>>>>>>> Can you show how bdrv_reopen() would use these new interfaces? I'm >>>>>>>>>>>>> not 100% clear on the idea. >>>>>>>>>>>> >>>>>>>>>>>> Well, you wouldn't only call bdrv_reopen, but also either >>>>>>>>>>>> bdrv_reopen_commit/abort (for the top-level caller we can have a wrapper >>>>>>>>>>>> function that does both, but that's syntactic sugar). >>>>>>>>>>>> >>>>>>>>>>>> For example we would have: >>>>>>>>>>>> >>>>>>>>>>>> int vmdk_reopen() >>>>>>>>>>> >>>>>>>>>>> .bdrv_reopen() is a confusing name for this operation because it does >>>>>>>>>>> not reopen anything. bdrv_prepare_reopen() might be clearer. >>>>>>>>>> >>>>>>>>>> Makes sense. >>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>> { >>>>>>>>>>>> *((VMDKReopenState**) rs) = malloc(); >>>>>>>>>>>> >>>>>>>>>>>> foreach (extent in s->extents) { >>>>>>>>>>>> ret = bdrv_reopen(extent->file, &extent->reopen_state) >>>>>>>>>>>> if (ret < 0) >>>>>>>>>>>> goto fail; >>>>>>>>>>>> } >>>>>>>>>>>> return 0; >>>>>>>>>>>> >>>>>>>>>>>> fail: >>>>>>>>>>>> foreach (extent in rs->already_reopened) { >>>>>>>>>>>> bdrv_reopen_abort(extent->reopen_state); >>>>>>>>>>>> } >>>>>>>>>>>> return ret; >>>>>>>>>>>> } >>>>>>>>>>> >>>>>>>>>>>> void vmdk_reopen_commit() >>>>>>>>>>>> { >>>>>>>>>>>> foreach (extent in s->extents) { >>>>>>>>>>>> bdrv_reopen_commit(extent->reopen_state); >>>>>>>>>>>> } >>>>>>>>>>>> free(rs); >>>>>>>>>>>> } >>>>>>>>>>>> >>>>>>>>>>>> void vmdk_reopen_abort() >>>>>>>>>>>> { >>>>>>>>>>>> foreach (extent in s->extents) { >>>>>>>>>>>> bdrv_reopen_abort(extent->reopen_state); >>>>>>>>>>>> } >>>>>>>>>>>> free(rs); >>>>>>>>>>>> } >>>>>>>>>>> >>>>>>>>>>> Does the caller invoke bdrv_close(bs) after bdrv_prepare_reopen(bs, >>>>>>>>>>> &rs)? >>>>>>>>>> >>>>>>>>>> No. Closing the old backend would be part of bdrv_reopen_commit. >>>>>>>>>> >>>>>>>>>> Do you have a use case where it would be helpful if the caller invoked >>>>>>>>>> bdrv_close? >>>>>>>>> >>>>>>>>> When the caller does bdrv_close() two BlockDriverStates are never open >>>>>>>>> for the same image file. I thought this was a property we wanted. >>>>>>>>> >>>>>>>>> Also, in the block_set_hostcache case we need to reopen without >>>>>>>>> switching to a new BlockDriverState instance. That means the reopen >>>>>>>>> needs to be in-place with respect to the BlockDriverState *bs pointer. >>>>>>>>> We cannot create a new instance. >>>>>>>> >>>>>>>> Yes, but where do you even get the second BlockDriverState from? >>>>>>>> >>>>>>>> My prototype only returns an int, not a new BlockDriverState. Until >>>>>>>> bdrv_reopen_commit() it would refer to the old file descriptors etc. and >>>>>>>> after bdrv_reopen_commit() the very same BlockDriverState would refer to >>>>>>>> the new ones. >>>>>>> >>>>>>> It seems I don't understand the API. I thought it was: >>>>>>> >>>>>>> do_block_set_hostcache() >>>>>>> { >>>>>>> bdrv_prepare_reopen(bs, &rs); >>>>>>> ...open new file and check everything is okay... >>>>>>> if (ret == 0) { >>>>>>> bdrv_reopen_commit(rs); >>>>>>> } else { >>>>>>> bdrv_reopen_abort(rs); >>>>>>> } >>>>>>> return ret; >>>>>>> } >>>>>>> >>>>>>> If the caller isn't opening the new file then what's the point of >>>>>>> giving the caller control over prepare, commit, and abort? >>>>>> >>>>>> After sending the last email I realized what I was missing: >>>>>> >>>>>> You need the prepare, commit, and abort API in order to handle >>>>>> multi-file block drivers like VMDK. >>>>> >>>>> Yes, this is whole point of separating commit out. Does the proposal >>>>> make sense to you now? >>>> >>>> It depends on the details. Adding more functions that every BlockDriver >>>> must implement is bad, so it's important that we only drop this >>>> functionality into raw-posix.c, vmdk.c, and block.c as appropriate. >>> >>> Yes, I agree. >>> >>>> I liked the idea of doing a generic FDStash type that the monitor and >>>> bdrv_reopen() can use. Blue's idea to hook at the qemu_open() level >>>> takes that further. >>> >>> Well, having something that works for the raw-posix, the monitor and >>> maybe some more things is nice. Having something that works for >>> raw-posix, Sheepdog and rbd is an absolute requirement and I can't see >>> how FDStash solves that. >> >> For Sheepdog also network access functions would need to be hooked. >> RBD seems to use librados functions for low level I/O, so that needs >> some RBD specific wrappers. >> >>> Even raw-win32 doesn't have an int fd, but a >>> HANDLE hfile for its backend. >> >> Just replace "int fd" with "FDStash fd" everywhere? > > And how is FDStash defined? The current proposed is this one: > > /* A stashed file descriptor */ > typedef FDEntry { > const char *name; > int fd; > QLIST_ENTRY(FDEntry) next; > } FDEntry; > > /* A container for stashing file descriptors */ > typedef struct FDStash { > QLIST_HEAD(, FDEntry) fds; > } FDStash; > > So there we have the int fd again. If you want something generic, you What's the problem: typedef struct FDEntry { const char *name; #ifdef _WIN32 HANDLE fd; #else int fd; #endif QLIST_ENTRY(FDEntry) next; } FDEntry; Renaming 'fd' to something that does not immediately look like 'int' may be useful. > would have to start letting block drivers extend FDEntry with their own > fields (and how would the first bdrv_open work then?) and basically > arrive at something like a BDRVReopenState. I'd not handle FDEntries at block level at all (or only at raw level), then there would not be first mover problems. Extending the uses of FDEntries to for example network stuff (Sheepdog) could need some kind of unified API for both file and net operations which is not what we have now (bdrv_read/write etc. vs. direct recv/send). Then this new API would use FDEntries instead of fds, sockets or HANDLEs. The 'name' field could be either network address or file path. Though maybe we are only interested in open/connect part so unification would not be needed for other operations. ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [Qemu-devel] Safely reopening image files by stashing fds 2011-08-10 17:20 ` Blue Swirl @ 2011-08-11 7:37 ` Kevin Wolf 2011-08-11 16:21 ` Blue Swirl 0 siblings, 1 reply; 39+ messages in thread From: Kevin Wolf @ 2011-08-11 7:37 UTC (permalink / raw) To: Blue Swirl; +Cc: Stefan Hajnoczi, Anthony Liguori, qemu-devel, Supriya Kannery Am 10.08.2011 19:20, schrieb Blue Swirl: > On Wed, Aug 10, 2011 at 7:58 AM, Kevin Wolf <kwolf@redhat.com> wrote: >> Am 09.08.2011 21:39, schrieb Blue Swirl: >>> On Tue, Aug 9, 2011 at 12:24 PM, Kevin Wolf <kwolf@redhat.com> wrote: >>>> Am 09.08.2011 14:00, schrieb Stefan Hajnoczi: >>>>> I liked the idea of doing a generic FDStash type that the monitor and >>>>> bdrv_reopen() can use. Blue's idea to hook at the qemu_open() level >>>>> takes that further. >>>> >>>> Well, having something that works for the raw-posix, the monitor and >>>> maybe some more things is nice. Having something that works for >>>> raw-posix, Sheepdog and rbd is an absolute requirement and I can't see >>>> how FDStash solves that. >>> >>> For Sheepdog also network access functions would need to be hooked. >>> RBD seems to use librados functions for low level I/O, so that needs >>> some RBD specific wrappers. >>> >>>> Even raw-win32 doesn't have an int fd, but a >>>> HANDLE hfile for its backend. >>> >>> Just replace "int fd" with "FDStash fd" everywhere? >> >> And how is FDStash defined? The current proposed is this one: >> >> /* A stashed file descriptor */ >> typedef FDEntry { >> const char *name; >> int fd; >> QLIST_ENTRY(FDEntry) next; >> } FDEntry; >> >> /* A container for stashing file descriptors */ >> typedef struct FDStash { >> QLIST_HEAD(, FDEntry) fds; >> } FDStash; >> >> So there we have the int fd again. If you want something generic, you > > What's the problem: > typedef struct FDEntry { > const char *name; > #ifdef _WIN32 > HANDLE fd; > #else > int fd; > #endif > QLIST_ENTRY(FDEntry) next; > } FDEntry; > > Renaming 'fd' to something that does not immediately look like 'int' > may be useful. This isn't any more generic, it merely covers two special cases instead of only one. You have used the fact that raw-posix and raw-win32 are never compiled in at the same time, so that you can use an #ifdef to conditionally pull out parts of their internal data structures into a generic struct (which in itself is a layering violation) It doesn't help with the other backend drivers like curl, nbd, rbd, sheepdog and last but not least our special friend vvfat. >> would have to start letting block drivers extend FDEntry with their own >> fields (and how would the first bdrv_open work then?) and basically >> arrive at something like a BDRVReopenState. > > I'd not handle FDEntries at block level at all (or only at raw level), > then there would not be first mover problems. Well, but how does it solve the bdrv_reopen problem if it's not visible on the block level? > Extending the uses of FDEntries to for example network stuff > (Sheepdog) could need some kind of unified API for both file and net > operations which is not what we have now (bdrv_read/write etc. vs. > direct recv/send). Then this new API would use FDEntries instead of > fds, sockets or HANDLEs. The 'name' field could be either network > address or file path. Though maybe we are only interested in > open/connect part so unification would not be needed for other > operations. Now you have handled three special cases, but still haven't got a generic solution. But considering that you don't want to touch the block layer API and therefore I don't even have an idea of what you would use FDEntries for... Let's go back one step: What problem are you trying to solve, and what does the API look like that you're thinking of? It doesn't seem to be the same as Stefan suggested. Kevin ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [Qemu-devel] Safely reopening image files by stashing fds 2011-08-11 7:37 ` Kevin Wolf @ 2011-08-11 16:21 ` Blue Swirl 0 siblings, 0 replies; 39+ messages in thread From: Blue Swirl @ 2011-08-11 16:21 UTC (permalink / raw) To: Kevin Wolf; +Cc: Stefan Hajnoczi, Anthony Liguori, qemu-devel, Supriya Kannery On Thu, Aug 11, 2011 at 7:37 AM, Kevin Wolf <kwolf@redhat.com> wrote: > Am 10.08.2011 19:20, schrieb Blue Swirl: >> On Wed, Aug 10, 2011 at 7:58 AM, Kevin Wolf <kwolf@redhat.com> wrote: >>> Am 09.08.2011 21:39, schrieb Blue Swirl: >>>> On Tue, Aug 9, 2011 at 12:24 PM, Kevin Wolf <kwolf@redhat.com> wrote: >>>>> Am 09.08.2011 14:00, schrieb Stefan Hajnoczi: >>>>>> I liked the idea of doing a generic FDStash type that the monitor and >>>>>> bdrv_reopen() can use. Blue's idea to hook at the qemu_open() level >>>>>> takes that further. >>>>> >>>>> Well, having something that works for the raw-posix, the monitor and >>>>> maybe some more things is nice. Having something that works for >>>>> raw-posix, Sheepdog and rbd is an absolute requirement and I can't see >>>>> how FDStash solves that. >>>> >>>> For Sheepdog also network access functions would need to be hooked. >>>> RBD seems to use librados functions for low level I/O, so that needs >>>> some RBD specific wrappers. >>>> >>>>> Even raw-win32 doesn't have an int fd, but a >>>>> HANDLE hfile for its backend. >>>> >>>> Just replace "int fd" with "FDStash fd" everywhere? >>> >>> And how is FDStash defined? The current proposed is this one: >>> >>> /* A stashed file descriptor */ >>> typedef FDEntry { >>> const char *name; >>> int fd; >>> QLIST_ENTRY(FDEntry) next; >>> } FDEntry; >>> >>> /* A container for stashing file descriptors */ >>> typedef struct FDStash { >>> QLIST_HEAD(, FDEntry) fds; >>> } FDStash; >>> >>> So there we have the int fd again. If you want something generic, you >> >> What's the problem: >> typedef struct FDEntry { >> const char *name; >> #ifdef _WIN32 >> HANDLE fd; >> #else >> int fd; >> #endif >> QLIST_ENTRY(FDEntry) next; >> } FDEntry; >> >> Renaming 'fd' to something that does not immediately look like 'int' >> may be useful. > > This isn't any more generic, it merely covers two special cases instead > of only one. > > You have used the fact that raw-posix and raw-win32 are never compiled > in at the same time, so that you can use an #ifdef to conditionally pull > out parts of their internal data structures into a generic struct (which > in itself is a layering violation) > > It doesn't help with the other backend drivers like curl, nbd, rbd, > sheepdog and last but not least our special friend vvfat. Probably vvfat would not need any FDEntry. Curl has the same problem as RBD that it's not possible to inject a file descriptor. >>> would have to start letting block drivers extend FDEntry with their own >>> fields (and how would the first bdrv_open work then?) and basically >>> arrive at something like a BDRVReopenState. >> >> I'd not handle FDEntries at block level at all (or only at raw level), >> then there would not be first mover problems. > > Well, but how does it solve the bdrv_reopen problem if it's not visible > on the block level? > >> Extending the uses of FDEntries to for example network stuff >> (Sheepdog) could need some kind of unified API for both file and net >> operations which is not what we have now (bdrv_read/write etc. vs. >> direct recv/send). Then this new API would use FDEntries instead of >> fds, sockets or HANDLEs. The 'name' field could be either network >> address or file path. Though maybe we are only interested in >> open/connect part so unification would not be needed for other >> operations. > > Now you have handled three special cases, but still haven't got a > generic solution. > > But considering that you don't want to touch the block layer API and > therefore I don't even have an idea of what you would use FDEntries > for... Let's go back one step: What problem are you trying to solve, and > what does the API look like that you're thinking of? It doesn't seem to > be the same as Stefan suggested. Excellent point. I think I was thinking of solving SELinux/NFS and privilege separation problems with this one. From this one step back perspective those seem to be slightly different issues. For this problem, BDRVReopenState could be OK. FDStash may be useful otherwise. ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [Qemu-devel] Safely reopening image files by stashing fds 2011-08-05 8:40 [Qemu-devel] Safely reopening image files by stashing fds Stefan Hajnoczi 2011-08-05 9:04 ` Paolo Bonzini 2011-08-05 9:07 ` Kevin Wolf @ 2011-08-05 20:16 ` Blue Swirl 2 siblings, 0 replies; 39+ messages in thread From: Blue Swirl @ 2011-08-05 20:16 UTC (permalink / raw) To: Stefan Hajnoczi; +Cc: Kevin Wolf, Supriya Kannery, Anthony Liguori, qemu-devel On Fri, Aug 5, 2011 at 8:40 AM, Stefan Hajnoczi <stefanha@gmail.com> wrote: > We've discussed safe methods for reopening image files (e.g. useful for > changing the hostcache parameter). The problem is that closing the file first > and then opening it again exposes us to the error case where the open fails. > At that point we cannot get to the file anymore and our options are to > terminate QEMU, pause the VM, or offline the block device. > > This window of vulnerability can be eliminated by keeping the file descriptor > around and falling back to it should the open fail. > > The challenge for the file descriptor approach is that image formats, like > VMDK, can span multiple files. Therefore the solution is not as simple as > stashing a single file descriptor and reopening from it. > > Here is the outline for an fd stashing mechanism that can handle reopening > multi-file images and could also solve the file descriptor passing problem for > libvirt: > > 1. Extract monitor getfd/closefd functionality > > The monitor already supports fd stashing with getfd/closefd commands. But the > fd stash code is part of Monitor and we need to extract it into its own object. > > /* A stashed file descriptor */ > typedef FDEntry { > const char *name; > int fd; > QLIST_ENTRY(FDEntry) next; > } FDEntry; > > /* A container for stashing file descriptors */ > typedef struct FDStash { > QLIST_HEAD(, FDEntry) fds; > } FDStash; > > void fdstash_init(FDStash *stash); > > /** > * Clear stashed file descriptors and close them > */ > void fdstash_cleanup(FDStash *stash); > > /** > * Stash a file descriptor and give up ownership > * > * If a file descriptor is already present with the same name the old fd is > * closed and replaced by the new one. > */ > void fdstash_give(FDStash *stash, const char *name, int fd); > > /** > * Find and take ownership of a stashed file descriptor > * > * Return the file descriptor or -ENOENT if not found. > */ > int fdstash_take(FDStash *stash, const char *name); > > The monitor is refactored to use this code instead of open coding fd stashing. > > 2. Introduce a function to extract open file descriptors from an block device > > Add a new .bdrv_extract_fds(BlockDriverState *bs, FDStash *stash) interface, > which defaults to calling bdrv_extract_fds(bs->file, stash). > > VMDK and protocols can implement this function to support extracting open fds > from a block device. Note that they need to dup(2) fds before giving them to > the fdstash, otherwise the fd will be closed when the block device is > closed/deleted. > > 3. Rework bdrv_open() to take a FDStash > > Check the FDStash before opening an image file on the host file system. This > makes it possible to open an image file and use existing stashed fds. > > 4. Implement bdrv_reopen() > > First call bdrv_extract_fds() to stash the file descriptors, then close the > block device. Try opening the new image but if that fails, reopen using the > stashed file descriptors. > > Thoughts? I was previously thinking that the fd store should be tightly coupled with block layer, but maybe it would be better to make this totally generic like Avi(?) proposed, then all files (logs etc) could benefit from this functionality. I'd then handle fd stashing in qemu_open() etc. Stashing would not need major changes anywhere, except reopen should be handled with qemu_reopen() instead of qemu_close() + qemu_open() sequence. From the other side (monitor), the interface could be like you present in 1. ^ permalink raw reply [flat|nested] 39+ messages in thread
end of thread, other threads:[~2011-10-11 5:22 UTC | newest] Thread overview: 39+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2011-08-05 8:40 [Qemu-devel] Safely reopening image files by stashing fds Stefan Hajnoczi 2011-08-05 9:04 ` Paolo Bonzini 2011-08-05 9:27 ` Stefan Hajnoczi 2011-08-05 9:55 ` Paolo Bonzini 2011-08-05 13:03 ` Stefan Hajnoczi 2011-08-05 13:12 ` Daniel P. Berrange 2011-08-05 14:28 ` Christoph Hellwig 2011-08-05 15:24 ` Stefan Hajnoczi 2011-08-05 15:43 ` Kevin Wolf 2011-08-05 15:49 ` Anthony Liguori 2011-08-08 7:02 ` Supriya Kannery 2011-08-08 8:12 ` Kevin Wolf 2011-08-09 9:22 ` supriya kannery 2011-08-09 9:51 ` Kevin Wolf 2011-08-09 9:32 ` supriya kannery 2011-08-16 19:18 ` [Qemu-devel] [RFC] " Supriya Kannery 2011-08-16 19:18 ` Supriya Kannery 2011-08-17 14:35 ` Kevin Wolf 2011-10-10 18:28 ` [Qemu-devel] " Kevin Wolf 2011-10-11 5:21 ` Supriya Kannery 2011-08-05 14:27 ` Christoph Hellwig 2011-08-05 9:07 ` Kevin Wolf 2011-08-05 9:29 ` Stefan Hajnoczi 2011-08-05 9:48 ` Kevin Wolf 2011-08-08 14:49 ` Stefan Hajnoczi 2011-08-08 15:16 ` Kevin Wolf 2011-08-09 10:25 ` Stefan Hajnoczi 2011-08-09 10:35 ` Kevin Wolf 2011-08-09 10:50 ` Stefan Hajnoczi 2011-08-09 10:56 ` Stefan Hajnoczi 2011-08-09 11:39 ` Kevin Wolf 2011-08-09 12:00 ` Stefan Hajnoczi 2011-08-09 12:24 ` Kevin Wolf 2011-08-09 19:39 ` Blue Swirl 2011-08-10 7:58 ` Kevin Wolf 2011-08-10 17:20 ` Blue Swirl 2011-08-11 7:37 ` Kevin Wolf 2011-08-11 16:21 ` Blue Swirl 2011-08-05 20:16 ` Blue Swirl
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).