From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([140.186.70.92]:47284) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1QmSFO-00058D-TF for qemu-devel@nongnu.org; Thu, 28 Jul 2011 11:11:40 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1QmSFN-000665-39 for qemu-devel@nongnu.org; Thu, 28 Jul 2011 11:11:38 -0400 Received: from e6.ny.us.ibm.com ([32.97.182.146]:34189) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1QmSFM-00062X-Sp for qemu-devel@nongnu.org; Thu, 28 Jul 2011 11:11:37 -0400 Received: from d01relay01.pok.ibm.com (d01relay01.pok.ibm.com [9.56.227.233]) by e6.ny.us.ibm.com (8.14.4/8.13.1) with ESMTP id p6SElRCO003227 for ; Thu, 28 Jul 2011 10:47:27 -0400 Received: from d01av03.pok.ibm.com (d01av03.pok.ibm.com [9.56.224.217]) by d01relay01.pok.ibm.com (8.13.8/8.13.8/NCO v10.0) with ESMTP id p6SFBY47127348 for ; Thu, 28 Jul 2011 11:11:34 -0400 Received: from d01av03.pok.ibm.com (loopback [127.0.0.1]) by d01av03.pok.ibm.com (8.14.4/8.13.1/NCO v10.0 AVout) with ESMTP id p6SBBLcG008519 for ; Thu, 28 Jul 2011 08:11:21 -0300 Message-ID: <4E317C24.3000102@linux.vnet.ibm.com> Date: Thu, 28 Jul 2011 10:11:32 -0500 From: Michael Roth MIME-Version: 1.0 References: <20110727152457.GK18528@redhat.com> <1311821631.9256.11.camel@nexus.oss.ntt.co.jp> <20110728080313.GE3087@redhat.com> In-Reply-To: <20110728080313.GE3087@redhat.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: quoted-printable Subject: Re: [Qemu-devel] RFC: moving fsfreeze support from the userland guest agent to the guest kernel List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Andrea Arcangeli Cc: Jes Sorensen , =?ISO-8859-1?Q?Fernando_Luis_V=E1zquez_Cao?= , qemu-devel@nongnu.org, Luiz Capitulino On 07/28/2011 03:03 AM, Andrea Arcangeli wrote: > On Thu, Jul 28, 2011 at 11:53:50AM +0900, Fernando Luis V=E1zquez Cao w= rote: >> On Wed, 2011-07-27 at 17:24 +0200, Andrea Arcangeli wrote: >>> making >>> sure no lib is calling any I/O function to be able to defreeze the >>> filesystems later, making sure the oom killer or a wrong kill -9 >>> $RANDOM isn't killing the agent by mistake while the I/O is blocked >>> and the copy is going. >> >> Yes with the current API if the agent is killed while the filesystems >> are frozen we are screwed. >> >> I have just submitted patches that implement a new API that should mak= e >> the virtualization use case more reliable. Basically, I am adding a ne= w >> ioctl, FIGETFREEZEFD, which freezes the indicated filesystem and retur= ns >> a file descriptor; as long as that file descriptor is held open, the >> filesystem remains open. If the freeze file descriptor is closed (be i= t >> through a explicit call to close(2) or as part of process exit >> housekeeping) the associated filesystem is automatically thawed. >> >> - fsfreeze: add ioctl to create a fd for freeze control >> http://marc.info/?l=3Dlinux-fsdevel&m=3D131175212512290&w=3D2 >> - fsfreeze: add freeze fd ioctls >> http://marc.info/?l=3Dlinux-fsdevel&m=3D131175220612341&w=3D2 > > This is probably how the API should have been implemented originally > instead of FIFREEZE/FITHAW. > > It looks a bit overkill though, I would think it'd be enough to have > the fsfreeze forced at FIGETFREEZEFD, and the only way to thaw by > closing the file without requiring any of the > FS_FREEZE_FD/FS_THAW_FD/FS_ISFROZEN_FD. But I guess you have use cases One of the crappy things about the current implementation is the=20 inability to determine whether or not a filesystem is frozen. At least=20 in the context of guest agent at least, it'd be nice if=20 guest-fsfreeze-status checked the actual system state rather than some=20 internal state that may not necessarily reflect reality (if we freeze,=20 and some other application thaws, we currently still report the state as=20 frozen). Also in the context of the guest agent, we are indeed screwed if the=20 agent gets killed while in a frozen state, and remain screwed even if=20 it's restarted since we have no way of determining whether or not we're=20 in a frozen state and thus should disable logging operations. We could check status by looking for a failure from the freeze=20 operation, but if you're just interested in getting the state, having to=20 potentially induce a freeze just to get at the state is really heavy-hand= ed. So having an open operation that doesn't force a freeze/thaw/status=20 operation serves some fairly common use cases I think. > for those if you implemented it, maybe to check if root is stepping on > its own toes by checking if the fs is already freezed before freezing > it and returning failure if it is, running ioctl instead of opening > closing the file isn't necessarily better. At the very least the > get_user(should_freeze, argp) doesn't seem so necessary, it just > complicates the ioctl API a bit without much gain, I think it'd be > cleaner if the FS_FREEZE_FD was the only way to freeze then. > > It's certainly a nice reliability improvement and safer API. > > Now if you add a file descriptor to epoll/poll that userland can open > and talk to, to know when a fsfreeze is asked on a certain fs, a > fsfreeze userland agent (not virt related too) could open it and start > the scripts if that filesystem is being fsfreezed before calling > freeze_super(). > > Then a PARAVIRT_FSFREEZE=3Dy/m driver could just invoke the fsfreeze > without any dependency on a virt specific guest agent. > > Maybe Christoph's right there are filesystems in userland (not sure > how the storage is related, it's all about filesystems and apps as far > I can see, and it's all blkdev agnostic) that may make things more > complicated, but those usually have a kernel backend too (like > fuse). I may not see the full picture of the filesystem in userland or > how the storage agent in guest userland relates to this. > > If you believe having libvirt talking QMP/QAPI over a virtio-serial > vmchannel with some virt specific guest userland agent bypassing qemu > entirely is better, that's ok with me, but there should be a strong > reason for it because the paravirt_fsfreeze.ko approach with a small > qemu backend and a qemu monitor command that starts paravirt-fsfreeze > in guest before going ahead blocking all I/O (to provide backwards > compatibility and reliable snapshots to guest OS that won't have the > paravirt fsfreeze too) looks more reliable, more compact and simpler > to use to me. I'll be surely ok either ways though. > > Thanks, > Andrea