From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43) id 1M1Jpl-00079K-Rw for qemu-devel@nongnu.org; Tue, 05 May 2009 08:33:17 -0400 Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43) id 1M1Jph-00076K-Pc for qemu-devel@nongnu.org; Tue, 05 May 2009 08:33:17 -0400 Received: from [199.232.76.173] (port=52350 helo=monty-python.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1M1Jph-00075u-9C for qemu-devel@nongnu.org; Tue, 05 May 2009 08:33:13 -0400 Received: from mail2.shareable.org ([80.68.89.115]:53050) by monty-python.gnu.org with esmtps (TLS-1.0:RSA_AES_256_CBC_SHA1:32) (Exim 4.60) (envelope-from ) id 1M1Jpg-0006yz-Mz for qemu-devel@nongnu.org; Tue, 05 May 2009 08:33:13 -0400 Date: Tue, 5 May 2009 13:33:11 +0100 From: Jamie Lokier Subject: Re: [Qemu-devel] [PATCH 2/3] barriers: block-raw-posix barrier support Message-ID: <20090505123311.GD25328@shareable.org> References: <20090505120804.GA30651@lst.de> <20090505120836.GB30721@lst.de> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20090505120836.GB30721@lst.de> List-Id: qemu-devel.nongnu.org List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Christoph Hellwig Cc: qemu-devel@nongnu.org Christoph Hellwig wrote: > Add support for write barriers to the posix raw file / block device code. > The guts of this is in the aio emulation as that's where we handle our queue > of outstanding requests. It's nice to see this :-) IDE and SCSI's cache flush commands should map to it nicely too. > The highlevel design is the following: > > - As soon as a barrier request is submitted via qemu_paio_submit we > increment the barrier_inprogress count to signal we now have to > deal with barriers. > - From that point on every new request that is queued up by > qemu_paio_submit does not get onto the normal request list but a > secondary post-barrier queue > > - Once the barrier request is dequeued by an aio_thread that thread waits for > all other outstanding requests to finish, issues an fdatasync, the actual > barrier request, another fdatasync to prevent reordering in the pagecache. You don't need two fdatasyncs if the barrier request is just a barrier, no data write, used only to flush previously written data by a guest's fsync/fdatasync implementation. > After the request is finished the barrier_inprogress counter is decrement, > the post-barrier list is splice back onto the main request list up to and > including the next barrier request if there is one and normal operation > is resumed. > > That means barrier mean a quite massive serialization of the I/O submission > path, which unfortunately is not avoidable given their semantics. This is the best argument yet for having distinct "barrier" and "sync" operations. "Barrier" is for ordering I/O, such as journalling filesystems. "Sync" is to be sent after guest fsync/fdatasync, to commit data written so far to storage. It waits for the data to be committed, and also asks the data to be written sooner. "Sync" operations don't need to serialise I/O as much: it's ok to initiate later writes in parallel, and this is enough to keep the storage busy when there's a steady stream of guest fsyncs. Both together, "Barrier|Sync" would do what you've implemented: force ordering, write data quickly, and wait until it's committed to hard storage. Although Linux doesn't separate these two concepts (yet), because of I/O serialisation it might make a measurable difference to fsync-heavy workloads for virtio to have two separate bits, one for each concept, and then add the necessary tweaks to guests kernels to use only one or both bits as needed. -- Jamie