From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43) id 1O9WhE-0003OW-Ca for qemu-devel@nongnu.org; Wed, 05 May 2010 00:58:56 -0400 Received: from [140.186.70.92] (port=39914 helo=eggs.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1O9WhB-0003Ke-Gl for qemu-devel@nongnu.org; Wed, 05 May 2010 00:58:55 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.69) (envelope-from ) id 1O9Wh9-0003LK-RI for qemu-devel@nongnu.org; Wed, 05 May 2010 00:58:53 -0400 Received: from ozlabs.org ([203.10.76.45]:38267) by eggs.gnu.org with esmtp (Exim 4.69) (envelope-from ) id 1O9Wh9-0003Km-BM for qemu-devel@nongnu.org; Wed, 05 May 2010 00:58:51 -0400 From: Rusty Russell Subject: Re: [Qemu-devel] Re: [PATCH] virtio-spec: document block CMD and FLUSH Date: Wed, 5 May 2010 14:28:41 +0930 References: <20100218222220.GA14847@redhat.com> <20100504084133.GH27497@kernel.dk> <20100504201705.GA4360@shareable.org> In-Reply-To: <20100504201705.GA4360@shareable.org> MIME-Version: 1.0 Content-Type: Text/Plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Message-Id: <201005051428.41735.rusty@rustcorp.com.au> List-Id: qemu-devel.nongnu.org List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Jamie Lokier Cc: tytso@mit.edu, kvm@vger.kernel.org, "Michael S. Tsirkin" , Neil Brown , qemu-devel@nongnu.org, virtualization@lists.linux-foundation.org, Jens Axboe , hch@lst.de On Wed, 5 May 2010 05:47:05 am Jamie Lokier wrote: > Jens Axboe wrote: > > On Tue, May 04 2010, Rusty Russell wrote: > > > ISTR someone mentioning a desire for such an API years ago, so CC'ing the > > > usual I/O suspects... > > > > It would be nice to have a more fuller API for this, but the reality is > > that only the flush approach is really workable. Even just strict > > ordering of requests could only be supported on SCSI, and even there the > > kernel still lacks proper guarantees on error handling to prevent > > reordering there. > > There's a few I/O scheduling differences that might be useful: > > 1. The I/O scheduler could freely move WRITEs before a FLUSH but not > before a BARRIER. That might be useful for time-critical WRITEs, > and those issued by high I/O priority. This is only because noone actually wants flushes or barriers, though I/O people seem to only offer that. We really want " must occur before ". That offers maximum choice to the I/O subsystem and potentially to smart (virtual?) disks. > 2. The I/O scheduler could move WRITEs after a FLUSH if the FLUSH is > only for data belonging to a particular file (e.g. fdatasync with > no file size change, even on btrfs if O_DIRECT was used for the > writes being committed). That would entail tagging FLUSHes and > WRITEs with a fs-specific identifier (such as inode number), opaque > to the scheduler which only checks equality. This is closer. In userspace I'd be happy with a "all prior writes to this struct file before all future writes". Even if the original guarantees were stronger (ie. inode basis). We currently implement transactions using 4 fsync /msync pairs. write_recovery_data(fd); fsync(fd); msync(mmap); write_recovery_header(fd); fsync(fd); msync(mmap); overwrite_with_new_data(fd); fsync(fd); msync(mmap); remove_recovery_header(fd); fsync(fd); msync(mmap); Yet we really only need ordering, not guarantees about it actually hitting disk before returning. > In other words, FLUSH can be more relaxed than BARRIER inside the > kernel. It's ironic that we think of fsync as stronger than > fbarrier outside the kernel :-) It's an implementation detail; barrier has less flexibility because it has less information about what is required. I'm saying I want to give you as much information as I can, even if you don't use it yet. Thanks, Rusty.