From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43)
	id 1O9OYe-0002pE-6U
	for qemu-devel@nongnu.org; Tue, 04 May 2010 16:17:32 -0400
Received: from [140.186.70.92] (port=39412 helo=eggs.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.43) id 1O9OYc-0002n8-JU
	for qemu-devel@nongnu.org; Tue, 04 May 2010 16:17:31 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.69)
	(envelope-from <jamie@shareable.org>) id 1O9OYa-0003U8-E0
	for qemu-devel@nongnu.org; Tue, 04 May 2010 16:17:30 -0400
Received: from mail2.shareable.org ([80.68.89.115]:40735)
	by eggs.gnu.org with esmtp (Exim 4.69)
	(envelope-from <jamie@shareable.org>) id 1O9OYa-0003TM-9V
	for qemu-devel@nongnu.org; Tue, 04 May 2010 16:17:28 -0400
Date: Tue, 4 May 2010 21:17:05 +0100
From: Jamie Lokier <jamie@shareable.org>
Subject: Re: [Qemu-devel] Re: [PATCH] virtio-spec: document block CMD and FLUSH
Message-ID: <20100504201705.GA4360@shareable.org>
References: <20100218222220.GA14847@redhat.com>
	<201005041408.25069.rusty@rustcorp.com.au>
	<20100504084133.GH27497@kernel.dk>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20100504084133.GH27497@kernel.dk>
List-Id: qemu-devel.nongnu.org
List-Unsubscribe: <http://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.gnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <http://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Jens Axboe <qemu@kernel.dk>
Cc: tytso@mit.edu, kvm@vger.kernel.org, "Michael S. Tsirkin" <mst@redhat.com>, Neil Brown <neilb@suse.de>, Rusty Russell <rusty@rustcorp.com.au>, qemu-devel@nongnu.org, virtualization@lists.linux-foundation.org, hch@lst.de

Jens Axboe wrote:
> On Tue, May 04 2010, Rusty Russell wrote:
> > ISTR someone mentioning a desire for such an API years ago, so CC'ing the
> > usual I/O suspects...
> 
> It would be nice to have a more fuller API for this, but the reality is
> that only the flush approach is really workable. Even just strict
> ordering of requests could only be supported on SCSI, and even there the
> kernel still lacks proper guarantees on error handling to prevent
> reordering there.

There's a few I/O scheduling differences that might be useful:

1. The I/O scheduler could freely move WRITEs before a FLUSH but not
   before a BARRIER.  That might be useful for time-critical WRITEs,
   and those issued by high I/O priority.

2. The I/O scheduler could move WRITEs after a FLUSH if the FLUSH is
   only for data belonging to a particular file (e.g. fdatasync with
   no file size change, even on btrfs if O_DIRECT was used for the
   writes being committed).  That would entail tagging FLUSHes and
   WRITEs with a fs-specific identifier (such as inode number), opaque
   to the scheduler which only checks equality.

3. By delaying FLUSHes through reordering as above, the I/O scheduler
   could merge multiple FLUSHes into a single command.

4. On MD/RAID, BARRIER requires every backing device to quiesce before
   sending the low-level cache-flush, and all of those to finish
   before resuming each backing device.  FLUSH doesn't require as much
   synchronising.  (With per-file FLUSH; see 2; it could even avoid
   FLUSH altogether to some backing devices for small files).

In other words, FLUSH can be more relaxed than BARRIER inside the
kernel.  It's ironic that we think of fsync as stronger than
fbarrier outside the kernel :-)

-- Jamie