From mboxrd@z Thu Jan  1 00:00:00 1970
From: Jamie Lokier <jamie@shareable.org>
Subject: Re: barriers vs. reads - O_DIRECT
Date: Thu, 24 Jun 2004 23:42:59 +0100
Sender: linux-fsdevel-owner@vger.kernel.org
Message-ID: <20040624224259.GA12840@mail.shareable.org>
References: <20040623214845.A21586@almesberger.net> <OF499CA81A.A6865E7D-ON88256EBD.005BF287-88256EBD.005D50C0@us.ibm.com> <20040624144638.V1325@almesberger.net> <20040624185059.GA11175@mail.shareable.org> <20040624175516.W1325@almesberger.net>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Cc: Bryan Henderson <hbryan@us.ibm.com>, linux-fsdevel@vger.kernel.org
Return-path: <linux-fsdevel-owner@vger.kernel.org>
Received: from mail.shareable.org ([81.29.64.88]:33962 "EHLO
	mail.shareable.org") by vger.kernel.org with ESMTP id S265760AbUFXWoH
	(ORCPT <rfc822;linux-fsdevel@vger.kernel.org>);
	Thu, 24 Jun 2004 18:44:07 -0400
To: Werner Almesberger <wa@almesberger.net>
Content-Disposition: inline
In-Reply-To: <20040624175516.W1325@almesberger.net>
List-Id: linux-fsdevel.vger.kernel.org

Werner Almesberger wrote:
> > Note that what filesystems and databases want is write-write *partial
> > dependencies*.  The per-device I/O barrier is just a crude
> > approximation.
> 
> True ;-) So what would an ideally flexible model look like ?
> Partial order ? Triggers plus virtual requests ? There's also
> the little issue that this should still yield an interface
> that people can understand without taking a semester of
> graph theory ;-)

For a fully journalling fs (including data), a barrier is used to
commit the journal writes before the corresponding non-journal writes.

For that purpose, a barrier has a set of writes which must come before
it, and a set of writes which must come after.  These represent a
transaction set.

(When data is not journalled, the situation can be more complicated
because to avoid revealing secure data, you might require non-journal
data to be committed before allowing a journal write which increases
the file length or block mapping metadata.  So then you have
non-journal writes before journal writes before other non-journal
writes.  I'm not sure if ext3 or reiserfs do this).

You can imagine that every small fs update could become a small
transaction.  That'd be one barrier per transaction.  Or you can
imagine many fs updates are aggregated, into a larger transaction.
That'd aggregate into fewer barriers.

Now you see that if the former, many small transactions, are in the
I/O queue, they _may_ be logically rescheduled by converting them to
larger transactions -- and reducing the number of I/O barriers which
read the device.

That's a simple consequence of barriers being a partial order.  If you
have two parallel transactions:

    A then (barrier) B
    C then (barrier) D

It's ok to schedule those as:

    A, B then (barrier) C, D

This is interesting because barriers are _sometimes_ low-level device
operations themselves, with a real overhead.  Think of the IDE
barriers implemented as cache flushes.  Therefore scheduling I/Os in
this way is a real optimisation.  In that example, it reduces 6 IDE
transactions to 5.

This optimisation is possible even when you have totally independent
filesystems, on different partitions.  Therefore it _can't_ be done
fully at the filesystem level, by virtue of the fs batching
transactions.

So that begs a question: should the filesystem give the I/O queues
enough information that the I/O queues can decide when to discard
physical write barriers in this way?  That is, the barriers remain in
the queue to logically constrain the order of other requests, but some
of them don't need to reach the device as actual commands, i.e. with
IDE that would allow some cache flush commands to be omitted.

I suspect that if barriers are represented as a queue entry with a
"before" set and an "after" set, such that the before set is known
prior to the barrier entering the queue, and the after set may be
added to after, that is enough to do this kind of optimisation in the
I/O scheduler.

It would be nice to come up with a interface that the loopback device
can support and relay through the underlying fs.

> > 3. What if a journal is on a different device to its filesystem?
> 
> "Don't do this" comes to mind :-)

ext3 and reiserfs both offered this from the begining, so it's
important to someone.  The two scenarios that come to mind are
journalling onto NVRAM for fast commits, and journalling onto a faster
device than the main filesystem -- faster in part because it's linear
writing.

> > Isn't the barrier itself an I/O operation which can be waited on?
> > I agree something could depend on the reads at the moment.
> 
> Making barriers waitable might be very useful, yes. That could
> also be a step towards implementing those cross-device barriers.

For fsync(), journalling fs's don't need to wait on barriers because
they can simply return from fsync() when all the prerequisite journal
writes are completed.

The same is true of a database.  So, waiting on barriers isn't
strictly needed for any application which knows which writes it has
queued before the barrier.

fsync() and _some_ databases need those barriers to say they've
committed prerequisite writes to stable storage.  At other times, a
barrier is there only to preserve ordering so that a journal
functions, but it's not required that the data is actually committed
to storage immediately -- merely that it _will_ be committed in order.

That's the difference between a cache flush and an ordering command to
some I/O devices.  PATA uses cache flush commands for ordering so both
barrier types are implemented the same.  I'm not sure if there are
disks which allow ordering commands without immediately committing to
storage.  Are there?

-- Jamie