linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jamie Lokier <jamie@shareable.org>
To: Werner Almesberger <wa@almesberger.net>
Cc: Bryan Henderson <hbryan@us.ibm.com>, linux-fsdevel@vger.kernel.org
Subject: Re: barriers vs. reads - O_DIRECT
Date: Thu, 24 Jun 2004 23:42:59 +0100	[thread overview]
Message-ID: <20040624224259.GA12840@mail.shareable.org> (raw)
In-Reply-To: <20040624175516.W1325@almesberger.net>

Werner Almesberger wrote:
> > Note that what filesystems and databases want is write-write *partial
> > dependencies*.  The per-device I/O barrier is just a crude
> > approximation.
> 
> True ;-) So what would an ideally flexible model look like ?
> Partial order ? Triggers plus virtual requests ? There's also
> the little issue that this should still yield an interface
> that people can understand without taking a semester of
> graph theory ;-)

For a fully journalling fs (including data), a barrier is used to
commit the journal writes before the corresponding non-journal writes.

For that purpose, a barrier has a set of writes which must come before
it, and a set of writes which must come after.  These represent a
transaction set.

(When data is not journalled, the situation can be more complicated
because to avoid revealing secure data, you might require non-journal
data to be committed before allowing a journal write which increases
the file length or block mapping metadata.  So then you have
non-journal writes before journal writes before other non-journal
writes.  I'm not sure if ext3 or reiserfs do this).

You can imagine that every small fs update could become a small
transaction.  That'd be one barrier per transaction.  Or you can
imagine many fs updates are aggregated, into a larger transaction.
That'd aggregate into fewer barriers.

Now you see that if the former, many small transactions, are in the
I/O queue, they _may_ be logically rescheduled by converting them to
larger transactions -- and reducing the number of I/O barriers which
read the device.

That's a simple consequence of barriers being a partial order.  If you
have two parallel transactions:

    A then (barrier) B
    C then (barrier) D

It's ok to schedule those as:

    A, B then (barrier) C, D

This is interesting because barriers are _sometimes_ low-level device
operations themselves, with a real overhead.  Think of the IDE
barriers implemented as cache flushes.  Therefore scheduling I/Os in
this way is a real optimisation.  In that example, it reduces 6 IDE
transactions to 5.

This optimisation is possible even when you have totally independent
filesystems, on different partitions.  Therefore it _can't_ be done
fully at the filesystem level, by virtue of the fs batching
transactions.

So that begs a question: should the filesystem give the I/O queues
enough information that the I/O queues can decide when to discard
physical write barriers in this way?  That is, the barriers remain in
the queue to logically constrain the order of other requests, but some
of them don't need to reach the device as actual commands, i.e. with
IDE that would allow some cache flush commands to be omitted.

I suspect that if barriers are represented as a queue entry with a
"before" set and an "after" set, such that the before set is known
prior to the barrier entering the queue, and the after set may be
added to after, that is enough to do this kind of optimisation in the
I/O scheduler.

It would be nice to come up with a interface that the loopback device
can support and relay through the underlying fs.

> > 3. What if a journal is on a different device to its filesystem?
> 
> "Don't do this" comes to mind :-)

ext3 and reiserfs both offered this from the begining, so it's
important to someone.  The two scenarios that come to mind are
journalling onto NVRAM for fast commits, and journalling onto a faster
device than the main filesystem -- faster in part because it's linear
writing.

> > Isn't the barrier itself an I/O operation which can be waited on?
> > I agree something could depend on the reads at the moment.
> 
> Making barriers waitable might be very useful, yes. That could
> also be a step towards implementing those cross-device barriers.

For fsync(), journalling fs's don't need to wait on barriers because
they can simply return from fsync() when all the prerequisite journal
writes are completed.

The same is true of a database.  So, waiting on barriers isn't
strictly needed for any application which knows which writes it has
queued before the barrier.

fsync() and _some_ databases need those barriers to say they've
committed prerequisite writes to stable storage.  At other times, a
barrier is there only to preserve ordering so that a journal
functions, but it's not required that the data is actually committed
to storage immediately -- merely that it _will_ be committed in order.

That's the difference between a cache flush and an ordering command to
some I/O devices.  PATA uses cache flush commands for ordering so both
barrier types are implemented the same.  I'm not sure if there are
disks which allow ordering commands without immediately committing to
storage.  Are there?

-- Jamie

  reply	other threads:[~2004-06-24 22:44 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2004-06-24  0:48 barriers vs. reads Werner Almesberger
2004-06-24  3:39 ` Werner Almesberger
2004-06-24  8:00   ` Herbert Poetzl
2004-06-24 12:16     ` Werner Almesberger
2004-06-24 13:36   ` Jamie Lokier
2004-06-24 17:02     ` Werner Almesberger
2004-06-24 16:39 ` Steve Lord
2004-06-24 17:00 ` barriers vs. reads - O_DIRECT Bryan Henderson
2004-06-24 17:46   ` Werner Almesberger
2004-06-24 18:50     ` Jamie Lokier
2004-06-24 20:55       ` Werner Almesberger
2004-06-24 22:42         ` Jamie Lokier [this message]
2004-06-25  3:21           ` Werner Almesberger
2004-06-25  3:57           ` Guy
2004-06-25  4:52             ` Werner Almesberger
2004-06-25  0:11     ` Bryan Henderson
2004-06-25  2:42       ` Werner Almesberger
2004-06-25 15:59         ` barriers vs. reads - O_DIRECT aio Bryan Henderson
2004-06-25 16:31         ` barriers vs. reads - O_DIRECT Bryan Henderson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20040624224259.GA12840@mail.shareable.org \
    --to=jamie@shareable.org \
    --cc=hbryan@us.ibm.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=wa@almesberger.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).