Re: barriers vs. reads

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: barriers vs. reads
@ 2004-06-24  0:48 Werner Almesberger
  2004-06-24  3:39 ` Werner Almesberger
                   ` (2 more replies)
  0 siblings, 3 replies; 19+ messages in thread
From: Werner Almesberger @ 2004-06-24  0:48 UTC (permalink / raw)
  To: linux-fsdevel

BTW, regarding overlapping requests, I wonder if there's a data
structure that gives O(log requests) or such lookups for ranges.
The best I could spontaneously think of would be
O(new_request_size*log(requests*avg_request_size))
which isn't pretty.

BTW2, is O_DIRECT actually a Linux-only thing, or is there some
ancestor whose semantics we may want to preserve ? I've had a
quick look at POSIX, but they don't seem to have direct IO.

- Werner

-- 
  _________________________________________________________________________
 / Werner Almesberger, Buenos Aires, Argentina         wa@almesberger.net /
/_http://www.almesberger.net/____________________________________________/

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: barriers vs. reads
  2004-06-24  0:48 barriers vs. reads Werner Almesberger
@ 2004-06-24  3:39 ` Werner Almesberger
  2004-06-24  8:00   ` Herbert Poetzl
  2004-06-24 13:36   ` Jamie Lokier
  2004-06-24 16:39 ` Steve Lord
  2004-06-24 17:00 ` barriers vs. reads - O_DIRECT Bryan Henderson
  2 siblings, 2 replies; 19+ messages in thread
From: Werner Almesberger @ 2004-06-24  3:39 UTC (permalink / raw)
  To: linux-fsdevel

I wrote:
> BTW, regarding overlapping requests, I wonder if there's a data
> structure that gives O(log requests) or such lookups for ranges.

Seem that I've found one that is maybe 2-4 times as expensive as a
single tree. It works as follows: if we have a set of ranges (a,b)
and want to see if any of them overlap with a range (x,y), we
compare the indices of the matches (or almost-matches).

  num_overlaps = |{a : a < y}| - |{b : b <= x}|

"{a : a < y}" is "the set of all a where a < y". "|...|" is the
number of elements in a set.

We could obtain such indices by counting the number of nodes in
each branch of the tree. That's O(1) for all regular tree
operations, and O(log n) for the sum. The index is the size of
all the trees under left branches we haven't taken, plus the
number of nodes we've left through the right branch. If there are
multiple equal entries, we must find the first one.

One problem: I did this mostly by instinct. It seems to work
perfectly, but I can't quite explain why :-(

I put a dirty little program to simulate this on
http://abiss.sourceforge.net/t.tar.gz

- Werner

-- 
  _________________________________________________________________________
 / Werner Almesberger, Buenos Aires, Argentina         wa@almesberger.net /
/_http://www.almesberger.net/____________________________________________/

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: barriers vs. reads
  2004-06-24  3:39 ` Werner Almesberger
@ 2004-06-24  8:00   ` Herbert Poetzl
  2004-06-24 12:16     ` Werner Almesberger
  2004-06-24 13:36   ` Jamie Lokier
  1 sibling, 1 reply; 19+ messages in thread
From: Herbert Poetzl @ 2004-06-24  8:00 UTC (permalink / raw)
  To: Werner Almesberger; +Cc: linux-fsdevel

On Thu, Jun 24, 2004 at 12:39:44AM -0300, Werner Almesberger wrote:
> I wrote:
> > BTW, regarding overlapping requests, I wonder if there's a data
> > structure that gives O(log requests) or such lookups for ranges.
> 
> Seem that I've found one that is maybe 2-4 times as expensive as a
> single tree. It works as follows: if we have a set of ranges (a,b)
> and want to see if any of them overlap with a range (x,y), we
> compare the indices of the matches (or almost-matches).
> 
>   num_overlaps = |{a : a < y}| - |{b : b <= x}|
> 
> "{a : a < y}" is "the set of all a where a < y". "|...|" is the
> number of elements in a set.
> 
> We could obtain such indices by counting the number of nodes in
> each branch of the tree. That's O(1) for all regular tree
> operations, and O(log n) for the sum. The index is the size of
> all the trees under left branches we haven't taken, plus the
> number of nodes we've left through the right branch. If there are
> multiple equal entries, we must find the first one.
> 
> One problem: I did this mostly by instinct. It seems to work
> perfectly, but I can't quite explain why :-(

hmm ... there are eight cases how ranges can interact ...

    a     b     		        		a	  b
    |     |			        		|	  |
 1) +-----+       +---------+	       2) +-----+	+---------+
                  |         |	          |	|
		  x         y	          x	y

    a             b		        		a	  b
    |      	  |		        		|	  |
 3) +-------------+---------+	       4) +-------------+---------+
                  |         |	          |		|
		  x         y	          x		y


    a                 b    	        	a		  b
    |                 | 	        	|		  |
 5) +-------------+===+-----+	       6) +-----+========+--------+
                  |         |	          |		 |
		  x         y	          x		 y
	
    a			    b	        	a	 b
    |			    |	        	|	 |
 7) +-----+========+--------+	       8) +-----+========+--------+
          |        |		          |			  |
	  x        y		          x			  y
	

by verifying those eight cases for correctness, you
can conclude, that the sum of N such cases will give
the correct number of overlaps (with a given test
range); verification itself is simple:

case     a<y    b<=x    |{a:a<y}| - |{b:b<= x}|
------+-------+-------+-------------------------
 1)   |  YES  |  YES  |  0
 2)   |  NO   |  NO   |  0
 3)   |  YES  |  YES  |  0
 4)   |  NO   |  NO   |  0
------+-------+-------+-----
 5)   |  YES  |  NO   |  1
 6)   |  YES  |  NO   |  1
 7)   |  YES  |  NO   |  1
 8)   |  YES  |  NO   |  1

HTH,
Herbert

> I put a dirty little program to simulate this on
> http://abiss.sourceforge.net/t.tar.gz
> 
> - Werner
> 
> -- 
>   _________________________________________________________________________
>  / Werner Almesberger, Buenos Aires, Argentina         wa@almesberger.net /
> /_http://www.almesberger.net/____________________________________________/
> -
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: barriers vs. reads
  2004-06-24  8:00   ` Herbert Poetzl
@ 2004-06-24 12:16     ` Werner Almesberger
  0 siblings, 0 replies; 19+ messages in thread
From: Werner Almesberger @ 2004-06-24 12:16 UTC (permalink / raw)
  To: linux-fsdevel

Herbert Poetzl wrote:
> by verifying those eight cases for correctness, you
> can conclude, that the sum of N such cases will give
> the correct number of overlaps (with a given test
> range); verification itself is simple:

Thanks (very nice drawings, btw.) ! Yes, now that I see it
written down, I also see how we can generalize that to a
proof for any number of (a,b) ranges. Cool !

So this means that we have a solution to detect overlaps
that shouldn't be  significantly slower than, say tree+hash
as used in the anticipatory scheduler. One problem is that
this approach doesn't tell us where the overlapping
requests are, only that there are somewhere out there.

Thanks,
- Werner

-- 
  _________________________________________________________________________
 / Werner Almesberger, Buenos Aires, Argentina         wa@almesberger.net /
/_http://www.almesberger.net/____________________________________________/

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: barriers vs. reads
  2004-06-24  3:39 ` Werner Almesberger
  2004-06-24  8:00   ` Herbert Poetzl
@ 2004-06-24 13:36   ` Jamie Lokier
  2004-06-24 17:02     ` Werner Almesberger
  1 sibling, 1 reply; 19+ messages in thread
From: Jamie Lokier @ 2004-06-24 13:36 UTC (permalink / raw)
  To: Werner Almesberger; +Cc: linux-fsdevel

Werner Almesberger wrote:
> I wrote:
> > BTW, regarding overlapping requests, I wonder if there's a data
> > structure that gives O(log requests) or such lookups for ranges.
> 
> Seem that I've found one that is maybe 2-4 times as expensive as a
> single tree. It works as follows: if we have a set of ranges (a,b)
> and want to see if any of them overlap with a range (x,y), we
> compare the indices of the matches (or almost-matches).

Is the prio_tree data structure, the one being used by recent VM work,
which keeps track of ranges, any use?

-- Jamie

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: barriers vs. reads
  2004-06-24  0:48 barriers vs. reads Werner Almesberger
  2004-06-24  3:39 ` Werner Almesberger
@ 2004-06-24 16:39 ` Steve Lord
  2004-06-24 17:00 ` barriers vs. reads - O_DIRECT Bryan Henderson
  2 siblings, 0 replies; 19+ messages in thread
From: Steve Lord @ 2004-06-24 16:39 UTC (permalink / raw)
  To: Werner Almesberger; +Cc: linux-fsdevel

Werner Almesberger wrote:
> BTW, regarding overlapping requests, I wonder if there's a data
> structure that gives O(log requests) or such lookups for ranges.
> The best I could spontaneously think of would be
> O(new_request_size*log(requests*avg_request_size))
> which isn't pretty.
> 
> BTW2, is O_DIRECT actually a Linux-only thing, or is there some
> ancestor whose semantics we may want to preserve ? I've had a
> quick look at POSIX, but they don't seem to have direct IO.

Irix has O_DIRECT, Solaris has something too, but it is not
in the posix specs. Cray Unicos is the oldest implementation I came
across.

Irix explicitly lets multiple readers and writers into a file
at once with O_DIRECT. The assumption being that the application
which does this is doing its own coordination and will not
shoot itself in the foot.

Steve Lord


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: barriers vs. reads - O_DIRECT
  2004-06-24  0:48 barriers vs. reads Werner Almesberger
  2004-06-24  3:39 ` Werner Almesberger
  2004-06-24 16:39 ` Steve Lord
@ 2004-06-24 17:00 ` Bryan Henderson
  2004-06-24 17:46   ` Werner Almesberger
  2 siblings, 1 reply; 19+ messages in thread
From: Bryan Henderson @ 2004-06-24 17:00 UTC (permalink / raw)
  To: Werner Almesberger; +Cc: linux-fsdevel

>BTW2, is O_DIRECT actually a Linux-only thing, or is there some
>ancestor whose semantics we may want to preserve ? I've had a
>quick look at POSIX, but they don't seem to have direct IO.

Several operating systems got O_DIRECT before, but practically at the same 
time as, Linux.  There is no standard and no agreement on the small 
details, such as what to do when someone tries to read the last 
half-sector of a file, given that direct I/O is supposed to be in whole 
sectors.

It seems obvious to me that whatever ordering guarantees the user gets 
without the O_DIRECT flag, he should get with it as well.  The user 
doesn't see disk queues and sectors and elevators and file caches.  He 
sees bytes stream files.

If we can give much better performance by withdrawing those guarantees for 
O_DIRECT, though, users would probably accept it since there is in fact no 
strong legacy we have to live up to.

When I worry about I/O scheduling, I usually worry a lot more about block 
device direct I/O (raw devices) than file direct I/O.  The former is where 
the block layer is most exposed to the user.

--
Bryan Henderson                          IBM Almaden Research Center
San Jose CA                              Filesystems

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: barriers vs. reads
  2004-06-24 13:36   ` Jamie Lokier
@ 2004-06-24 17:02     ` Werner Almesberger
  0 siblings, 0 replies; 19+ messages in thread
From: Werner Almesberger @ 2004-06-24 17:02 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: linux-fsdevel

Jamie Lokier wrote:
> Is the prio_tree data structure, the one being used by recent VM work,
> which keeps track of ranges, any use?

Hmm, they deliver exactly the data we need. That's great. Unlike
rb-trees, they also aren't balanced, and thus have a worst-case
O(log n) term with "log n" = sizeof(sector_t)*8 for simple
lookups. Furthermore, they have a scary worst-case insertion
time of O((log n)^2).

With rb-trees, we get a mere O(log nr_requests) for lookups. So
that's A*32 vs. B*7 using all-default 2.6.7 on ia32. Insertion
has the same typical cost, and A*1024 vs. B*7 worst-case. A and B
are implementation-specific per-operation cost factors.

But yes, they look very interesting ...

Thanks,
- Werner

-- 
  _________________________________________________________________________
 / Werner Almesberger, Buenos Aires, Argentina         wa@almesberger.net /
/_http://www.almesberger.net/____________________________________________/

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: barriers vs. reads - O_DIRECT
  2004-06-24 17:00 ` barriers vs. reads - O_DIRECT Bryan Henderson
@ 2004-06-24 17:46   ` Werner Almesberger
  2004-06-24 18:50     ` Jamie Lokier
  2004-06-25  0:11     ` Bryan Henderson
  0 siblings, 2 replies; 19+ messages in thread
From: Werner Almesberger @ 2004-06-24 17:46 UTC (permalink / raw)
  To: Bryan Henderson; +Cc: linux-fsdevel

Bryan Henderson wrote:
> It seems obvious to me that whatever ordering guarantees the user gets 
> without the O_DIRECT flag, he should get with it as well.

Yes, it would be nice if we could obtain such behaviour without
unacceptable performance sacrifices. It seems to me that, if we
can find an efficient way for serializing all write-write and
read-write overlaps, plus have explicit barriers for serializing
non-overlapping writes, this should yield pretty much what
everyone wants (*). Now, that "if" needs a bit of work ... :-)

(*) The only difference being that a completing read doesn't
    tell you whether the elevator has already passed a barrier.
    Currently, one could be lured into depending on this.

> When I worry about I/O scheduling, I usually worry a lot more about block 
> device direct I/O (raw devices) than file direct I/O.  The former is where 
> the block layer is most exposed to the user.

I haven't looked at the direct IO code in detail, but it seems
to me that the elevator behaviour affects file-based direct IO
in the same way as the device-based one.

- Werner

-- 
  _________________________________________________________________________
 / Werner Almesberger, Buenos Aires, Argentina         wa@almesberger.net /
/_http://www.almesberger.net/____________________________________________/

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: barriers vs. reads - O_DIRECT
  2004-06-24 17:46   ` Werner Almesberger
@ 2004-06-24 18:50     ` Jamie Lokier
  2004-06-24 20:55       ` Werner Almesberger
  2004-06-25  0:11     ` Bryan Henderson
  1 sibling, 1 reply; 19+ messages in thread
From: Jamie Lokier @ 2004-06-24 18:50 UTC (permalink / raw)
  To: Werner Almesberger; +Cc: Bryan Henderson, linux-fsdevel

Werner Almesberger wrote:
> Bryan Henderson wrote:
> > It seems obvious to me that whatever ordering guarantees the user gets 
> > without the O_DIRECT flag, he should get with it as well.
> 
> Yes, it would be nice if we could obtain such behaviour without
> unacceptable performance sacrifices. It seems to me that, if we
> can find an efficient way for serializing all write-write and
> read-write overlaps, plus have explicit barriers for serializing
> non-overlapping writes, this should yield pretty much what
> everyone wants (*). Now, that "if" needs a bit of work ... :-)

Note that what filesystems and databases want is write-write *partial
dependencies*.  The per-device I/O barrier is just a crude
approximation.

1. Think about this: two filesystems on different partitions of the same
device.  The writes of each filesystem are independent, yet the
barriers will force the writes of one filesystem to come before
later-queued writes of the other.

2. Or, two database back-ends doing direct I/O to two separate files.

It's probably not a big performance penalty, but it illustrates that
the barriers are "bigger" than they need to be.  Worth taking into
account when deciding what minimal ordering everyone _really_ wants.

If you do implement overlap detection logic, then would giving
barriers an I/O range be helpful?  E.g. corresponding to partitions.

Here's a few more cases, which may not be quite right even now:

3. What if a journal is on a different device to its filesystem?
Ideally, write barriers between the different device queues would be
appropriate.

4. A journalling filesystem mounted on a loopback device.  Is this
reliable now?

5. A journalling filesystem mounted on two loopback devices -- one for
the fs, one for the journal.

> (*) The only difference being that a completing read doesn't
>     tell you whether the elevator has already passed a barrier.
>     Currently, one could be lured into depending on this.

Isn't the barrier itself an I/O operation which can be waited on?
I agree something could depend on the reads at the moment.

-- Jamie

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: barriers vs. reads - O_DIRECT
  2004-06-24 18:50     ` Jamie Lokier
@ 2004-06-24 20:55       ` Werner Almesberger
  2004-06-24 22:42         ` Jamie Lokier
  0 siblings, 1 reply; 19+ messages in thread
From: Werner Almesberger @ 2004-06-24 20:55 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: Bryan Henderson, linux-fsdevel

Jamie Lokier wrote:
> Note that what filesystems and databases want is write-write *partial
> dependencies*.  The per-device I/O barrier is just a crude
> approximation.

True ;-) So what would an ideally flexible model look like ?
Partial order ? Triggers plus virtual requests ? There's also
the little issue that this should still yield an interface
that people can understand without taking a semester of
graph theory ;-)

> 3. What if a journal is on a different device to its filesystem?

"Don't do this" comes to mind :-)

> Isn't the barrier itself an I/O operation which can be waited on?
> I agree something could depend on the reads at the moment.

Making barriers waitable might be very useful, yes. That could
also be a step towards implementing those cross-device barriers.

- Werner

-- 
  _________________________________________________________________________
 / Werner Almesberger, Buenos Aires, Argentina         wa@almesberger.net /
/_http://www.almesberger.net/____________________________________________/

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: barriers vs. reads - O_DIRECT
  2004-06-24 20:55       ` Werner Almesberger
@ 2004-06-24 22:42         ` Jamie Lokier
  2004-06-25  3:21           ` Werner Almesberger
  2004-06-25  3:57           ` Guy
  0 siblings, 2 replies; 19+ messages in thread
From: Jamie Lokier @ 2004-06-24 22:42 UTC (permalink / raw)
  To: Werner Almesberger; +Cc: Bryan Henderson, linux-fsdevel

Werner Almesberger wrote:
> > Note that what filesystems and databases want is write-write *partial
> > dependencies*.  The per-device I/O barrier is just a crude
> > approximation.
> 
> True ;-) So what would an ideally flexible model look like ?
> Partial order ? Triggers plus virtual requests ? There's also
> the little issue that this should still yield an interface
> that people can understand without taking a semester of
> graph theory ;-)

For a fully journalling fs (including data), a barrier is used to
commit the journal writes before the corresponding non-journal writes.

For that purpose, a barrier has a set of writes which must come before
it, and a set of writes which must come after.  These represent a
transaction set.

(When data is not journalled, the situation can be more complicated
because to avoid revealing secure data, you might require non-journal
data to be committed before allowing a journal write which increases
the file length or block mapping metadata.  So then you have
non-journal writes before journal writes before other non-journal
writes.  I'm not sure if ext3 or reiserfs do this).

You can imagine that every small fs update could become a small
transaction.  That'd be one barrier per transaction.  Or you can
imagine many fs updates are aggregated, into a larger transaction.
That'd aggregate into fewer barriers.

Now you see that if the former, many small transactions, are in the
I/O queue, they _may_ be logically rescheduled by converting them to
larger transactions -- and reducing the number of I/O barriers which
read the device.

That's a simple consequence of barriers being a partial order.  If you
have two parallel transactions:

    A then (barrier) B
    C then (barrier) D

It's ok to schedule those as:

    A, B then (barrier) C, D

This is interesting because barriers are _sometimes_ low-level device
operations themselves, with a real overhead.  Think of the IDE
barriers implemented as cache flushes.  Therefore scheduling I/Os in
this way is a real optimisation.  In that example, it reduces 6 IDE
transactions to 5.

This optimisation is possible even when you have totally independent
filesystems, on different partitions.  Therefore it _can't_ be done
fully at the filesystem level, by virtue of the fs batching
transactions.

So that begs a question: should the filesystem give the I/O queues
enough information that the I/O queues can decide when to discard
physical write barriers in this way?  That is, the barriers remain in
the queue to logically constrain the order of other requests, but some
of them don't need to reach the device as actual commands, i.e. with
IDE that would allow some cache flush commands to be omitted.

I suspect that if barriers are represented as a queue entry with a
"before" set and an "after" set, such that the before set is known
prior to the barrier entering the queue, and the after set may be
added to after, that is enough to do this kind of optimisation in the
I/O scheduler.

It would be nice to come up with a interface that the loopback device
can support and relay through the underlying fs.

> > 3. What if a journal is on a different device to its filesystem?
> 
> "Don't do this" comes to mind :-)

ext3 and reiserfs both offered this from the begining, so it's
important to someone.  The two scenarios that come to mind are
journalling onto NVRAM for fast commits, and journalling onto a faster
device than the main filesystem -- faster in part because it's linear
writing.

> > Isn't the barrier itself an I/O operation which can be waited on?
> > I agree something could depend on the reads at the moment.
> 
> Making barriers waitable might be very useful, yes. That could
> also be a step towards implementing those cross-device barriers.

For fsync(), journalling fs's don't need to wait on barriers because
they can simply return from fsync() when all the prerequisite journal
writes are completed.

The same is true of a database.  So, waiting on barriers isn't
strictly needed for any application which knows which writes it has
queued before the barrier.

fsync() and _some_ databases need those barriers to say they've
committed prerequisite writes to stable storage.  At other times, a
barrier is there only to preserve ordering so that a journal
functions, but it's not required that the data is actually committed
to storage immediately -- merely that it _will_ be committed in order.

That's the difference between a cache flush and an ordering command to
some I/O devices.  PATA uses cache flush commands for ordering so both
barrier types are implemented the same.  I'm not sure if there are
disks which allow ordering commands without immediately committing to
storage.  Are there?

-- Jamie

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: barriers vs. reads - O_DIRECT
  2004-06-24 17:46   ` Werner Almesberger
  2004-06-24 18:50     ` Jamie Lokier
@ 2004-06-25  0:11     ` Bryan Henderson
  2004-06-25  2:42       ` Werner Almesberger
  1 sibling, 1 reply; 19+ messages in thread
From: Bryan Henderson @ 2004-06-25  0:11 UTC (permalink / raw)
  To: Werner Almesberger; +Cc: linux-fsdevel

>> It seems obvious to me that whatever ordering guarantees the user gets 
>> without the O_DIRECT flag, he should get with it as well.
>
>Yes, it would be nice if we could obtain such behaviour without
>unacceptable performance sacrifices. It seems to me that, if we
>can find an efficient way for serializing all write-write and
>read-write overlaps, plus have explicit barriers for serializing
>non-overlapping writes, this should yield pretty much what
>everyone wants 

Maybe it's time to identify a specific scenario we're trying to make 
right.  Because if the goal is for reads and writes with O_DIRECT to 
interact the same way as without, then I don't see where the block layer 
even comes into it -- the filesystem layer does the job. 

There are two cases: atomic writes and nonatomic writes.  Some filesystems 
do atomic writes, which is what everyone seems to expect but isn't 
actually stated in POSIX, and others don't.  I heard Linux filesystems 
based on generic_file_read/write don't, but I haven't looked closely at 
those.

_With_ atomic writes, the filesystem code must use a lock to ensure a 
whole write() finishes before another write() or read() begins.  (If it's 
smart, it can do this only where there are overlaps of file offsets, but I 
don't think that's common).  In the case of cached I/O, "finished" means 
all the data is in the cache; with direct I/O, it means it's all on the 
disk.

_Without_ atomic writes, the caller of read() and write() is responsible 
for synchronization.  There simply isn't any guarantee that if Process A 
does a read while Process B is doing a write that A won't see just part of 
B's write.  So if you're brave enough to have multiple processes 
simultaneously reading and writing the same file regions, the processes 
have to talk to each other.

There's also (kernel) aio to consider.  I can see barriers playing a role 
there.  But I can't see how barriers figure into ordinary O_DIRECT file 
I/O.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: barriers vs. reads - O_DIRECT
  2004-06-25  0:11     ` Bryan Henderson
@ 2004-06-25  2:42       ` Werner Almesberger
  2004-06-25 15:59         ` barriers vs. reads - O_DIRECT aio Bryan Henderson
  2004-06-25 16:31         ` barriers vs. reads - O_DIRECT Bryan Henderson
  0 siblings, 2 replies; 19+ messages in thread
From: Werner Almesberger @ 2004-06-25  2:42 UTC (permalink / raw)
  To: Bryan Henderson; +Cc: linux-fsdevel

Bryan Henderson wrote:
> There's also (kernel) aio to consider.

I suppose this would be the main scenario. AIO+O_DIRECT also
gives the closest approximation to directly talking to the
elevator.

> But I can't see how barriers figure into ordinary O_DIRECT file I/O.

Hmm, I've never looked at non-AIO write semantics with O_DIRECT.
For reasonable semantics, I guess it would either have to block
until the operation has completed, or be "atomic" (*).

(*) Defining atomic can be tricky, too. E.g. see
    http://www.uwsg.iu.edu/hypermail/linux/kernel/0402.0/1361.html

- Werner

-- 
  _________________________________________________________________________
 / Werner Almesberger, Buenos Aires, Argentina         wa@almesberger.net /
/_http://www.almesberger.net/____________________________________________/

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: barriers vs. reads - O_DIRECT
  2004-06-24 22:42         ` Jamie Lokier
@ 2004-06-25  3:21           ` Werner Almesberger
  2004-06-25  3:57           ` Guy
  1 sibling, 0 replies; 19+ messages in thread
From: Werner Almesberger @ 2004-06-25  3:21 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: Bryan Henderson, linux-fsdevel

Jamie Lokier wrote:
> For that purpose, a barrier has a set of writes which must come before
> it, and a set of writes which must come after.  These represent a
> transaction set.

Okay, partial order then. AFAIK, this doesn't allow us to
express things like "do A after one of { B, C } is done". 
That could be useful for redundant storage structures, but
we may not care.

>     A then (barrier) B
>     C then (barrier) D
> 
> It's ok to schedule those as:
> 
>     A, B then (barrier) C, D

I think you mean "A, C then (barrier) B, D" :-)

There's a problem, though: the first has the following relations:
A < B, C < D. The second has: A < B, A < D, C < B, C < D. So now
the elevator needs to decided if avoiding the cost of the
side-effects of a barrier (cache-flush, or such) is acceptable,
given the cost of the additional ordering restrictions. (I'm
carefully trying to avoid to say "has less cost", because the
elevator may not always pick the "cheapest" variant.)

> It would be nice to come up with a interface that the loopback device
> can support and relay through the underlying fs.

You really like those loopback devices, don't you ? :-)

> ext3 and reiserfs both offered this from the begining, so it's
> important to someone.

Sigh, yes ...

> committed prerequisite writes to stable storage.  At other times, a
> barrier is there only to preserve ordering so that a journal
> functions, but it's not required that the data is actually committed
> to storage immediately -- merely that it _will_ be committed in order.

Well, to implement cross-device barriers, you need a means to
find out if the prerequisites for the local device to cross a
barrier are satisfied, and to make the local device wait until
then.

This may be implemented in a completely different thread/whatever
than the actual IO.

Making an elevator non-work-conserving is an interesting
exercise by itself. (It may be feasible: I've done it for a power
management experiment (*), and this works at least for PATA.)

(*) To stop, I simply make next_req return NULL. queue_empty
    still returns the correct result. To resume, I have an
    external trigger that calls blk_start_queue.

There's of course the question whether the elevator is really the
right place for all this. Perhaps just embedding a callback in a
barrier request could be used for a more general solution - in
particular one that would allow us to avoid having to solve all
problems at once :-)

- Werner

-- 
  _________________________________________________________________________
 / Werner Almesberger, Buenos Aires, Argentina         wa@almesberger.net /
/_http://www.almesberger.net/____________________________________________/

^ permalink raw reply	[flat|nested] 19+ messages in thread

* RE: barriers vs. reads - O_DIRECT
  2004-06-24 22:42         ` Jamie Lokier
  2004-06-25  3:21           ` Werner Almesberger
@ 2004-06-25  3:57           ` Guy
  2004-06-25  4:52             ` Werner Almesberger
  1 sibling, 1 reply; 19+ messages in thread
From: Guy @ 2004-06-25  3:57 UTC (permalink / raw)
  To: 'Jamie Lokier', 'Werner Almesberger'
  Cc: 'Bryan Henderson', linux-fsdevel

What if a filesystem is on a software RAID5 array?  How does the filesystem
use a barrier when some of the writes are on different disks?  Would md pass
the barrier down to each disk?

-----Original Message-----
From: linux-fsdevel-owner@vger.kernel.org
[mailto:linux-fsdevel-owner@vger.kernel.org] On Behalf Of Jamie Lokier
Sent: Thursday, June 24, 2004 6:43 PM
To: Werner Almesberger
Cc: Bryan Henderson; linux-fsdevel@vger.kernel.org
Subject: Re: barriers vs. reads - O_DIRECT

Werner Almesberger wrote:
> > Note that what filesystems and databases want is write-write *partial
> > dependencies*.  The per-device I/O barrier is just a crude
> > approximation.
> 
> True ;-) So what would an ideally flexible model look like ?
> Partial order ? Triggers plus virtual requests ? There's also
> the little issue that this should still yield an interface
> that people can understand without taking a semester of
> graph theory ;-)

For a fully journalling fs (including data), a barrier is used to
commit the journal writes before the corresponding non-journal writes.

For that purpose, a barrier has a set of writes which must come before
it, and a set of writes which must come after.  These represent a
transaction set.

(When data is not journalled, the situation can be more complicated
because to avoid revealing secure data, you might require non-journal
data to be committed before allowing a journal write which increases
the file length or block mapping metadata.  So then you have
non-journal writes before journal writes before other non-journal
writes.  I'm not sure if ext3 or reiserfs do this).

You can imagine that every small fs update could become a small
transaction.  That'd be one barrier per transaction.  Or you can
imagine many fs updates are aggregated, into a larger transaction.
That'd aggregate into fewer barriers.

Now you see that if the former, many small transactions, are in the
I/O queue, they _may_ be logically rescheduled by converting them to
larger transactions -- and reducing the number of I/O barriers which
read the device.

That's a simple consequence of barriers being a partial order.  If you
have two parallel transactions:

    A then (barrier) B
    C then (barrier) D

It's ok to schedule those as:

    A, B then (barrier) C, D

This is interesting because barriers are _sometimes_ low-level device
operations themselves, with a real overhead.  Think of the IDE
barriers implemented as cache flushes.  Therefore scheduling I/Os in
this way is a real optimisation.  In that example, it reduces 6 IDE
transactions to 5.

This optimisation is possible even when you have totally independent
filesystems, on different partitions.  Therefore it _can't_ be done
fully at the filesystem level, by virtue of the fs batching
transactions.

So that begs a question: should the filesystem give the I/O queues
enough information that the I/O queues can decide when to discard
physical write barriers in this way?  That is, the barriers remain in
the queue to logically constrain the order of other requests, but some
of them don't need to reach the device as actual commands, i.e. with
IDE that would allow some cache flush commands to be omitted.

I suspect that if barriers are represented as a queue entry with a
"before" set and an "after" set, such that the before set is known
prior to the barrier entering the queue, and the after set may be
added to after, that is enough to do this kind of optimisation in the
I/O scheduler.

It would be nice to come up with a interface that the loopback device
can support and relay through the underlying fs.

> > 3. What if a journal is on a different device to its filesystem?
> 
> "Don't do this" comes to mind :-)

ext3 and reiserfs both offered this from the begining, so it's
important to someone.  The two scenarios that come to mind are
journalling onto NVRAM for fast commits, and journalling onto a faster
device than the main filesystem -- faster in part because it's linear
writing.

> > Isn't the barrier itself an I/O operation which can be waited on?
> > I agree something could depend on the reads at the moment.
> 
> Making barriers waitable might be very useful, yes. That could
> also be a step towards implementing those cross-device barriers.

For fsync(), journalling fs's don't need to wait on barriers because
they can simply return from fsync() when all the prerequisite journal
writes are completed.

The same is true of a database.  So, waiting on barriers isn't
strictly needed for any application which knows which writes it has
queued before the barrier.

fsync() and _some_ databases need those barriers to say they've
committed prerequisite writes to stable storage.  At other times, a
barrier is there only to preserve ordering so that a journal
functions, but it's not required that the data is actually committed
to storage immediately -- merely that it _will_ be committed in order.

That's the difference between a cache flush and an ordering command to
some I/O devices.  PATA uses cache flush commands for ordering so both
barrier types are implemented the same.  I'm not sure if there are
disks which allow ordering commands without immediately committing to
storage.  Are there?

-- Jamie
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: barriers vs. reads - O_DIRECT
  2004-06-25  3:57           ` Guy
@ 2004-06-25  4:52             ` Werner Almesberger
  0 siblings, 0 replies; 19+ messages in thread
From: Werner Almesberger @ 2004-06-25  4:52 UTC (permalink / raw)
  To: Guy; +Cc: 'Jamie Lokier', 'Bryan Henderson', linux-fsdevel

Guy wrote:
> What if a filesystem is on a software RAID5 array?  How does the filesystem
> use a barrier when some of the writes are on different disks?  Would md pass
> the barrier down to each disk?

Let's assume we just have a FIFO elevator per device, barriers
affect all requests, and we don't allow any reordering across
barriers. Then, a device would have to stop IO when it hits a
barrier, and could only resume when all other devices using
this barrier have reached it.

md would have to send the barrier to each device, with the
exception that in any sequence of barriers without other
requests, only the first and the last barrier are needed.

Now, if we have a smarter elevator and smarter barriers,
there are more tricks that can be played. E.g. an elevator
could try to reach a barrier shared by many others as soon
as possible, and then schedule requests allowed to cross
the barrier while waiting for the other devices.

I'm not sure how sophisticated we want to get in the
multiple devices case, though.

- Werner

-- 
  _________________________________________________________________________
 / Werner Almesberger, Buenos Aires, Argentina         wa@almesberger.net /
/_http://www.almesberger.net/____________________________________________/

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: barriers vs. reads - O_DIRECT aio
  2004-06-25  2:42       ` Werner Almesberger
@ 2004-06-25 15:59         ` Bryan Henderson
  2004-06-25 16:31         ` barriers vs. reads - O_DIRECT Bryan Henderson
  1 sibling, 0 replies; 19+ messages in thread
From: Bryan Henderson @ 2004-06-25 15:59 UTC (permalink / raw)
  To: Werner Almesberger; +Cc: linux-fsdevel

>> There's also (kernel) aio to consider.
>
>I suppose this would be the main scenario. AIO+O_DIRECT also
>gives the closest approximation to directly talking to the
>elevator.

I agree.  And it would be good for the aio interface to have all the 
function of a block device make_request_fn function, including whatever 
barrier function it winds up with.



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: barriers vs. reads - O_DIRECT
  2004-06-25  2:42       ` Werner Almesberger
  2004-06-25 15:59         ` barriers vs. reads - O_DIRECT aio Bryan Henderson
@ 2004-06-25 16:31         ` Bryan Henderson
  1 sibling, 0 replies; 19+ messages in thread
From: Bryan Henderson @ 2004-06-25 16:31 UTC (permalink / raw)
  To: Werner Almesberger; +Cc: linux-fsdevel

>> But I can't see how barriers figure into ordinary O_DIRECT file I/O.
>
>Hmm, I've never looked at non-AIO write semantics with O_DIRECT.
>For reasonable semantics, I guess it would either have to block
>until the operation has completed, or be "atomic"

I think this part is well established.  When an O_DIRECT write() returns, 
the data has to have been written to the disk, ergo the process has to 
block until all relevant I/O has completed.  Remember that a primary 
purpose of direct I/O is to allow multiple systems to access the same disk 
from different ports.  It has to be possible for a system to send a 
message to another system saying, "I just wrote my data on disk Q.  Have a 
look."

_In addition_ to that requirement, the filesystem may or may not make that 
write() atomic as viewed by other processes.  (And as you point out, there 
are multiple kinds of atomicity it might choose).  I don't believe it 
would make use of a block layer barrier in either case.

--
Bryan Henderson                          IBM Almaden Research Center
San Jose CA                              Filesystems

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2004-06-25 16:31 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-06-24  0:48 barriers vs. reads Werner Almesberger
2004-06-24  3:39 ` Werner Almesberger
2004-06-24  8:00   ` Herbert Poetzl
2004-06-24 12:16     ` Werner Almesberger
2004-06-24 13:36   ` Jamie Lokier
2004-06-24 17:02     ` Werner Almesberger
2004-06-24 16:39 ` Steve Lord
2004-06-24 17:00 ` barriers vs. reads - O_DIRECT Bryan Henderson
2004-06-24 17:46   ` Werner Almesberger
2004-06-24 18:50     ` Jamie Lokier
2004-06-24 20:55       ` Werner Almesberger
2004-06-24 22:42         ` Jamie Lokier
2004-06-25  3:21           ` Werner Almesberger
2004-06-25  3:57           ` Guy
2004-06-25  4:52             ` Werner Almesberger
2004-06-25  0:11     ` Bryan Henderson
2004-06-25  2:42       ` Werner Almesberger
2004-06-25 15:59         ` barriers vs. reads - O_DIRECT aio Bryan Henderson
2004-06-25 16:31         ` barriers vs. reads - O_DIRECT Bryan Henderson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).