barriers vs. reads

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* barriers vs. reads
@ 2004-06-22  3:53 Werner Almesberger
  2004-06-22  7:39 ` Jens Axboe
  0 siblings, 1 reply; 46+ messages in thread
From: Werner Almesberger @ 2004-06-22  3:53 UTC (permalink / raw)
  To: linux-fsdevel

I'm working on an elevator with priorities, and I'm wondering what
semantics are expected from barriers when it comes to reads.

My problem with read barriers is that they can upset priorities
quite a bit, by forcing the entire queue to be processed before
any new (possibly timing-critical) reads are allowed.

So, is there anything that actually depends on barriers also
constraining read - or, more likely, read vs. write - order ?
If not, will there be ?

Also, it seems, but is never quite explicitly spelt out, that an
elevator is never really supposed to look for barriers in
rq->flags, but can solely rely on the insertion position as an
indication for barriers. Is this true ?

Thanks,
- Werner

-- 
  _________________________________________________________________________
 / Werner Almesberger, Buenos Aires, Argentina         wa@almesberger.net /
/_http://www.almesberger.net/____________________________________________/

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: barriers vs. reads
  2004-06-22  3:53 barriers vs. reads Werner Almesberger
@ 2004-06-22  7:39 ` Jens Axboe
  2004-06-22  7:50   ` Werner Almesberger
  0 siblings, 1 reply; 46+ messages in thread
From: Jens Axboe @ 2004-06-22  7:39 UTC (permalink / raw)
  To: Werner Almesberger; +Cc: linux-fsdevel

On Tue, Jun 22 2004, Werner Almesberger wrote:
> I'm working on an elevator with priorities, and I'm wondering what
> semantics are expected from barriers when it comes to reads.
> 
> My problem with read barriers is that they can upset priorities
> quite a bit, by forcing the entire queue to be processed before
> any new (possibly timing-critical) reads are allowed.
> 
> So, is there anything that actually depends on barriers also
> constraining read - or, more likely, read vs. write - order ?
> If not, will there be ?

I don't think a read-barrier currently has a meaning. A write barrier
will force ordering for later reads too, of course.

> Also, it seems, but is never quite explicitly spelt out, that an
> elevator is never really supposed to look for barriers in
> rq->flags, but can solely rely on the insertion position as an
> indication for barriers. Is this true ?

It can't, the insert position doesn't tell you whether it's a barrier or
not. You have to check ->flags for that.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: barriers vs. reads
  2004-06-22  7:39 ` Jens Axboe
@ 2004-06-22  7:50   ` Werner Almesberger
  2004-06-22  7:55     ` Jens Axboe
  0 siblings, 1 reply; 46+ messages in thread
From: Werner Almesberger @ 2004-06-22  7:50 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-fsdevel

Jens Axboe wrote:
> I don't think a read-barrier currently has a meaning. A write barrier
> will force ordering for later reads too, of course.

That's one of the problem spots with priorities: if there are a
lot of writes in the queue, high-priority reads will be delayed
for a long time.

But do we have cases where reads must not cross write barriers ?

> It can't, the insert position doesn't tell you whether it's a barrier or
> not. You have to check ->flags for that.

Yet deadline, AS, and CFQ don't do any such check :-)

- Werner

-- 
  _________________________________________________________________________
 / Werner Almesberger, Buenos Aires, Argentina         wa@almesberger.net /
/_http://www.almesberger.net/____________________________________________/

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: barriers vs. reads
  2004-06-22  7:50   ` Werner Almesberger
@ 2004-06-22  7:55     ` Jens Axboe
  2004-06-22  8:34       ` Werner Almesberger
  2004-06-22 11:28       ` Jamie Lokier
  0 siblings, 2 replies; 46+ messages in thread
From: Jens Axboe @ 2004-06-22  7:55 UTC (permalink / raw)
  To: Werner Almesberger; +Cc: linux-fsdevel

On Tue, Jun 22 2004, Werner Almesberger wrote:
> Jens Axboe wrote:
> > I don't think a read-barrier currently has a meaning. A write barrier
> > will force ordering for later reads too, of course.
> 
> That's one of the problem spots with priorities: if there are a
> lot of writes in the queue, high-priority reads will be delayed
> for a long time.

If there are lots of barrier writes, you mean?

> But do we have cases where reads must not cross write barriers ?

To me, it's the expected behaviour. If you issue a barrier write, a read
issued later should not be able to fetch old data.

> > It can't, the insert position doesn't tell you whether it's a barrier or
> > not. You have to check ->flags for that.
> 
> Yet deadline, AS, and CFQ don't do any such check :-)

Hmm? Recently this was moved into __elv_add_request() to make sure that
a barrier always implies ELEVATOR_INSERT_BACK so these checks were
removed. deadline still has it though:

        /* barriers must flush the reorder queue */
        if (unlikely(rq->flags & (REQ_SOFTBARRIER | REQ_HARDBARRIER)
                        && where == ELEVATOR_INSERT_SORT))
                where = ELEVATOR_INSERT_BACK;

this can be removed now, though. So it's definitely there, if you are
using a recent kernel you can assume that INSERT_BACK implies a barrier.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: barriers vs. reads
  2004-06-22  7:55     ` Jens Axboe
@ 2004-06-22  8:34       ` Werner Almesberger
  2004-06-22 10:08         ` Jens Axboe
  2004-06-22 11:28       ` Jamie Lokier
  1 sibling, 1 reply; 46+ messages in thread
From: Werner Almesberger @ 2004-06-22  8:34 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-fsdevel

Jens Axboe wrote:
> If there are lots of barrier writes, you mean?

If there are a lot of writes before the barrier. Then the reads
after the barrier have to wait for all these writes to complete.
Of course, things get even worse if you have a lot of barriers,
in addition to there being lots of writes.

> To me, it's the expected behaviour. If you issue a barrier write, a read
> issued later should not be able to fetch old data.

... which pretty much kills the idea of short predictable
queuing delays :-(

> Hmm? Recently this was moved into __elv_add_request()

Ah, okay, different definition of where the elevator starts ;-)
Yes, I saw that.

BTW, in what cases would ELEVATOR_INSERT_FRONT combined with
a barrier make sense ?

- Werner

-- 
  _________________________________________________________________________
 / Werner Almesberger, Buenos Aires, Argentina         wa@almesberger.net /
/_http://www.almesberger.net/____________________________________________/

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: barriers vs. reads
  2004-06-22  8:34       ` Werner Almesberger
@ 2004-06-22 10:08         ` Jens Axboe
  0 siblings, 0 replies; 46+ messages in thread
From: Jens Axboe @ 2004-06-22 10:08 UTC (permalink / raw)
  To: Werner Almesberger; +Cc: linux-fsdevel

On Tue, Jun 22 2004, Werner Almesberger wrote:
> Jens Axboe wrote:
> > If there are lots of barrier writes, you mean?
> 
> If there are a lot of writes before the barrier. Then the reads
> after the barrier have to wait for all these writes to complete.
> Of course, things get even worse if you have a lot of barriers,
> in addition to there being lots of writes.

There's nothing you can do about that, in my opinion. Barriers are bad
for io scheduler performance, that's a given.

> > To me, it's the expected behaviour. If you issue a barrier write, a read
> > issued later should not be able to fetch old data.
> 
> ... which pretty much kills the idea of short predictable
> queuing delays :-(

So you can't support guarenteed low delays with lots of writes and
barriers, big deal. If you need hard guarantees, you need to tailor the
environment. IMHO this is no different than the regular linux code base
not supporting hard realtime processing.

> > Hmm? Recently this was moved into __elv_add_request()
> 
> Ah, okay, different definition of where the elevator starts ;-)
> Yes, I saw that.
> 
> BTW, in what cases would ELEVATOR_INSERT_FRONT combined with
> a barrier make sense ?

It wouldn't really, INSERT_BACK is the only one that really makes sense.
But if you do an elv_requeue_request() (ide barrier does this) to
reinsert a barrier, it would have the barrier bit set but need to go to
the front anyways. SCSI does it too, come to think of it.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: barriers vs. reads
  2004-06-22  7:55     ` Jens Axboe
  2004-06-22  8:34       ` Werner Almesberger
@ 2004-06-22 11:28       ` Jamie Lokier
  2004-06-22 11:32         ` Jens Axboe
  2004-06-22 18:45         ` Werner Almesberger
  1 sibling, 2 replies; 46+ messages in thread
From: Jamie Lokier @ 2004-06-22 11:28 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Werner Almesberger, linux-fsdevel

Jens Axboe wrote:
> > But do we have cases where reads must not cross write barriers ?
> 
> To me, it's the expected behaviour. If you issue a barrier write, a read
> issued later should not be able to fetch old data.

Two things:

   1. A read _which doesn't overlap writes before the barrier_
      should be ok before the barrier with no visible change.

      So, look at the block numbers and permit reordering if there's
      no overlap.  This reordering is semantically invisible.

   2. Other than O_DIRECT, can the I/O subsystem issue reads that
      overlap writes in flight?  Surely that never occurs?

      If it never occurs, then reads can be safely moved before write
      barriers without looking at block numbers.

-- Jamie

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: barriers vs. reads
  2004-06-22 11:28       ` Jamie Lokier
@ 2004-06-22 11:32         ` Jens Axboe
  2004-06-22 17:12           ` Bryan Henderson
  2004-06-22 18:53           ` Werner Almesberger
  2004-06-22 18:45         ` Werner Almesberger
  1 sibling, 2 replies; 46+ messages in thread
From: Jens Axboe @ 2004-06-22 11:32 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: Werner Almesberger, linux-fsdevel

On Tue, Jun 22 2004, Jamie Lokier wrote:
> Jens Axboe wrote:
> > > But do we have cases where reads must not cross write barriers ?
> > 
> > To me, it's the expected behaviour. If you issue a barrier write, a read
> > issued later should not be able to fetch old data.
> 
> Two things:
> 
>    1. A read _which doesn't overlap writes before the barrier_
>       should be ok before the barrier with no visible change.
> 
>       So, look at the block numbers and permit reordering if there's
>       no overlap.  This reordering is semantically invisible.

You mean a read that doesn't contain sectors that overlap with the
barrier writes? Yes that would be fine.

It's easier said than done, though. Current io schedulers don't handle
barriers in a very fast fashion - they push all pending requests from
the internal sorted tree to the dispatch list, the latter which is
always accessed in FIFO like fashion (io scheduler adds to tail, driver
eats from the head). So if you wanted to optimize this, that has to be
changed.

>    2. Other than O_DIRECT, can the I/O subsystem issue reads that
>       overlap writes in flight?  Surely that never occurs?

No, it can only happen for reads that don't go through the page cache.

>       If it never occurs, then reads can be safely moved before write
>       barriers without looking at block numbers.

It can happen with direct io of any sort, the solution has to take this
into account. That's why we currently have handling for rbtree aliases
as well.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: barriers vs. reads
  2004-06-22 11:32         ` Jens Axboe
@ 2004-06-22 17:12           ` Bryan Henderson
  2004-06-22 20:53             ` Jens Axboe
  2004-06-22 18:53           ` Werner Almesberger
  1 sibling, 1 reply; 46+ messages in thread
From: Bryan Henderson @ 2004-06-22 17:12 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Jamie Lokier, linux-fsdevel, Werner Almesberger

>>    2. Other than O_DIRECT, can the I/O subsystem issue reads that
>>       overlap writes in flight?  Surely that never occurs?
>
>No, it can only happen for reads that don't go through the page cache.
>
>>       If it never occurs, then reads can be safely moved before write
>>       barriers without looking at block numbers.
>
>It can happen with direct io of any sort, the solution has to take this
>into account. That's why we currently have handling for rbtree aliases
>as well.

Are you saying that if Sector A contains "foo", and I do a 
__make_request(Write, Sector A, "bar") and then a __make_request(Read, 
Sector A), the read might read "foo"?  Assuming no barriers.

As I understand it, the topic of discussion is completely outside the 
realm of the page cache and open() flags -- it's all down below that.

--
Bryan Henderson                          IBM Almaden Research Center
San Jose CA                              Filesystems

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: barriers vs. reads
  2004-06-22 11:28       ` Jamie Lokier
  2004-06-22 11:32         ` Jens Axboe
@ 2004-06-22 18:45         ` Werner Almesberger
  2004-06-22 19:07           ` Guy
  1 sibling, 1 reply; 46+ messages in thread
From: Werner Almesberger @ 2004-06-22 18:45 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: Jens Axboe, linux-fsdevel

Jamie Lokier wrote:
>    1. A read _which doesn't overlap writes before the barrier_
>       should be ok before the barrier with no visible change.

Ah, that's an excellent point. Thanks !

- Werner

-- 
  _________________________________________________________________________
 / Werner Almesberger, Buenos Aires, Argentina         wa@almesberger.net /
/_http://www.almesberger.net/____________________________________________/

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: barriers vs. reads
  2004-06-22 11:32         ` Jens Axboe
  2004-06-22 17:12           ` Bryan Henderson
@ 2004-06-22 18:53           ` Werner Almesberger
  2004-06-22 19:57             ` Jamie Lokier
  2004-06-22 20:57             ` Jens Axboe
  1 sibling, 2 replies; 46+ messages in thread
From: Werner Almesberger @ 2004-06-22 18:53 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Jamie Lokier, linux-fsdevel

Jens Axboe wrote:
> It can happen with direct io of any sort, the solution has to take this
> into account. That's why we currently have handling for rbtree aliases
> as well.

How well is this actually supposed to work ? When reading what
as-iosched does, I was left with the impression that you could
construct a set of partially overlapping requests that doesn't
get sorted in FIFO order.

I haven't tried to feed as-iosched such a request mix, though,
so maybe I'm wrong.

(For partially overlapping requests, it may actually be nice
to be able to break them into multiple parts, and queue them
separately. Particularly if they also come with distinct
priorities.)

- Werner

-- 
  _________________________________________________________________________
 / Werner Almesberger, Buenos Aires, Argentina         wa@almesberger.net /
/_http://www.almesberger.net/____________________________________________/

^ permalink raw reply	[flat|nested] 46+ messages in thread

* RE: barriers vs. reads
  2004-06-22 18:45         ` Werner Almesberger
@ 2004-06-22 19:07           ` Guy
  0 siblings, 0 replies; 46+ messages in thread
From: Guy @ 2004-06-22 19:07 UTC (permalink / raw)
  To: 'Werner Almesberger', 'Jamie Lokier'
  Cc: 'Jens Axboe', linux-fsdevel

Maybe I am missing something.

A read that does not overlap a write could be moved up, regardless of
barriers.

A read that does overlap a write could be moved up to the write, but not
before it, regardless of barriers.

Writes that overlap should be written in order!  Or corrupt data!

Guy

-----Original Message-----
From: linux-fsdevel-owner@vger.kernel.org
[mailto:linux-fsdevel-owner@vger.kernel.org] On Behalf Of Werner Almesberger
Sent: Tuesday, June 22, 2004 2:46 PM
To: Jamie Lokier
Cc: Jens Axboe; linux-fsdevel@vger.kernel.org
Subject: Re: barriers vs. reads

Jamie Lokier wrote:
>    1. A read _which doesn't overlap writes before the barrier_
>       should be ok before the barrier with no visible change.

Ah, that's an excellent point. Thanks !

- Werner

-- 
  _________________________________________________________________________
 / Werner Almesberger, Buenos Aires, Argentina         wa@almesberger.net /
/_http://www.almesberger.net/____________________________________________/
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: barriers vs. reads
  2004-06-22 18:53           ` Werner Almesberger
@ 2004-06-22 19:57             ` Jamie Lokier
  2004-06-22 23:13               ` Werner Almesberger
  2004-06-22 20:57             ` Jens Axboe
  1 sibling, 1 reply; 46+ messages in thread
From: Jamie Lokier @ 2004-06-22 19:57 UTC (permalink / raw)
  To: Werner Almesberger; +Cc: Jens Axboe, linux-fsdevel

Werner Almesberger wrote:
> (For partially overlapping requests, it may actually be nice
> to be able to break them into multiple parts, and queue them
> separately. Particularly if they also come with distinct
> priorities.)

If there's a read which depends on a prior write, doesn't it make more
sense to just copy the data in memory for the parts which overlap?

If you do that, then you can service all high priority reads properly:
those that don't overlap can cross write barriers, and those that do
overlap can be copied in memory immediately.

There may be a case for forcing a direct read for programs which check
storage data integrity, although probably not even then is it useful.
(A program that wants to write some data and then read using direct
I/O will call write(), _wait_ until that returns after committing the
data, and then call read()).

Even if the current behaviour of requiring a device read after the
barrier must remain for ordinary direct I/O, it isn't required for
requests marked as "high-priority I/O".

-- Jamie

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: barriers vs. reads
  2004-06-22 17:12           ` Bryan Henderson
@ 2004-06-22 20:53             ` Jens Axboe
  2004-06-23 16:41               ` Bryan Henderson
  0 siblings, 1 reply; 46+ messages in thread
From: Jens Axboe @ 2004-06-22 20:53 UTC (permalink / raw)
  To: Bryan Henderson; +Cc: Jamie Lokier, linux-fsdevel, Werner Almesberger

On Tue, Jun 22 2004, Bryan Henderson wrote:
> >>    2. Other than O_DIRECT, can the I/O subsystem issue reads that
> >>       overlap writes in flight?  Surely that never occurs?
> >
> >No, it can only happen for reads that don't go through the page cache.
> >
> >>       If it never occurs, then reads can be safely moved before write
> >>       barriers without looking at block numbers.
> >
> >It can happen with direct io of any sort, the solution has to take this
> >into account. That's why we currently have handling for rbtree aliases
> >as well.
> 
> Are you saying that if Sector A contains "foo", and I do a 
> __make_request(Write, Sector A, "bar") and then a __make_request(Read, 
> Sector A), the read might read "foo"?  Assuming no barriers.

Well no, even for direct io we maintain ordering if you hit an alias.
See cfq_add_crq_rb() in cfq-iosched.c for example. AS handles it
differently, but result is the same. At least the io is issued in that
order, so if the underlying storage hardware doesn't mess it up, there
is no problem.

If you have overlapping requests, then you could get into serious
trouble. You should not do that.

> As I understand it, the topic of discussion is completely outside the 
> realm of the page cache and open() flags -- it's all down below that.

The page cache will make sure don't see "foo", since you'll wait for the
page to unlocked at the end of io. The above description is only valid
for direct io, this is where the alias handling works.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: barriers vs. reads
  2004-06-22 18:53           ` Werner Almesberger
  2004-06-22 19:57             ` Jamie Lokier
@ 2004-06-22 20:57             ` Jens Axboe
  2004-06-22 23:10               ` Werner Almesberger
  1 sibling, 1 reply; 46+ messages in thread
From: Jens Axboe @ 2004-06-22 20:57 UTC (permalink / raw)
  To: Werner Almesberger; +Cc: Jamie Lokier, linux-fsdevel

On Tue, Jun 22 2004, Werner Almesberger wrote:
> Jens Axboe wrote:
> > It can happen with direct io of any sort, the solution has to take this
> > into account. That's why we currently have handling for rbtree aliases
> > as well.
> 
> How well is this actually supposed to work ? When reading what
> as-iosched does, I was left with the impression that you could
> construct a set of partially overlapping requests that doesn't
> get sorted in FIFO order.

Overlapping requests are only detected if they start at the same
sector.

The mechanism is just there because of the data structure use, Linux has
never made any effort to guard against this outside of the page cache
context. If you issue direct io that overlaps, you are providing your
own rope.

> I haven't tried to feed as-iosched such a request mix, though,
> so maybe I'm wrong.
> 
> (For partially overlapping requests, it may actually be nice
> to be able to break them into multiple parts, and queue them
> separately. Particularly if they also come with distinct
> priorities.)

Bad idea, unless you have zero setup overhead for the hardware issued
commands. Linux will also attempt to remerge these requests when it
later discovers they are adjacent. You can block this by disallowing
merging of request with different priorities, but I really don't see why
you'd want to do that. It would be a net loss in the end anyways.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: barriers vs. reads
  2004-06-22 20:57             ` Jens Axboe
@ 2004-06-22 23:10               ` Werner Almesberger
  2004-06-23  0:14                 ` Jamie Lokier
  2004-06-23  6:27                 ` Jens Axboe
  0 siblings, 2 replies; 46+ messages in thread
From: Werner Almesberger @ 2004-06-22 23:10 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Jamie Lokier, linux-fsdevel

Jens Axboe wrote:
> Overlapping requests are only detected if they start at the same
> sector.
> 
> The mechanism is just there because of the data structure use, [...]

So even the special handling of requests that start with the
same sector isn't required, and shouldn't be depended on, right ?

> Bad idea, unless you have zero setup overhead for the hardware issued
> commands. Linux will also attempt to remerge these requests when it
> later discovers they are adjacent. You can block this by disallowing
> merging of request with different priorities, but I really don't see why
> you'd want to do that. It would be a net loss in the end anyways.

The issue is that you may get large requests, in the middle of
which a single page gets a higher priority, e.g. because the
large request comes from a low-priority copy operation, and
there's a high-priority reader concurrently working on the
same file.

In this case, the high-priority reader either has to wait for
the whole low-priority request to crawl to the head of the queue
(probably missing the deadline of the high-priority read), or we
could take the request and raise its priority, giving our
low-priority reader a nice boost. The latter isn't so bad if it
happens every once in a while, but someone may figure out how to
do this repeatedly, throwing off our bandwidth calculations.

- Werner

-- 
  _________________________________________________________________________
 / Werner Almesberger, Buenos Aires, Argentina         wa@almesberger.net /
/_http://www.almesberger.net/____________________________________________/

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: barriers vs. reads
  2004-06-22 19:57             ` Jamie Lokier
@ 2004-06-22 23:13               ` Werner Almesberger
  0 siblings, 0 replies; 46+ messages in thread
From: Werner Almesberger @ 2004-06-22 23:13 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: Jens Axboe, linux-fsdevel

Jamie Lokier wrote:
> If there's a read which depends on a prior write, doesn't it make more
> sense to just copy the data in memory for the parts which overlap?

Probably. In this case, we'd have to fragment the read request
into the part that's handled through this copy operation, and
the 0-2 parts that need to come from disk.

- Werner

-- 
  _________________________________________________________________________
 / Werner Almesberger, Buenos Aires, Argentina         wa@almesberger.net /
/_http://www.almesberger.net/____________________________________________/

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: barriers vs. reads
  2004-06-22 23:10               ` Werner Almesberger
@ 2004-06-23  0:14                 ` Jamie Lokier
  2004-06-23  6:27                 ` Jens Axboe
  1 sibling, 0 replies; 46+ messages in thread
From: Jamie Lokier @ 2004-06-23  0:14 UTC (permalink / raw)
  To: Werner Almesberger; +Cc: Jens Axboe, linux-fsdevel

Werner Almesberger wrote:
> > Bad idea, unless you have zero setup overhead for the hardware issued
> > commands. Linux will also attempt to remerge these requests when it
> > later discovers they are adjacent. You can block this by disallowing
> > merging of request with different priorities, but I really don't see why
> > you'd want to do that. It would be a net loss in the end anyways.
> 
> The issue is that you may get large requests, in the middle of
> which a single page gets a higher priority, e.g. because the
> large request comes from a low-priority copy operation, and
> there's a high-priority reader concurrently working on the
> same file.
> 
> In this case, the high-priority reader either has to wait for
> the whole low-priority request to crawl to the head of the queue
> (probably missing the deadline of the high-priority read), or we
> could take the request and raise its priority, giving our
> low-priority reader a nice boost. The latter isn't so bad if it
> happens every once in a while, but someone may figure out how to
> do this repeatedly, throwing off our bandwidth calculations.

That's fine for a device with fast data transfer and slow seek times.
But for a device with slow data transfer (e.g. nbd to a remote disk),
you'd want to split the request for sure.

-- Jamie

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: barriers vs. reads
  2004-06-22 23:10               ` Werner Almesberger
  2004-06-23  0:14                 ` Jamie Lokier
@ 2004-06-23  6:27                 ` Jens Axboe
  1 sibling, 0 replies; 46+ messages in thread
From: Jens Axboe @ 2004-06-23  6:27 UTC (permalink / raw)
  To: Werner Almesberger; +Cc: Jamie Lokier, linux-fsdevel

On Tue, Jun 22 2004, Werner Almesberger wrote:
> Jens Axboe wrote:
> > Overlapping requests are only detected if they start at the same
> > sector.
> > 
> > The mechanism is just there because of the data structure use, [...]
> 
> So even the special handling of requests that start with the
> same sector isn't required, and shouldn't be depended on, right ?

Correct, it's just a side effect of the rbtree. 2.4 never tried to do
anything about it.

> > Bad idea, unless you have zero setup overhead for the hardware issued
> > commands. Linux will also attempt to remerge these requests when it
> > later discovers they are adjacent. You can block this by disallowing
> > merging of request with different priorities, but I really don't see why
> > you'd want to do that. It would be a net loss in the end anyways.
> 
> The issue is that you may get large requests, in the middle of
> which a single page gets a higher priority, e.g. because the
> large request comes from a low-priority copy operation, and
> there's a high-priority reader concurrently working on the
> same file.
> 
> In this case, the high-priority reader either has to wait for
> the whole low-priority request to crawl to the head of the queue
> (probably missing the deadline of the high-priority read), or we
> could take the request and raise its priority, giving our
> low-priority reader a nice boost. The latter isn't so bad if it
> happens every once in a while, but someone may figure out how to
> do this repeatedly, throwing off our bandwidth calculations.
> 

I see your point. Sounds like you have to be careful with request
allocations once a single request suddenly needs 2 more request slots
due to splitting (livelock country).

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: barriers vs. reads
  2004-06-22 20:53             ` Jens Axboe
@ 2004-06-23 16:41               ` Bryan Henderson
  2004-06-23 16:52                 ` Jens Axboe
  2004-06-23 16:53                 ` Jamie Lokier
  0 siblings, 2 replies; 46+ messages in thread
From: Bryan Henderson @ 2004-06-23 16:41 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Jamie Lokier, linux-fsdevel, Werner Almesberger

>> Are you saying that if Sector A contains "foo", and I do a 
>> __make_request(Write, Sector A, "bar") and then a __make_request(Read, 
>> Sector A), the read might read "foo"?  Assuming no barriers.
>
>Well no, even for direct io we maintain ordering if you hit an alias.
>...
>If you have overlapping requests, then you could get into serious
>trouble. You should not do that.

What is an alias, and how is it different from an overlap?

Also:  The question is about the block layer, but you answer in terms of 
direct I/O.  Does __make_request() know what direct I/O is?  I thought it 
just did I/O for I/O's sake.

>The page cache will make sure don't see "foo", since you'll wait for the
>page to unlocked at the end of io.

If it's a page cache page, the point is moot because the elevator would 
never see the sequence of requests in question.  But my question is about 
the general case, with no assumptions about who is calling 
__make_request() and why.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: barriers vs. reads
  2004-06-23 16:41               ` Bryan Henderson
@ 2004-06-23 16:52                 ` Jens Axboe
  2004-06-23 16:53                 ` Jamie Lokier
  1 sibling, 0 replies; 46+ messages in thread
From: Jens Axboe @ 2004-06-23 16:52 UTC (permalink / raw)
  To: Bryan Henderson; +Cc: Jamie Lokier, linux-fsdevel, Werner Almesberger

On Wed, Jun 23 2004, Bryan Henderson wrote:
> >> Are you saying that if Sector A contains "foo", and I do a 
> >> __make_request(Write, Sector A, "bar") and then a __make_request(Read, 
> >> Sector A), the read might read "foo"?  Assuming no barriers.
> >
> >Well no, even for direct io we maintain ordering if you hit an alias.
> >...
> >If you have overlapping requests, then you could get into serious
> >trouble. You should not do that.
> 
> What is an alias, and how is it different from an overlap?

It's just an internal problem - when you add a request to the rbtree of
the io scheduler, and an existing request with that key (start location)
already exists.

> Also:  The question is about the block layer, but you answer in terms
> of direct I/O.  Does __make_request() know what direct I/O is?  I
> thought it just did I/O for I/O's sake.

The block layer doesn't know and it doesn't care. I answer in terms of
direct IO since that's the only way you'll get a request issued for the
same sector. As mentioned below, the page cache prevents that from
happening.

> >The page cache will make sure don't see "foo", since you'll wait for the
> >page to unlocked at the end of io.
> 
> If it's a page cache page, the point is moot because the elevator would 
> never see the sequence of requests in question.  But my question is about 

Precisely.

> the general case, with no assumptions about who is calling 
> __make_request() and why.

As stated earlier in this thread, the io scheduler makes no real attempt
to cope with this scenario. Users of direct io are expected to
syncronize themselves. It would be pretty silly not to do this, if only
for performance reasons. The fact that the io scheduler catches
"aliases" is not a feature and not something that is really useful in
this respect, since it's only a slight subset of the general problem of
overlapping ios.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: barriers vs. reads
  2004-06-23 16:41               ` Bryan Henderson
  2004-06-23 16:52                 ` Jens Axboe
@ 2004-06-23 16:53                 ` Jamie Lokier
  2004-06-23 21:08                   ` Bryan Henderson
  2004-06-23 23:23                   ` Werner Almesberger
  1 sibling, 2 replies; 46+ messages in thread
From: Jamie Lokier @ 2004-06-23 16:53 UTC (permalink / raw)
  To: Bryan Henderson; +Cc: Jens Axboe, linux-fsdevel, Werner Almesberger

Bryan Henderson wrote:
> Also:  The question is about the block layer, but you answer in terms of 
> direct I/O.  Does __make_request() know what direct I/O is?  I thought it 
> just did I/O for I/O's sake.
> 
> >The page cache will make sure don't see "foo", since you'll wait for the
> >page to unlocked at the end of io.
> 
> If it's a page cache page, the point is moot because the elevator would 
> never see the sequence of requests in question.  But my question is about 
> the general case, with no assumptions about who is calling 
> __make_request() and why.

This whole thread is asking a simple question: what shall we define
"I/O write barrier" to mean?  Does it force an ordering for reads
after the barrier which overlap writes before the barrier, or not?

We are free to define the answer as "yes" or "no", according to
whichever is more useful.  Or even to define two kinds of barrier, if
that would be useful.  (E.g. I wonder if direct I/O to a file on a
journalled filesystem would need that).

-- Jamie

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: barriers vs. reads
  2004-06-23 16:53                 ` Jamie Lokier
@ 2004-06-23 21:08                   ` Bryan Henderson
  2004-06-23 23:23                   ` Werner Almesberger
  1 sibling, 0 replies; 46+ messages in thread
From: Bryan Henderson @ 2004-06-23 21:08 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: Jens Axboe, linux-fsdevel, Werner Almesberger

>This whole thread is asking a simple question: what shall we define
>"I/O write barrier" to mean?  Does it force an ordering for reads
>after the barrier which overlap writes before the barrier, or not?
>
>We are free to define the answer as "yes" or "no", according to
>whichever is more useful.  Or even to define two kinds of barrier, if
>that would be useful.

There was a small branch of the thread aimed at clarifying what kind of 
ordering there is independent of barriers, which is crucial to deciding 
what additional ordering a barrier should impose.  The answer to that 
subsidiary question appears to be "none" unless you count a side effect of 
alias handling (and you shouldn't).

This is, incidentally, nonobvious.  I believe in most contexts, when 
people say "elevator algorithm," they're talking about something that 
affects the timing of a stream of disk I/O requests, but not the outcome; 
the outcome is expected to be the same as FIFO regardless of the 
scheduling.  Barriers come into play only when you have timing 
dependencies between two sectors, e.g. when you don't want to wipe 
something out of the journal until _after_ it has been written someplace 
else.  Jens points out that this ideal elevator isn't practical, or 
necessary, in Linux.

--
Bryan Henderson                          IBM Almaden Research Center
San Jose CA                              Filesystems

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: barriers vs. reads
  2004-06-23 16:53                 ` Jamie Lokier
  2004-06-23 21:08                   ` Bryan Henderson
@ 2004-06-23 23:23                   ` Werner Almesberger
  2004-06-24 13:43                     ` Jamie Lokier
  1 sibling, 1 reply; 46+ messages in thread
From: Werner Almesberger @ 2004-06-23 23:23 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: Bryan Henderson, Jens Axboe, linux-fsdevel

Jamie Lokier wrote:
> This whole thread is asking a simple question: what shall we define
> "I/O write barrier" to mean?  Does it force an ordering for reads
> after the barrier which overlap writes before the barrier, or not?

It's always the simple questions that are the hardest to answer ;-)

So far, we seem to have the following rules (at least for
"barrier-only" barriers. Special requests come with their own
barriers, see below):

 i) write requests may never cross a barrier that separates
    them from other write requests
 ii) a read request R and a write request W may change their
     order (with respect to each other), unless they overlap
     AND are separated by at least one barrier
 iii) in all other cases, requests are free to move about

Rule ii) seems a little tricky. Is there actually a means for
user space to send a barrier, e.g. when doing direct IO ? If
not, the I think only user of ii) would be "direct IO" that
comes from unbuffered file system meta-data and such.

> Or even to define two kinds of barrier, if that would be useful.

I can think of three distinct uses of barriers (but there may
be more):

 1) ensure integrity involving multiple requests
 2) disentangle overlapping requests
 3) make sure special requests (power management or such) don't
    wander off

2) may merit a distinction. E.g. an elevator that automatically
puts overlapping requests in FIFO order (i.e. Bryan's "ideal"
elevator) could ignore this kind of barriers. Barriers of type
3) restricts the move of anything.

Cheers, Werner

-- 
  _________________________________________________________________________
 / Werner Almesberger, Buenos Aires, Argentina         wa@almesberger.net /
/_http://www.almesberger.net/____________________________________________/

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: barriers vs. reads
@ 2004-06-24  0:48 Werner Almesberger
  2004-06-24  3:39 ` Werner Almesberger
                   ` (2 more replies)
  0 siblings, 3 replies; 46+ messages in thread
From: Werner Almesberger @ 2004-06-24  0:48 UTC (permalink / raw)
  To: linux-fsdevel

BTW, regarding overlapping requests, I wonder if there's a data
structure that gives O(log requests) or such lookups for ranges.
The best I could spontaneously think of would be
O(new_request_size*log(requests*avg_request_size))
which isn't pretty.

BTW2, is O_DIRECT actually a Linux-only thing, or is there some
ancestor whose semantics we may want to preserve ? I've had a
quick look at POSIX, but they don't seem to have direct IO.

- Werner

-- 
  _________________________________________________________________________
 / Werner Almesberger, Buenos Aires, Argentina         wa@almesberger.net /
/_http://www.almesberger.net/____________________________________________/

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: barriers vs. reads
  2004-06-24  0:48 barriers vs. reads Werner Almesberger
@ 2004-06-24  3:39 ` Werner Almesberger
  2004-06-24  8:00   ` Herbert Poetzl
  2004-06-24 13:36   ` Jamie Lokier
  2004-06-24 16:39 ` Steve Lord
  2004-06-24 17:00 ` barriers vs. reads - O_DIRECT Bryan Henderson
  2 siblings, 2 replies; 46+ messages in thread
From: Werner Almesberger @ 2004-06-24  3:39 UTC (permalink / raw)
  To: linux-fsdevel

I wrote:
> BTW, regarding overlapping requests, I wonder if there's a data
> structure that gives O(log requests) or such lookups for ranges.

Seem that I've found one that is maybe 2-4 times as expensive as a
single tree. It works as follows: if we have a set of ranges (a,b)
and want to see if any of them overlap with a range (x,y), we
compare the indices of the matches (or almost-matches).

  num_overlaps = |{a : a < y}| - |{b : b <= x}|

"{a : a < y}" is "the set of all a where a < y". "|...|" is the
number of elements in a set.

We could obtain such indices by counting the number of nodes in
each branch of the tree. That's O(1) for all regular tree
operations, and O(log n) for the sum. The index is the size of
all the trees under left branches we haven't taken, plus the
number of nodes we've left through the right branch. If there are
multiple equal entries, we must find the first one.

One problem: I did this mostly by instinct. It seems to work
perfectly, but I can't quite explain why :-(

I put a dirty little program to simulate this on
http://abiss.sourceforge.net/t.tar.gz

- Werner

-- 
  _________________________________________________________________________
 / Werner Almesberger, Buenos Aires, Argentina         wa@almesberger.net /
/_http://www.almesberger.net/____________________________________________/

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: barriers vs. reads
  2004-06-24  3:39 ` Werner Almesberger
@ 2004-06-24  8:00   ` Herbert Poetzl
  2004-06-24 12:16     ` Werner Almesberger
  2004-06-24 13:36   ` Jamie Lokier
  1 sibling, 1 reply; 46+ messages in thread
From: Herbert Poetzl @ 2004-06-24  8:00 UTC (permalink / raw)
  To: Werner Almesberger; +Cc: linux-fsdevel

On Thu, Jun 24, 2004 at 12:39:44AM -0300, Werner Almesberger wrote:
> I wrote:
> > BTW, regarding overlapping requests, I wonder if there's a data
> > structure that gives O(log requests) or such lookups for ranges.
> 
> Seem that I've found one that is maybe 2-4 times as expensive as a
> single tree. It works as follows: if we have a set of ranges (a,b)
> and want to see if any of them overlap with a range (x,y), we
> compare the indices of the matches (or almost-matches).
> 
>   num_overlaps = |{a : a < y}| - |{b : b <= x}|
> 
> "{a : a < y}" is "the set of all a where a < y". "|...|" is the
> number of elements in a set.
> 
> We could obtain such indices by counting the number of nodes in
> each branch of the tree. That's O(1) for all regular tree
> operations, and O(log n) for the sum. The index is the size of
> all the trees under left branches we haven't taken, plus the
> number of nodes we've left through the right branch. If there are
> multiple equal entries, we must find the first one.
> 
> One problem: I did this mostly by instinct. It seems to work
> perfectly, but I can't quite explain why :-(

hmm ... there are eight cases how ranges can interact ...

    a     b     		        		a	  b
    |     |			        		|	  |
 1) +-----+       +---------+	       2) +-----+	+---------+
                  |         |	          |	|
		  x         y	          x	y

    a             b		        		a	  b
    |      	  |		        		|	  |
 3) +-------------+---------+	       4) +-------------+---------+
                  |         |	          |		|
		  x         y	          x		y


    a                 b    	        	a		  b
    |                 | 	        	|		  |
 5) +-------------+===+-----+	       6) +-----+========+--------+
                  |         |	          |		 |
		  x         y	          x		 y
	
    a			    b	        	a	 b
    |			    |	        	|	 |
 7) +-----+========+--------+	       8) +-----+========+--------+
          |        |		          |			  |
	  x        y		          x			  y
	

by verifying those eight cases for correctness, you
can conclude, that the sum of N such cases will give
the correct number of overlaps (with a given test
range); verification itself is simple:

case     a<y    b<=x    |{a:a<y}| - |{b:b<= x}|
------+-------+-------+-------------------------
 1)   |  YES  |  YES  |  0
 2)   |  NO   |  NO   |  0
 3)   |  YES  |  YES  |  0
 4)   |  NO   |  NO   |  0
------+-------+-------+-----
 5)   |  YES  |  NO   |  1
 6)   |  YES  |  NO   |  1
 7)   |  YES  |  NO   |  1
 8)   |  YES  |  NO   |  1

HTH,
Herbert

> I put a dirty little program to simulate this on
> http://abiss.sourceforge.net/t.tar.gz
> 
> - Werner
> 
> -- 
>   _________________________________________________________________________
>  / Werner Almesberger, Buenos Aires, Argentina         wa@almesberger.net /
> /_http://www.almesberger.net/____________________________________________/
> -
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: barriers vs. reads
  2004-06-24  8:00   ` Herbert Poetzl
@ 2004-06-24 12:16     ` Werner Almesberger
  0 siblings, 0 replies; 46+ messages in thread
From: Werner Almesberger @ 2004-06-24 12:16 UTC (permalink / raw)
  To: linux-fsdevel

Herbert Poetzl wrote:
> by verifying those eight cases for correctness, you
> can conclude, that the sum of N such cases will give
> the correct number of overlaps (with a given test
> range); verification itself is simple:

Thanks (very nice drawings, btw.) ! Yes, now that I see it
written down, I also see how we can generalize that to a
proof for any number of (a,b) ranges. Cool !

So this means that we have a solution to detect overlaps
that shouldn't be  significantly slower than, say tree+hash
as used in the anticipatory scheduler. One problem is that
this approach doesn't tell us where the overlapping
requests are, only that there are somewhere out there.

Thanks,
- Werner

-- 
  _________________________________________________________________________
 / Werner Almesberger, Buenos Aires, Argentina         wa@almesberger.net /
/_http://www.almesberger.net/____________________________________________/

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: barriers vs. reads
  2004-06-24  3:39 ` Werner Almesberger
  2004-06-24  8:00   ` Herbert Poetzl
@ 2004-06-24 13:36   ` Jamie Lokier
  2004-06-24 17:02     ` Werner Almesberger
  1 sibling, 1 reply; 46+ messages in thread
From: Jamie Lokier @ 2004-06-24 13:36 UTC (permalink / raw)
  To: Werner Almesberger; +Cc: linux-fsdevel

Werner Almesberger wrote:
> I wrote:
> > BTW, regarding overlapping requests, I wonder if there's a data
> > structure that gives O(log requests) or such lookups for ranges.
> 
> Seem that I've found one that is maybe 2-4 times as expensive as a
> single tree. It works as follows: if we have a set of ranges (a,b)
> and want to see if any of them overlap with a range (x,y), we
> compare the indices of the matches (or almost-matches).

Is the prio_tree data structure, the one being used by recent VM work,
which keeps track of ranges, any use?

-- Jamie

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: barriers vs. reads
  2004-06-23 23:23                   ` Werner Almesberger
@ 2004-06-24 13:43                     ` Jamie Lokier
  2004-06-24 14:32                       ` Christoph Hellwig
  2004-06-24 17:05                       ` Werner Almesberger
  0 siblings, 2 replies; 46+ messages in thread
From: Jamie Lokier @ 2004-06-24 13:43 UTC (permalink / raw)
  To: Werner Almesberger; +Cc: Bryan Henderson, Jens Axboe, linux-fsdevel

Werner Almesberger wrote:
>  i) write requests may never cross a barrier that separates
>     them from other write requests
>  ii) a read request R and a write request W may change their
>      order (with respect to each other), unless they overlap
>      AND are separated by at least one barrier
>  iii) in all other cases, requests are free to move about
> 
> Rule ii) seems a little tricky. Is there actually a means for
> user space to send a barrier, e.g. when doing direct IO ? If
> not, the I think only user of ii) would be "direct IO" that
> comes from unbuffered file system meta-data and such.

It should be possible to write a decent filesystem in userspace --
even if it's just for prototyping -- so barriers from userspace should
be offered, eventually.

Note that unbuffered file system meta-data can theoretically overlap
page cache, although it usually doesn't, and a filesystem could easily
set a flag to say that it doesn't ever.

Think of the experimental modifications to ext3 to allow inodes to be
allocated in a file.  That means blocks are changing their meaning
from being file blocks to inode blocks or vice versa -- and both I/Os
could conceivably be in flight at the same time.

-- Jamie

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: barriers vs. reads
  2004-06-24 13:43                     ` Jamie Lokier
@ 2004-06-24 14:32                       ` Christoph Hellwig
  2004-06-24 17:05                       ` Werner Almesberger
  1 sibling, 0 replies; 46+ messages in thread
From: Christoph Hellwig @ 2004-06-24 14:32 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Werner Almesberger, Bryan Henderson, Jens Axboe, linux-fsdevel

On Thu, Jun 24, 2004 at 02:43:07PM +0100, Jamie Lokier wrote:
> Think of the experimental modifications to ext3 to allow inodes to be
> allocated in a file.

Long time ago  Kristian Köhntoppp did the for ext2 for his Diploma Thesis [1]

[1] http://kris.koehntopp.de/artikel/diplom/ (Warning: German)
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: barriers vs. reads
  2004-06-24  0:48 barriers vs. reads Werner Almesberger
  2004-06-24  3:39 ` Werner Almesberger
@ 2004-06-24 16:39 ` Steve Lord
  2004-06-24 17:00 ` barriers vs. reads - O_DIRECT Bryan Henderson
  2 siblings, 0 replies; 46+ messages in thread
From: Steve Lord @ 2004-06-24 16:39 UTC (permalink / raw)
  To: Werner Almesberger; +Cc: linux-fsdevel

Werner Almesberger wrote:
> BTW, regarding overlapping requests, I wonder if there's a data
> structure that gives O(log requests) or such lookups for ranges.
> The best I could spontaneously think of would be
> O(new_request_size*log(requests*avg_request_size))
> which isn't pretty.
> 
> BTW2, is O_DIRECT actually a Linux-only thing, or is there some
> ancestor whose semantics we may want to preserve ? I've had a
> quick look at POSIX, but they don't seem to have direct IO.

Irix has O_DIRECT, Solaris has something too, but it is not
in the posix specs. Cray Unicos is the oldest implementation I came
across.

Irix explicitly lets multiple readers and writers into a file
at once with O_DIRECT. The assumption being that the application
which does this is doing its own coordination and will not
shoot itself in the foot.

Steve Lord


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: barriers vs. reads - O_DIRECT
  2004-06-24  0:48 barriers vs. reads Werner Almesberger
  2004-06-24  3:39 ` Werner Almesberger
  2004-06-24 16:39 ` Steve Lord
@ 2004-06-24 17:00 ` Bryan Henderson
  2004-06-24 17:46   ` Werner Almesberger
  2 siblings, 1 reply; 46+ messages in thread
From: Bryan Henderson @ 2004-06-24 17:00 UTC (permalink / raw)
  To: Werner Almesberger; +Cc: linux-fsdevel

>BTW2, is O_DIRECT actually a Linux-only thing, or is there some
>ancestor whose semantics we may want to preserve ? I've had a
>quick look at POSIX, but they don't seem to have direct IO.

Several operating systems got O_DIRECT before, but practically at the same 
time as, Linux.  There is no standard and no agreement on the small 
details, such as what to do when someone tries to read the last 
half-sector of a file, given that direct I/O is supposed to be in whole 
sectors.

It seems obvious to me that whatever ordering guarantees the user gets 
without the O_DIRECT flag, he should get with it as well.  The user 
doesn't see disk queues and sectors and elevators and file caches.  He 
sees bytes stream files.

If we can give much better performance by withdrawing those guarantees for 
O_DIRECT, though, users would probably accept it since there is in fact no 
strong legacy we have to live up to.

When I worry about I/O scheduling, I usually worry a lot more about block 
device direct I/O (raw devices) than file direct I/O.  The former is where 
the block layer is most exposed to the user.

--
Bryan Henderson                          IBM Almaden Research Center
San Jose CA                              Filesystems

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: barriers vs. reads
  2004-06-24 13:36   ` Jamie Lokier
@ 2004-06-24 17:02     ` Werner Almesberger
  0 siblings, 0 replies; 46+ messages in thread
From: Werner Almesberger @ 2004-06-24 17:02 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: linux-fsdevel

Jamie Lokier wrote:
> Is the prio_tree data structure, the one being used by recent VM work,
> which keeps track of ranges, any use?

Hmm, they deliver exactly the data we need. That's great. Unlike
rb-trees, they also aren't balanced, and thus have a worst-case
O(log n) term with "log n" = sizeof(sector_t)*8 for simple
lookups. Furthermore, they have a scary worst-case insertion
time of O((log n)^2).

With rb-trees, we get a mere O(log nr_requests) for lookups. So
that's A*32 vs. B*7 using all-default 2.6.7 on ia32. Insertion
has the same typical cost, and A*1024 vs. B*7 worst-case. A and B
are implementation-specific per-operation cost factors.

But yes, they look very interesting ...

Thanks,
- Werner

-- 
  _________________________________________________________________________
 / Werner Almesberger, Buenos Aires, Argentina         wa@almesberger.net /
/_http://www.almesberger.net/____________________________________________/

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: barriers vs. reads
  2004-06-24 13:43                     ` Jamie Lokier
  2004-06-24 14:32                       ` Christoph Hellwig
@ 2004-06-24 17:05                       ` Werner Almesberger
  1 sibling, 0 replies; 46+ messages in thread
From: Werner Almesberger @ 2004-06-24 17:05 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: Bryan Henderson, Jens Axboe, linux-fsdevel

Jamie Lokier wrote:
> It should be possible to write a decent filesystem in userspace --
> even if it's just for prototyping -- so barriers from userspace should
> be offered, eventually.

Yes, I suppose also databases would want barriers. My question
was more in the direction of "what might we break if we change
some of the rules now ?".

> Think of the experimental modifications to ext3 to allow inodes to be
> allocated in a file.  That means blocks are changing their meaning
> from being file blocks to inode blocks or vice versa -- and both I/Os
> could conceivably be in flight at the same time.

Oh, that one's evil !

- Werner

-- 
  _________________________________________________________________________
 / Werner Almesberger, Buenos Aires, Argentina         wa@almesberger.net /
/_http://www.almesberger.net/____________________________________________/

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: barriers vs. reads - O_DIRECT
  2004-06-24 17:00 ` barriers vs. reads - O_DIRECT Bryan Henderson
@ 2004-06-24 17:46   ` Werner Almesberger
  2004-06-24 18:50     ` Jamie Lokier
  2004-06-25  0:11     ` Bryan Henderson
  0 siblings, 2 replies; 46+ messages in thread
From: Werner Almesberger @ 2004-06-24 17:46 UTC (permalink / raw)
  To: Bryan Henderson; +Cc: linux-fsdevel

Bryan Henderson wrote:
> It seems obvious to me that whatever ordering guarantees the user gets 
> without the O_DIRECT flag, he should get with it as well.

Yes, it would be nice if we could obtain such behaviour without
unacceptable performance sacrifices. It seems to me that, if we
can find an efficient way for serializing all write-write and
read-write overlaps, plus have explicit barriers for serializing
non-overlapping writes, this should yield pretty much what
everyone wants (*). Now, that "if" needs a bit of work ... :-)

(*) The only difference being that a completing read doesn't
    tell you whether the elevator has already passed a barrier.
    Currently, one could be lured into depending on this.

> When I worry about I/O scheduling, I usually worry a lot more about block 
> device direct I/O (raw devices) than file direct I/O.  The former is where 
> the block layer is most exposed to the user.

I haven't looked at the direct IO code in detail, but it seems
to me that the elevator behaviour affects file-based direct IO
in the same way as the device-based one.

- Werner

-- 
  _________________________________________________________________________
 / Werner Almesberger, Buenos Aires, Argentina         wa@almesberger.net /
/_http://www.almesberger.net/____________________________________________/

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: barriers vs. reads - O_DIRECT
  2004-06-24 17:46   ` Werner Almesberger
@ 2004-06-24 18:50     ` Jamie Lokier
  2004-06-24 20:55       ` Werner Almesberger
  2004-06-25  0:11     ` Bryan Henderson
  1 sibling, 1 reply; 46+ messages in thread
From: Jamie Lokier @ 2004-06-24 18:50 UTC (permalink / raw)
  To: Werner Almesberger; +Cc: Bryan Henderson, linux-fsdevel

Werner Almesberger wrote:
> Bryan Henderson wrote:
> > It seems obvious to me that whatever ordering guarantees the user gets 
> > without the O_DIRECT flag, he should get with it as well.
> 
> Yes, it would be nice if we could obtain such behaviour without
> unacceptable performance sacrifices. It seems to me that, if we
> can find an efficient way for serializing all write-write and
> read-write overlaps, plus have explicit barriers for serializing
> non-overlapping writes, this should yield pretty much what
> everyone wants (*). Now, that "if" needs a bit of work ... :-)

Note that what filesystems and databases want is write-write *partial
dependencies*.  The per-device I/O barrier is just a crude
approximation.

1. Think about this: two filesystems on different partitions of the same
device.  The writes of each filesystem are independent, yet the
barriers will force the writes of one filesystem to come before
later-queued writes of the other.

2. Or, two database back-ends doing direct I/O to two separate files.

It's probably not a big performance penalty, but it illustrates that
the barriers are "bigger" than they need to be.  Worth taking into
account when deciding what minimal ordering everyone _really_ wants.

If you do implement overlap detection logic, then would giving
barriers an I/O range be helpful?  E.g. corresponding to partitions.

Here's a few more cases, which may not be quite right even now:

3. What if a journal is on a different device to its filesystem?
Ideally, write barriers between the different device queues would be
appropriate.

4. A journalling filesystem mounted on a loopback device.  Is this
reliable now?

5. A journalling filesystem mounted on two loopback devices -- one for
the fs, one for the journal.

> (*) The only difference being that a completing read doesn't
>     tell you whether the elevator has already passed a barrier.
>     Currently, one could be lured into depending on this.

Isn't the barrier itself an I/O operation which can be waited on?
I agree something could depend on the reads at the moment.

-- Jamie

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: barriers vs. reads - O_DIRECT
  2004-06-24 18:50     ` Jamie Lokier
@ 2004-06-24 20:55       ` Werner Almesberger
  2004-06-24 22:42         ` Jamie Lokier
  0 siblings, 1 reply; 46+ messages in thread
From: Werner Almesberger @ 2004-06-24 20:55 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: Bryan Henderson, linux-fsdevel

Jamie Lokier wrote:
> Note that what filesystems and databases want is write-write *partial
> dependencies*.  The per-device I/O barrier is just a crude
> approximation.

True ;-) So what would an ideally flexible model look like ?
Partial order ? Triggers plus virtual requests ? There's also
the little issue that this should still yield an interface
that people can understand without taking a semester of
graph theory ;-)

> 3. What if a journal is on a different device to its filesystem?

"Don't do this" comes to mind :-)

> Isn't the barrier itself an I/O operation which can be waited on?
> I agree something could depend on the reads at the moment.

Making barriers waitable might be very useful, yes. That could
also be a step towards implementing those cross-device barriers.

- Werner

-- 
  _________________________________________________________________________
 / Werner Almesberger, Buenos Aires, Argentina         wa@almesberger.net /
/_http://www.almesberger.net/____________________________________________/

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: barriers vs. reads - O_DIRECT
  2004-06-24 20:55       ` Werner Almesberger
@ 2004-06-24 22:42         ` Jamie Lokier
  2004-06-25  3:21           ` Werner Almesberger
  2004-06-25  3:57           ` Guy
  0 siblings, 2 replies; 46+ messages in thread
From: Jamie Lokier @ 2004-06-24 22:42 UTC (permalink / raw)
  To: Werner Almesberger; +Cc: Bryan Henderson, linux-fsdevel

Werner Almesberger wrote:
> > Note that what filesystems and databases want is write-write *partial
> > dependencies*.  The per-device I/O barrier is just a crude
> > approximation.
> 
> True ;-) So what would an ideally flexible model look like ?
> Partial order ? Triggers plus virtual requests ? There's also
> the little issue that this should still yield an interface
> that people can understand without taking a semester of
> graph theory ;-)

For a fully journalling fs (including data), a barrier is used to
commit the journal writes before the corresponding non-journal writes.

For that purpose, a barrier has a set of writes which must come before
it, and a set of writes which must come after.  These represent a
transaction set.

(When data is not journalled, the situation can be more complicated
because to avoid revealing secure data, you might require non-journal
data to be committed before allowing a journal write which increases
the file length or block mapping metadata.  So then you have
non-journal writes before journal writes before other non-journal
writes.  I'm not sure if ext3 or reiserfs do this).

You can imagine that every small fs update could become a small
transaction.  That'd be one barrier per transaction.  Or you can
imagine many fs updates are aggregated, into a larger transaction.
That'd aggregate into fewer barriers.

Now you see that if the former, many small transactions, are in the
I/O queue, they _may_ be logically rescheduled by converting them to
larger transactions -- and reducing the number of I/O barriers which
read the device.

That's a simple consequence of barriers being a partial order.  If you
have two parallel transactions:

    A then (barrier) B
    C then (barrier) D

It's ok to schedule those as:

    A, B then (barrier) C, D

This is interesting because barriers are _sometimes_ low-level device
operations themselves, with a real overhead.  Think of the IDE
barriers implemented as cache flushes.  Therefore scheduling I/Os in
this way is a real optimisation.  In that example, it reduces 6 IDE
transactions to 5.

This optimisation is possible even when you have totally independent
filesystems, on different partitions.  Therefore it _can't_ be done
fully at the filesystem level, by virtue of the fs batching
transactions.

So that begs a question: should the filesystem give the I/O queues
enough information that the I/O queues can decide when to discard
physical write barriers in this way?  That is, the barriers remain in
the queue to logically constrain the order of other requests, but some
of them don't need to reach the device as actual commands, i.e. with
IDE that would allow some cache flush commands to be omitted.

I suspect that if barriers are represented as a queue entry with a
"before" set and an "after" set, such that the before set is known
prior to the barrier entering the queue, and the after set may be
added to after, that is enough to do this kind of optimisation in the
I/O scheduler.

It would be nice to come up with a interface that the loopback device
can support and relay through the underlying fs.

> > 3. What if a journal is on a different device to its filesystem?
> 
> "Don't do this" comes to mind :-)

ext3 and reiserfs both offered this from the begining, so it's
important to someone.  The two scenarios that come to mind are
journalling onto NVRAM for fast commits, and journalling onto a faster
device than the main filesystem -- faster in part because it's linear
writing.

> > Isn't the barrier itself an I/O operation which can be waited on?
> > I agree something could depend on the reads at the moment.
> 
> Making barriers waitable might be very useful, yes. That could
> also be a step towards implementing those cross-device barriers.

For fsync(), journalling fs's don't need to wait on barriers because
they can simply return from fsync() when all the prerequisite journal
writes are completed.

The same is true of a database.  So, waiting on barriers isn't
strictly needed for any application which knows which writes it has
queued before the barrier.

fsync() and _some_ databases need those barriers to say they've
committed prerequisite writes to stable storage.  At other times, a
barrier is there only to preserve ordering so that a journal
functions, but it's not required that the data is actually committed
to storage immediately -- merely that it _will_ be committed in order.

That's the difference between a cache flush and an ordering command to
some I/O devices.  PATA uses cache flush commands for ordering so both
barrier types are implemented the same.  I'm not sure if there are
disks which allow ordering commands without immediately committing to
storage.  Are there?

-- Jamie

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: barriers vs. reads - O_DIRECT
  2004-06-24 17:46   ` Werner Almesberger
  2004-06-24 18:50     ` Jamie Lokier
@ 2004-06-25  0:11     ` Bryan Henderson
  2004-06-25  2:42       ` Werner Almesberger
  1 sibling, 1 reply; 46+ messages in thread
From: Bryan Henderson @ 2004-06-25  0:11 UTC (permalink / raw)
  To: Werner Almesberger; +Cc: linux-fsdevel

>> It seems obvious to me that whatever ordering guarantees the user gets 
>> without the O_DIRECT flag, he should get with it as well.
>
>Yes, it would be nice if we could obtain such behaviour without
>unacceptable performance sacrifices. It seems to me that, if we
>can find an efficient way for serializing all write-write and
>read-write overlaps, plus have explicit barriers for serializing
>non-overlapping writes, this should yield pretty much what
>everyone wants 

Maybe it's time to identify a specific scenario we're trying to make 
right.  Because if the goal is for reads and writes with O_DIRECT to 
interact the same way as without, then I don't see where the block layer 
even comes into it -- the filesystem layer does the job. 

There are two cases: atomic writes and nonatomic writes.  Some filesystems 
do atomic writes, which is what everyone seems to expect but isn't 
actually stated in POSIX, and others don't.  I heard Linux filesystems 
based on generic_file_read/write don't, but I haven't looked closely at 
those.

_With_ atomic writes, the filesystem code must use a lock to ensure a 
whole write() finishes before another write() or read() begins.  (If it's 
smart, it can do this only where there are overlaps of file offsets, but I 
don't think that's common).  In the case of cached I/O, "finished" means 
all the data is in the cache; with direct I/O, it means it's all on the 
disk.

_Without_ atomic writes, the caller of read() and write() is responsible 
for synchronization.  There simply isn't any guarantee that if Process A 
does a read while Process B is doing a write that A won't see just part of 
B's write.  So if you're brave enough to have multiple processes 
simultaneously reading and writing the same file regions, the processes 
have to talk to each other.

There's also (kernel) aio to consider.  I can see barriers playing a role 
there.  But I can't see how barriers figure into ordinary O_DIRECT file 
I/O.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: barriers vs. reads - O_DIRECT
  2004-06-25  0:11     ` Bryan Henderson
@ 2004-06-25  2:42       ` Werner Almesberger
  2004-06-25 15:59         ` barriers vs. reads - O_DIRECT aio Bryan Henderson
  2004-06-25 16:31         ` barriers vs. reads - O_DIRECT Bryan Henderson
  0 siblings, 2 replies; 46+ messages in thread
From: Werner Almesberger @ 2004-06-25  2:42 UTC (permalink / raw)
  To: Bryan Henderson; +Cc: linux-fsdevel

Bryan Henderson wrote:
> There's also (kernel) aio to consider.

I suppose this would be the main scenario. AIO+O_DIRECT also
gives the closest approximation to directly talking to the
elevator.

> But I can't see how barriers figure into ordinary O_DIRECT file I/O.

Hmm, I've never looked at non-AIO write semantics with O_DIRECT.
For reasonable semantics, I guess it would either have to block
until the operation has completed, or be "atomic" (*).

(*) Defining atomic can be tricky, too. E.g. see
    http://www.uwsg.iu.edu/hypermail/linux/kernel/0402.0/1361.html

- Werner

-- 
  _________________________________________________________________________
 / Werner Almesberger, Buenos Aires, Argentina         wa@almesberger.net /
/_http://www.almesberger.net/____________________________________________/

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: barriers vs. reads - O_DIRECT
  2004-06-24 22:42         ` Jamie Lokier
@ 2004-06-25  3:21           ` Werner Almesberger
  2004-06-25  3:57           ` Guy
  1 sibling, 0 replies; 46+ messages in thread
From: Werner Almesberger @ 2004-06-25  3:21 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: Bryan Henderson, linux-fsdevel

Jamie Lokier wrote:
> For that purpose, a barrier has a set of writes which must come before
> it, and a set of writes which must come after.  These represent a
> transaction set.

Okay, partial order then. AFAIK, this doesn't allow us to
express things like "do A after one of { B, C } is done". 
That could be useful for redundant storage structures, but
we may not care.

>     A then (barrier) B
>     C then (barrier) D
> 
> It's ok to schedule those as:
> 
>     A, B then (barrier) C, D

I think you mean "A, C then (barrier) B, D" :-)

There's a problem, though: the first has the following relations:
A < B, C < D. The second has: A < B, A < D, C < B, C < D. So now
the elevator needs to decided if avoiding the cost of the
side-effects of a barrier (cache-flush, or such) is acceptable,
given the cost of the additional ordering restrictions. (I'm
carefully trying to avoid to say "has less cost", because the
elevator may not always pick the "cheapest" variant.)

> It would be nice to come up with a interface that the loopback device
> can support and relay through the underlying fs.

You really like those loopback devices, don't you ? :-)

> ext3 and reiserfs both offered this from the begining, so it's
> important to someone.

Sigh, yes ...

> committed prerequisite writes to stable storage.  At other times, a
> barrier is there only to preserve ordering so that a journal
> functions, but it's not required that the data is actually committed
> to storage immediately -- merely that it _will_ be committed in order.

Well, to implement cross-device barriers, you need a means to
find out if the prerequisites for the local device to cross a
barrier are satisfied, and to make the local device wait until
then.

This may be implemented in a completely different thread/whatever
than the actual IO.

Making an elevator non-work-conserving is an interesting
exercise by itself. (It may be feasible: I've done it for a power
management experiment (*), and this works at least for PATA.)

(*) To stop, I simply make next_req return NULL. queue_empty
    still returns the correct result. To resume, I have an
    external trigger that calls blk_start_queue.

There's of course the question whether the elevator is really the
right place for all this. Perhaps just embedding a callback in a
barrier request could be used for a more general solution - in
particular one that would allow us to avoid having to solve all
problems at once :-)

- Werner

-- 
  _________________________________________________________________________
 / Werner Almesberger, Buenos Aires, Argentina         wa@almesberger.net /
/_http://www.almesberger.net/____________________________________________/

^ permalink raw reply	[flat|nested] 46+ messages in thread

* RE: barriers vs. reads - O_DIRECT
  2004-06-24 22:42         ` Jamie Lokier
  2004-06-25  3:21           ` Werner Almesberger
@ 2004-06-25  3:57           ` Guy
  2004-06-25  4:52             ` Werner Almesberger
  1 sibling, 1 reply; 46+ messages in thread
From: Guy @ 2004-06-25  3:57 UTC (permalink / raw)
  To: 'Jamie Lokier', 'Werner Almesberger'
  Cc: 'Bryan Henderson', linux-fsdevel

What if a filesystem is on a software RAID5 array?  How does the filesystem
use a barrier when some of the writes are on different disks?  Would md pass
the barrier down to each disk?

-----Original Message-----
From: linux-fsdevel-owner@vger.kernel.org
[mailto:linux-fsdevel-owner@vger.kernel.org] On Behalf Of Jamie Lokier
Sent: Thursday, June 24, 2004 6:43 PM
To: Werner Almesberger
Cc: Bryan Henderson; linux-fsdevel@vger.kernel.org
Subject: Re: barriers vs. reads - O_DIRECT

Werner Almesberger wrote:
> > Note that what filesystems and databases want is write-write *partial
> > dependencies*.  The per-device I/O barrier is just a crude
> > approximation.
> 
> True ;-) So what would an ideally flexible model look like ?
> Partial order ? Triggers plus virtual requests ? There's also
> the little issue that this should still yield an interface
> that people can understand without taking a semester of
> graph theory ;-)

For a fully journalling fs (including data), a barrier is used to
commit the journal writes before the corresponding non-journal writes.

For that purpose, a barrier has a set of writes which must come before
it, and a set of writes which must come after.  These represent a
transaction set.

(When data is not journalled, the situation can be more complicated
because to avoid revealing secure data, you might require non-journal
data to be committed before allowing a journal write which increases
the file length or block mapping metadata.  So then you have
non-journal writes before journal writes before other non-journal
writes.  I'm not sure if ext3 or reiserfs do this).

You can imagine that every small fs update could become a small
transaction.  That'd be one barrier per transaction.  Or you can
imagine many fs updates are aggregated, into a larger transaction.
That'd aggregate into fewer barriers.

Now you see that if the former, many small transactions, are in the
I/O queue, they _may_ be logically rescheduled by converting them to
larger transactions -- and reducing the number of I/O barriers which
read the device.

That's a simple consequence of barriers being a partial order.  If you
have two parallel transactions:

    A then (barrier) B
    C then (barrier) D

It's ok to schedule those as:

    A, B then (barrier) C, D

This is interesting because barriers are _sometimes_ low-level device
operations themselves, with a real overhead.  Think of the IDE
barriers implemented as cache flushes.  Therefore scheduling I/Os in
this way is a real optimisation.  In that example, it reduces 6 IDE
transactions to 5.

This optimisation is possible even when you have totally independent
filesystems, on different partitions.  Therefore it _can't_ be done
fully at the filesystem level, by virtue of the fs batching
transactions.

So that begs a question: should the filesystem give the I/O queues
enough information that the I/O queues can decide when to discard
physical write barriers in this way?  That is, the barriers remain in
the queue to logically constrain the order of other requests, but some
of them don't need to reach the device as actual commands, i.e. with
IDE that would allow some cache flush commands to be omitted.

I suspect that if barriers are represented as a queue entry with a
"before" set and an "after" set, such that the before set is known
prior to the barrier entering the queue, and the after set may be
added to after, that is enough to do this kind of optimisation in the
I/O scheduler.

It would be nice to come up with a interface that the loopback device
can support and relay through the underlying fs.

> > 3. What if a journal is on a different device to its filesystem?
> 
> "Don't do this" comes to mind :-)

ext3 and reiserfs both offered this from the begining, so it's
important to someone.  The two scenarios that come to mind are
journalling onto NVRAM for fast commits, and journalling onto a faster
device than the main filesystem -- faster in part because it's linear
writing.

> > Isn't the barrier itself an I/O operation which can be waited on?
> > I agree something could depend on the reads at the moment.
> 
> Making barriers waitable might be very useful, yes. That could
> also be a step towards implementing those cross-device barriers.

For fsync(), journalling fs's don't need to wait on barriers because
they can simply return from fsync() when all the prerequisite journal
writes are completed.

The same is true of a database.  So, waiting on barriers isn't
strictly needed for any application which knows which writes it has
queued before the barrier.

fsync() and _some_ databases need those barriers to say they've
committed prerequisite writes to stable storage.  At other times, a
barrier is there only to preserve ordering so that a journal
functions, but it's not required that the data is actually committed
to storage immediately -- merely that it _will_ be committed in order.

That's the difference between a cache flush and an ordering command to
some I/O devices.  PATA uses cache flush commands for ordering so both
barrier types are implemented the same.  I'm not sure if there are
disks which allow ordering commands without immediately committing to
storage.  Are there?

-- Jamie
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: barriers vs. reads - O_DIRECT
  2004-06-25  3:57           ` Guy
@ 2004-06-25  4:52             ` Werner Almesberger
  0 siblings, 0 replies; 46+ messages in thread
From: Werner Almesberger @ 2004-06-25  4:52 UTC (permalink / raw)
  To: Guy; +Cc: 'Jamie Lokier', 'Bryan Henderson', linux-fsdevel

Guy wrote:
> What if a filesystem is on a software RAID5 array?  How does the filesystem
> use a barrier when some of the writes are on different disks?  Would md pass
> the barrier down to each disk?

Let's assume we just have a FIFO elevator per device, barriers
affect all requests, and we don't allow any reordering across
barriers. Then, a device would have to stop IO when it hits a
barrier, and could only resume when all other devices using
this barrier have reached it.

md would have to send the barrier to each device, with the
exception that in any sequence of barriers without other
requests, only the first and the last barrier are needed.

Now, if we have a smarter elevator and smarter barriers,
there are more tricks that can be played. E.g. an elevator
could try to reach a barrier shared by many others as soon
as possible, and then schedule requests allowed to cross
the barrier while waiting for the other devices.

I'm not sure how sophisticated we want to get in the
multiple devices case, though.

- Werner

-- 
  _________________________________________________________________________
 / Werner Almesberger, Buenos Aires, Argentina         wa@almesberger.net /
/_http://www.almesberger.net/____________________________________________/

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: barriers vs. reads - O_DIRECT aio
  2004-06-25  2:42       ` Werner Almesberger
@ 2004-06-25 15:59         ` Bryan Henderson
  2004-06-25 16:31         ` barriers vs. reads - O_DIRECT Bryan Henderson
  1 sibling, 0 replies; 46+ messages in thread
From: Bryan Henderson @ 2004-06-25 15:59 UTC (permalink / raw)
  To: Werner Almesberger; +Cc: linux-fsdevel

>> There's also (kernel) aio to consider.
>
>I suppose this would be the main scenario. AIO+O_DIRECT also
>gives the closest approximation to directly talking to the
>elevator.

I agree.  And it would be good for the aio interface to have all the 
function of a block device make_request_fn function, including whatever 
barrier function it winds up with.



^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: barriers vs. reads - O_DIRECT
  2004-06-25  2:42       ` Werner Almesberger
  2004-06-25 15:59         ` barriers vs. reads - O_DIRECT aio Bryan Henderson
@ 2004-06-25 16:31         ` Bryan Henderson
  1 sibling, 0 replies; 46+ messages in thread
From: Bryan Henderson @ 2004-06-25 16:31 UTC (permalink / raw)
  To: Werner Almesberger; +Cc: linux-fsdevel

>> But I can't see how barriers figure into ordinary O_DIRECT file I/O.
>
>Hmm, I've never looked at non-AIO write semantics with O_DIRECT.
>For reasonable semantics, I guess it would either have to block
>until the operation has completed, or be "atomic"

I think this part is well established.  When an O_DIRECT write() returns, 
the data has to have been written to the disk, ergo the process has to 
block until all relevant I/O has completed.  Remember that a primary 
purpose of direct I/O is to allow multiple systems to access the same disk 
from different ports.  It has to be possible for a system to send a 
message to another system saying, "I just wrote my data on disk Q.  Have a 
look."

_In addition_ to that requirement, the filesystem may or may not make that 
write() atomic as viewed by other processes.  (And as you point out, there 
are multiple kinds of atomicity it might choose).  I don't believe it 
would make use of a block layer barrier in either case.

--
Bryan Henderson                          IBM Almaden Research Center
San Jose CA                              Filesystems

^ permalink raw reply	[flat|nested] 46+ messages in thread

end of thread, other threads:[~2004-06-25 16:31 UTC | newest]

Thread overview: 46+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-06-24  0:48 barriers vs. reads Werner Almesberger
2004-06-24  3:39 ` Werner Almesberger
2004-06-24  8:00   ` Herbert Poetzl
2004-06-24 12:16     ` Werner Almesberger
2004-06-24 13:36   ` Jamie Lokier
2004-06-24 17:02     ` Werner Almesberger
2004-06-24 16:39 ` Steve Lord
2004-06-24 17:00 ` barriers vs. reads - O_DIRECT Bryan Henderson
2004-06-24 17:46   ` Werner Almesberger
2004-06-24 18:50     ` Jamie Lokier
2004-06-24 20:55       ` Werner Almesberger
2004-06-24 22:42         ` Jamie Lokier
2004-06-25  3:21           ` Werner Almesberger
2004-06-25  3:57           ` Guy
2004-06-25  4:52             ` Werner Almesberger
2004-06-25  0:11     ` Bryan Henderson
2004-06-25  2:42       ` Werner Almesberger
2004-06-25 15:59         ` barriers vs. reads - O_DIRECT aio Bryan Henderson
2004-06-25 16:31         ` barriers vs. reads - O_DIRECT Bryan Henderson
  -- strict thread matches above, loose matches on Subject: below --
2004-06-22  3:53 barriers vs. reads Werner Almesberger
2004-06-22  7:39 ` Jens Axboe
2004-06-22  7:50   ` Werner Almesberger
2004-06-22  7:55     ` Jens Axboe
2004-06-22  8:34       ` Werner Almesberger
2004-06-22 10:08         ` Jens Axboe
2004-06-22 11:28       ` Jamie Lokier
2004-06-22 11:32         ` Jens Axboe
2004-06-22 17:12           ` Bryan Henderson
2004-06-22 20:53             ` Jens Axboe
2004-06-23 16:41               ` Bryan Henderson
2004-06-23 16:52                 ` Jens Axboe
2004-06-23 16:53                 ` Jamie Lokier
2004-06-23 21:08                   ` Bryan Henderson
2004-06-23 23:23                   ` Werner Almesberger
2004-06-24 13:43                     ` Jamie Lokier
2004-06-24 14:32                       ` Christoph Hellwig
2004-06-24 17:05                       ` Werner Almesberger
2004-06-22 18:53           ` Werner Almesberger
2004-06-22 19:57             ` Jamie Lokier
2004-06-22 23:13               ` Werner Almesberger
2004-06-22 20:57             ` Jens Axboe
2004-06-22 23:10               ` Werner Almesberger
2004-06-23  0:14                 ` Jamie Lokier
2004-06-23  6:27                 ` Jens Axboe
2004-06-22 18:45         ` Werner Almesberger
2004-06-22 19:07           ` Guy

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).