* Re: barriers vs. reads
@ 2004-06-24 0:48 Werner Almesberger
2004-06-24 3:39 ` Werner Almesberger
` (2 more replies)
0 siblings, 3 replies; 19+ messages in thread
From: Werner Almesberger @ 2004-06-24 0:48 UTC (permalink / raw)
To: linux-fsdevel
BTW, regarding overlapping requests, I wonder if there's a data
structure that gives O(log requests) or such lookups for ranges.
The best I could spontaneously think of would be
O(new_request_size*log(requests*avg_request_size))
which isn't pretty.
BTW2, is O_DIRECT actually a Linux-only thing, or is there some
ancestor whose semantics we may want to preserve ? I've had a
quick look at POSIX, but they don't seem to have direct IO.
- Werner
--
_________________________________________________________________________
/ Werner Almesberger, Buenos Aires, Argentina wa@almesberger.net /
/_http://www.almesberger.net/____________________________________________/
^ permalink raw reply [flat|nested] 19+ messages in thread* Re: barriers vs. reads 2004-06-24 0:48 barriers vs. reads Werner Almesberger @ 2004-06-24 3:39 ` Werner Almesberger 2004-06-24 8:00 ` Herbert Poetzl 2004-06-24 13:36 ` Jamie Lokier 2004-06-24 16:39 ` Steve Lord 2004-06-24 17:00 ` barriers vs. reads - O_DIRECT Bryan Henderson 2 siblings, 2 replies; 19+ messages in thread From: Werner Almesberger @ 2004-06-24 3:39 UTC (permalink / raw) To: linux-fsdevel I wrote: > BTW, regarding overlapping requests, I wonder if there's a data > structure that gives O(log requests) or such lookups for ranges. Seem that I've found one that is maybe 2-4 times as expensive as a single tree. It works as follows: if we have a set of ranges (a,b) and want to see if any of them overlap with a range (x,y), we compare the indices of the matches (or almost-matches). num_overlaps = |{a : a < y}| - |{b : b <= x}| "{a : a < y}" is "the set of all a where a < y". "|...|" is the number of elements in a set. We could obtain such indices by counting the number of nodes in each branch of the tree. That's O(1) for all regular tree operations, and O(log n) for the sum. The index is the size of all the trees under left branches we haven't taken, plus the number of nodes we've left through the right branch. If there are multiple equal entries, we must find the first one. One problem: I did this mostly by instinct. It seems to work perfectly, but I can't quite explain why :-( I put a dirty little program to simulate this on http://abiss.sourceforge.net/t.tar.gz - Werner -- _________________________________________________________________________ / Werner Almesberger, Buenos Aires, Argentina wa@almesberger.net / /_http://www.almesberger.net/____________________________________________/ ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: barriers vs. reads 2004-06-24 3:39 ` Werner Almesberger @ 2004-06-24 8:00 ` Herbert Poetzl 2004-06-24 12:16 ` Werner Almesberger 2004-06-24 13:36 ` Jamie Lokier 1 sibling, 1 reply; 19+ messages in thread From: Herbert Poetzl @ 2004-06-24 8:00 UTC (permalink / raw) To: Werner Almesberger; +Cc: linux-fsdevel On Thu, Jun 24, 2004 at 12:39:44AM -0300, Werner Almesberger wrote: > I wrote: > > BTW, regarding overlapping requests, I wonder if there's a data > > structure that gives O(log requests) or such lookups for ranges. > > Seem that I've found one that is maybe 2-4 times as expensive as a > single tree. It works as follows: if we have a set of ranges (a,b) > and want to see if any of them overlap with a range (x,y), we > compare the indices of the matches (or almost-matches). > > num_overlaps = |{a : a < y}| - |{b : b <= x}| > > "{a : a < y}" is "the set of all a where a < y". "|...|" is the > number of elements in a set. > > We could obtain such indices by counting the number of nodes in > each branch of the tree. That's O(1) for all regular tree > operations, and O(log n) for the sum. The index is the size of > all the trees under left branches we haven't taken, plus the > number of nodes we've left through the right branch. If there are > multiple equal entries, we must find the first one. > > One problem: I did this mostly by instinct. It seems to work > perfectly, but I can't quite explain why :-( hmm ... there are eight cases how ranges can interact ... a b a b | | | | 1) +-----+ +---------+ 2) +-----+ +---------+ | | | | x y x y a b a b | | | | 3) +-------------+---------+ 4) +-------------+---------+ | | | | x y x y a b a b | | | | 5) +-------------+===+-----+ 6) +-----+========+--------+ | | | | x y x y a b a b | | | | 7) +-----+========+--------+ 8) +-----+========+--------+ | | | | x y x y by verifying those eight cases for correctness, you can conclude, that the sum of N such cases will give the correct number of overlaps (with a given test range); verification itself is simple: case a<y b<=x |{a:a<y}| - |{b:b<= x}| ------+-------+-------+------------------------- 1) | YES | YES | 0 2) | NO | NO | 0 3) | YES | YES | 0 4) | NO | NO | 0 ------+-------+-------+----- 5) | YES | NO | 1 6) | YES | NO | 1 7) | YES | NO | 1 8) | YES | NO | 1 HTH, Herbert > I put a dirty little program to simulate this on > http://abiss.sourceforge.net/t.tar.gz > > - Werner > > -- > _________________________________________________________________________ > / Werner Almesberger, Buenos Aires, Argentina wa@almesberger.net / > /_http://www.almesberger.net/____________________________________________/ > - > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: barriers vs. reads 2004-06-24 8:00 ` Herbert Poetzl @ 2004-06-24 12:16 ` Werner Almesberger 0 siblings, 0 replies; 19+ messages in thread From: Werner Almesberger @ 2004-06-24 12:16 UTC (permalink / raw) To: linux-fsdevel Herbert Poetzl wrote: > by verifying those eight cases for correctness, you > can conclude, that the sum of N such cases will give > the correct number of overlaps (with a given test > range); verification itself is simple: Thanks (very nice drawings, btw.) ! Yes, now that I see it written down, I also see how we can generalize that to a proof for any number of (a,b) ranges. Cool ! So this means that we have a solution to detect overlaps that shouldn't be significantly slower than, say tree+hash as used in the anticipatory scheduler. One problem is that this approach doesn't tell us where the overlapping requests are, only that there are somewhere out there. Thanks, - Werner -- _________________________________________________________________________ / Werner Almesberger, Buenos Aires, Argentina wa@almesberger.net / /_http://www.almesberger.net/____________________________________________/ ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: barriers vs. reads 2004-06-24 3:39 ` Werner Almesberger 2004-06-24 8:00 ` Herbert Poetzl @ 2004-06-24 13:36 ` Jamie Lokier 2004-06-24 17:02 ` Werner Almesberger 1 sibling, 1 reply; 19+ messages in thread From: Jamie Lokier @ 2004-06-24 13:36 UTC (permalink / raw) To: Werner Almesberger; +Cc: linux-fsdevel Werner Almesberger wrote: > I wrote: > > BTW, regarding overlapping requests, I wonder if there's a data > > structure that gives O(log requests) or such lookups for ranges. > > Seem that I've found one that is maybe 2-4 times as expensive as a > single tree. It works as follows: if we have a set of ranges (a,b) > and want to see if any of them overlap with a range (x,y), we > compare the indices of the matches (or almost-matches). Is the prio_tree data structure, the one being used by recent VM work, which keeps track of ranges, any use? -- Jamie ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: barriers vs. reads 2004-06-24 13:36 ` Jamie Lokier @ 2004-06-24 17:02 ` Werner Almesberger 0 siblings, 0 replies; 19+ messages in thread From: Werner Almesberger @ 2004-06-24 17:02 UTC (permalink / raw) To: Jamie Lokier; +Cc: linux-fsdevel Jamie Lokier wrote: > Is the prio_tree data structure, the one being used by recent VM work, > which keeps track of ranges, any use? Hmm, they deliver exactly the data we need. That's great. Unlike rb-trees, they also aren't balanced, and thus have a worst-case O(log n) term with "log n" = sizeof(sector_t)*8 for simple lookups. Furthermore, they have a scary worst-case insertion time of O((log n)^2). With rb-trees, we get a mere O(log nr_requests) for lookups. So that's A*32 vs. B*7 using all-default 2.6.7 on ia32. Insertion has the same typical cost, and A*1024 vs. B*7 worst-case. A and B are implementation-specific per-operation cost factors. But yes, they look very interesting ... Thanks, - Werner -- _________________________________________________________________________ / Werner Almesberger, Buenos Aires, Argentina wa@almesberger.net / /_http://www.almesberger.net/____________________________________________/ ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: barriers vs. reads 2004-06-24 0:48 barriers vs. reads Werner Almesberger 2004-06-24 3:39 ` Werner Almesberger @ 2004-06-24 16:39 ` Steve Lord 2004-06-24 17:00 ` barriers vs. reads - O_DIRECT Bryan Henderson 2 siblings, 0 replies; 19+ messages in thread From: Steve Lord @ 2004-06-24 16:39 UTC (permalink / raw) To: Werner Almesberger; +Cc: linux-fsdevel Werner Almesberger wrote: > BTW, regarding overlapping requests, I wonder if there's a data > structure that gives O(log requests) or such lookups for ranges. > The best I could spontaneously think of would be > O(new_request_size*log(requests*avg_request_size)) > which isn't pretty. > > BTW2, is O_DIRECT actually a Linux-only thing, or is there some > ancestor whose semantics we may want to preserve ? I've had a > quick look at POSIX, but they don't seem to have direct IO. Irix has O_DIRECT, Solaris has something too, but it is not in the posix specs. Cray Unicos is the oldest implementation I came across. Irix explicitly lets multiple readers and writers into a file at once with O_DIRECT. The assumption being that the application which does this is doing its own coordination and will not shoot itself in the foot. Steve Lord ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: barriers vs. reads - O_DIRECT 2004-06-24 0:48 barriers vs. reads Werner Almesberger 2004-06-24 3:39 ` Werner Almesberger 2004-06-24 16:39 ` Steve Lord @ 2004-06-24 17:00 ` Bryan Henderson 2004-06-24 17:46 ` Werner Almesberger 2 siblings, 1 reply; 19+ messages in thread From: Bryan Henderson @ 2004-06-24 17:00 UTC (permalink / raw) To: Werner Almesberger; +Cc: linux-fsdevel >BTW2, is O_DIRECT actually a Linux-only thing, or is there some >ancestor whose semantics we may want to preserve ? I've had a >quick look at POSIX, but they don't seem to have direct IO. Several operating systems got O_DIRECT before, but practically at the same time as, Linux. There is no standard and no agreement on the small details, such as what to do when someone tries to read the last half-sector of a file, given that direct I/O is supposed to be in whole sectors. It seems obvious to me that whatever ordering guarantees the user gets without the O_DIRECT flag, he should get with it as well. The user doesn't see disk queues and sectors and elevators and file caches. He sees bytes stream files. If we can give much better performance by withdrawing those guarantees for O_DIRECT, though, users would probably accept it since there is in fact no strong legacy we have to live up to. When I worry about I/O scheduling, I usually worry a lot more about block device direct I/O (raw devices) than file direct I/O. The former is where the block layer is most exposed to the user. -- Bryan Henderson IBM Almaden Research Center San Jose CA Filesystems ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: barriers vs. reads - O_DIRECT 2004-06-24 17:00 ` barriers vs. reads - O_DIRECT Bryan Henderson @ 2004-06-24 17:46 ` Werner Almesberger 2004-06-24 18:50 ` Jamie Lokier 2004-06-25 0:11 ` Bryan Henderson 0 siblings, 2 replies; 19+ messages in thread From: Werner Almesberger @ 2004-06-24 17:46 UTC (permalink / raw) To: Bryan Henderson; +Cc: linux-fsdevel Bryan Henderson wrote: > It seems obvious to me that whatever ordering guarantees the user gets > without the O_DIRECT flag, he should get with it as well. Yes, it would be nice if we could obtain such behaviour without unacceptable performance sacrifices. It seems to me that, if we can find an efficient way for serializing all write-write and read-write overlaps, plus have explicit barriers for serializing non-overlapping writes, this should yield pretty much what everyone wants (*). Now, that "if" needs a bit of work ... :-) (*) The only difference being that a completing read doesn't tell you whether the elevator has already passed a barrier. Currently, one could be lured into depending on this. > When I worry about I/O scheduling, I usually worry a lot more about block > device direct I/O (raw devices) than file direct I/O. The former is where > the block layer is most exposed to the user. I haven't looked at the direct IO code in detail, but it seems to me that the elevator behaviour affects file-based direct IO in the same way as the device-based one. - Werner -- _________________________________________________________________________ / Werner Almesberger, Buenos Aires, Argentina wa@almesberger.net / /_http://www.almesberger.net/____________________________________________/ ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: barriers vs. reads - O_DIRECT 2004-06-24 17:46 ` Werner Almesberger @ 2004-06-24 18:50 ` Jamie Lokier 2004-06-24 20:55 ` Werner Almesberger 2004-06-25 0:11 ` Bryan Henderson 1 sibling, 1 reply; 19+ messages in thread From: Jamie Lokier @ 2004-06-24 18:50 UTC (permalink / raw) To: Werner Almesberger; +Cc: Bryan Henderson, linux-fsdevel Werner Almesberger wrote: > Bryan Henderson wrote: > > It seems obvious to me that whatever ordering guarantees the user gets > > without the O_DIRECT flag, he should get with it as well. > > Yes, it would be nice if we could obtain such behaviour without > unacceptable performance sacrifices. It seems to me that, if we > can find an efficient way for serializing all write-write and > read-write overlaps, plus have explicit barriers for serializing > non-overlapping writes, this should yield pretty much what > everyone wants (*). Now, that "if" needs a bit of work ... :-) Note that what filesystems and databases want is write-write *partial dependencies*. The per-device I/O barrier is just a crude approximation. 1. Think about this: two filesystems on different partitions of the same device. The writes of each filesystem are independent, yet the barriers will force the writes of one filesystem to come before later-queued writes of the other. 2. Or, two database back-ends doing direct I/O to two separate files. It's probably not a big performance penalty, but it illustrates that the barriers are "bigger" than they need to be. Worth taking into account when deciding what minimal ordering everyone _really_ wants. If you do implement overlap detection logic, then would giving barriers an I/O range be helpful? E.g. corresponding to partitions. Here's a few more cases, which may not be quite right even now: 3. What if a journal is on a different device to its filesystem? Ideally, write barriers between the different device queues would be appropriate. 4. A journalling filesystem mounted on a loopback device. Is this reliable now? 5. A journalling filesystem mounted on two loopback devices -- one for the fs, one for the journal. > (*) The only difference being that a completing read doesn't > tell you whether the elevator has already passed a barrier. > Currently, one could be lured into depending on this. Isn't the barrier itself an I/O operation which can be waited on? I agree something could depend on the reads at the moment. -- Jamie ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: barriers vs. reads - O_DIRECT 2004-06-24 18:50 ` Jamie Lokier @ 2004-06-24 20:55 ` Werner Almesberger 2004-06-24 22:42 ` Jamie Lokier 0 siblings, 1 reply; 19+ messages in thread From: Werner Almesberger @ 2004-06-24 20:55 UTC (permalink / raw) To: Jamie Lokier; +Cc: Bryan Henderson, linux-fsdevel Jamie Lokier wrote: > Note that what filesystems and databases want is write-write *partial > dependencies*. The per-device I/O barrier is just a crude > approximation. True ;-) So what would an ideally flexible model look like ? Partial order ? Triggers plus virtual requests ? There's also the little issue that this should still yield an interface that people can understand without taking a semester of graph theory ;-) > 3. What if a journal is on a different device to its filesystem? "Don't do this" comes to mind :-) > Isn't the barrier itself an I/O operation which can be waited on? > I agree something could depend on the reads at the moment. Making barriers waitable might be very useful, yes. That could also be a step towards implementing those cross-device barriers. - Werner -- _________________________________________________________________________ / Werner Almesberger, Buenos Aires, Argentina wa@almesberger.net / /_http://www.almesberger.net/____________________________________________/ ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: barriers vs. reads - O_DIRECT 2004-06-24 20:55 ` Werner Almesberger @ 2004-06-24 22:42 ` Jamie Lokier 2004-06-25 3:21 ` Werner Almesberger 2004-06-25 3:57 ` Guy 0 siblings, 2 replies; 19+ messages in thread From: Jamie Lokier @ 2004-06-24 22:42 UTC (permalink / raw) To: Werner Almesberger; +Cc: Bryan Henderson, linux-fsdevel Werner Almesberger wrote: > > Note that what filesystems and databases want is write-write *partial > > dependencies*. The per-device I/O barrier is just a crude > > approximation. > > True ;-) So what would an ideally flexible model look like ? > Partial order ? Triggers plus virtual requests ? There's also > the little issue that this should still yield an interface > that people can understand without taking a semester of > graph theory ;-) For a fully journalling fs (including data), a barrier is used to commit the journal writes before the corresponding non-journal writes. For that purpose, a barrier has a set of writes which must come before it, and a set of writes which must come after. These represent a transaction set. (When data is not journalled, the situation can be more complicated because to avoid revealing secure data, you might require non-journal data to be committed before allowing a journal write which increases the file length or block mapping metadata. So then you have non-journal writes before journal writes before other non-journal writes. I'm not sure if ext3 or reiserfs do this). You can imagine that every small fs update could become a small transaction. That'd be one barrier per transaction. Or you can imagine many fs updates are aggregated, into a larger transaction. That'd aggregate into fewer barriers. Now you see that if the former, many small transactions, are in the I/O queue, they _may_ be logically rescheduled by converting them to larger transactions -- and reducing the number of I/O barriers which read the device. That's a simple consequence of barriers being a partial order. If you have two parallel transactions: A then (barrier) B C then (barrier) D It's ok to schedule those as: A, B then (barrier) C, D This is interesting because barriers are _sometimes_ low-level device operations themselves, with a real overhead. Think of the IDE barriers implemented as cache flushes. Therefore scheduling I/Os in this way is a real optimisation. In that example, it reduces 6 IDE transactions to 5. This optimisation is possible even when you have totally independent filesystems, on different partitions. Therefore it _can't_ be done fully at the filesystem level, by virtue of the fs batching transactions. So that begs a question: should the filesystem give the I/O queues enough information that the I/O queues can decide when to discard physical write barriers in this way? That is, the barriers remain in the queue to logically constrain the order of other requests, but some of them don't need to reach the device as actual commands, i.e. with IDE that would allow some cache flush commands to be omitted. I suspect that if barriers are represented as a queue entry with a "before" set and an "after" set, such that the before set is known prior to the barrier entering the queue, and the after set may be added to after, that is enough to do this kind of optimisation in the I/O scheduler. It would be nice to come up with a interface that the loopback device can support and relay through the underlying fs. > > 3. What if a journal is on a different device to its filesystem? > > "Don't do this" comes to mind :-) ext3 and reiserfs both offered this from the begining, so it's important to someone. The two scenarios that come to mind are journalling onto NVRAM for fast commits, and journalling onto a faster device than the main filesystem -- faster in part because it's linear writing. > > Isn't the barrier itself an I/O operation which can be waited on? > > I agree something could depend on the reads at the moment. > > Making barriers waitable might be very useful, yes. That could > also be a step towards implementing those cross-device barriers. For fsync(), journalling fs's don't need to wait on barriers because they can simply return from fsync() when all the prerequisite journal writes are completed. The same is true of a database. So, waiting on barriers isn't strictly needed for any application which knows which writes it has queued before the barrier. fsync() and _some_ databases need those barriers to say they've committed prerequisite writes to stable storage. At other times, a barrier is there only to preserve ordering so that a journal functions, but it's not required that the data is actually committed to storage immediately -- merely that it _will_ be committed in order. That's the difference between a cache flush and an ordering command to some I/O devices. PATA uses cache flush commands for ordering so both barrier types are implemented the same. I'm not sure if there are disks which allow ordering commands without immediately committing to storage. Are there? -- Jamie ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: barriers vs. reads - O_DIRECT 2004-06-24 22:42 ` Jamie Lokier @ 2004-06-25 3:21 ` Werner Almesberger 2004-06-25 3:57 ` Guy 1 sibling, 0 replies; 19+ messages in thread From: Werner Almesberger @ 2004-06-25 3:21 UTC (permalink / raw) To: Jamie Lokier; +Cc: Bryan Henderson, linux-fsdevel Jamie Lokier wrote: > For that purpose, a barrier has a set of writes which must come before > it, and a set of writes which must come after. These represent a > transaction set. Okay, partial order then. AFAIK, this doesn't allow us to express things like "do A after one of { B, C } is done". That could be useful for redundant storage structures, but we may not care. > A then (barrier) B > C then (barrier) D > > It's ok to schedule those as: > > A, B then (barrier) C, D I think you mean "A, C then (barrier) B, D" :-) There's a problem, though: the first has the following relations: A < B, C < D. The second has: A < B, A < D, C < B, C < D. So now the elevator needs to decided if avoiding the cost of the side-effects of a barrier (cache-flush, or such) is acceptable, given the cost of the additional ordering restrictions. (I'm carefully trying to avoid to say "has less cost", because the elevator may not always pick the "cheapest" variant.) > It would be nice to come up with a interface that the loopback device > can support and relay through the underlying fs. You really like those loopback devices, don't you ? :-) > ext3 and reiserfs both offered this from the begining, so it's > important to someone. Sigh, yes ... > committed prerequisite writes to stable storage. At other times, a > barrier is there only to preserve ordering so that a journal > functions, but it's not required that the data is actually committed > to storage immediately -- merely that it _will_ be committed in order. Well, to implement cross-device barriers, you need a means to find out if the prerequisites for the local device to cross a barrier are satisfied, and to make the local device wait until then. This may be implemented in a completely different thread/whatever than the actual IO. Making an elevator non-work-conserving is an interesting exercise by itself. (It may be feasible: I've done it for a power management experiment (*), and this works at least for PATA.) (*) To stop, I simply make next_req return NULL. queue_empty still returns the correct result. To resume, I have an external trigger that calls blk_start_queue. There's of course the question whether the elevator is really the right place for all this. Perhaps just embedding a callback in a barrier request could be used for a more general solution - in particular one that would allow us to avoid having to solve all problems at once :-) - Werner -- _________________________________________________________________________ / Werner Almesberger, Buenos Aires, Argentina wa@almesberger.net / /_http://www.almesberger.net/____________________________________________/ ^ permalink raw reply [flat|nested] 19+ messages in thread
* RE: barriers vs. reads - O_DIRECT 2004-06-24 22:42 ` Jamie Lokier 2004-06-25 3:21 ` Werner Almesberger @ 2004-06-25 3:57 ` Guy 2004-06-25 4:52 ` Werner Almesberger 1 sibling, 1 reply; 19+ messages in thread From: Guy @ 2004-06-25 3:57 UTC (permalink / raw) To: 'Jamie Lokier', 'Werner Almesberger' Cc: 'Bryan Henderson', linux-fsdevel What if a filesystem is on a software RAID5 array? How does the filesystem use a barrier when some of the writes are on different disks? Would md pass the barrier down to each disk? -----Original Message----- From: linux-fsdevel-owner@vger.kernel.org [mailto:linux-fsdevel-owner@vger.kernel.org] On Behalf Of Jamie Lokier Sent: Thursday, June 24, 2004 6:43 PM To: Werner Almesberger Cc: Bryan Henderson; linux-fsdevel@vger.kernel.org Subject: Re: barriers vs. reads - O_DIRECT Werner Almesberger wrote: > > Note that what filesystems and databases want is write-write *partial > > dependencies*. The per-device I/O barrier is just a crude > > approximation. > > True ;-) So what would an ideally flexible model look like ? > Partial order ? Triggers plus virtual requests ? There's also > the little issue that this should still yield an interface > that people can understand without taking a semester of > graph theory ;-) For a fully journalling fs (including data), a barrier is used to commit the journal writes before the corresponding non-journal writes. For that purpose, a barrier has a set of writes which must come before it, and a set of writes which must come after. These represent a transaction set. (When data is not journalled, the situation can be more complicated because to avoid revealing secure data, you might require non-journal data to be committed before allowing a journal write which increases the file length or block mapping metadata. So then you have non-journal writes before journal writes before other non-journal writes. I'm not sure if ext3 or reiserfs do this). You can imagine that every small fs update could become a small transaction. That'd be one barrier per transaction. Or you can imagine many fs updates are aggregated, into a larger transaction. That'd aggregate into fewer barriers. Now you see that if the former, many small transactions, are in the I/O queue, they _may_ be logically rescheduled by converting them to larger transactions -- and reducing the number of I/O barriers which read the device. That's a simple consequence of barriers being a partial order. If you have two parallel transactions: A then (barrier) B C then (barrier) D It's ok to schedule those as: A, B then (barrier) C, D This is interesting because barriers are _sometimes_ low-level device operations themselves, with a real overhead. Think of the IDE barriers implemented as cache flushes. Therefore scheduling I/Os in this way is a real optimisation. In that example, it reduces 6 IDE transactions to 5. This optimisation is possible even when you have totally independent filesystems, on different partitions. Therefore it _can't_ be done fully at the filesystem level, by virtue of the fs batching transactions. So that begs a question: should the filesystem give the I/O queues enough information that the I/O queues can decide when to discard physical write barriers in this way? That is, the barriers remain in the queue to logically constrain the order of other requests, but some of them don't need to reach the device as actual commands, i.e. with IDE that would allow some cache flush commands to be omitted. I suspect that if barriers are represented as a queue entry with a "before" set and an "after" set, such that the before set is known prior to the barrier entering the queue, and the after set may be added to after, that is enough to do this kind of optimisation in the I/O scheduler. It would be nice to come up with a interface that the loopback device can support and relay through the underlying fs. > > 3. What if a journal is on a different device to its filesystem? > > "Don't do this" comes to mind :-) ext3 and reiserfs both offered this from the begining, so it's important to someone. The two scenarios that come to mind are journalling onto NVRAM for fast commits, and journalling onto a faster device than the main filesystem -- faster in part because it's linear writing. > > Isn't the barrier itself an I/O operation which can be waited on? > > I agree something could depend on the reads at the moment. > > Making barriers waitable might be very useful, yes. That could > also be a step towards implementing those cross-device barriers. For fsync(), journalling fs's don't need to wait on barriers because they can simply return from fsync() when all the prerequisite journal writes are completed. The same is true of a database. So, waiting on barriers isn't strictly needed for any application which knows which writes it has queued before the barrier. fsync() and _some_ databases need those barriers to say they've committed prerequisite writes to stable storage. At other times, a barrier is there only to preserve ordering so that a journal functions, but it's not required that the data is actually committed to storage immediately -- merely that it _will_ be committed in order. That's the difference between a cache flush and an ordering command to some I/O devices. PATA uses cache flush commands for ordering so both barrier types are implemented the same. I'm not sure if there are disks which allow ordering commands without immediately committing to storage. Are there? -- Jamie - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: barriers vs. reads - O_DIRECT 2004-06-25 3:57 ` Guy @ 2004-06-25 4:52 ` Werner Almesberger 0 siblings, 0 replies; 19+ messages in thread From: Werner Almesberger @ 2004-06-25 4:52 UTC (permalink / raw) To: Guy; +Cc: 'Jamie Lokier', 'Bryan Henderson', linux-fsdevel Guy wrote: > What if a filesystem is on a software RAID5 array? How does the filesystem > use a barrier when some of the writes are on different disks? Would md pass > the barrier down to each disk? Let's assume we just have a FIFO elevator per device, barriers affect all requests, and we don't allow any reordering across barriers. Then, a device would have to stop IO when it hits a barrier, and could only resume when all other devices using this barrier have reached it. md would have to send the barrier to each device, with the exception that in any sequence of barriers without other requests, only the first and the last barrier are needed. Now, if we have a smarter elevator and smarter barriers, there are more tricks that can be played. E.g. an elevator could try to reach a barrier shared by many others as soon as possible, and then schedule requests allowed to cross the barrier while waiting for the other devices. I'm not sure how sophisticated we want to get in the multiple devices case, though. - Werner -- _________________________________________________________________________ / Werner Almesberger, Buenos Aires, Argentina wa@almesberger.net / /_http://www.almesberger.net/____________________________________________/ ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: barriers vs. reads - O_DIRECT 2004-06-24 17:46 ` Werner Almesberger 2004-06-24 18:50 ` Jamie Lokier @ 2004-06-25 0:11 ` Bryan Henderson 2004-06-25 2:42 ` Werner Almesberger 1 sibling, 1 reply; 19+ messages in thread From: Bryan Henderson @ 2004-06-25 0:11 UTC (permalink / raw) To: Werner Almesberger; +Cc: linux-fsdevel >> It seems obvious to me that whatever ordering guarantees the user gets >> without the O_DIRECT flag, he should get with it as well. > >Yes, it would be nice if we could obtain such behaviour without >unacceptable performance sacrifices. It seems to me that, if we >can find an efficient way for serializing all write-write and >read-write overlaps, plus have explicit barriers for serializing >non-overlapping writes, this should yield pretty much what >everyone wants Maybe it's time to identify a specific scenario we're trying to make right. Because if the goal is for reads and writes with O_DIRECT to interact the same way as without, then I don't see where the block layer even comes into it -- the filesystem layer does the job. There are two cases: atomic writes and nonatomic writes. Some filesystems do atomic writes, which is what everyone seems to expect but isn't actually stated in POSIX, and others don't. I heard Linux filesystems based on generic_file_read/write don't, but I haven't looked closely at those. _With_ atomic writes, the filesystem code must use a lock to ensure a whole write() finishes before another write() or read() begins. (If it's smart, it can do this only where there are overlaps of file offsets, but I don't think that's common). In the case of cached I/O, "finished" means all the data is in the cache; with direct I/O, it means it's all on the disk. _Without_ atomic writes, the caller of read() and write() is responsible for synchronization. There simply isn't any guarantee that if Process A does a read while Process B is doing a write that A won't see just part of B's write. So if you're brave enough to have multiple processes simultaneously reading and writing the same file regions, the processes have to talk to each other. There's also (kernel) aio to consider. I can see barriers playing a role there. But I can't see how barriers figure into ordinary O_DIRECT file I/O. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: barriers vs. reads - O_DIRECT 2004-06-25 0:11 ` Bryan Henderson @ 2004-06-25 2:42 ` Werner Almesberger 2004-06-25 15:59 ` barriers vs. reads - O_DIRECT aio Bryan Henderson 2004-06-25 16:31 ` barriers vs. reads - O_DIRECT Bryan Henderson 0 siblings, 2 replies; 19+ messages in thread From: Werner Almesberger @ 2004-06-25 2:42 UTC (permalink / raw) To: Bryan Henderson; +Cc: linux-fsdevel Bryan Henderson wrote: > There's also (kernel) aio to consider. I suppose this would be the main scenario. AIO+O_DIRECT also gives the closest approximation to directly talking to the elevator. > But I can't see how barriers figure into ordinary O_DIRECT file I/O. Hmm, I've never looked at non-AIO write semantics with O_DIRECT. For reasonable semantics, I guess it would either have to block until the operation has completed, or be "atomic" (*). (*) Defining atomic can be tricky, too. E.g. see http://www.uwsg.iu.edu/hypermail/linux/kernel/0402.0/1361.html - Werner -- _________________________________________________________________________ / Werner Almesberger, Buenos Aires, Argentina wa@almesberger.net / /_http://www.almesberger.net/____________________________________________/ ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: barriers vs. reads - O_DIRECT aio 2004-06-25 2:42 ` Werner Almesberger @ 2004-06-25 15:59 ` Bryan Henderson 2004-06-25 16:31 ` barriers vs. reads - O_DIRECT Bryan Henderson 1 sibling, 0 replies; 19+ messages in thread From: Bryan Henderson @ 2004-06-25 15:59 UTC (permalink / raw) To: Werner Almesberger; +Cc: linux-fsdevel >> There's also (kernel) aio to consider. > >I suppose this would be the main scenario. AIO+O_DIRECT also >gives the closest approximation to directly talking to the >elevator. I agree. And it would be good for the aio interface to have all the function of a block device make_request_fn function, including whatever barrier function it winds up with. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: barriers vs. reads - O_DIRECT 2004-06-25 2:42 ` Werner Almesberger 2004-06-25 15:59 ` barriers vs. reads - O_DIRECT aio Bryan Henderson @ 2004-06-25 16:31 ` Bryan Henderson 1 sibling, 0 replies; 19+ messages in thread From: Bryan Henderson @ 2004-06-25 16:31 UTC (permalink / raw) To: Werner Almesberger; +Cc: linux-fsdevel >> But I can't see how barriers figure into ordinary O_DIRECT file I/O. > >Hmm, I've never looked at non-AIO write semantics with O_DIRECT. >For reasonable semantics, I guess it would either have to block >until the operation has completed, or be "atomic" I think this part is well established. When an O_DIRECT write() returns, the data has to have been written to the disk, ergo the process has to block until all relevant I/O has completed. Remember that a primary purpose of direct I/O is to allow multiple systems to access the same disk from different ports. It has to be possible for a system to send a message to another system saying, "I just wrote my data on disk Q. Have a look." _In addition_ to that requirement, the filesystem may or may not make that write() atomic as viewed by other processes. (And as you point out, there are multiple kinds of atomicity it might choose). I don't believe it would make use of a block layer barrier in either case. -- Bryan Henderson IBM Almaden Research Center San Jose CA Filesystems ^ permalink raw reply [flat|nested] 19+ messages in thread
end of thread, other threads:[~2004-06-25 16:31 UTC | newest] Thread overview: 19+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2004-06-24 0:48 barriers vs. reads Werner Almesberger 2004-06-24 3:39 ` Werner Almesberger 2004-06-24 8:00 ` Herbert Poetzl 2004-06-24 12:16 ` Werner Almesberger 2004-06-24 13:36 ` Jamie Lokier 2004-06-24 17:02 ` Werner Almesberger 2004-06-24 16:39 ` Steve Lord 2004-06-24 17:00 ` barriers vs. reads - O_DIRECT Bryan Henderson 2004-06-24 17:46 ` Werner Almesberger 2004-06-24 18:50 ` Jamie Lokier 2004-06-24 20:55 ` Werner Almesberger 2004-06-24 22:42 ` Jamie Lokier 2004-06-25 3:21 ` Werner Almesberger 2004-06-25 3:57 ` Guy 2004-06-25 4:52 ` Werner Almesberger 2004-06-25 0:11 ` Bryan Henderson 2004-06-25 2:42 ` Werner Almesberger 2004-06-25 15:59 ` barriers vs. reads - O_DIRECT aio Bryan Henderson 2004-06-25 16:31 ` barriers vs. reads - O_DIRECT Bryan Henderson
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).