From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jamie Lokier Subject: Re: barriers vs. reads - O_DIRECT Date: Thu, 24 Jun 2004 23:42:59 +0100 Sender: linux-fsdevel-owner@vger.kernel.org Message-ID: <20040624224259.GA12840@mail.shareable.org> References: <20040623214845.A21586@almesberger.net> <20040624144638.V1325@almesberger.net> <20040624185059.GA11175@mail.shareable.org> <20040624175516.W1325@almesberger.net> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: Bryan Henderson , linux-fsdevel@vger.kernel.org Return-path: Received: from mail.shareable.org ([81.29.64.88]:33962 "EHLO mail.shareable.org") by vger.kernel.org with ESMTP id S265760AbUFXWoH (ORCPT ); Thu, 24 Jun 2004 18:44:07 -0400 To: Werner Almesberger Content-Disposition: inline In-Reply-To: <20040624175516.W1325@almesberger.net> List-Id: linux-fsdevel.vger.kernel.org Werner Almesberger wrote: > > Note that what filesystems and databases want is write-write *partial > > dependencies*. The per-device I/O barrier is just a crude > > approximation. > > True ;-) So what would an ideally flexible model look like ? > Partial order ? Triggers plus virtual requests ? There's also > the little issue that this should still yield an interface > that people can understand without taking a semester of > graph theory ;-) For a fully journalling fs (including data), a barrier is used to commit the journal writes before the corresponding non-journal writes. For that purpose, a barrier has a set of writes which must come before it, and a set of writes which must come after. These represent a transaction set. (When data is not journalled, the situation can be more complicated because to avoid revealing secure data, you might require non-journal data to be committed before allowing a journal write which increases the file length or block mapping metadata. So then you have non-journal writes before journal writes before other non-journal writes. I'm not sure if ext3 or reiserfs do this). You can imagine that every small fs update could become a small transaction. That'd be one barrier per transaction. Or you can imagine many fs updates are aggregated, into a larger transaction. That'd aggregate into fewer barriers. Now you see that if the former, many small transactions, are in the I/O queue, they _may_ be logically rescheduled by converting them to larger transactions -- and reducing the number of I/O barriers which read the device. That's a simple consequence of barriers being a partial order. If you have two parallel transactions: A then (barrier) B C then (barrier) D It's ok to schedule those as: A, B then (barrier) C, D This is interesting because barriers are _sometimes_ low-level device operations themselves, with a real overhead. Think of the IDE barriers implemented as cache flushes. Therefore scheduling I/Os in this way is a real optimisation. In that example, it reduces 6 IDE transactions to 5. This optimisation is possible even when you have totally independent filesystems, on different partitions. Therefore it _can't_ be done fully at the filesystem level, by virtue of the fs batching transactions. So that begs a question: should the filesystem give the I/O queues enough information that the I/O queues can decide when to discard physical write barriers in this way? That is, the barriers remain in the queue to logically constrain the order of other requests, but some of them don't need to reach the device as actual commands, i.e. with IDE that would allow some cache flush commands to be omitted. I suspect that if barriers are represented as a queue entry with a "before" set and an "after" set, such that the before set is known prior to the barrier entering the queue, and the after set may be added to after, that is enough to do this kind of optimisation in the I/O scheduler. It would be nice to come up with a interface that the loopback device can support and relay through the underlying fs. > > 3. What if a journal is on a different device to its filesystem? > > "Don't do this" comes to mind :-) ext3 and reiserfs both offered this from the begining, so it's important to someone. The two scenarios that come to mind are journalling onto NVRAM for fast commits, and journalling onto a faster device than the main filesystem -- faster in part because it's linear writing. > > Isn't the barrier itself an I/O operation which can be waited on? > > I agree something could depend on the reads at the moment. > > Making barriers waitable might be very useful, yes. That could > also be a step towards implementing those cross-device barriers. For fsync(), journalling fs's don't need to wait on barriers because they can simply return from fsync() when all the prerequisite journal writes are completed. The same is true of a database. So, waiting on barriers isn't strictly needed for any application which knows which writes it has queued before the barrier. fsync() and _some_ databases need those barriers to say they've committed prerequisite writes to stable storage. At other times, a barrier is there only to preserve ordering so that a journal functions, but it's not required that the data is actually committed to storage immediately -- merely that it _will_ be committed in order. That's the difference between a cache flush and an ordering command to some I/O devices. PATA uses cache flush commands for ordering so both barrier types are implemented the same. I'm not sure if there are disks which allow ordering commands without immediately committing to storage. Are there? -- Jamie