O_DIRECT and barriers

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* O_DIRECT and barriers
       [not found] <1250697884-22288-1-git-send-email-jack@suse.cz>
@ 2009-08-20 22:12 ` Christoph Hellwig
  2009-08-21 11:40   ` Jens Axboe
  0 siblings, 1 reply; 50+ messages in thread
From: Christoph Hellwig @ 2009-08-20 22:12 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-fsdevel, linux-scsi

Btw, something semi-related I've been looking at recently:

Currently O_DIRECT writes bypass all kernel caches, but there they do
use the disk caches.  We currenly don't have any barrier support for
them at all, which is really bad for data integrity in virtualized
environments.  I've started thinking about how to implement this.

The simplest scheme would be to mark the last request of each
O_DIRECT write as barrier requests.  This works nicely from the FS
perspective and works with all hardware supporting barriers.  It's
massive overkill though - we really only need to flush the cache
after our request, and not before.  And for SCSI we would be much
better just setting the FUA bit on the commands and not require a
full cache flush at all.

The next scheme would be to simply always do a cache flush after
the direct I/O write has completed, but given that blkdev_issue_flush
blocks until the command is done that would a) require everyone to
use the end_io callback and b) spend a lot of time in that workque.
This only requires one full cache flush, but it's still suboptimal.

I have prototypes this for XFS, but I don't really like it.

The best scheme would be to get some highlevel FUA request in the
block layer which gets emulated by a post-command cache flush.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: O_DIRECT and barriers
  2009-08-20 22:12 ` O_DIRECT and barriers Christoph Hellwig
@ 2009-08-21 11:40   ` Jens Axboe
  2009-08-21 13:54     ` Jamie Lokier
  2009-08-21 14:20     ` Christoph Hellwig
  0 siblings, 2 replies; 50+ messages in thread
From: Jens Axboe @ 2009-08-21 11:40 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-fsdevel, linux-scsi

On Thu, Aug 20 2009, Christoph Hellwig wrote:
> Btw, something semi-related I've been looking at recently:
> 
> Currently O_DIRECT writes bypass all kernel caches, but there they do
> use the disk caches.  We currenly don't have any barrier support for
> them at all, which is really bad for data integrity in virtualized
> environments.  I've started thinking about how to implement this.
> 
> The simplest scheme would be to mark the last request of each
> O_DIRECT write as barrier requests.  This works nicely from the FS
> perspective and works with all hardware supporting barriers.  It's
> massive overkill though - we really only need to flush the cache
> after our request, and not before.  And for SCSI we would be much
> better just setting the FUA bit on the commands and not require a
> full cache flush at all.
> 
> The next scheme would be to simply always do a cache flush after
> the direct I/O write has completed, but given that blkdev_issue_flush
> blocks until the command is done that would a) require everyone to
> use the end_io callback and b) spend a lot of time in that workque.
> This only requires one full cache flush, but it's still suboptimal.
> 
> I have prototypes this for XFS, but I don't really like it.
> 
> The best scheme would be to get some highlevel FUA request in the
> block layer which gets emulated by a post-command cache flush.

I've talked to Chris about this in the past too, but I never got around
to benchmarking FUA for O_DIRECT. It should be pretty easy to wire up
without making too many changes, and we do have FUA support on most SATA
drives too. Basically just a check in the driver for whether the
request is O_DIRECT and a WRITE, ala:

        if (rq_data_dir(rq) == WRITE && rq_is_sync(rq))
                WRITE_FUA;

I know that FUA is used by that other OS, so I think we should be golden
on the hw support side.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: O_DIRECT and barriers
  2009-08-21 11:40   ` Jens Axboe
@ 2009-08-21 13:54     ` Jamie Lokier
  2009-08-21 14:26       ` Christoph Hellwig
  2009-08-21 14:20     ` Christoph Hellwig
  1 sibling, 1 reply; 50+ messages in thread
From: Jamie Lokier @ 2009-08-21 13:54 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Christoph Hellwig, linux-fsdevel, linux-scsi

Jens Axboe wrote:
> On Thu, Aug 20 2009, Christoph Hellwig wrote:
> > Btw, something semi-related I've been looking at recently:
> > 
> > Currently O_DIRECT writes bypass all kernel caches, but there they do
> > use the disk caches.  We currenly don't have any barrier support for
> > them at all, which is really bad for data integrity in virtualized
> > environments.  I've started thinking about how to implement this.
> > 
> > The simplest scheme would be to mark the last request of each
> > O_DIRECT write as barrier requests.  This works nicely from the FS
> > perspective and works with all hardware supporting barriers.  It's
> > massive overkill though - we really only need to flush the cache
> > after our request, and not before.  And for SCSI we would be much
> > better just setting the FUA bit on the commands and not require a
> > full cache flush at all.
> > 
> > The next scheme would be to simply always do a cache flush after
> > the direct I/O write has completed, but given that blkdev_issue_flush
> > blocks until the command is done that would a) require everyone to
> > use the end_io callback and b) spend a lot of time in that workque.
> > This only requires one full cache flush, but it's still suboptimal.
> > 
> > I have prototypes this for XFS, but I don't really like it.
> > 
> > The best scheme would be to get some highlevel FUA request in the
> > block layer which gets emulated by a post-command cache flush.
> 
> I've talked to Chris about this in the past too, but I never got around
> to benchmarking FUA for O_DIRECT. It should be pretty easy to wire up
> without making too many changes, and we do have FUA support on most SATA
> drives too. Basically just a check in the driver for whether the
> request is O_DIRECT and a WRITE, ala:
> 
>         if (rq_data_dir(rq) == WRITE && rq_is_sync(rq))
>                 WRITE_FUA;
> 
> I know that FUA is used by that other OS, so I think we should be golden
> on the hw support side.

I've been thinking about this too, and for optimal performance with
VMs and also with databases, I think FUA is too strong.  (It's also
too weak, on drives which don't have FUA).

I would like to be able to get the same performance and integrity as
the kernel filesystems can get, and that means using barrier flushes
when a kernel filesystem would use them, and FUA when a kernel
filesystem would use that.  Preferably the same whether userspace is
using a file or a block device.

The conclusion I came to is that O_DIRECT users need a barrier flush
primitive.  FUA can either be deduced by the elevator, or signalled
explicitly by userspace.

Fortunately there's already a sensible API for both: fdatasync (and
aio_fsync) to mean flush, and O_DSYNC (or inferred from
flush-after-one-write) to mean FUA.

Those apply to files, but they could be made to have the same effect
with block devices, which would be nice for applications which can use
both.  I'll talk about files from here on; assume the idea is to
provide the same functions for block devices.

It turns out that applications needing integrity must use fdatasync or
O_DSYNC (or O_SYNC) *already* with O_DIRECT, because the kernel may
choose to use buffered writes at any time, with no signal to the
application.  O_DSYNC or fdatasync ensures that unknown buffered
writes will be committed.  This is true for other operating systems
too, for the same reason, except some other unixes will convert all
writes to buffered writes, not just corner cases, under various
circumstances that it's hard for applications to detect.

So there's already a good match to using fdatasync and/or O_DSYNC for
O_DIRECT integrity.

If we define fdatasync's behaviour to be that it always causes a
barrier flush if there have been any WRITE commands to a disk since
the last barrier flush, in addition to it's behaviour of flushing
cached pages, that would be enough for VM and database applications
would have good support for integrity.  Of course O_DSYNC would imply
the same after each write.

As an optimisation, I think that FUA might be best done by the
elevator detecting opportunities to do that, rather than explicitly
signalled.

For VMs, the highest performance (with integrity) will likely come from:

    If the guest requests a virtual disk with write cache enabled:

        - Host opens file/blockdev with O_DIRECT  (but *not O_DSYNC*)
        - Host maps guests WRITE commands to host writes
        - Host maps guests CACHE FLUSH commands to fdatasync on host

    If the guest requests a virtual disk with write cache disabled:

        - Host opens file/blockdev with O_DIRECT|O_DSYNC
        - Host maps guests WRITE commands to host writes
        - Host maps guests CACHE FLUSH commands to nothing

    That's with host configured to use O_DIRECT.  If the host is
    configured to not use O_DIRECT, the same logic applies except that
    O_DIRECT is simply omitted.  Nice and simple eh?

Databases and userspace filesystems would be encouraged to do the
equivalent.  In other words, databases would open with O_DIRECT or not
(depending on behaviour preferred), and use fdatasync for barriers, or
use O_DSYNC if they are not using fdatasync.

Notice how it conveniently does the right thing when the kernel falls
back to buffered writes without telling anyone.

Code written in that way should do the right thing (or as close as
it's possible to get) on other OSes too.

(Btw, from what I can tell from various Windows documentation, it maps
the equivalent of O_DIRECT|O_DSYNC to setting FUA on every disk write,
and it maps the equivalent of fsync to sending a the disk a cache
flush command as well as writing file metadata.  There's no Windows
equivalent to O_SYNC or fdatasync.)

-- Jamie

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: O_DIRECT and barriers
  2009-08-21 11:40   ` Jens Axboe
  2009-08-21 13:54     ` Jamie Lokier
@ 2009-08-21 14:20     ` Christoph Hellwig
  2009-08-21 15:06       ` James Bottomley
  1 sibling, 1 reply; 50+ messages in thread
From: Christoph Hellwig @ 2009-08-21 14:20 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Christoph Hellwig, linux-fsdevel, linux-scsi

On Fri, Aug 21, 2009 at 01:40:10PM +0200, Jens Axboe wrote:
> I've talked to Chris about this in the past too, but I never got around
> to benchmarking FUA for O_DIRECT. It should be pretty easy to wire up
> without making too many changes, and we do have FUA support on most SATA
> drives too. Basically just a check in the driver for whether the
> request is O_DIRECT and a WRITE, ala:
> 
>         if (rq_data_dir(rq) == WRITE && rq_is_sync(rq))
>                 WRITE_FUA;
> 
> I know that FUA is used by that other OS, so I think we should be golden
> on the hw support side.

Just doing FUA should be pretty easy, in fact from my reading of the
code we already use FUA for barriers if supported, that is only drain
the queue, do a pre-flush for a barrier and then issue the actual
barrier write as FUA.

I can play around with getting rid of the pre-flush and doing cache
flush based emulation if FUA is not supported if you're fine with that.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: O_DIRECT and barriers
  2009-08-21 13:54     ` Jamie Lokier
@ 2009-08-21 14:26       ` Christoph Hellwig
  2009-08-21 15:24         ` Jamie Lokier
  2009-08-21 22:08         ` Theodore Tso
  0 siblings, 2 replies; 50+ messages in thread
From: Christoph Hellwig @ 2009-08-21 14:26 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: Jens Axboe, Christoph Hellwig, linux-fsdevel, linux-scsi

On Fri, Aug 21, 2009 at 02:54:03PM +0100, Jamie Lokier wrote:
> I've been thinking about this too, and for optimal performance with
> VMs and also with databases, I think FUA is too strong.  (It's also
> too weak, on drives which don't have FUA).

Why is FUA too strong?

> Fortunately there's already a sensible API for both: fdatasync (and
> aio_fsync) to mean flush, and O_DSYNC (or inferred from
> flush-after-one-write) to mean FUA.

I thought about this alot .  It would be sensible to only require
the FUA semantics if O_SYNC is specified.  But from looking around at
users of O_DIRECT no one seems to actually specify O_SYNC with it.
And on Linux where O_SYNC really means O_DYSNC that's pretty sensible -
if O_DIRECT bypasses the filesystem cache there is nothing else
left to sync for a non-extending write.  That is until those pesky disk
write back caches come into play that no application writer wants or
should have to understand.

> It turns out that applications needing integrity must use fdatasync or
> O_DSYNC (or O_SYNC) *already* with O_DIRECT, because the kernel may
> choose to use buffered writes at any time, with no signal to the
> application.

The fallback was a relatively recent addition to the O_DIRECT semantics
for broken filesystems that can't handle holes very well.  Fortunately
enough we do force O_SYNC (that is Linux O_SYNC aka Posix O_DSYNC)
semantics for that already.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: O_DIRECT and barriers
  2009-08-21 14:20     ` Christoph Hellwig
@ 2009-08-21 15:06       ` James Bottomley
  2009-08-21 15:23         ` Christoph Hellwig
  0 siblings, 1 reply; 50+ messages in thread
From: James Bottomley @ 2009-08-21 15:06 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Jens Axboe, linux-fsdevel, linux-scsi

On Fri, 2009-08-21 at 10:20 -0400, Christoph Hellwig wrote:
> On Fri, Aug 21, 2009 at 01:40:10PM +0200, Jens Axboe wrote:
> > I've talked to Chris about this in the past too, but I never got around
> > to benchmarking FUA for O_DIRECT. It should be pretty easy to wire up
> > without making too many changes, and we do have FUA support on most SATA
> > drives too. Basically just a check in the driver for whether the
> > request is O_DIRECT and a WRITE, ala:
> > 
> >         if (rq_data_dir(rq) == WRITE && rq_is_sync(rq))
> >                 WRITE_FUA;
> > 
> > I know that FUA is used by that other OS, so I think we should be golden
> > on the hw support side.
> 
> Just doing FUA should be pretty easy, in fact from my reading of the
> code we already use FUA for barriers if supported, that is only drain
> the queue, do a pre-flush for a barrier and then issue the actual
> barrier write as FUA.

I've never really understood why FUA is considered equivalent to a
barrier.  Our barrier semantics are that all I/Os before the barrier
should be safely on disk after the barrier executes.  The FUA semantics
are that *this write* should be safely on disk after it executes ... it
can still leave preceding writes in the cache.  I can see that if you're
only interested in metadata that making every metadata write a FUA and
leaving the cache to sort out data writes does give FS image
consistency.

How does FUA give us linux barrier semantics?

James



^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: O_DIRECT and barriers
  2009-08-21 15:06       ` James Bottomley
@ 2009-08-21 15:23         ` Christoph Hellwig
  0 siblings, 0 replies; 50+ messages in thread
From: Christoph Hellwig @ 2009-08-21 15:23 UTC (permalink / raw)
  To: James Bottomley; +Cc: Christoph Hellwig, Jens Axboe, linux-fsdevel, linux-scsi

On Fri, Aug 21, 2009 at 09:06:10AM -0600, James Bottomley wrote:
> I've never really understood why FUA is considered equivalent to a
> barrier.  Our barrier semantics are that all I/Os before the barrier
> should be safely on disk after the barrier executes.  The FUA semantics
> are that *this write* should be safely on disk after it executes ... it
> can still leave preceding writes in the cache.  I can see that if you're
> only interested in metadata that making every metadata write a FUA and
> leaving the cache to sort out data writes does give FS image
> consistency.
> 
> How does FUA give us linux barrier semantics?

FUA by itself doesn't.

Think what use cases we have for barriers and/or FUA right now:

 - a cache flush.  Can only implement as cache flush obviously.
 - a barrier flush bio - can be implement as
     o cache flush, write, cache flush
     o or more efficiently as cache flush, write with FUA bit set

now there is a third use case for O_SYNC, O_DIRECT write which actually
do have FUA-like semantis, that is we only guarantee the I/O is on disk,
but we do not make guarantees about ordering vs earlier writes.
Currently we (as in those few filesystem bothering despite the
VFS/generic helpers making it really hard) implement O_SYNC by:

 - doing one or multiple normal writes, and wait on them
 - then issue a cache flush - either explicitly blkdev_issue_flush
   or implicitly as part of a barrier write for metadata

this could be done more efficiently simply setting the FUA bit on these
requests if we had an API for it.  For O_DIRECT should also apply
except that currently we don't even try.


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: O_DIRECT and barriers
  2009-08-21 14:26       ` Christoph Hellwig
@ 2009-08-21 15:24         ` Jamie Lokier
  2009-08-21 17:45           ` Christoph Hellwig
  2009-08-21 22:08         ` Theodore Tso
  1 sibling, 1 reply; 50+ messages in thread
From: Jamie Lokier @ 2009-08-21 15:24 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Jens Axboe, linux-fsdevel, linux-scsi

Christoph Hellwig wrote:
> On Fri, Aug 21, 2009 at 02:54:03PM +0100, Jamie Lokier wrote:
> > I've been thinking about this too, and for optimal performance with
> > VMs and also with databases, I think FUA is too strong.  (It's also
> > too weak, on drives which don't have FUA).
> 
> Why is FUA too strong?

In measurements I've done, disabling a disk's write cache results in
much slower ext3 filesystem writes than using barriers.  Others report
similar results.  This is with disks that don't have NCQ; good NCQ may
be better.

Using FUA for all writes should be equivalent to writing with write
cache disabled.

A journalling filesystem or database tends to write like this:

   (guest) WRITE
   (guest) WRITE
   (guest) WRITE
   (guest) WRITE
   (guest) WRITE
   (guest) CACHE FLUSH
   (guest) WRITE
   (guest) CACHE FLUSH
   (guest) WRITE
   (guest) WRITE
   (guest) WRITE

When a guest does that, for integrity it can be mapped to this on the
host with FUA:

   (host) WRITE FUA
   (host) WRITE FUA
   (host) WRITE FUA
   (host) WRITE FUA
   (host) WRITE FUA
   (host) WRITE FUA
   (host) WRITE FUA
   (host) WRITE FUA
   (host) WRITE FUA

or

   (host) WRITE
   (host) WRITE
   (host) WRITE
   (host) WRITE
   (host) WRITE
   (host) CACHE FLUSH
   (host) WRITE
   (host) CACHE FLUSH 
   (host) WRITE
   (host) WRITE
   (host) WRITE

We know from measurements that disabling the disk write cache is much
slower than using barriers, at least with some disks.

Assuming that WRITE FUA is equivalent to disabling write cache, we may
expect the WRITE FUA version to run much slower than the CACHE FLUSH
version.

It's also too weak, of course, on drives which don't support FUA.
Then you have to use CACHE FLUSH anyway, so the code should support
that (or disable the write cache entirely, which also performs badly).
If you don't handle drives without FUA, then you're back to "integrity
sometimes, user must check type of hardware", which is something we're
trying to get away from.  Integrity should not be a surprise when the
application requests it.

> > Fortunately there's already a sensible API for both: fdatasync (and
> > aio_fsync) to mean flush, and O_DSYNC (or inferred from
> > flush-after-one-write) to mean FUA.
> 
> I thought about this alot .  It would be sensible to only require
> the FUA semantics if O_SYNC is specified.  But from looking around at
> users of O_DIRECT no one seems to actually specify O_SYNC with it.

O_DIRECT with true POSIX O_SYNC is a bad idea, because it flushes
inode metadata (like mtime) too.  O_DIRECT|O_DSYNC is better.

O_DIRECT without O_SYNC, O_DSYNC, fsync or fdatasync is asking for
integrity problems when direct writes are converted to buffered writes
- which applies to all or nearly all OSes according to their
documentation (I've read a lot of them).

I notice that all applications I looked at which use O_DIRECT don't
attempt to determine when O_DIRECT will definitely result in direct
writes; they simpy assume it can be used as a substituted for O_SYNC
or O_DSYNC, as long as you follow the alignment rules.  Generally they
leave it to the user to configure what they want, and often don't
explain the drive integrity issue, except to say "depends on the OS,
your mileage may vary, we can do nothing about it".

Imho, integrity should not be something which depends on the user
knowing the details of their hardware to decide application
configuration options - at least, not out of the box.

On a related note,
http://publib.boulder.ibm.com/infocenter/systems/index.jsp?topic=/com.ibm.aix.genprogc/doc/genprogc/fileio.htm
says:

    Direct I/O and Data I/O Integrity Completion

    Although direct I/O writes are done synchronously, they do not
    provide synchronized I/O data integrity completion, as defined by
    POSIX. Applications that need this feature should use O_DSYNC in
    addition to O_DIRECT. O_DSYNC guarantees that all of the data and
    enough of the metadata (for example, indirect blocks) have written
    to the stable store to be able to retrieve the data after a system
    crash. O_DIRECT only writes the data; it does not write the
    metadata.

That's another reason to use O_DIRECT|O_DSYNC in moderately portable code.

> And on Linux where O_SYNC really means O_DYSNC that's pretty sensible -
> if O_DIRECT bypasses the filesystem cache there is nothing else
> left to sync for a non-extending write.

Oh, O_SYNC means O_DSYNC?  I thought it was the other way around.
Ugh, how messy.

> That is until those pesky disk
> write back caches come into play that no application writer wants or
> should have to understand.

As far as I can tell, they generally go out of their way to avoid
understanding it, except as a vaguely uncomfortable awareness and pass
the problem on to the application's user.

Unfortunately just disabling the disk cache for O_DIRECT would make
it's performance drop significantly, otherwise I'd say go for it.

> > It turns out that applications needing integrity must use fdatasync or
> > O_DSYNC (or O_SYNfC) *already* with O_DIRECT, because the kernel may
> > choose to use buffered writes at any time, with no signal to the
> > application.
> 
> The fallback was a relatively recent addition to the O_DIRECT semantics
> for broken filesystems that can't handle holes very well.  Fortunately
> enough we do force O_SYNC (that is Linux O_SYNC aka Posix O_DSYNC)
> semantics for that already.

Ok, so you're saying there's no _harm_ in specifying O_DSYNC with
O_DIRECT either? :-)

-- Jamie

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: O_DIRECT and barriers
  2009-08-21 15:24         ` Jamie Lokier
@ 2009-08-21 17:45           ` Christoph Hellwig
  2009-08-21 19:18             ` Ric Wheeler
  2009-08-22  0:50             ` Jamie Lokier
  0 siblings, 2 replies; 50+ messages in thread
From: Christoph Hellwig @ 2009-08-21 17:45 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: Christoph Hellwig, Jens Axboe, linux-fsdevel, linux-scsi

On Fri, Aug 21, 2009 at 04:24:59PM +0100, Jamie Lokier wrote:
> In measurements I've done, disabling a disk's write cache results in
> much slower ext3 filesystem writes than using barriers.  Others report
> similar results.  This is with disks that don't have NCQ; good NCQ may
> be better.

On a scsi disk and a SATA SSD with NCQ I get different results.  Most
worksloads, in particular metadata-intensive ones and large streaming
writes are noticably better just turning off the write cache.  The only
onces that benefit from it are relatively small writes witout O_SYNC
or much fsyncs.  This is however using XFS which tends to issue much
more barriers than ext3.

> Using FUA for all writes should be equivalent to writing with write
> cache disabled.
> 
> A journalling filesystem or database tends to write like this:
> 
>    (guest) WRITE
>    (guest) WRITE
>    (guest) WRITE
>    (guest) WRITE
>    (guest) WRITE
>    (guest) CACHE FLUSH
>    (guest) WRITE
>    (guest) CACHE FLUSH
>    (guest) WRITE
>    (guest) WRITE
>    (guest) WRITE

In the optimal case, yeah.

> Assuming that WRITE FUA is equivalent to disabling write cache, we may
> expect the WRITE FUA version to run much slower than the CACHE FLUSH
> version.

For a workload that only does FUA writes, yeah.  That is however the use
case for virtual machines.  As I'm looking into those issues I will run
some benchmarks comparing both variants.

> It's also too weak, of course, on drives which don't support FUA.
> Then you have to use CACHE FLUSH anyway, so the code should support
> that (or disable the write cache entirely, which also performs badly).
> If you don't handle drives without FUA, then you're back to "integrity
> sometimes, user must check type of hardware", which is something we're
> trying to get away from.  Integrity should not be a surprise when the
> application requests it.

As mentioned in the previous mails FUA would only be an optimization
(if it ends up helping) we do need to support the cache flush case.

> > I thought about this alot .  It would be sensible to only require
> > the FUA semantics if O_SYNC is specified.  But from looking around at
> > users of O_DIRECT no one seems to actually specify O_SYNC with it.
> 
> O_DIRECT with true POSIX O_SYNC is a bad idea, because it flushes
> inode metadata (like mtime) too.  O_DIRECT|O_DSYNC is better.

O_SYNC above is the Linux O_SYNC aka Posix O_DYNC.

> O_DIRECT without O_SYNC, O_DSYNC, fsync or fdatasync is asking for
> integrity problems when direct writes are converted to buffered writes
> - which applies to all or nearly all OSes according to their
> documentation (I've read a lot of them).

It did not happen on IRIX where O_DIRECT originated that did not happen,
neither does it happen on Linux when using XFS.  Then again at least on
Linux we provide O_SYNC (that is Linux O_SYNC, aka Posix O_DYSC)
semantics for that case.

> Imho, integrity should not be something which depends on the user
> knowing the details of their hardware to decide application
> configuration options - at least, not out of the box.

That is what I meant.  Only doing cache flushes/FUA for O_DIRECT|O_DSYNC
is not what users naively expect.  And the wording in hour manpages also
suggests this behaviour, although it is not entirely clear:


O_DIRECT (Since Linux 2.4.10)

	Try to minimize cache effects of the I/O to and from this file.  In
	general this will degrade performance, but it is useful in special
	situations, such as when applications do their own caching.  File I/O
	is done directly to/from user space buffers.  The I/O is synchronous,
	that is,  at the completion of a read(2) or write(2), data is
	guaranteed to have been transferred.  See NOTES below forfurther
	discussion.

(And yeah, the whole wording is horrible, I will send an update once
we've sorted out the semantics, including caveats about older kernels)

> > And on Linux where O_SYNC really means O_DYSNC that's pretty sensible -
> > if O_DIRECT bypasses the filesystem cache there is nothing else
> > left to sync for a non-extending write.
> 
> Oh, O_SYNC means O_DSYNC?  I thought it was the other way around.
> Ugh, how messy.

Yes.  Except when using XFS and using the "osyncisosync" mount option :)

> > The fallback was a relatively recent addition to the O_DIRECT semantics
> > for broken filesystems that can't handle holes very well.  Fortunately
> > enough we do force O_SYNC (that is Linux O_SYNC aka Posix O_DSYNC)
> > semantics for that already.
> 
> Ok, so you're saying there's no _harm_ in specifying O_DSYNC with
> O_DIRECT either? :-)

No.  In the generic code and filesystems I looked at it simply has no
effect at all.


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: O_DIRECT and barriers
  2009-08-21 17:45           ` Christoph Hellwig
@ 2009-08-21 19:18             ` Ric Wheeler
  2009-08-22  0:50             ` Jamie Lokier
  1 sibling, 0 replies; 50+ messages in thread
From: Ric Wheeler @ 2009-08-21 19:18 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Jamie Lokier, Jens Axboe, linux-fsdevel, linux-scsi

On 08/21/2009 01:45 PM, Christoph Hellwig wrote:
> On Fri, Aug 21, 2009 at 04:24:59PM +0100, Jamie Lokier wrote:
>    
>> In measurements I've done, disabling a disk's write cache results in
>> much slower ext3 filesystem writes than using barriers.  Others report
>> similar results.  This is with disks that don't have NCQ; good NCQ may
>> be better.
>>      
> On a scsi disk and a SATA SSD with NCQ I get different results.  Most
> worksloads, in particular metadata-intensive ones and large streaming
> writes are noticably better just turning off the write cache.  The only
> onces that benefit from it are relatively small writes witout O_SYNC
> or much fsyncs.  This is however using XFS which tends to issue much
> more barriers than ext3.
>    

With normal S-ATA disks, streaming write workloads on ext3 run twice as 
fast with barriers & write cache enabled in my testing.

Small file workloads were more even if I remember correctly...

ric

>    
>> Using FUA for all writes should be equivalent to writing with write
>> cache disabled.
>>
>> A journalling filesystem or database tends to write like this:
>>
>>     (guest) WRITE
>>     (guest) WRITE
>>     (guest) WRITE
>>     (guest) WRITE
>>     (guest) WRITE
>>     (guest) CACHE FLUSH
>>     (guest) WRITE
>>     (guest) CACHE FLUSH
>>     (guest) WRITE
>>     (guest) WRITE
>>     (guest) WRITE
>>      
> In the optimal case, yeah.
>
>    
>> Assuming that WRITE FUA is equivalent to disabling write cache, we may
>> expect the WRITE FUA version to run much slower than the CACHE FLUSH
>> version.
>>      
> For a workload that only does FUA writes, yeah.  That is however the use
> case for virtual machines.  As I'm looking into those issues I will run
> some benchmarks comparing both variants.
>
>    
>> It's also too weak, of course, on drives which don't support FUA.
>> Then you have to use CACHE FLUSH anyway, so the code should support
>> that (or disable the write cache entirely, which also performs badly).
>> If you don't handle drives without FUA, then you're back to "integrity
>> sometimes, user must check type of hardware", which is something we're
>> trying to get away from.  Integrity should not be a surprise when the
>> application requests it.
>>      
> As mentioned in the previous mails FUA would only be an optimization
> (if it ends up helping) we do need to support the cache flush case.
>
>    
>>> I thought about this alot .  It would be sensible to only require
>>> the FUA semantics if O_SYNC is specified.  But from looking around at
>>> users of O_DIRECT no one seems to actually specify O_SYNC with it.
>>>        
>> O_DIRECT with true POSIX O_SYNC is a bad idea, because it flushes
>> inode metadata (like mtime) too.  O_DIRECT|O_DSYNC is better.
>>      
> O_SYNC above is the Linux O_SYNC aka Posix O_DYNC.
>
>    
>> O_DIRECT without O_SYNC, O_DSYNC, fsync or fdatasync is asking for
>> integrity problems when direct writes are converted to buffered writes
>> - which applies to all or nearly all OSes according to their
>> documentation (I've read a lot of them).
>>      
> It did not happen on IRIX where O_DIRECT originated that did not happen,
> neither does it happen on Linux when using XFS.  Then again at least on
> Linux we provide O_SYNC (that is Linux O_SYNC, aka Posix O_DYSC)
> semantics for that case.
>
>    
>> Imho, integrity should not be something which depends on the user
>> knowing the details of their hardware to decide application
>> configuration options - at least, not out of the box.
>>      
> That is what I meant.  Only doing cache flushes/FUA for O_DIRECT|O_DSYNC
> is not what users naively expect.  And the wording in hour manpages also
> suggests this behaviour, although it is not entirely clear:
>
>
> O_DIRECT (Since Linux 2.4.10)
>
> 	Try to minimize cache effects of the I/O to and from this file.  In
> 	general this will degrade performance, but it is useful in special
> 	situations, such as when applications do their own caching.  File I/O
> 	is done directly to/from user space buffers.  The I/O is synchronous,
> 	that is,  at the completion of a read(2) or write(2), data is
> 	guaranteed to have been transferred.  See NOTES below forfurther
> 	discussion.
>
> (And yeah, the whole wording is horrible, I will send an update once
> we've sorted out the semantics, including caveats about older kernels)
>
>    
>>> And on Linux where O_SYNC really means O_DYSNC that's pretty sensible -
>>> if O_DIRECT bypasses the filesystem cache there is nothing else
>>> left to sync for a non-extending write.
>>>        
>> Oh, O_SYNC means O_DSYNC?  I thought it was the other way around.
>> Ugh, how messy.
>>      
> Yes.  Except when using XFS and using the "osyncisosync" mount option :)
>
>    
>>> The fallback was a relatively recent addition to the O_DIRECT semantics
>>> for broken filesystems that can't handle holes very well.  Fortunately
>>> enough we do force O_SYNC (that is Linux O_SYNC aka Posix O_DSYNC)
>>> semantics for that already.
>>>        
>> Ok, so you're saying there's no _harm_ in specifying O_DSYNC with
>> O_DIRECT either? :-)
>>      
> No.  In the generic code and filesystems I looked at it simply has no
> effect at all.
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>    


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: O_DIRECT and barriers
  2009-08-21 14:26       ` Christoph Hellwig
  2009-08-21 15:24         ` Jamie Lokier
@ 2009-08-21 22:08         ` Theodore Tso
  2009-08-21 22:38           ` Joel Becker
                             ` (3 more replies)
  1 sibling, 4 replies; 50+ messages in thread
From: Theodore Tso @ 2009-08-21 22:08 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Jamie Lokier, Jens Axboe, linux-fsdevel, linux-scsi

On Fri, Aug 21, 2009 at 10:26:35AM -0400, Christoph Hellwig wrote:
> > It turns out that applications needing integrity must use fdatasync or
> > O_DSYNC (or O_SYNC) *already* with O_DIRECT, because the kernel may
> > choose to use buffered writes at any time, with no signal to the
> > application.
> 
> The fallback was a relatively recent addition to the O_DIRECT semantics
> for broken filesystems that can't handle holes very well.  Fortunately
> enough we do force O_SYNC (that is Linux O_SYNC aka Posix O_DSYNC)
> semantics for that already.

Um, actually, we don't.  If we did that, we would have to wait for a
journal commit to complete before allowing the write(2) to complete,
which would be especially painfully slow for ext3.

This question recently came up on the ext4 developer's list, because
of a question of how direct I/O to an preallocated (uninitialized)
extent should be handled.  Are we supposed to guarantee synchronous
updates of the metadata by the time write(2) returns, or not?  One of
the ext4 developers (I can't remember if it was Mingming or Eric)
asked an XFS developer what they did in that case, and I believe the
answer they were given was that XFS started a commit, but did *not*
wait for the commit to complete before returning from the Direct I/O
write.  In fact, they were told (I believe this was from an SGI
engineer, but I don't remember the name; we can track that down if
it's important) that if an application wanted to guarantee metadata
would be updated for an extending write, they had to use fsync() or
O_SYNC/O_DSYNC.  

Perhaps they were given an incorrect answer, but it's clear the
semantics of exactly how Direct I/O works in edge cases isn't well
defined, or at least clearly and widely understood.

I have an early draft (for discussion only) what we think it means and
what is currently implemented in Linux, which I've put up, (again, let
me emphasisize) for *discussion* here:

http://ext4.wiki.kernel.org/index.php/Clarifying_Direct_IO's_Semantics

Comments are welcome, either on the wiki's talk page, or directly to
me, or to the linux-fsdevel or linux-ext4.

						- Ted

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: O_DIRECT and barriers
  2009-08-21 22:08         ` Theodore Tso
@ 2009-08-21 22:38           ` Joel Becker
  2009-08-21 22:45           ` Joel Becker
                             ` (2 subsequent siblings)
  3 siblings, 0 replies; 50+ messages in thread
From: Joel Becker @ 2009-08-21 22:38 UTC (permalink / raw)
  To: Theodore Tso, Christoph Hellwig, Jamie Lokier, Jens Axboe,
	linux-fsdevel, linux-scsi

On Fri, Aug 21, 2009 at 06:08:52PM -0400, Theodore Tso wrote:
> I have an early draft (for discussion only) what we think it means and
> what is currently implemented in Linux, which I've put up, (again, let
> me emphasisize) for *discussion* here:
> 
> http://ext4.wiki.kernel.org/index.php/Clarifying_Direct_IO's_Semantics

	I think you mean "not well specified". ;-)

Joel

-- 

Life's Little Instruction Book #511

	"Call your mother."

Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: O_DIRECT and barriers
  2009-08-21 22:08         ` Theodore Tso
  2009-08-21 22:38           ` Joel Becker
@ 2009-08-21 22:45           ` Joel Becker
  2009-08-22  2:11             ` Theodore Tso
  2009-08-24  2:37             ` Christoph Hellwig
  2009-08-22  0:56           ` Jamie Lokier
  2009-08-26  6:34           ` Dave Chinner
  3 siblings, 2 replies; 50+ messages in thread
From: Joel Becker @ 2009-08-21 22:45 UTC (permalink / raw)
  To: Theodore Tso, Christoph Hellwig, Jamie Lokier, Jens Axboe,
	linux-fsdevel, linux-scsi

On Fri, Aug 21, 2009 at 06:08:52PM -0400, Theodore Tso wrote:
> http://ext4.wiki.kernel.org/index.php/Clarifying_Direct_IO's_Semantics
> 
> Comments are welcome, either on the wiki's talk page, or directly to
> me, or to the linux-fsdevel or linux-ext4.

	In the section on perhaps not waiting for buffered fallback, we
need to clarify that O_DIRECT reads need to know to look in the
pagecache.  That is, if we decide that extending O_DIRECT writes without
fsync can return before the data hits the storage, the caller shouldn't
also have to call fsync() just to call read() of data they just wrote!

Joel

-- 

To spot the expert, pick the one who predicts the job will take the
longest and cost the most.

Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: O_DIRECT and barriers
  2009-08-21 17:45           ` Christoph Hellwig
  2009-08-21 19:18             ` Ric Wheeler
@ 2009-08-22  0:50             ` Jamie Lokier
  2009-08-22  2:19               ` Theodore Tso
  2009-08-24  2:34               ` Christoph Hellwig
  1 sibling, 2 replies; 50+ messages in thread
From: Jamie Lokier @ 2009-08-22  0:50 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Jens Axboe, linux-fsdevel, linux-scsi

Christoph Hellwig wrote:
> > O_DIRECT without O_SYNC, O_DSYNC, fsync or fdatasync is asking for
> > integrity problems when direct writes are converted to buffered writes
> > - which applies to all or nearly all OSes according to their
> > documentation (I've read a lot of them).
> 
> It did not happen on IRIX where O_DIRECT originated that did not happen,

IRIX has an unusually sane O_DIRECT - at least according to it's
documentation.  This is write(2):

     When attempting to write to a file with O_DIRECT or FDIRECT set,
     the portion being written can not be locked in memory by any
     process. In this case, -1 will be returned and errno will be set
     to EBUSY.

AIX however says this:

     In order to avoid consistency issues between programs that use
     Direct I/O and programs that use normal cached I/O, Direct I/O is
     by default used in an exclusive use mode. If there are multiple
     opens of a file and some of them are direct and others are not,
     the file will stay in its normal cached access mode. Only when
     the file is open exclusively by Direct I/O programs will the file
     be placed in Direct I/O mode.

     Similarly, if the file is mapped into virtual memory via the
     shmat() or mmap() system calls, then file will stay in normal
     cached mode.

     The JFS or JFS2 will attempt to move the file into Direct I/O
     mode any time the last conflicting. non-direct access is
     eliminated (either by close(), munmap(), or shmdt()
     subroutines). Changing the file from normal mode to Direct I/O
     mode can be rather expensive since it requires writing all
     modified pages to disk and removing all the file's pages from
     memory.

> neither does it happen on Linux when using XFS.  Then again at least on
> Linux we provide O_SYNC (that is Linux O_SYNC, aka Posix O_DYSC)
> semantics for that case.

As Ted T'so pointer out, we don't.

> > Imho, integrity should not be something which depends on the user
> > knowing the details of their hardware to decide application
> > configuration options - at least, not out of the box.
> 
> That is what I meant.  Only doing cache flushes/FUA for O_DIRECT|O_DSYNC
> is not what users naively expect.

Oh, I agree with that.  That comes from observing that quasi-portable
code using O_DIRECT needs to use O_DSYNC too because several OSes and
filesystems on those OSes revert to buffered writes under some
circumstances, in which case you want O_DSYNC too.  That has nothing
to do with hardware caches, but it's a lucky coincidence that
fdatasync() would form a nice barrier function, and O_DIRECT|O_DSYNC
would then make sense as an FUA equivalent.

> And the wording in hour manpages also suggests this behaviour,
> although it is not entirely clear:
> 
> O_DIRECT (Since Linux 2.4.10)
> 	Try to minimize cache effects of the I/O to and from this file.  In
> 	general this will degrade performance, but it is useful in special
> 	situations, such as when applications do their own caching.  File I/O
> 	is done directly to/from user space buffers.  The I/O is synchronous,
> 	that is,  at the completion of a read(2) or write(2), data is
> 	guaranteed to have been transferred.  See NOTES below forfurther
> 	discussion.

Perhaps in the same way that fsync/fdatasync aren't clear on disk
cache behaviour either.  On Linux and some other OSes.

> (And yeah, the whole wording is horrible, I will send an update once
> we've sorted out the semantics, including caveats about older kernels)

One thing it's unhelpful about is the performance.  O_DIRECT tends to
improve performance for applications that do their own caching, it
also improves performance in whole systems when caching when would
cause memory pressure, and on Linux O_DIRECT is necessary for AIO
which can improve performance.

I have a 166MHz embedded device that I'm using O_DIRECT on to improve
performance - from 1MB/s to 10MB/s.

However if O_DIRECT is changed to force each write(2) through the disk
cache separately, then it will no longer provide this performance
boost at least with some kinds of disk.

That's why it's important not to change it casually.  Maybe it's the
right thing to do, but then it will be important to provide another
form of O_DIRECT which does not write through the disk cache, instead
providing a barrier capability.

(...After all, if we believed in integrity above everything then barriers
would be enabled for ext3 by default, *ahem*.)

Probably the best thing to do is look at what other OSes that are used
by databases etc. do with O_DIRECT, and if it makes sense, copy it.

What does IRIX do?  Does O_DIRECT on IRIX write through the drive's
cache?  What about Solaris?

-- Jamie

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: O_DIRECT and barriers
  2009-08-21 22:08         ` Theodore Tso
  2009-08-21 22:38           ` Joel Becker
  2009-08-21 22:45           ` Joel Becker
@ 2009-08-22  0:56           ` Jamie Lokier
  2009-08-22  2:06             ` Theodore Tso
  2009-08-26  6:34           ` Dave Chinner
  3 siblings, 1 reply; 50+ messages in thread
From: Jamie Lokier @ 2009-08-22  0:56 UTC (permalink / raw)
  To: Theodore Tso, Christoph Hellwig, Jens Axboe, linux-fsdevel,
	linux-scsi

Theodore Tso wrote:
> On Fri, Aug 21, 2009 at 10:26:35AM -0400, Christoph Hellwig wrote:
> > > It turns out that applications needing integrity must use fdatasync or
> > > O_DSYNC (or O_SYNC) *already* with O_DIRECT, because the kernel may
> > > choose to use buffered writes at any time, with no signal to the
> > > application.
> > 
> > The fallback was a relatively recent addition to the O_DIRECT semantics
> > for broken filesystems that can't handle holes very well.  Fortunately
> > enough we do force O_SYNC (that is Linux O_SYNC aka Posix O_DSYNC)
> > semantics for that already.
> 
> Um, actually, we don't.  If we did that, we would have to wait for a
> journal commit to complete before allowing the write(2) to complete,
> which would be especially painfully slow for ext3.
> 
> This question recently came up on the ext4 developer's list, because
> of a question of how direct I/O to an preallocated (uninitialized)
> extent should be handled.  Are we supposed to guarantee synchronous
> updates of the metadata by the time write(2) returns, or not?  One of
> the ext4 developers (I can't remember if it was Mingming or Eric)
> asked an XFS developer what they did in that case, and I believe the
> answer they were given was that XFS started a commit, but did *not*
> wait for the commit to complete before returning from the Direct I/O
> write.  In fact, they were told (I believe this was from an SGI
> engineer, but I don't remember the name; we can track that down if
> it's important) that if an application wanted to guarantee metadata
> would be updated for an extending write, they had to use fsync() or
> O_SYNC/O_DSYNC.  
> 
> Perhaps they were given an incorrect answer, but it's clear the
> semantics of exactly how Direct I/O works in edge cases isn't well
> defined, or at least clearly and widely understood.

And that's not even a hardware cache issue, just whether filesystem
metadata is written.

AIX behaves like XFS according to documentation:

    [ http://publib.boulder.ibm.com/infocenter/systems/index.jsp?topic=/com.ibm.aix.genprogc/doc/genprogc/fileio.htm ]

    Direct I/O and Data I/O Integrity Completion

    Although direct I/O writes are done synchronously, they do not
    provide synchronized I/O data integrity completion, as defined by
    POSIX. Applications that need this feature should use O_DSYNC in
    addition to O_DIRECT. O_DSYNC guarantees that all of the data and
    enough of the metadata (for example, indirect blocks) have written
    to the stable store to be able to retrieve the data after a system
    crash. O_DIRECT only writes the data; it does not write the
    metadata.

That's another reason to use O_DIRECT|O_DSYNC in moderately portable
code.

> I have an early draft (for discussion only) what we think it means and
> what is currently implemented in Linux, which I've put up, (again, let
> me emphasisize) for *discussion* here:
> 
> http://ext4.wiki.kernel.org/index.php/Clarifying_Direct_IO's_Semantics
> 
> Comments are welcome, either on the wiki's talk page, or directly to
> me, or to the linux-fsdevel or linux-ext4.

I haven't read it yet.  One thing which comes to mind is it would be
good to summarise what other OSes as well as Linux do with O_DIRECT
w.r.t. data-finding metadata, preallocation, file extending, hole
filling, unaligned access and what alignment is required, block
devices vs. files and different filesystems and behaviour-modifying
mount options, file open for buffered I/O on another descriptor, file
has mapped pages, mlocked pages, and of course drive cache write
through or not.

-- Jamie

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: O_DIRECT and barriers
  2009-08-22  0:56           ` Jamie Lokier
@ 2009-08-22  2:06             ` Theodore Tso
  0 siblings, 0 replies; 50+ messages in thread
From: Theodore Tso @ 2009-08-22  2:06 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: Christoph Hellwig, Jens Axboe, linux-fsdevel, linux-scsi

On Sat, Aug 22, 2009 at 01:56:13AM +0100, Jamie Lokier wrote:
> AIX behaves like XFS according to documentation:
> 
>     [ http://publib.boulder.ibm.com/infocenter/systems/index.jsp?topic=/com.ibm.aix.genprogc/doc/genprogc/fileio.htm ]
> 
>     Direct I/O and Data I/O Integrity Completion
> 
>     Although direct I/O writes are done synchronously, they do not
>     provide synchronized I/O data integrity completion, as defined by
>     POSIX. Applications that need this feature should use O_DSYNC in
>     addition to O_DIRECT. O_DSYNC guarantees that all of the data and
>     enough of the metadata (for example, indirect blocks) have written
>     to the stable store to be able to retrieve the data after a system
>     crash. O_DIRECT only writes the data; it does not write the
>     metadata.
> 
> That's another reason to use O_DIRECT|O_DSYNC in moderately portable
> code.

...or use fsync() when they need to guarantee that data has been
atomically written, but not before.  This becomes critically important
if the application is writing into a sparse file, or writing into
uninitalized blocks that were allocated using fallocate(); otherwise,
with O_DIRECT|O_DSYNC, the file system would have to do a commit
operation after each write, which could be a performance disaster.

> > http://ext4.wiki.kernel.org/index.php/Clarifying_Direct_IO's_Semantics
> > 
> > Comments are welcome, either on the wiki's talk page, or directly to
> > me, or to the linux-fsdevel or linux-ext4.
> 
> I haven't read it yet.  One thing which comes to mind is it would be
> good to summarise what other OSes as well as Linux do with O_DIRECT
> w.r.t. data-finding metadata, preallocation, file extending, hole
> filling, unaligned access and what alignment is required, block
> devices vs. files and different filesystems and behaviour-modifying
> mount options, file open for buffered I/O on another descriptor, file
> has mapped pages, mlocked pages, and of course drive cache write
> through or not.

It's a wiki; contributions to define all of that is welcome.  :-)

We may want to carefully consider what we want to guarantee for all
time to application writers, and what we might want to leave open to
allow for performance optimizations by the kernel, though.

      	  	      		       	   	   - Ted

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: O_DIRECT and barriers
  2009-08-21 22:45           ` Joel Becker
@ 2009-08-22  2:11             ` Theodore Tso
  2009-08-24  2:42               ` Christoph Hellwig
  2009-08-24  2:37             ` Christoph Hellwig
  1 sibling, 1 reply; 50+ messages in thread
From: Theodore Tso @ 2009-08-22  2:11 UTC (permalink / raw)
  To: Christoph Hellwig, Jamie Lokier, Jens Axboe, linux-fsdevel,
	linux-scsi

On Fri, Aug 21, 2009 at 03:45:18PM -0700, Joel Becker wrote:
> On Fri, Aug 21, 2009 at 06:08:52PM -0400, Theodore Tso wrote:
> > http://ext4.wiki.kernel.org/index.php/Clarifying_Direct_IO's_Semantics
> > 
> > Comments are welcome, either on the wiki's talk page, or directly to
> > me, or to the linux-fsdevel or linux-ext4.
> 
> 	In the section on perhaps not waiting for buffered fallback, we
> need to clarify that O_DIRECT reads need to know to look in the
> pagecache.  That is, if we decide that extending O_DIRECT writes without
> fsync can return before the data hits the storage, the caller shouldn't
> also have to call fsync() just to call read() of data they just wrote!

Yeah, I guess we can only do that if the filesystem guarantees
coherence between the page cache and O_DIRECT reads; it's been a long
while since I've studied that code, so I'm not sure whether all
filesystems that support O_DIRECT provide this coherency (since I
thought it was provided in the generic O_DIRECT routines, isn't it?)
or not.

							- Ted

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: O_DIRECT and barriers
  2009-08-22  0:50             ` Jamie Lokier
@ 2009-08-22  2:19               ` Theodore Tso
  2009-08-22  2:31                 ` Theodore Tso
  2009-08-24  2:34               ` Christoph Hellwig
  1 sibling, 1 reply; 50+ messages in thread
From: Theodore Tso @ 2009-08-22  2:19 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: Christoph Hellwig, Jens Axboe, linux-fsdevel, linux-scsi

On Sat, Aug 22, 2009 at 01:50:06AM +0100, Jamie Lokier wrote:
> Christoph Hellwig wrote:
> > > O_DIRECT without O_SYNC, O_DSYNC, fsync or fdatasync is asking for
> > > integrity problems when direct writes are converted to buffered writes
> > > - which applies to all or nearly all OSes according to their
> > > documentation (I've read a lot of them).
> > 
> > It did not happen on IRIX where O_DIRECT originated that did not happen,
> 
> IRIX has an unusually sane O_DIRECT - at least according to it's
> documentation.  This is write(2):
> 
>      When attempting to write to a file with O_DIRECT or FDIRECT set,
>      the portion being written can not be locked in memory by any
>      process. In this case, -1 will be returned and errno will be set
>      to EBUSY.

Can you forward a pointer to an Irix man page which describes its
O_DIRECT semantics (or at least what they claim in their man pages)?
I was looking for one on the web, but I couldn't seem to find any
on-line web pages for Irix.  

It'd be nice if we could also get permission from SGI to quote
relevant sections in the "Clarifying Direct I/O Semantics" wiki page
would be welcome, in case we end up quoting more than what someone
might consider fair game for fair use, but for now, I'd be really
happy getting something that I could look out for reference purposes.
Was there any thing more than what you quoted in the Irix write(2) man
page about O_DIRECT?

Thanks,

						- Ted

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: O_DIRECT and barriers
  2009-08-22  2:19               ` Theodore Tso
@ 2009-08-22  2:31                 ` Theodore Tso
  0 siblings, 0 replies; 50+ messages in thread
From: Theodore Tso @ 2009-08-22  2:31 UTC (permalink / raw)
  To: Jamie Lokier, Christoph Hellwig, Jens Axboe, linux-fsdevel,
	linux-scsi

On Fri, Aug 21, 2009 at 10:19:56PM -0400, Theodore Tso wrote:
> Can you forward a pointer to an Irix man page which describes its
> O_DIRECT semantics (or at least what they claim in their man pages)?
> I was looking for one on the web, but I couldn't seem to find any
> on-line web pages for Irix.  

Never mind, I found it.  (And I've added the relevant bits to the wiki
article).

					- Ted

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: O_DIRECT and barriers
  2009-08-22  0:50             ` Jamie Lokier
  2009-08-22  2:19               ` Theodore Tso
@ 2009-08-24  2:34               ` Christoph Hellwig
  2009-08-27 14:34                 ` Jamie Lokier
  1 sibling, 1 reply; 50+ messages in thread
From: Christoph Hellwig @ 2009-08-24  2:34 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: Christoph Hellwig, Jens Axboe, linux-fsdevel, linux-scsi

On Sat, Aug 22, 2009 at 01:50:06AM +0100, Jamie Lokier wrote:
> Oh, I agree with that.  That comes from observing that quasi-portable
> code using O_DIRECT needs to use O_DSYNC too because several OSes and
> filesystems on those OSes revert to buffered writes under some
> circumstances, in which case you want O_DSYNC too.  That has nothing
> to do with hardware caches, but it's a lucky coincidence that
> fdatasync() would form a nice barrier function, and O_DIRECT|O_DSYNC
> would then make sense as an FUA equivalent.

I agree.  I do however fear about everything using O_DIRECT that is
around now.  Less so about the databases and HPC workloads on expensive
hardware because they usually run on vendor approved scsi disks that
have the write back cache disabled, but rather things like
virtualization software or other things that get run on commodity
hardware.

Then again they already don't get what they expect and never did,
so if we clear document and communicate the O_SYNC (that is Linux
O_SYNC) requirement we might be able to go with this.

> Perhaps in the same way that fsync/fdatasync aren't clear on disk
> cache behaviour either.  On Linux and some other OSes.

The disk write cache really is an implementation detail, it has no
business in Posix.

Posix seems to define the semantics for fdatasync and cor relatively
well (that is if you like the specification speak in there):

"The fdatasync() function forces all currently queued I/O operations
 associated with the file indicated by file descriptor fildes to the
 synchronised I/O completion state."

"synchronised I/O data integrity completion

 o For read, when the operation has been completed or diagnosed if
   unsuccessful. The read is complete only when an image of the data has
   been successfully transferred to the requesting process. If there were
   any pending write requests affecting the data to be read at the time
   that the synchronised read operation was requested, these write
   requests shall be successfully transferred prior to reading the
   data."
 o For write, when the operation has been completed or diagnosed if
   unsuccessful. The write is complete only when the data specified in the
   write request is successfully transferred and all file system
   information required to retrieve the data is successfully transferred."

Given that it talks about data retrievable an volatile cache does not
seem to meet the above criteria.  But yeah, it's a horrible language.

> What does IRIX do?  Does O_DIRECT on IRIX write through the drive's
> cache?  What about Solaris?

IRIX only came pre-packaged with SGI MIPS systems.  Which as most of
the more expensive hardware was not configured with write through
caches.  Which btw is still the case for all more expensive hardware
I have.  The whole issue with volatile write back cache is just too
much of a data integrity nightmare as that you would enable it where
your customers actually care about their data.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: O_DIRECT and barriers
  2009-08-21 22:45           ` Joel Becker
  2009-08-22  2:11             ` Theodore Tso
@ 2009-08-24  2:37             ` Christoph Hellwig
  1 sibling, 0 replies; 50+ messages in thread
From: Christoph Hellwig @ 2009-08-24  2:37 UTC (permalink / raw)
  To: Theodore Tso, Christoph Hellwig, Jamie Lokier, Jens Axboe,
	linux-fsdevel, linux-scsi

On Fri, Aug 21, 2009 at 03:45:18PM -0700, Joel Becker wrote:
> 	In the section on perhaps not waiting for buffered fallback, we
> need to clarify that O_DIRECT reads need to know to look in the
> pagecache.  That is, if we decide that extending O_DIRECT writes without
> fsync can return before the data hits the storage, the caller shouldn't
> also have to call fsync() just to call read() of data they just wrote!

The way the O_DIRECT fallback is implemented currenly is that data does
hit the disk before return, thanks to a:

	err = do_sync_mapping_range(file->f_mapping, pos, endbyte,
					SYNC_FILE_RANGE_WAIT_BEFORE|
					SYNC_FILE_RANGE_WRITE|
					SYNC_FILE_RANGE_WAIT_AFTER);

which I expected to also sync the required metdata to disk, which
it doesn't.    Which btw are really horrible semantics given that
we export that beast to userspace as a separate system call.


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: O_DIRECT and barriers
  2009-08-22  2:11             ` Theodore Tso
@ 2009-08-24  2:42               ` Christoph Hellwig
  0 siblings, 0 replies; 50+ messages in thread
From: Christoph Hellwig @ 2009-08-24  2:42 UTC (permalink / raw)
  To: Theodore Tso, Christoph Hellwig, Jamie Lokier, Jens Axboe,
	linux-fsdevel, linux-scsi

On Fri, Aug 21, 2009 at 10:11:37PM -0400, Theodore Tso wrote:
> Yeah, I guess we can only do that if the filesystem guarantees
> coherence between the page cache and O_DIRECT reads; it's been a long
> while since I've studied that code, so I'm not sure whether all
> filesystems that support O_DIRECT provide this coherency (since I
> thought it was provided in the generic O_DIRECT routines, isn't it?)
> or not.

It's provided in the generic code, yes (or at least appears to).  

Note that xfstests has quite a few tests exercising it.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: O_DIRECT and barriers
  2009-08-21 22:08         ` Theodore Tso
                             ` (2 preceding siblings ...)
  2009-08-22  0:56           ` Jamie Lokier
@ 2009-08-26  6:34           ` Dave Chinner
  2009-08-26 15:01             ` Jamie Lokier
  3 siblings, 1 reply; 50+ messages in thread
From: Dave Chinner @ 2009-08-26  6:34 UTC (permalink / raw)
  To: Theodore Tso, Christoph Hellwig, Jamie Lokier, Jens Axboe,
	linux-fsdevel, linux-scsi

On Fri, Aug 21, 2009 at 06:08:52PM -0400, Theodore Tso wrote:
> On Fri, Aug 21, 2009 at 10:26:35AM -0400, Christoph Hellwig wrote:
> > > It turns out that applications needing integrity must use fdatasync or
> > > O_DSYNC (or O_SYNC) *already* with O_DIRECT, because the kernel may
> > > choose to use buffered writes at any time, with no signal to the
> > > application.
> > 
> > The fallback was a relatively recent addition to the O_DIRECT semantics
> > for broken filesystems that can't handle holes very well.  Fortunately
> > enough we do force O_SYNC (that is Linux O_SYNC aka Posix O_DSYNC)
> > semantics for that already.
> 
> Um, actually, we don't.  If we did that, we would have to wait for a
> journal commit to complete before allowing the write(2) to complete,
> which would be especially painfully slow for ext3.
> 
> This question recently came up on the ext4 developer's list, because
> of a question of how direct I/O to an preallocated (uninitialized)
> extent should be handled.  Are we supposed to guarantee synchronous
> updates of the metadata by the time write(2) returns, or not?  One of
> the ext4 developers (I can't remember if it was Mingming or Eric)
> asked an XFS developer what they did in that case, and I believe the
> answer they were given was that XFS started a commit, but did *not*
> wait for the commit to complete before returning from the Direct I/O
> write.  In fact, they were told (I believe this was from an SGI
> engineer, but I don't remember the name; we can track that down if
> it's important) that if an application wanted to guarantee metadata
> would be updated for an extending write, they had to use fsync() or
> O_SYNC/O_DSYNC.  

That would have been Eric asking me. My answer that O_DIRECT does
not imply any new data integrity guarantees associated with a
write(2) call - it just avoids system caches. You get the same
guarantees of resiliency as a non-O_DIRECT write(2) call at
completion - it may or may notbe there if you crash. If you want
some guarantee of integrity, then you need to use O_DSYNC, O_SYNC or
call f[data]sync(2) just like all other IO.

Also, note that direct IO is not necessarily synchronous - you can
do asynchronous direct IO.....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: O_DIRECT and barriers
  2009-08-26  6:34           ` Dave Chinner
@ 2009-08-26 15:01             ` Jamie Lokier
  2009-08-26 18:47               ` Theodore Tso
  0 siblings, 1 reply; 50+ messages in thread
From: Jamie Lokier @ 2009-08-26 15:01 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Theodore Tso, Christoph Hellwig, Jens Axboe, linux-fsdevel,
	linux-scsi

Dave Chinner wrote:
> On Fri, Aug 21, 2009 at 06:08:52PM -0400, Theodore Tso wrote:
> > On Fri, Aug 21, 2009 at 10:26:35AM -0400, Christoph Hellwig wrote:
> > > > It turns out that applications needing integrity must use fdatasync or
> > > > O_DSYNC (or O_SYNC) *already* with O_DIRECT, because the kernel may
> > > > choose to use buffered writes at any time, with no signal to the
> > > > application.
> > > 
> > > The fallback was a relatively recent addition to the O_DIRECT semantics
> > > for broken filesystems that can't handle holes very well.  Fortunately
> > > enough we do force O_SYNC (that is Linux O_SYNC aka Posix O_DSYNC)
> > > semantics for that already.
> > 
> > Um, actually, we don't.  If we did that, we would have to wait for a
> > journal commit to complete before allowing the write(2) to complete,
> > which would be especially painfully slow for ext3.
> > 
> > This question recently came up on the ext4 developer's list, because
> > of a question of how direct I/O to an preallocated (uninitialized)
> > extent should be handled.  Are we supposed to guarantee synchronous
> > updates of the metadata by the time write(2) returns, or not?  One of
> > the ext4 developers (I can't remember if it was Mingming or Eric)
> > asked an XFS developer what they did in that case, and I believe the
> > answer they were given was that XFS started a commit, but did *not*
> > wait for the commit to complete before returning from the Direct I/O
> > write.  In fact, they were told (I believe this was from an SGI
> > engineer, but I don't remember the name; we can track that down if
> > it's important) that if an application wanted to guarantee metadata
> > would be updated for an extending write, they had to use fsync() or
> > O_SYNC/O_DSYNC.  
> 
> That would have been Eric asking me. My answer that O_DIRECT does
> not imply any new data integrity guarantees associated with a
> write(2) call - it just avoids system caches. You get the same
> guarantees of resiliency as a non-O_DIRECT write(2) call at
> completion - it may or may notbe there if you crash. If you want
> some guarantee of integrity, then you need to use O_DSYNC, O_SYNC or
> call f[data]sync(2) just like all other IO.
> 
> Also, note that direct IO is not necessarily synchronous - you can
> do asynchronous direct IO.....

I agree with all of the above, except:

  1. If the automatic O_SYNC fallback mentioned by Christopher is
     currently implemented at all, even in a subset of filesystems,
     then I think it should be removed.

     An app which wants integrity should be calling fsync/fdatasync or
     using O_DSYNC/O_SYNC explicitly - with fsync/fdatasync giving
     more control over batching.

     If it doesn't do any of those things, it may be using O_DIRECT
     for performance, and not wish to be penalised by an expensive
     O_SYNC on every individual write.  Especially when O_SYNC is
     fixed to commit drive caches.

  2. I agree with everything Dave said about needing to use some other
     mechanism for an integrity commit; O_DIRECT is not enough.

     We can't realistically make O_DIRECT (by itself) do integrity
     commits anyway, because on some drives that involves committing
     the drive cache, and it would be a large performance regression.
     Given O_DIRECT is often used for its performance, that's not an
     option.

  3. Currently none of the options provides good integrity commit.

     All of them fail to commit drive caches under some circumstances;
     even fsync on ext3 with barriers enabled (because it doesn't
     commit a journal record if there were writes but no inode change
     with data=ordered).

     This should be changed (or at least made optionally available),
     and that's all the more reason to avoid commit operations except
     when requested.

  4. On drives which need it, fdatasync/fsync must trigger a drive
     cache flush even when there is no dirty page cache to write,
     because dirty pages may have been written in the background
     already, and because O_DIRECT writes dirty the drive cache but
     not the page cache.

     A per-drive flag would make sense to optimise this: It is set by
     any non-FUA writes sent to the drive while the drive's writeback
     cache is enabled, and cleared when any cache flush command is
     sent.  When the flag is clear, further cache flush commands don't
     need to be sent.

-- Jamie

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: O_DIRECT and barriers
  2009-08-26 15:01             ` Jamie Lokier
@ 2009-08-26 18:47               ` Theodore Tso
  2009-08-27 14:50                 ` Jamie Lokier
  0 siblings, 1 reply; 50+ messages in thread
From: Theodore Tso @ 2009-08-26 18:47 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Dave Chinner, Christoph Hellwig, Jens Axboe, linux-fsdevel,
	linux-scsi

On Wed, Aug 26, 2009 at 04:01:02PM +0100, Jamie Lokier wrote:
>   1. If the automatic O_SYNC fallback mentioned by Christopher is
>      currently implemented at all, even in a subset of filesystems,
>      then I think it should be removed.

Could you clarify what you meant by "it" above?  I'm not sure I
understood what you were referring to.

Also, it sounds like you and Dave are mostly agreeing with the what
I've written here; is that true?

http://ext4.wiki.kernel.org/index.php/Clarifying_Direct_IO's_Semantics

I'm trying to get consensus that this is both (a) an accurate
description of the state of affiars in Linux, and (b) that it is what
we think things should be, before I start circulating it around
application developers (especially database developers), to make sure
they have the same understanding of O_DIRECT semantics as we have.

> 
>   4. On drives which need it, fdatasync/fsync must trigger a drive
>      cache flush even when there is no dirty page cache to write,
>      because dirty pages may have been written in the background
>      already, and because O_DIRECT writes dirty the drive cache but
>      not the page cache.
> 

I agree we *should* do this, but we're going to take a pretty serious
performance hit when we do.  Mac OS chickened out and added an
F_FULLSYNC option:

http://developer.apple.com/documentation/Darwin/Reference/Manpages/man2/fcntl.2.html

The concern is that there are GUI programers that want to update state
files after every window resize or move, and after click on a web
browser.  These GUI programmers then get cranky when changes get lost
after proprietary video drivers cause the laptop to lock up.  If we
make fsync() too burdensome, then fewer and fewer applications will
use it.  Evidently the MacOS developers decided the few applications
who really cared about doing device cache flushes were much smaller
than the fast number of applications that need a lightweight file
flush.  Should we do the same?  

It seems like an awful cop-out, but having seen, up front and
personal, how "agressively stupid" some desktop programmers can be[1],
I can **certainly** understand why Apple chose the F_FULLSYNC route.

[1] http://josefsipek.net/blahg/?p=364

    		  	     	      - Ted
				      (who really needs to get himself
				       an O_PONIES t-shirt :-)

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: O_DIRECT and barriers
  2009-08-24  2:34               ` Christoph Hellwig
@ 2009-08-27 14:34                 ` Jamie Lokier
  2009-08-27 17:10                   ` adding proper O_SYNC/O_DSYNC, was " Christoph Hellwig
  0 siblings, 1 reply; 50+ messages in thread
From: Jamie Lokier @ 2009-08-27 14:34 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Jens Axboe, linux-fsdevel, linux-scsi

Christoph Hellwig wrote:
> Then again they already don't get what they expect and never did,
> so if we clear document and communicate the O_SYNC (that is Linux
> O_SYNC) requirement we might be able to go with this.

I'm thinking, while we're looking at this, that now is a really good
time to split up O_SYNC and O_DSYNC.

We have separate fsync and fdatasync, so it should be quite tidy now.

Then we can document using O_DSYNC on Linux, which is fine for older
versions because it has the same value as O_SYNC at the moment.

-- Jamie

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: O_DIRECT and barriers
  2009-08-26 18:47               ` Theodore Tso
@ 2009-08-27 14:50                 ` Jamie Lokier
  0 siblings, 0 replies; 50+ messages in thread
From: Jamie Lokier @ 2009-08-27 14:50 UTC (permalink / raw)
  To: Theodore Tso, Dave Chinner, Christoph Hellwig, Jens Axboe,
	linux-fsdevel, linux-scsi@

Theodore Tso wrote:
> On Wed, Aug 26, 2009 at 04:01:02PM +0100, Jamie Lokier wrote:
> >   1. If the automatic O_SYNC fallback mentioned by Christopher is
> >      currently implemented at all, even in a subset of filesystems,
> >      then I think it should be removed.
> 
> Could you clarify what you meant by "it" above?  I'm not sure I
> understood what you were referring to.

I meant the automatic O_SYNC fallback, in other words, if O_DIRECT
falls back to buffered writing, Chris said it automatically did
O_SYNC, and you followed up by saying it doesn't :-)

All I'm saying is if there's _some_ code doing O_SYNC writing when
O_DIRECT falls back to buffered, it should be ripped out.  Leave the
syncing to explicit fsync calls from userspace.

> >   4. On drives which need it, fdatasync/fsync must trigger a drive
> >      cache flush even when there is no dirty page cache to write,
> >      because dirty pages may have been written in the background
> >      already, and because O_DIRECT writes dirty the drive cache but
> >      not the page cache.
> > 
> 
> I agree we *should* do this, but we're going to take a pretty serious
> performance hit when we do.  Mac OS chickened out and added an
> F_FULLSYNC option:

I know about that one.  (I've done quite a lot of research on O_DIRECT
and fsync behaviours).  It's really unfortunate that they didn't
provide F_FULLDATASYNC, which is what a database or VM would ideally
use.

I think Vxfs provides a whole suite of mount options to adjust what
O_SYNC and fdatasync actually do.

> The concern is that there are GUI programers that want to update state
> files after every window resize or move, and after click on a web
> browser.  These GUI programmers then get cranky when changes get lost
> after proprietary video drivers cause the laptop to lock up.  If we
> make fsync() too burdensome, then fewer and fewer applications will
> use it.  Evidently the MacOS developers decided the few applications
> who really cared about doing device cache flushes were much smaller
> than the fast number of applications that need a lightweight file
> flush.  Should we do the same?  

If fsync is cheap but doesn't commit changes properly - what's the
point in encouraging applications to use it?  Without drive cache
flushes, they will still lose changes occasionally.

(Btw, don't blame proprietary video drivers.  I see too many lockups
with open source video drivers too.)

> It seems like an awful cop-out, but having seen, up front and
> personal, how "agressively stupid" some desktop programmers can be[1],
> I can **certainly** understand why Apple chose the F_FULLSYNC route.

I did see a few of those threads, and I think your solution was genius.
Genius at keeping people quiet that is :-)

But it's also a good default.  fsync() isn't practical in shell
scripts or Makefiles, although that's really because "mv" lacks the
fsync option...

Personally I side with "want some kind of full-system asynchronous
transactionality please".  (Possibly aka. O_PONIES :-)

-- Jamie

^ permalink raw reply	[flat|nested] 50+ messages in thread

* adding proper O_SYNC/O_DSYNC, was Re: O_DIRECT and barriers
  2009-08-27 14:34                 ` Jamie Lokier
@ 2009-08-27 17:10                   ` Christoph Hellwig
  2009-08-27 17:24                     ` Ulrich Drepper
  0 siblings, 1 reply; 50+ messages in thread
From: Christoph Hellwig @ 2009-08-27 17:10 UTC (permalink / raw)
  To: Jamie Lokier, Ulrich Drepper; +Cc: linux-fsdevel, linux-kernel

On Thu, Aug 27, 2009 at 03:34:59PM +0100, Jamie Lokier wrote:
> Christoph Hellwig wrote:
> > Then again they already don't get what they expect and never did,
> > so if we clear document and communicate the O_SYNC (that is Linux
> > O_SYNC) requirement we might be able to go with this.
> 
> I'm thinking, while we're looking at this, that now is a really good
> time to split up O_SYNC and O_DSYNC.
> 
> We have separate fsync and fdatasync, so it should be quite tidy now.
> 
> Then we can document using O_DSYNC on Linux, which is fine for older
> versions because it has the same value as O_SYNC at the moment.

Technically we could easily make O_SYNC really mean O_SYNC and implement
a seaprate O_DSYNC at the kernel level.

The question is how to handle this at the libc level.  Currently glibc
defines O_DSYNC to be O_SYNC.  We would need to update glibc to pass
through O_DSYNC for newer kernels and make sure it falls back to O_SYNC
for olders.  I'm not sure how feasible this is, but maybe Ulrich has
some better ideas.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: adding proper O_SYNC/O_DSYNC, was Re: O_DIRECT and barriers
  2009-08-27 17:10                   ` adding proper O_SYNC/O_DSYNC, was " Christoph Hellwig
@ 2009-08-27 17:24                     ` Ulrich Drepper
  2009-08-28 15:46                       ` Christoph Hellwig
  0 siblings, 1 reply; 50+ messages in thread
From: Ulrich Drepper @ 2009-08-27 17:24 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Jamie Lokier, linux-fsdevel, linux-kernel

On 08/27/2009 10:10 AM, Christoph Hellwig wrote:
> The question is how to handle this at the libc level.  Currently glibc
> defines O_DSYNC to be O_SYNC.  We would need to update glibc to pass
> through O_DSYNC for newer kernels and make sure it falls back to O_SYNC
> for olders.  I'm not sure how feasible this is, but maybe Ulrich has
> some better ideas.

The problem with O_* extensions is that the syscall doesn't fail if the 
flag is not handled.  This is a problem in the open implementation which 
can only be fixed with a new syscall.

Why cannot just go on and say we interpret O_SYNC like O_SYNC and 
O_SYNC|O_DSYNC like O_DSYNC.  The POSIX spec explicitly requires that 
the latter handled like O_SYNC.

We could handle it by allocating two bits, only one is handled in the 
kernel.  If the O_DSYNC definition for userlevel would be different from 
the kernel definition then the kernel could interpret O_SYNC|O_DSYNC 
like O_DSYNC.  The libc would then have to translate the userlevel 
O_DSYNC into the kernel O_DSYNC.  If the libc is too old for the kernel 
and the application, the userlevel flag would be passed to the kernel 
and nothing bad happens.

The cleaner alternative is to have a sys_newopen which checks for 
unknown flags and fails in that case.

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: adding proper O_SYNC/O_DSYNC, was Re: O_DIRECT and barriers
  2009-08-27 17:24                     ` Ulrich Drepper
@ 2009-08-28 15:46                       ` Christoph Hellwig
  2009-08-28 16:06                         ` Ulrich Drepper
                                           ` (2 more replies)
  0 siblings, 3 replies; 50+ messages in thread
From: Christoph Hellwig @ 2009-08-28 15:46 UTC (permalink / raw)
  To: Ulrich Drepper
  Cc: Christoph Hellwig, Jamie Lokier, linux-fsdevel, linux-kernel

On Thu, Aug 27, 2009 at 10:24:28AM -0700, Ulrich Drepper wrote:
> The problem with O_* extensions is that the syscall doesn't fail if the  
> flag is not handled.  This is a problem in the open implementation which  
> can only be fixed with a new syscall.
>
> Why cannot just go on and say we interpret O_SYNC like O_SYNC and  
> O_SYNC|O_DSYNC like O_DSYNC.  The POSIX spec explicitly requires that  
> the latter handled like O_SYNC.
>
> We could handle it by allocating two bits, only one is handled in the  
> kernel.  If the O_DSYNC definition for userlevel would be different from  
> the kernel definition then the kernel could interpret O_SYNC|O_DSYNC  
> like O_DSYNC.  The libc would then have to translate the userlevel  
> O_DSYNC into the kernel O_DSYNC.  If the libc is too old for the kernel  
> and the application, the userlevel flag would be passed to the kernel  
> and nothing bad happens.

What about hte following variant:

 - given that our current O_SYNC really is and always has been actuall
   Posix O_DSYNC keep the numerical value and rename it to O_DSYNC in
   the headers.
 - Add a new O_SYNC definition:

	#define O_SYNC		(O_DSYNC|O_REALLY_SYNC)

   and do full O_SYNC handling in new kernels if O_REALLY_SYNC is
   present.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: adding proper O_SYNC/O_DSYNC, was Re: O_DIRECT and barriers
  2009-08-28 15:46                       ` Christoph Hellwig
@ 2009-08-28 16:06                         ` Ulrich Drepper
  2009-08-28 16:17                           ` Christoph Hellwig
  2009-08-28 16:44                         ` Jamie Lokier
  2009-08-28 23:06                         ` Jamie Lokier
  2 siblings, 1 reply; 50+ messages in thread
From: Ulrich Drepper @ 2009-08-28 16:06 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Jamie Lokier, linux-fsdevel, linux-kernel

On 08/28/2009 08:46 AM, Christoph Hellwig wrote:
>   - given that our current O_SYNC really is and always has been actuall
>     Posix O_DSYNC

If this is true, then this proposal would work, yes.

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: adding proper O_SYNC/O_DSYNC, was Re: O_DIRECT and barriers
  2009-08-28 16:06                         ` Ulrich Drepper
@ 2009-08-28 16:17                           ` Christoph Hellwig
  2009-08-28 16:33                             ` Ulrich Drepper
  0 siblings, 1 reply; 50+ messages in thread
From: Christoph Hellwig @ 2009-08-28 16:17 UTC (permalink / raw)
  To: Ulrich Drepper
  Cc: Christoph Hellwig, Jamie Lokier, linux-fsdevel, linux-kernel

On Fri, Aug 28, 2009 at 09:06:35AM -0700, Ulrich Drepper wrote:
> On 08/28/2009 08:46 AM, Christoph Hellwig wrote:
>>   - given that our current O_SYNC really is and always has been actuall
>>     Posix O_DSYNC
>
> If this is true, then this proposal would work, yes.

I'll put it on my todo list.  While reading through the Posix specs
I came up with some questions that you might be able to answer:

 - O_RSYNC basically means we need to commit atime updates before a
   read returns, right?  It would be easy to implement 
   it in a slightly suboptimal fashion, but is there any point?

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: adding proper O_SYNC/O_DSYNC, was Re: O_DIRECT and barriers
  2009-08-28 16:17                           ` Christoph Hellwig
@ 2009-08-28 16:33                             ` Ulrich Drepper
  2009-08-28 16:41                               ` Christoph Hellwig
  2009-08-28 16:46                               ` Jamie Lokier
  0 siblings, 2 replies; 50+ messages in thread
From: Ulrich Drepper @ 2009-08-28 16:33 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Jamie Lokier, linux-fsdevel, linux-kernel

On 08/28/2009 09:17 AM, Christoph Hellwig wrote:
> I'll put it on my todo list.

Any ABI change like this takes a long time to trickle down.

If this is agreed to be the correct approach then adding the O_* 
definitions earlier is better.  Even if it isn't yet implemented.  Then, 
once the kernel side is implemented, programs are ready to use it.  I 
cannot jump the gun and define the flags myself first.

>   - O_RSYNC basically means we need to commit atime updates before a
>     read returns, right?

No, that's not it.

O_RSYNC on its own just means the data is successfully transferred to 
the calling process (always the case).

O_RSYNC|O_DSYNC means that if a read request hits data that is currently 
in a cache and not yet on the medium, then the write to medium is 
successful before the read succeeds.

O_RSYNC|O_SYNC means the same plus the integrity of file meta 
information (access time etc).

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: adding proper O_SYNC/O_DSYNC, was Re: O_DIRECT and barriers
  2009-08-28 16:33                             ` Ulrich Drepper
@ 2009-08-28 16:41                               ` Christoph Hellwig
  2009-08-28 20:51                                 ` Ulrich Drepper
  2009-08-28 16:46                               ` Jamie Lokier
  1 sibling, 1 reply; 50+ messages in thread
From: Christoph Hellwig @ 2009-08-28 16:41 UTC (permalink / raw)
  To: Ulrich Drepper
  Cc: Christoph Hellwig, Jamie Lokier, linux-fsdevel, linux-kernel

On Fri, Aug 28, 2009 at 09:33:29AM -0700, Ulrich Drepper wrote:
> On 08/28/2009 09:17 AM, Christoph Hellwig wrote:
>> I'll put it on my todo list.
>
> Any ABI change like this takes a long time to trickle down.
>
> If this is agreed to be the correct approach then adding the O_*  
> definitions earlier is better.  Even if it isn't yet implemented.  Then,  
> once the kernel side is implemented, programs are ready to use it.  I  
> cannot jump the gun and define the flags myself first.

Yeah.  The implementation really is trivial in 2.6.32 - we basically
just need to change one function to check the new O_REALLY_SYNC flag
and pass down a 0 instead of a 1 to another routine in the generic
fs code, plus doing the same in a few filesystems opencoding it instead
of using the generic helpers.

So the logistics of doing the flags really is the biggest work here.
And I'm not entirely sure how to do it correctly.  Can we just switch
the current O_SYNC defintion in the kernel headers to O_DSYNC while
adding the new O_SYNC and everything will continue to work? 

>>   - O_RSYNC basically means we need to commit atime updates before a
>>     read returns, right?
>
> No, that's not it.
>
> O_RSYNC on its own just means the data is successfully transferred to  
> the calling process (always the case).
>
> O_RSYNC|O_DSYNC means that if a read request hits data that is currently  
> in a cache and not yet on the medium, then the write to medium is  
> successful before the read succeeds.

That includes a write from another process?  So O_RSYNC basically means
doing an range-fdatasync before the actual read request?

Again, we could implement this easily if we care enough.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: adding proper O_SYNC/O_DSYNC, was Re: O_DIRECT and barriers
  2009-08-28 15:46                       ` Christoph Hellwig
  2009-08-28 16:06                         ` Ulrich Drepper
@ 2009-08-28 16:44                         ` Jamie Lokier
  2009-08-28 16:50                           ` Jamie Lokier
  2009-08-28 21:08                           ` Ulrich Drepper
  2009-08-28 23:06                         ` Jamie Lokier
  2 siblings, 2 replies; 50+ messages in thread
From: Jamie Lokier @ 2009-08-28 16:44 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Ulrich Drepper, linux-fsdevel, linux-kernel

Christoph Hellwig wrote:
> On Thu, Aug 27, 2009 at 10:24:28AM -0700, Ulrich Drepper wrote:
> > The problem with O_* extensions is that the syscall doesn't fail if the  
> > flag is not handled.  This is a problem in the open implementation which  
> > can only be fixed with a new syscall.
> >
> > Why cannot just go on and say we interpret O_SYNC like O_SYNC and  
> > O_SYNC|O_DSYNC like O_DSYNC.  The POSIX spec explicitly requires that  
> > the latter handled like O_SYNC.
> >
> > We could handle it by allocating two bits, only one is handled in the  
> > kernel.  If the O_DSYNC definition for userlevel would be different from  
> > the kernel definition then the kernel could interpret O_SYNC|O_DSYNC  
> > like O_DSYNC.  The libc would then have to translate the userlevel  
> > O_DSYNC into the kernel O_DSYNC.  If the libc is too old for the kernel  
> > and the application, the userlevel flag would be passed to the kernel  
> > and nothing bad happens.
> 
> What about hte following variant:
> 
>  - given that our current O_SYNC really is and always has been actuall
>    Posix O_DSYNC keep the numerical value and rename it to O_DSYNC in
>    the headers.
>  - Add a new O_SYNC definition:
> 
> 	#define O_SYNC		(O_DSYNC|O_REALLY_SYNC)
> 
>    and do full O_SYNC handling in new kernels if O_REALLY_SYNC is
>    present.

That looks good for the kernel.

However, for userspace, there's an issue with applications which were
compiled with an old libc and used O_SYNC.  Most of them probably
expected O_SYNC behaviour but all they got was O_DSYNC, because Linux
didn't do it right.

But they *didn't know* that.

When using a newer kernel which actually implements O_SYNC behaviour,
I'm thinking those applications which asked for O_SYNC should get it,
even though they're still linked with an old libc.

That's because this thread is the first time I've heard that Linux
O_SYNC was really the weaker O_DSYNC in disguise, and judging from the
many Googlings I've done about O_SYNC in applications and on different
OS, it'll be news to other people too.

(I always thought the "#define O_DSYNC O_SYNC" was because Linux
didn't implement the weaker O_DSYNC).

(Oh, and Ulrich: Why is there a "#define O_RSYNC O_SYNC" in the Glibc
headers?  That doesn't make sense: O_RSYNC has nothing to do with
writing.)

To achieve that, libc could implement two versions of open() at the
same time as it updates header files.  The new libc's __old_open() would
do:

    /* Only O_DSYNC is set for apps built against old libc which
       were compiled
    if (flags & O_DSYNC)
        flags |= O_SYNC;

I'm not exactly sure how symbol versioning works, but perhaps the
header file in the new libc would need __REDIRECT_NTH to map open() to
__new_open(), which just calls the kernel.  This is to ensure .o and
.a files built with an old libc's headers but then linked to a new
libc will get __old_open().

Although libc's __new_open() could have this:

    /* Old kernels only look at O_DSYNC.  It's better than nothing. */
    if (flags & O_SYNC)
        flags |= O_DSYNC;

Imho, it's better to not do that, and instead have

    #define O_SYNC          (O_DSYNC|__O_SYNC_KERNEL)

as Chris suggests, in the libc header the same as the kernel header,
because that way applications which use the syscall() function or have
to invoke a syscall directly (I've seen clone-using code doing it),
won't spontaneously start losing their O_SYNCness on older kernels.
Unless there is some reason why "flags &= ~O_SYNC" is not permitted to
clear the O_DSYNC flag, or other reason why they must be separate flags.

-- Jamie

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: adding proper O_SYNC/O_DSYNC, was Re: O_DIRECT and barriers
  2009-08-28 16:33                             ` Ulrich Drepper
  2009-08-28 16:41                               ` Christoph Hellwig
@ 2009-08-28 16:46                               ` Jamie Lokier
  2009-08-29  0:59                                 ` Jamie Lokier
  1 sibling, 1 reply; 50+ messages in thread
From: Jamie Lokier @ 2009-08-28 16:46 UTC (permalink / raw)
  To: Ulrich Drepper; +Cc: Christoph Hellwig, linux-fsdevel, linux-kernel

Ulrich Drepper wrote:
> >  - O_RSYNC basically means we need to commit atime updates before a
> >    read returns, right?
> 
> No, that's not it.
> 
> O_RSYNC on its own just means the data is successfully transferred to 
> the calling process (always the case).
> 
> O_RSYNC|O_DSYNC means that if a read request hits data that is currently 
> in a cache and not yet on the medium, then the write to medium is 
> successful before the read succeeds.
> 
> O_RSYNC|O_SYNC means the same plus the integrity of file meta 
> information (access time etc).

On several unixes, O_RSYNC means it will send the read to the
hardware, not relying on the cache.  This can be used to verify the
data which was written earlier, whether by O_DSYNC or fdatasync.

-- Jamie

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: adding proper O_SYNC/O_DSYNC, was Re: O_DIRECT and barriers
  2009-08-28 16:44                         ` Jamie Lokier
@ 2009-08-28 16:50                           ` Jamie Lokier
  2009-08-28 21:08                           ` Ulrich Drepper
  1 sibling, 0 replies; 50+ messages in thread
From: Jamie Lokier @ 2009-08-28 16:50 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Ulrich Drepper, linux-fsdevel, linux-kernel

Jamie Lokier wrote:
> That's because this thread is the first time I've heard that Linux
> O_SYNC was really the weaker O_DSYNC in disguise, and judging from the
> many Googlings I've done about O_SYNC in applications and on different
> OS, it'll be news to other people too.
> 
> (I always thought the "#define O_DSYNC O_SYNC" was because Linux
> didn't implement the weaker O_DSYNC).

It looks like we're not the only ones.  AIX has:

http://publib.boulder.ibm.com/infocenter/systems/index.jsp?topic=/com.ibm.aix.genprogc/doc/genprogc/fileio.htm

    Before the O_DSYNC open mode existed, AIX applied O_DSYNC semantics to
    O_SYNC. For binary compatibility reasons, this behavior still
    exists. If true O_SYNC behavior is required, then both O_DSYNC and
    O_SYNC open flags must be specified. Exporting the XPG_SUS_ENV=ON
    environment variable also enables true O_SYNC behavior.

-- Jamie

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: adding proper O_SYNC/O_DSYNC, was Re: O_DIRECT and barriers
  2009-08-28 16:41                               ` Christoph Hellwig
@ 2009-08-28 20:51                                 ` Ulrich Drepper
  2009-08-28 21:08                                   ` Christoph Hellwig
  0 siblings, 1 reply; 50+ messages in thread
From: Ulrich Drepper @ 2009-08-28 20:51 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Jamie Lokier, linux-fsdevel, linux-kernel

On 08/28/2009 09:41 AM, Christoph Hellwig wrote:
> Yeah.  The implementation really is trivial in 2.6.32 - we basically
> just need to change one function to check the new O_REALLY_SYNC flag
> and pass down a 0 instead of a 1 to another routine in the generic
> fs code, plus doing the same in a few filesystems opencoding it instead
> of using the generic helpers.

I don't think you have to change anything.  As I wrote before, the 
kernel ignores unknown O_* flags.  It's usually a problem.  Here it is a 
positive thing.


> So the logistics of doing the flags really is the biggest work here.
> And I'm not entirely sure how to do it correctly.  Can we just switch
> the current O_SYNC defintion in the kernel headers to O_DSYNC while
> adding the new O_SYNC and everything will continue to work?

No, that's not a good idea.  This would mean a program compiled with 
newer headers is using O_SYNC which isn't known to old kernels and 
ignored.  Such programs will then not even get the current O_DSYNC benefits.


> That includes a write from another process?  So O_RSYNC basically means
> doing an range-fdatasync before the actual read request?

Yes.  You can easily see how this can be useful.


> Again, we could implement this easily if we care enough.

I think it can be useful at times.

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: adding proper O_SYNC/O_DSYNC, was Re: O_DIRECT and barriers
  2009-08-28 20:51                                 ` Ulrich Drepper
@ 2009-08-28 21:08                                   ` Christoph Hellwig
  2009-08-28 21:16                                     ` Trond Myklebust
  2009-08-30 16:44                                     ` Jamie Lokier
  0 siblings, 2 replies; 50+ messages in thread
From: Christoph Hellwig @ 2009-08-28 21:08 UTC (permalink / raw)
  To: Ulrich Drepper
  Cc: Christoph Hellwig, Jamie Lokier, linux-fsdevel, linux-kernel,
	torvalds

On Fri, Aug 28, 2009 at 01:51:03PM -0700, Ulrich Drepper wrote:
> No, that's not a good idea.  This would mean a program compiled with  
> newer headers is using O_SYNC which isn't known to old kernels and  
> ignored.  Such programs will then not even get the current O_DSYNC 
> benefits.

Ok, let's agree on how to proceed:


once 2.6.31 is out we will do the following

 - do a global s/O_SYNC/O_DSYNC/g over the whole kernel tree
 - add a this to include/asm-generic/fcntl.h and in modified form
   to arch headers not using it:

#ifndef O_FULLSYNC
#define O_FULLSYNC	02000000
#endif

#ifndef O_RSYNC
#define O_RSYNC		04000000
#endif

#define O_SYNC	(O_FULLSYNC|O_DSYNC)

 - during the normal merge window I will add a real implementation for
   for O_FULLSYNC and O_RSYNC

P.S. better naming suggestions for O_FULLSYNC welcome

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: adding proper O_SYNC/O_DSYNC, was Re: O_DIRECT and barriers
  2009-08-28 16:44                         ` Jamie Lokier
  2009-08-28 16:50                           ` Jamie Lokier
@ 2009-08-28 21:08                           ` Ulrich Drepper
  2009-08-30 16:58                             ` Jamie Lokier
  2009-08-30 17:48                             ` Jamie Lokier
  1 sibling, 2 replies; 50+ messages in thread
From: Ulrich Drepper @ 2009-08-28 21:08 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: Christoph Hellwig, linux-fsdevel, linux-kernel

On 08/28/2009 09:44 AM, Jamie Lokier wrote:
> However, for userspace, there's an issue with applications which were
> compiled with an old libc and used O_SYNC.  Most of them probably
> expected O_SYNC behaviour but all they got was O_DSYNC, because Linux
> didn't do it right.

Right.  But these programs apparently can live with the broken 
semantics.  I don't worry too much about this.  If people really need 
the fixed O_SYNC semantics then let them recompile their code.


> When using a newer kernel which actually implements O_SYNC behaviour,
> I'm thinking those applications which asked for O_SYNC should get it,
> even though they're still linked with an old libc.

In general yes, but it's too expensive.  Again, existing programs expect 
the current behavior and can live with it.


> (Oh, and Ulrich: Why is there a "#define O_RSYNC O_SYNC" in the Glibc
> headers?  That doesn't make sense: O_RSYNC has nothing to do with
> writing.)

O_SYNC is a superset of O_RSYNC.  In the absence of a true O_RSYNC 
that's the next best thing.  Of course I didn't know the Linux O_SYNC is 
really O_DSYNC.  In that context the definition doesn't make sense.


> Although libc's __new_open() could have this:
>
>      /* Old kernels only look at O_DSYNC.  It's better than nothing. */
>      if (flags&  O_SYNC)
>          flags |= O_DSYNC;
>
> Imho, it's better to not do that, and instead have
>
>      #define O_SYNC          (O_DSYNC|__O_SYNC_KERNEL)

Why should it be better?  You're replacing something the compiler can do 
with zero cost with active code.


Again, these O_* constant changes are sufficient.

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: adding proper O_SYNC/O_DSYNC, was Re: O_DIRECT and barriers
  2009-08-28 21:08                                   ` Christoph Hellwig
@ 2009-08-28 21:16                                     ` Trond Myklebust
  2009-08-28 21:29                                       ` Christoph Hellwig
  2009-08-30 16:44                                     ` Jamie Lokier
  1 sibling, 1 reply; 50+ messages in thread
From: Trond Myklebust @ 2009-08-28 21:16 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Ulrich Drepper, Jamie Lokier, linux-fsdevel, linux-kernel,
	torvalds

On Fri, 2009-08-28 at 17:08 -0400, Christoph Hellwig wrote:
> #define O_SYNC	(O_FULLSYNC|O_DSYNC)
> 
>  - during the normal merge window I will add a real implementation for
>    for O_FULLSYNC and O_RSYNC
> 
> P.S. better naming suggestions for O_FULLSYNC welcome

Basically you are just ensuring that the metadata changes are being
synced together with the data changes, so how about O_ISYNC (inode
sync)?



^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: adding proper O_SYNC/O_DSYNC, was Re: O_DIRECT and barriers
  2009-08-28 21:16                                     ` Trond Myklebust
@ 2009-08-28 21:29                                       ` Christoph Hellwig
  2009-08-28 21:43                                         ` Trond Myklebust
  0 siblings, 1 reply; 50+ messages in thread
From: Christoph Hellwig @ 2009-08-28 21:29 UTC (permalink / raw)
  To: Trond Myklebust
  Cc: Christoph Hellwig, Ulrich Drepper, Jamie Lokier, linux-fsdevel,
	linux-kernel, torvalds

On Fri, Aug 28, 2009 at 05:16:14PM -0400, Trond Myklebust wrote:
> On Fri, 2009-08-28 at 17:08 -0400, Christoph Hellwig wrote:
> > #define O_SYNC	(O_FULLSYNC|O_DSYNC)
> > 
> >  - during the normal merge window I will add a real implementation for
> >    for O_FULLSYNC and O_RSYNC
> > 
> > P.S. better naming suggestions for O_FULLSYNC welcome
> 
> Basically you are just ensuring that the metadata changes are being
> synced together with the data changes, so how about O_ISYNC (inode
> sync)?

Yeah.  Thinking about this a bit more we should define this flag
much more clearly.  In the obvious implementation it would not actually
do anything if it's set on it's own.  We would only check it if O_DSYNC
is already set to decided if we want to set the datasync argument to
->fsync to 0 or 1 for the generic filesystems (and similar things for
filesystems not using the generic helper).

If we deem that this is too unsafe we could make sure O_DSYNC always
gets set on this fag in ->open, but if we make sure O_SYNC is defined
like the one above in the kernel headers and glibc we should be fine.

Although in that case a name that doesn't suggest that it actually does
something useful would be better.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: adding proper O_SYNC/O_DSYNC, was Re: O_DIRECT and barriers
  2009-08-28 21:29                                       ` Christoph Hellwig
@ 2009-08-28 21:43                                         ` Trond Myklebust
  2009-08-28 22:39                                           ` Christoph Hellwig
  0 siblings, 1 reply; 50+ messages in thread
From: Trond Myklebust @ 2009-08-28 21:43 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Ulrich Drepper, Jamie Lokier, linux-fsdevel, linux-kernel,
	torvalds

On Fri, 2009-08-28 at 17:29 -0400, Christoph Hellwig wrote:
> On Fri, Aug 28, 2009 at 05:16:14PM -0400, Trond Myklebust wrote:
> > On Fri, 2009-08-28 at 17:08 -0400, Christoph Hellwig wrote:
> > > #define O_SYNC	(O_FULLSYNC|O_DSYNC)
> > > 
> > >  - during the normal merge window I will add a real implementation for
> > >    for O_FULLSYNC and O_RSYNC
> > > 
> > > P.S. better naming suggestions for O_FULLSYNC welcome
> > 
> > Basically you are just ensuring that the metadata changes are being
> > synced together with the data changes, so how about O_ISYNC (inode
> > sync)?
> 
> Yeah.  Thinking about this a bit more we should define this flag
> much more clearly.  In the obvious implementation it would not actually
> do anything if it's set on it's own.  We would only check it if O_DSYNC
> is already set to decided if we want to set the datasync argument to
> ->fsync to 0 or 1 for the generic filesystems (and similar things for
> filesystems not using the generic helper).
> 
> If we deem that this is too unsafe we could make sure O_DSYNC always
> gets set on this fag in ->open, but if we make sure O_SYNC is defined
> like the one above in the kernel headers and glibc we should be fine.
> 
> Although in that case a name that doesn't suggest that it actually does
> something useful would be better.

If you are going to automatically set O_DSYNC in open(), then
fcntl(F_SETFL) might get a bit nasty.

Imagine using it after the open in order to clear the O_ISYNC flag;
you'll still be left with the O_DSYNC (which you never set in the first
place). That would be confusing...

Cheers
  Trond

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: adding proper O_SYNC/O_DSYNC, was Re: O_DIRECT and barriers
  2009-08-28 21:43                                         ` Trond Myklebust
@ 2009-08-28 22:39                                           ` Christoph Hellwig
  0 siblings, 0 replies; 50+ messages in thread
From: Christoph Hellwig @ 2009-08-28 22:39 UTC (permalink / raw)
  To: Trond Myklebust
  Cc: Christoph Hellwig, Ulrich Drepper, Jamie Lokier, linux-fsdevel,
	linux-kernel, torvalds

On Fri, Aug 28, 2009 at 05:43:05PM -0400, Trond Myklebust wrote:
> If you are going to automatically set O_DSYNC in open(), then
> fcntl(F_SETFL) might get a bit nasty.
> 
> Imagine using it after the open in order to clear the O_ISYNC flag;
> you'll still be left with the O_DSYNC (which you never set in the first
> place). That would be confusing...

Indeed, that's a killer argument for the first variant.  We just need
to make it extremly clear (manpage _and_ comments) that only O_SYNC is
an exposed user interface and that O_WHATEVER_SYNC is an implementation
detail.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: adding proper O_SYNC/O_DSYNC, was Re: O_DIRECT and barriers
  2009-08-28 15:46                       ` Christoph Hellwig
  2009-08-28 16:06                         ` Ulrich Drepper
  2009-08-28 16:44                         ` Jamie Lokier
@ 2009-08-28 23:06                         ` Jamie Lokier
  2009-08-28 23:46                           ` Christoph Hellwig
  2 siblings, 1 reply; 50+ messages in thread
From: Jamie Lokier @ 2009-08-28 23:06 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Ulrich Drepper, linux-fsdevel, linux-kernel

Christoph Hellwig wrote:
>  - given that our current O_SYNC really is and always has been actuall
>    Posix O_DSYNC

Are you sure about this?

>From http://www-01.ibm.com/support/docview.wss?uid=isg1IZ01704 :

    Error description

       LINUX O_DIRECT/O_SYNC TAKES TOO MANY IOS

    Problem summary

       On AIX, the O_SYNC and O_DSYNC are different values and
       performance improvement are available because the inode does
       not need to be flushed for mtime changes only.
       On Linux the flags are the same, so performance is lost.
       when databases open files with O_DIRECT and O_SYNC.

-- Jamie

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: adding proper O_SYNC/O_DSYNC, was Re: O_DIRECT and barriers
  2009-08-28 23:06                         ` Jamie Lokier
@ 2009-08-28 23:46                           ` Christoph Hellwig
  0 siblings, 0 replies; 50+ messages in thread
From: Christoph Hellwig @ 2009-08-28 23:46 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Christoph Hellwig, Ulrich Drepper, linux-fsdevel, linux-kernel

On Sat, Aug 29, 2009 at 12:06:23AM +0100, Jamie Lokier wrote:
> Christoph Hellwig wrote:
> >  - given that our current O_SYNC really is and always has been actuall
> >    Posix O_DSYNC
> 
> Are you sure about this?
> 
> >From http://www-01.ibm.com/support/docview.wss?uid=isg1IZ01704 :
> 
>     Error description
> 
>        LINUX O_DIRECT/O_SYNC TAKES TOO MANY IOS

That is for GPFS, and out of tree filesystem with binary components.
It could be that they took linux O_SYNC for real O_SYNC.  Any filesystem
using the generic helpers in Linux has gotten the O_DSYNC semantics at
least as long as I have worked on Linux filesystems, which is getting
close to 10 years now.  I'll do some code archaelogy before we'll move
with this to be sure.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: adding proper O_SYNC/O_DSYNC, was Re: O_DIRECT and barriers
  2009-08-28 16:46                               ` Jamie Lokier
@ 2009-08-29  0:59                                 ` Jamie Lokier
  0 siblings, 0 replies; 50+ messages in thread
From: Jamie Lokier @ 2009-08-29  0:59 UTC (permalink / raw)
  To: Ulrich Drepper; +Cc: Christoph Hellwig, linux-fsdevel, linux-kernel

Jamie Lokier wrote:
> Ulrich Drepper wrote:
> > >  - O_RSYNC basically means we need to commit atime updates before a
> > >    read returns, right?
> > 
> > No, that's not it.
> > 
> > O_RSYNC on its own just means the data is successfully transferred to 
> > the calling process (always the case).
> > 
> > O_RSYNC|O_DSYNC means that if a read request hits data that is currently 
> > in a cache and not yet on the medium, then the write to medium is 
> > successful before the read succeeds.
> > 
> > O_RSYNC|O_SYNC means the same plus the integrity of file meta 
> > information (access time etc).
> 
> On several unixes, O_RSYNC means it will send the read to the
> hardware, not relying on the cache.  This can be used to verify the
> data which was written earlier, whether by O_DSYNC or fdatasync.

I'm sure I read that in a couple of OS man pages, but I can't find it
again.  Maybe it was something more obscure than the mainstream
unices; maybe I imagined it.  Ho hum.  For now, forget I said anythng.

-- Jamie

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: adding proper O_SYNC/O_DSYNC, was Re: O_DIRECT and barriers
  2009-08-28 21:08                                   ` Christoph Hellwig
  2009-08-28 21:16                                     ` Trond Myklebust
@ 2009-08-30 16:44                                     ` Jamie Lokier
  1 sibling, 0 replies; 50+ messages in thread
From: Jamie Lokier @ 2009-08-30 16:44 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Ulrich Drepper, linux-fsdevel, linux-kernel, torvalds

Christoph Hellwig wrote:
> P.S. better naming suggestions for O_FULLSYNC welcome

O_FULLSYNC might get confused with MacOS X's F_FULLSYNC, which means
something else: fsync through hardware volatile write caches.

(Might we even want to provide O_FULLSYNC and O_FULLDATASYNC to mean
that, eventually?)

O_ISYNC is a bit misleading if we don't really offer "flush just the
inode state" by itself.

So it should at least start with underscores: __O_ISYNC.

How about __O_SYNC_NEW with

    #define O_SYNC     (O_DSYNC|__O_SYNC_NEW)

I think that tells people reading the headers a bit about what to
expect on older kernels too.

-- Jamie

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: adding proper O_SYNC/O_DSYNC, was Re: O_DIRECT and barriers
  2009-08-28 21:08                           ` Ulrich Drepper
@ 2009-08-30 16:58                             ` Jamie Lokier
  2009-08-30 17:48                             ` Jamie Lokier
  1 sibling, 0 replies; 50+ messages in thread
From: Jamie Lokier @ 2009-08-30 16:58 UTC (permalink / raw)
  To: Ulrich Drepper; +Cc: Christoph Hellwig, linux-fsdevel, linux-kernel

Ulrich Drepper wrote:
> On 08/28/2009 09:44 AM, Jamie Lokier wrote:
> >(Oh, and Ulrich: Why is there a "#define O_RSYNC O_SYNC" in the Glibc
> >headers?  That doesn't make sense: O_RSYNC has nothing to do with
> >writing.)
> 
> O_SYNC is a superset of O_RSYNC.  In the absence of a true O_RSYNC 
> that's the next best thing.

That's an error - O_SYNC is not a superset of O_RSYNC.

O_SYNC (by itself) only affects writes.

O_RSYNC only affect reads.

In the absence of O_RSYNC support in the kernel, it's better to not
define O_RSYNC at all in userspace.  That tells applications they can
call fsync/fdatasync themselves before reading to get an equivalent
effect.

In fact O_RSYNC, when implemented correctly, can be used by
applications to get the effect of range-fsync/fdatasync when such
system calls aren't available (by reading a range), but not as
efficiently of course.  Defining O_RSYNC as O_SYNC fails to do that.

-- Jamie

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: adding proper O_SYNC/O_DSYNC, was Re: O_DIRECT and barriers
  2009-08-28 21:08                           ` Ulrich Drepper
  2009-08-30 16:58                             ` Jamie Lokier
@ 2009-08-30 17:48                             ` Jamie Lokier
  1 sibling, 0 replies; 50+ messages in thread
From: Jamie Lokier @ 2009-08-30 17:48 UTC (permalink / raw)
  To: Ulrich Drepper; +Cc: Christoph Hellwig, linux-fsdevel, linux-kernel

Ulrich Drepper wrote:
> On 08/28/2009 09:44 AM, Jamie Lokier wrote:
> >Although libc's __new_open() could have this:
> >
> >     /* Old kernels only look at O_DSYNC.  It's better than nothing. */
> >     if (flags&  O_SYNC)
> >         flags |= O_DSYNC;
> >
> >Imho, it's better to not do that, and instead have
> >
> >     #define O_SYNC          (O_DSYNC|__O_SYNC_KERNEL)
> 
> Why should it be better?  You're replacing something the compiler can do 
> with zero cost with active code.

You misread; I said the zero cost thing is better.

The only reason you might use the active code is this:

    /* Upgrade O_DSYNC to O_SYNC. */

    flags = fcntl(fd, F_GETFL, 0);
    flags = (flags | O_SYNC) & ~O_DSYNC;
    fcntl(fd, F_SETFL, flags);

I'm not sure if that should work in POSIX.

-- Jamie

^ permalink raw reply	[flat|nested] 50+ messages in thread

end of thread, other threads:[~2009-08-30 17:48 UTC | newest]

Thread overview: 50+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <1250697884-22288-1-git-send-email-jack@suse.cz>
2009-08-20 22:12 ` O_DIRECT and barriers Christoph Hellwig
2009-08-21 11:40   ` Jens Axboe
2009-08-21 13:54     ` Jamie Lokier
2009-08-21 14:26       ` Christoph Hellwig
2009-08-21 15:24         ` Jamie Lokier
2009-08-21 17:45           ` Christoph Hellwig
2009-08-21 19:18             ` Ric Wheeler
2009-08-22  0:50             ` Jamie Lokier
2009-08-22  2:19               ` Theodore Tso
2009-08-22  2:31                 ` Theodore Tso
2009-08-24  2:34               ` Christoph Hellwig
2009-08-27 14:34                 ` Jamie Lokier
2009-08-27 17:10                   ` adding proper O_SYNC/O_DSYNC, was " Christoph Hellwig
2009-08-27 17:24                     ` Ulrich Drepper
2009-08-28 15:46                       ` Christoph Hellwig
2009-08-28 16:06                         ` Ulrich Drepper
2009-08-28 16:17                           ` Christoph Hellwig
2009-08-28 16:33                             ` Ulrich Drepper
2009-08-28 16:41                               ` Christoph Hellwig
2009-08-28 20:51                                 ` Ulrich Drepper
2009-08-28 21:08                                   ` Christoph Hellwig
2009-08-28 21:16                                     ` Trond Myklebust
2009-08-28 21:29                                       ` Christoph Hellwig
2009-08-28 21:43                                         ` Trond Myklebust
2009-08-28 22:39                                           ` Christoph Hellwig
2009-08-30 16:44                                     ` Jamie Lokier
2009-08-28 16:46                               ` Jamie Lokier
2009-08-29  0:59                                 ` Jamie Lokier
2009-08-28 16:44                         ` Jamie Lokier
2009-08-28 16:50                           ` Jamie Lokier
2009-08-28 21:08                           ` Ulrich Drepper
2009-08-30 16:58                             ` Jamie Lokier
2009-08-30 17:48                             ` Jamie Lokier
2009-08-28 23:06                         ` Jamie Lokier
2009-08-28 23:46                           ` Christoph Hellwig
2009-08-21 22:08         ` Theodore Tso
2009-08-21 22:38           ` Joel Becker
2009-08-21 22:45           ` Joel Becker
2009-08-22  2:11             ` Theodore Tso
2009-08-24  2:42               ` Christoph Hellwig
2009-08-24  2:37             ` Christoph Hellwig
2009-08-22  0:56           ` Jamie Lokier
2009-08-22  2:06             ` Theodore Tso
2009-08-26  6:34           ` Dave Chinner
2009-08-26 15:01             ` Jamie Lokier
2009-08-26 18:47               ` Theodore Tso
2009-08-27 14:50                 ` Jamie Lokier
2009-08-21 14:20     ` Christoph Hellwig
2009-08-21 15:06       ` James Bottomley
2009-08-21 15:23         ` Christoph Hellwig

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).