linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Ric Wheeler <rwheeler@redhat.com>
To: Christoph Hellwig <hch@infradead.org>
Cc: Jamie Lokier <jamie@shareable.org>,
	Jens Axboe <jens.axboe@oracle.com>,
	linux-fsdevel@vger.kernel.org, linux-scsi@vger.kernel.org
Subject: Re: O_DIRECT and barriers
Date: Fri, 21 Aug 2009 15:18:03 -0400	[thread overview]
Message-ID: <4A8EF2EB.2040707@redhat.com> (raw)
In-Reply-To: <20090821174525.GA28861@infradead.org>

On 08/21/2009 01:45 PM, Christoph Hellwig wrote:
> On Fri, Aug 21, 2009 at 04:24:59PM +0100, Jamie Lokier wrote:
>    
>> In measurements I've done, disabling a disk's write cache results in
>> much slower ext3 filesystem writes than using barriers.  Others report
>> similar results.  This is with disks that don't have NCQ; good NCQ may
>> be better.
>>      
> On a scsi disk and a SATA SSD with NCQ I get different results.  Most
> worksloads, in particular metadata-intensive ones and large streaming
> writes are noticably better just turning off the write cache.  The only
> onces that benefit from it are relatively small writes witout O_SYNC
> or much fsyncs.  This is however using XFS which tends to issue much
> more barriers than ext3.
>    

With normal S-ATA disks, streaming write workloads on ext3 run twice as 
fast with barriers & write cache enabled in my testing.

Small file workloads were more even if I remember correctly...

ric

>    
>> Using FUA for all writes should be equivalent to writing with write
>> cache disabled.
>>
>> A journalling filesystem or database tends to write like this:
>>
>>     (guest) WRITE
>>     (guest) WRITE
>>     (guest) WRITE
>>     (guest) WRITE
>>     (guest) WRITE
>>     (guest) CACHE FLUSH
>>     (guest) WRITE
>>     (guest) CACHE FLUSH
>>     (guest) WRITE
>>     (guest) WRITE
>>     (guest) WRITE
>>      
> In the optimal case, yeah.
>
>    
>> Assuming that WRITE FUA is equivalent to disabling write cache, we may
>> expect the WRITE FUA version to run much slower than the CACHE FLUSH
>> version.
>>      
> For a workload that only does FUA writes, yeah.  That is however the use
> case for virtual machines.  As I'm looking into those issues I will run
> some benchmarks comparing both variants.
>
>    
>> It's also too weak, of course, on drives which don't support FUA.
>> Then you have to use CACHE FLUSH anyway, so the code should support
>> that (or disable the write cache entirely, which also performs badly).
>> If you don't handle drives without FUA, then you're back to "integrity
>> sometimes, user must check type of hardware", which is something we're
>> trying to get away from.  Integrity should not be a surprise when the
>> application requests it.
>>      
> As mentioned in the previous mails FUA would only be an optimization
> (if it ends up helping) we do need to support the cache flush case.
>
>    
>>> I thought about this alot .  It would be sensible to only require
>>> the FUA semantics if O_SYNC is specified.  But from looking around at
>>> users of O_DIRECT no one seems to actually specify O_SYNC with it.
>>>        
>> O_DIRECT with true POSIX O_SYNC is a bad idea, because it flushes
>> inode metadata (like mtime) too.  O_DIRECT|O_DSYNC is better.
>>      
> O_SYNC above is the Linux O_SYNC aka Posix O_DYNC.
>
>    
>> O_DIRECT without O_SYNC, O_DSYNC, fsync or fdatasync is asking for
>> integrity problems when direct writes are converted to buffered writes
>> - which applies to all or nearly all OSes according to their
>> documentation (I've read a lot of them).
>>      
> It did not happen on IRIX where O_DIRECT originated that did not happen,
> neither does it happen on Linux when using XFS.  Then again at least on
> Linux we provide O_SYNC (that is Linux O_SYNC, aka Posix O_DYSC)
> semantics for that case.
>
>    
>> Imho, integrity should not be something which depends on the user
>> knowing the details of their hardware to decide application
>> configuration options - at least, not out of the box.
>>      
> That is what I meant.  Only doing cache flushes/FUA for O_DIRECT|O_DSYNC
> is not what users naively expect.  And the wording in hour manpages also
> suggests this behaviour, although it is not entirely clear:
>
>
> O_DIRECT (Since Linux 2.4.10)
>
> 	Try to minimize cache effects of the I/O to and from this file.  In
> 	general this will degrade performance, but it is useful in special
> 	situations, such as when applications do their own caching.  File I/O
> 	is done directly to/from user space buffers.  The I/O is synchronous,
> 	that is,  at the completion of a read(2) or write(2), data is
> 	guaranteed to have been transferred.  See NOTES below forfurther
> 	discussion.
>
> (And yeah, the whole wording is horrible, I will send an update once
> we've sorted out the semantics, including caveats about older kernels)
>
>    
>>> And on Linux where O_SYNC really means O_DYSNC that's pretty sensible -
>>> if O_DIRECT bypasses the filesystem cache there is nothing else
>>> left to sync for a non-extending write.
>>>        
>> Oh, O_SYNC means O_DSYNC?  I thought it was the other way around.
>> Ugh, how messy.
>>      
> Yes.  Except when using XFS and using the "osyncisosync" mount option :)
>
>    
>>> The fallback was a relatively recent addition to the O_DIRECT semantics
>>> for broken filesystems that can't handle holes very well.  Fortunately
>>> enough we do force O_SYNC (that is Linux O_SYNC aka Posix O_DSYNC)
>>> semantics for that already.
>>>        
>> Ok, so you're saying there's no _harm_ in specifying O_DSYNC with
>> O_DIRECT either? :-)
>>      
> No.  In the generic code and filesystems I looked at it simply has no
> effect at all.
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>    


  reply	other threads:[~2009-08-21 19:18 UTC|newest]

Thread overview: 50+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <1250697884-22288-1-git-send-email-jack@suse.cz>
2009-08-20 22:12 ` O_DIRECT and barriers Christoph Hellwig
2009-08-21 11:40   ` Jens Axboe
2009-08-21 13:54     ` Jamie Lokier
2009-08-21 14:26       ` Christoph Hellwig
2009-08-21 15:24         ` Jamie Lokier
2009-08-21 17:45           ` Christoph Hellwig
2009-08-21 19:18             ` Ric Wheeler [this message]
2009-08-22  0:50             ` Jamie Lokier
2009-08-22  2:19               ` Theodore Tso
2009-08-22  2:31                 ` Theodore Tso
2009-08-24  2:34               ` Christoph Hellwig
2009-08-27 14:34                 ` Jamie Lokier
2009-08-27 17:10                   ` adding proper O_SYNC/O_DSYNC, was " Christoph Hellwig
2009-08-27 17:24                     ` Ulrich Drepper
2009-08-28 15:46                       ` Christoph Hellwig
2009-08-28 16:06                         ` Ulrich Drepper
2009-08-28 16:17                           ` Christoph Hellwig
2009-08-28 16:33                             ` Ulrich Drepper
2009-08-28 16:41                               ` Christoph Hellwig
2009-08-28 20:51                                 ` Ulrich Drepper
2009-08-28 21:08                                   ` Christoph Hellwig
2009-08-28 21:16                                     ` Trond Myklebust
2009-08-28 21:29                                       ` Christoph Hellwig
2009-08-28 21:43                                         ` Trond Myklebust
2009-08-28 22:39                                           ` Christoph Hellwig
2009-08-30 16:44                                     ` Jamie Lokier
2009-08-28 16:46                               ` Jamie Lokier
2009-08-29  0:59                                 ` Jamie Lokier
2009-08-28 16:44                         ` Jamie Lokier
2009-08-28 16:50                           ` Jamie Lokier
2009-08-28 21:08                           ` Ulrich Drepper
2009-08-30 16:58                             ` Jamie Lokier
2009-08-30 17:48                             ` Jamie Lokier
2009-08-28 23:06                         ` Jamie Lokier
2009-08-28 23:46                           ` Christoph Hellwig
2009-08-21 22:08         ` Theodore Tso
2009-08-21 22:38           ` Joel Becker
2009-08-21 22:45           ` Joel Becker
2009-08-22  2:11             ` Theodore Tso
2009-08-24  2:42               ` Christoph Hellwig
2009-08-24  2:37             ` Christoph Hellwig
2009-08-22  0:56           ` Jamie Lokier
2009-08-22  2:06             ` Theodore Tso
2009-08-26  6:34           ` Dave Chinner
2009-08-26 15:01             ` Jamie Lokier
2009-08-26 18:47               ` Theodore Tso
2009-08-27 14:50                 ` Jamie Lokier
2009-08-21 14:20     ` Christoph Hellwig
2009-08-21 15:06       ` James Bottomley
2009-08-21 15:23         ` Christoph Hellwig

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4A8EF2EB.2040707@redhat.com \
    --to=rwheeler@redhat.com \
    --cc=hch@infradead.org \
    --cc=jamie@shareable.org \
    --cc=jens.axboe@oracle.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-scsi@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).