public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Phillip Susi <psusi@cfl.rr.com>
To: Andrea Arcangeli <andrea@suse.de>
Cc: Denis Vlasenko <vda.linux@googlemail.com>,
	Bill Davidsen <davidsen@tmr.com>,
	Michael Tokarev <mjt@tls.msk.ru>,
	Linus Torvalds <torvalds@osdl.org>, Viktor <vvp01@inbox.ru>,
	Aubrey <aubreylee@gmail.com>, Hua Zhong <hzhong@gmail.com>,
	Hugh Dickins <hugh@veritas.com>,
	Linux-kernel <linux-kernel@vger.kernel.org>
Subject: Re: O_DIRECT question
Date: Tue, 30 Jan 2007 13:50:41 -0500	[thread overview]
Message-ID: <45BF9381.8010108@cfl.rr.com> (raw)
In-Reply-To: <20070130164806.GQ8030@opteron.random>

Andrea Arcangeli wrote:
> On Tue, Jan 30, 2007 at 10:36:03AM -0500, Phillip Susi wrote:
>> Did you intentionally drop this reply off list?
> 
> No.

Then I'll restore the lkml to the cc list.

>> No, it doesn't... or at least can't report WHERE the error is.
> 
> O_SYNC doesn't report where the error is either, try a write(fd, buf,
> 10*1024*1024).

It should return the number of bytes successfully written before the 
error, giving you the location of the first error.  Also using smaller 
individual writes ( preferably issued in parallel ) also allows the 
problem spot to be isolated.

>> Typically you only want one sector of data to be written before you 
>> continue.  In the cases where you don't, this might be nice, but as I 
>> said above, you can't handle errors properly.
> 
> Sorry but you're dreaming if you're thinking anything in real life
> writes at 512bytes at time with O_SYNC. Try that with any modern
> harddisk.

When you are writing a transaction log, you do; you don't need much 
data, but you do need to be sure it has hit the disk before continuing. 
  You certainly aren't writing many mb across a dozen write() calls and 
only then care to make sure it is all flushed in an unknown order.  When 
order matters, you can not use fsync, which is one of the reasons why 
databases use O_DIRECT; they care about the ordering.

>>> Just grep for fsync in the db code of your choice (try postgresql) and
>>> then explain me why they ever call fsync in their code, if you know
>>> how to do better with O_SYNC ;).
>> Doesn't sound like a very good idea to me.
> 
> Why not a good idea to check any real life app?

I meant it is not a good idea to use fsync as you can't properly handle 
errors.

>> The stalling is caused by cache pollution.  Since you did not specify a 
>> block size dd uses the base block size of the output disk.  When 
>> combined with sync, only one block is written at a time, and no more 
>> until the first block has been flushed.  Only then does dd send down 
>> another block to write.  Without dd the kernel is likely allowing many 
>> mb to be queued in the buffer cache.  Limiting output to one block at a 
>> time is not good for throughput, but allowing half of ram to be used by 
>> dirty pages is not good either.
> 
> Throughput is perfect. I forgot to tell I combine it with ibs=4k
> obs=16M. Like it would be perfect with odirect too for the same
> reason. Stalling the I/O pipeline once every 16M isn't measurable in

Throughput is nowhere near perfect, as the pipeline is stalled for quite 
some time.  The pipe fills up quickly while dd is blocked on the sync 
write, which then blocks tar until all 16 MB have hit the disk.  Only 
then does dd go back to reading from the tar pipe, allowing it to 
continue.  During the time it takes tar to archive another 16 MB of 
data, the write queue is empty.  The only time that the tar process gets 
to continue running while data is written to disk is in the small time 
it takes for the pipe ( 4 KB isn't it? ) to fill up.

>> The semantics of the two are very much the same; they only differ in the 
>> internal implementation.  As far as the caller is concerned, in both 
>> cases, he is sure that writes are safe on the disk when they return, and 
>> reads semantically are no different with either flag.  The internal 
>> implementations lead to different performance characteristics, and the 
>> other post was simply commenting that the performance characteristics of 
>> O_SYNC + madvise() is almost the same as O_DIRECT, or even better in 
>> some cases ( since the data read may already be in cache ).
> 
> The semantics mandates the implementation because the semantics make
> up for the performance expectations. For the same reason you shouldn't
> write 512bytes at time with O_SYNC you also shouldn't use O_SYNC if
> your device risks to create a bottleneck in the CPU and memory.

No, semantics have nothing to do with performance.  Semantics deals with 
the state of the machine after the call, not how quickly it got there. 
Semantics is a question of correct operation, not optimal.

With both O_DIRECT and O_SYNC, the machine state is essentially the same 
after the call: the data has hit the disk.  Aside from the performance 
difference, the application can not tell the difference between O_DIRECT 
and O_SYNC, so if that performance difference can be resolved by 
changing the implementation, Linus can be happy and get rid of O_DIRECT.



  parent reply	other threads:[~2007-01-30 18:50 UTC|newest]

Thread overview: 130+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2007-01-11  2:57 O_DIRECT question Aubrey
2007-01-11  3:05 ` Linus Torvalds
2007-01-11  3:15   ` Linus Torvalds
2007-01-11  6:09     ` Nick Piggin
2007-01-11 15:50       ` Linus Torvalds
2007-01-11 16:19         ` Aubrey
2007-01-16  3:41           ` Jörn Engel
2007-01-11 16:23         ` bert hubert
2007-01-11 16:52         ` Xavier Bestel
2007-01-11 17:04           ` Linus Torvalds
2007-01-11 18:41             ` Trond Myklebust
2007-01-11 19:00               ` Linus Torvalds
2007-01-11 19:49                 ` Trond Myklebust
2007-01-12 17:03             ` Viktor
2007-01-20 16:19         ` Denis Vlasenko
2007-01-22 15:52           ` Phillip Susi
2007-01-11  5:50   ` Aubrey
2007-01-11  6:06     ` Andrew Morton
2007-01-11  6:45       ` Aubrey
2007-01-11  6:57         ` Andrew Morton
2007-01-11  7:05           ` Nick Piggin
2007-01-11  7:54             ` Aubrey
2007-01-11  8:05               ` Roy Huang
2007-01-11 16:45                 ` Linus Torvalds
2007-01-17  4:29                   ` Aubrey Li
2007-01-12  2:12                 ` Aubrey
2007-01-12  2:47                   ` Nick Piggin
2007-01-12  3:59                   ` Roy Huang
2007-01-11  8:12               ` Nick Piggin
2007-01-11  8:49                 ` Roy Huang
2007-01-11  9:09                   ` Nick Piggin
2007-01-12  2:48                 ` Bill Davidsen
2007-01-12  4:30                   ` Nick Piggin
2007-01-12  4:46                     ` Linus Torvalds
2007-01-12  4:56                       ` Nick Piggin
2007-01-12  4:58                         ` Nick Piggin
2007-01-12  5:18                         ` Linus Torvalds
2007-01-12  5:22                         ` Aubrey
2007-01-12 14:59                           ` Bill Davidsen
2007-01-13  4:51                             ` Nick Piggin
2007-01-11  6:16     ` Alexander Shishkin
2007-01-11  6:57       ` Aubrey
2007-01-11 12:13   ` Viktor
2007-01-11 15:53     ` Phillip Susi
2007-01-11 16:20     ` Linus Torvalds
2007-01-11 17:13       ` Michael Tokarev
2007-01-11 23:01         ` Phillip Susi
2007-01-11 23:06           ` Hua Zhong
2007-01-12 15:21             ` Phillip Susi
2007-01-20 16:36         ` Denis Vlasenko
2007-01-20 20:55           ` Michael Tokarev
2007-01-20 23:05             ` Denis Vlasenko
2007-01-21 12:09               ` Michael Tokarev
2007-01-21 20:02                 ` Denis Vlasenko
2007-01-22 16:17                   ` Phillip Susi
2007-01-24 21:15                     ` Denis Vlasenko
2007-01-25 15:44                       ` Phillip Susi
2007-01-25 17:38                         ` Denis Vlasenko
2007-01-25 19:28                           ` Phillip Susi
2007-01-25 19:52                             ` Denis Vlasenko
2007-01-25 20:03                               ` Phillip Susi
2007-01-25 20:45                                 ` Michael Tokarev
2007-01-25 21:11                                   ` Denis Vlasenko
2007-01-26 16:02                                     ` Mark Lord
2007-01-26 16:52                                       ` Viktor
2007-01-26 16:58                                       ` Phillip Susi
2007-01-26 17:05                                     ` Phillip Susi
2007-01-26 23:16                                       ` Denis Vlasenko
2007-02-06 20:39                                         ` Pavel Machek
2007-01-26 18:23                                     ` Bill Davidsen
2007-01-26 23:35                                       ` Denis Vlasenko
2007-01-28 15:18                                         ` Bill Davidsen
2007-01-28 17:03                                           ` Denis Vlasenko
2007-01-29 15:43                                             ` Phillip Susi
2007-01-29 17:00                                             ` Andrea Arcangeli
2007-01-30  0:05                                               ` Denis Vlasenko
     [not found]                                               ` <45BE7D99.70200@cfl.rr.com>
     [not found]                                                 ` <20070130023056.GN8030@opteron.random>
     [not found]                                                   ` <45BF65E3.6070102@cfl.rr.com>
     [not found]                                                     ` <20070130164806.GQ8030@opteron.random>
2007-01-30 18:50                                                       ` Phillip Susi [this message]
2007-01-30 19:57                                                         ` Andrea Arcangeli
2007-01-30 20:06                                                           ` Andrea Arcangeli
2007-01-30 23:07                                                           ` Phillip Susi
2007-01-31  2:28                                                             ` Andrea Arcangeli
2007-01-31  9:37                                                             ` Michael Tokarev
2007-01-26 15:53                   ` Bill Davidsen
2007-01-11 17:42       ` Alan
2007-01-11 18:00         ` Linus Torvalds
2007-01-12  7:57       ` dean gaudet
2007-01-12 15:27         ` Phillip Susi
2007-01-12 18:06         ` Linus Torvalds
2007-01-12 20:23           ` Chris Mason
2007-01-12 20:46             ` Michael Tokarev
2007-01-12 20:52               ` Michael Tokarev
2007-01-12 21:03                 ` Michael Tokarev
2007-01-12 21:17                   ` Linus Torvalds
2007-01-12 21:54                     ` Michael Tokarev
2007-01-12 22:09                       ` Linus Torvalds
2007-01-12 22:26                         ` Michael Tokarev
2007-01-12 22:35                         ` Erik Andersen
2007-01-12 22:47                           ` Andrew Morton
2007-01-14  9:11                             ` Nate Diller
2007-01-20 16:45                               ` Denis Vlasenko
2007-01-22  1:47                             ` Andrea Arcangeli
2007-01-13 20:07                     ` Bill Davidsen
2007-01-13 20:27                       ` Michael Tokarev
2007-01-14 15:39                         ` Bill Davidsen
2007-01-12 21:39                   ` Disk Cache, Was: " Zan Lynx
2007-01-12 22:10                     ` Michael Tokarev
2007-01-15 12:11               ` Helge Hafting
2007-01-12 16:59       ` Viktor
2007-01-11 12:45   ` Erik Mouw
2007-01-11  4:51 ` Andrew Morton
2007-01-11  5:06   ` Gerrit Huizenga
2007-01-11 16:09   ` Badari Pulavarty
2007-01-11 12:34 ` linux-os (Dick Johnson)
2007-01-11 13:06   ` Martin Mares
2007-01-11 14:15   ` Jens Axboe
2007-01-12  2:13   ` Bill Davidsen
  -- strict thread matches above, loose matches on Subject: below --
2007-01-17 14:27 Alex Tomas
2007-01-22 15:59 Al Boldi
     [not found] <7BYkO-5OV-17@gated-at.bofh.it>
     [not found] ` <7BYul-6gz-5@gated-at.bofh.it>
     [not found]   ` <7C18X-1zo-5@gated-at.bofh.it>
     [not found]     ` <7C1iw-22q-7@gated-at.bofh.it>
     [not found]       ` <7C1Vb-2Ny-3@gated-at.bofh.it>
     [not found]         ` <7C256-2ZR-27@gated-at.bofh.it>
     [not found]           ` <7C2eE-3rT-15@gated-at.bofh.it>
     [not found]             ` <7C31d-4qb-11@gated-at.bofh.it>
     [not found]               ` <7C3kj-55E-9@gated-at.bofh.it>
2007-01-11 13:20                 ` Bodo Eggert
     [not found]   ` <7C74B-2A4-23@gated-at.bofh.it>
     [not found]     ` <7CaYA-mT-19@gated-at.bofh.it>
     [not found]       ` <7Cpuz-64X-1@gated-at.bofh.it>
     [not found]         ` <7Cz0T-4PH-17@gated-at.bofh.it>
     [not found]           ` <7CBcl-86B-9@gated-at.bofh.it>
     [not found]             ` <7CBvH-52-9@gated-at.bofh.it>
     [not found]               ` <7CBFn-hw-1@gated-at.bofh.it>
     [not found]                 ` <7CBP1-KI-3@gated-at.bofh.it>
     [not found]                   ` <7CBYG-WK-3@gated-at.bofh.it>
2007-01-13 16:53                     ` Bodo Eggert
2007-01-13 19:30                       ` Bill Davidsen
2007-01-14 18:51                         ` Bodo Eggert
     [not found]                     ` <7CXmz-88G-29@gated-at.bofh.it>
     [not found]                       ` <7CXFR-8vZ-15@gated-at.bofh.it>
     [not found]                         ` <7DfMP-2ak-19@gated-at.bofh.it>
2007-01-14 19:39                           ` Bodo Eggert
     [not found]               ` <7DyYK-6lE-3@gated-at.bofh.it>
2007-01-16 20:26                 ` Bodo Eggert
2007-01-17  5:55                   ` Arjan van de Ven
2007-01-17 22:36                     ` Bodo Eggert
     [not found] ` <7HkaQ-2Nb-9@gated-at.bofh.it>
     [not found]   ` <7HDZP-Pv-1@gated-at.bofh.it>
     [not found]     ` <7HIPV-8kp-35@gated-at.bofh.it>
2007-01-27 14:01       ` Bodo Eggert
2007-01-27 14:14         ` Denis Vlasenko
2007-01-28 15:30           ` Bill Davidsen
2007-01-28 17:18             ` Denis Vlasenko

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=45BF9381.8010108@cfl.rr.com \
    --to=psusi@cfl.rr.com \
    --cc=andrea@suse.de \
    --cc=aubreylee@gmail.com \
    --cc=davidsen@tmr.com \
    --cc=hugh@veritas.com \
    --cc=hzhong@gmail.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mjt@tls.msk.ru \
    --cc=torvalds@osdl.org \
    --cc=vda.linux@googlemail.com \
    --cc=vvp01@inbox.ru \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox