From: Denis Vlasenko <vda.linux@googlemail.com>
To: Michael Tokarev <mjt@tls.msk.ru>
Cc: Linus Torvalds <torvalds@osdl.org>, Viktor <vvp01@inbox.ru>,
Aubrey <aubreylee@gmail.com>, Hua Zhong <hzhong@gmail.com>,
Hugh Dickins <hugh@veritas.com>,
linux-kernel@vger.kernel.org, hch@infradead.org,
kenneth.w.chen@intel.com, akpm@osdl.org
Subject: Re: O_DIRECT question
Date: Sun, 21 Jan 2007 00:05:36 +0100 [thread overview]
Message-ID: <200701210005.36274.vda.linux@googlemail.com> (raw)
In-Reply-To: <45B281BB.50607@tls.msk.ru>
On Saturday 20 January 2007 21:55, Michael Tokarev wrote:
> Denis Vlasenko wrote:
> > On Thursday 11 January 2007 18:13, Michael Tokarev wrote:
> >> example, which isn't quite possible now from userspace. But as long as
> >> O_DIRECT actually writes data before returning from write() call (as it
> >> seems to be the case at least with a normal filesystem on a real block
> >> device - I don't touch corner cases like nfs here), it's pretty much
> >> THE ideal solution, at least from the application (developer) standpoint.
> >
> > Why do you want to wait while 100 megs of data are being written?
> > You _have to_ have threaded db code in order to not waste
> > gobs of CPU time on UP + even with that you eat context switch
> > penalty anyway.
>
> Usually it's done using aio ;)
>
> It's not that simple really.
>
> For reads, you have to wait for the data anyway before doing something
> with it. Omiting reads for now.
Really? All 100 megs _at once_? Linus described fairly simple (conceptually)
idea here: http://lkml.org/lkml/2002/5/11/58
In short, page-aligned read buffer can be just unmapped,
with page fault handler catching accesses to yet-unread data.
As data comes from disk, it gets mapped back in process'
address space.
This way read() returns almost immediately and CPU is free to do
something useful.
> For writes, it's not that problematic - even 10-15 threads is nothing
> compared with the I/O (O in this case) itself -- that context switch
> penalty.
Well, if you have some CPU intensive thing to do (e.g. sort),
why not benefit from lack of extra context switch?
Assume that we have "clever writes" like Linus described.
/* something like "caching i/o over this fd is mostly useless" */
/* (looks like this API is easier to transition to
* than fadvise etc. - it's "looks like" O_DIRECT) */
fd = open(..., flags|O_STREAM);
...
/* Starts writeout immediately due to O_STREAM,
* marks buf100meg's pages R/O to catch modifications,
* but doesn't block! */
write(fd, buf100meg, 100*1024*1024);
/* We are free to do something useful in parallel */
sort();
> > I hope you agree that threaded code is not ideal performance-wise
> > - async IO is better. O_DIRECT is strictly sync IO.
>
> Hmm.. Now I'm confused.
>
> For example, oracle uses aio + O_DIRECT. It seems to be working... ;)
> As an alternative, there are multiple single-threaded db_writer processes.
> Why do you say O_DIRECT is strictly sync?
I mean that O_DIRECT write() blocks until I/O really is done.
Normal write can block for much less, or not at all.
> In either case - I provided some real numbers in this thread before.
> Yes, O_DIRECT has its problems, even security problems. But the thing
> is - it is working, and working WAY better - from the performance point
> of view - than "indirect" I/O, and currently there's no alternative that
> works as good as O_DIRECT.
Why we bothered to write Linux at all?
There were other Unixes which worked ok.
--
vda
next prev parent reply other threads:[~2007-01-20 23:07 UTC|newest]
Thread overview: 130+ messages / expand[flat|nested] mbox.gz Atom feed top
2007-01-11 2:57 O_DIRECT question Aubrey
2007-01-11 3:05 ` Linus Torvalds
2007-01-11 3:15 ` Linus Torvalds
2007-01-11 6:09 ` Nick Piggin
2007-01-11 15:50 ` Linus Torvalds
2007-01-11 16:19 ` Aubrey
2007-01-16 3:41 ` Jörn Engel
2007-01-11 16:23 ` bert hubert
2007-01-11 16:52 ` Xavier Bestel
2007-01-11 17:04 ` Linus Torvalds
2007-01-11 18:41 ` Trond Myklebust
2007-01-11 19:00 ` Linus Torvalds
2007-01-11 19:49 ` Trond Myklebust
2007-01-12 17:03 ` Viktor
2007-01-20 16:19 ` Denis Vlasenko
2007-01-22 15:52 ` Phillip Susi
2007-01-11 5:50 ` Aubrey
2007-01-11 6:06 ` Andrew Morton
2007-01-11 6:45 ` Aubrey
2007-01-11 6:57 ` Andrew Morton
2007-01-11 7:05 ` Nick Piggin
2007-01-11 7:54 ` Aubrey
2007-01-11 8:05 ` Roy Huang
2007-01-11 16:45 ` Linus Torvalds
2007-01-17 4:29 ` Aubrey Li
2007-01-12 2:12 ` Aubrey
2007-01-12 2:47 ` Nick Piggin
2007-01-12 3:59 ` Roy Huang
2007-01-11 8:12 ` Nick Piggin
2007-01-11 8:49 ` Roy Huang
2007-01-11 9:09 ` Nick Piggin
2007-01-12 2:48 ` Bill Davidsen
2007-01-12 4:30 ` Nick Piggin
2007-01-12 4:46 ` Linus Torvalds
2007-01-12 4:56 ` Nick Piggin
2007-01-12 4:58 ` Nick Piggin
2007-01-12 5:18 ` Linus Torvalds
2007-01-12 5:22 ` Aubrey
2007-01-12 14:59 ` Bill Davidsen
2007-01-13 4:51 ` Nick Piggin
2007-01-11 6:16 ` Alexander Shishkin
2007-01-11 6:57 ` Aubrey
2007-01-11 12:13 ` Viktor
2007-01-11 15:53 ` Phillip Susi
2007-01-11 16:20 ` Linus Torvalds
2007-01-11 17:13 ` Michael Tokarev
2007-01-11 23:01 ` Phillip Susi
2007-01-11 23:06 ` Hua Zhong
2007-01-12 15:21 ` Phillip Susi
2007-01-20 16:36 ` Denis Vlasenko
2007-01-20 20:55 ` Michael Tokarev
2007-01-20 23:05 ` Denis Vlasenko [this message]
2007-01-21 12:09 ` Michael Tokarev
2007-01-21 20:02 ` Denis Vlasenko
2007-01-22 16:17 ` Phillip Susi
2007-01-24 21:15 ` Denis Vlasenko
2007-01-25 15:44 ` Phillip Susi
2007-01-25 17:38 ` Denis Vlasenko
2007-01-25 19:28 ` Phillip Susi
2007-01-25 19:52 ` Denis Vlasenko
2007-01-25 20:03 ` Phillip Susi
2007-01-25 20:45 ` Michael Tokarev
2007-01-25 21:11 ` Denis Vlasenko
2007-01-26 16:02 ` Mark Lord
2007-01-26 16:52 ` Viktor
2007-01-26 16:58 ` Phillip Susi
2007-01-26 17:05 ` Phillip Susi
2007-01-26 23:16 ` Denis Vlasenko
2007-02-06 20:39 ` Pavel Machek
2007-01-26 18:23 ` Bill Davidsen
2007-01-26 23:35 ` Denis Vlasenko
2007-01-28 15:18 ` Bill Davidsen
2007-01-28 17:03 ` Denis Vlasenko
2007-01-29 15:43 ` Phillip Susi
2007-01-29 17:00 ` Andrea Arcangeli
2007-01-30 0:05 ` Denis Vlasenko
[not found] ` <45BE7D99.70200@cfl.rr.com>
[not found] ` <20070130023056.GN8030@opteron.random>
[not found] ` <45BF65E3.6070102@cfl.rr.com>
[not found] ` <20070130164806.GQ8030@opteron.random>
2007-01-30 18:50 ` Phillip Susi
2007-01-30 19:57 ` Andrea Arcangeli
2007-01-30 20:06 ` Andrea Arcangeli
2007-01-30 23:07 ` Phillip Susi
2007-01-31 2:28 ` Andrea Arcangeli
2007-01-31 9:37 ` Michael Tokarev
2007-01-26 15:53 ` Bill Davidsen
2007-01-11 17:42 ` Alan
2007-01-11 18:00 ` Linus Torvalds
2007-01-12 7:57 ` dean gaudet
2007-01-12 15:27 ` Phillip Susi
2007-01-12 18:06 ` Linus Torvalds
2007-01-12 20:23 ` Chris Mason
2007-01-12 20:46 ` Michael Tokarev
2007-01-12 20:52 ` Michael Tokarev
2007-01-12 21:03 ` Michael Tokarev
2007-01-12 21:17 ` Linus Torvalds
2007-01-12 21:54 ` Michael Tokarev
2007-01-12 22:09 ` Linus Torvalds
2007-01-12 22:26 ` Michael Tokarev
2007-01-12 22:35 ` Erik Andersen
2007-01-12 22:47 ` Andrew Morton
2007-01-14 9:11 ` Nate Diller
2007-01-20 16:45 ` Denis Vlasenko
2007-01-22 1:47 ` Andrea Arcangeli
2007-01-13 20:07 ` Bill Davidsen
2007-01-13 20:27 ` Michael Tokarev
2007-01-14 15:39 ` Bill Davidsen
2007-01-12 21:39 ` Disk Cache, Was: " Zan Lynx
2007-01-12 22:10 ` Michael Tokarev
2007-01-15 12:11 ` Helge Hafting
2007-01-12 16:59 ` Viktor
2007-01-11 12:45 ` Erik Mouw
2007-01-11 4:51 ` Andrew Morton
2007-01-11 5:06 ` Gerrit Huizenga
2007-01-11 16:09 ` Badari Pulavarty
2007-01-11 12:34 ` linux-os (Dick Johnson)
2007-01-11 13:06 ` Martin Mares
2007-01-11 14:15 ` Jens Axboe
2007-01-12 2:13 ` Bill Davidsen
-- strict thread matches above, loose matches on Subject: below --
2007-01-17 14:27 Alex Tomas
2007-01-22 15:59 Al Boldi
[not found] <7BYkO-5OV-17@gated-at.bofh.it>
[not found] ` <7BYul-6gz-5@gated-at.bofh.it>
[not found] ` <7C18X-1zo-5@gated-at.bofh.it>
[not found] ` <7C1iw-22q-7@gated-at.bofh.it>
[not found] ` <7C1Vb-2Ny-3@gated-at.bofh.it>
[not found] ` <7C256-2ZR-27@gated-at.bofh.it>
[not found] ` <7C2eE-3rT-15@gated-at.bofh.it>
[not found] ` <7C31d-4qb-11@gated-at.bofh.it>
[not found] ` <7C3kj-55E-9@gated-at.bofh.it>
2007-01-11 13:20 ` Bodo Eggert
[not found] ` <7C74B-2A4-23@gated-at.bofh.it>
[not found] ` <7CaYA-mT-19@gated-at.bofh.it>
[not found] ` <7Cpuz-64X-1@gated-at.bofh.it>
[not found] ` <7Cz0T-4PH-17@gated-at.bofh.it>
[not found] ` <7CBcl-86B-9@gated-at.bofh.it>
[not found] ` <7CBvH-52-9@gated-at.bofh.it>
[not found] ` <7CBFn-hw-1@gated-at.bofh.it>
[not found] ` <7CBP1-KI-3@gated-at.bofh.it>
[not found] ` <7CBYG-WK-3@gated-at.bofh.it>
2007-01-13 16:53 ` Bodo Eggert
2007-01-13 19:30 ` Bill Davidsen
2007-01-14 18:51 ` Bodo Eggert
[not found] ` <7CXmz-88G-29@gated-at.bofh.it>
[not found] ` <7CXFR-8vZ-15@gated-at.bofh.it>
[not found] ` <7DfMP-2ak-19@gated-at.bofh.it>
2007-01-14 19:39 ` Bodo Eggert
[not found] ` <7DyYK-6lE-3@gated-at.bofh.it>
2007-01-16 20:26 ` Bodo Eggert
2007-01-17 5:55 ` Arjan van de Ven
2007-01-17 22:36 ` Bodo Eggert
[not found] ` <7HkaQ-2Nb-9@gated-at.bofh.it>
[not found] ` <7HDZP-Pv-1@gated-at.bofh.it>
[not found] ` <7HIPV-8kp-35@gated-at.bofh.it>
2007-01-27 14:01 ` Bodo Eggert
2007-01-27 14:14 ` Denis Vlasenko
2007-01-28 15:30 ` Bill Davidsen
2007-01-28 17:18 ` Denis Vlasenko
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=200701210005.36274.vda.linux@googlemail.com \
--to=vda.linux@googlemail.com \
--cc=akpm@osdl.org \
--cc=aubreylee@gmail.com \
--cc=hch@infradead.org \
--cc=hugh@veritas.com \
--cc=hzhong@gmail.com \
--cc=kenneth.w.chen@intel.com \
--cc=linux-kernel@vger.kernel.org \
--cc=mjt@tls.msk.ru \
--cc=torvalds@osdl.org \
--cc=vvp01@inbox.ru \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox