All of lore.kernel.org
 help / color / mirror / Atom feed
From: Denis Vlasenko <vda.linux@googlemail.com>
To: Bill Davidsen <davidsen@tmr.com>
Cc: Hugh Dickins <hugh@veritas.com>,
	Linux-kernel <linux-kernel@vger.kernel.org>
Subject: Re: open(O_DIRECT) on a tmpfs?
Date: Sat, 6 Jan 2007 01:30:13 +0100	[thread overview]
Message-ID: <200701060130.13903.vda.linux@googlemail.com> (raw)
In-Reply-To: <459E7AB3.8080802@tmr.com>

On Friday 05 January 2007 17:20, Bill Davidsen wrote:
> Denis Vlasenko wrote:
> > But O_DIRECT is _not_ about cache. At least I think it was not about
> > cache initially, it was more about DMAing data directly from/to
> > application address space to/from disks, saving memcpy's and double
> > allocations. Why do you think it has that special alignment requirements?
> > Are they cache related? Not at all!

> I'm not sure I can see how you find "don't use cache" not cache related. 
> Saving the resources needed for cache would seem to obviously leave them 
> for other processes.

I feel that word "direct" has nothing to do with caching (or lack thereof).
"Direct" means that I want to avoid extra allocations and memcpy:

	write(fd, hugebuf, 100*1024*1024);

Here application uses 100 megs for hugebuf, and if it is not sufficiently
aligned, even smartest kernel in this universe cannot DMA this data
to disk. No way. So it needs to allocate ANOTHER, aligned buffer,
memcpy the data (completely flushing L1 and L2 dcaches), and DMA it
from there. Thus we use twice as much RAM as we really need, and do
a lot of mostly pointless memory moves! And worse, application cannot
even detect it - it works, it's just slow and eats a lot of RAM and CPU.

That's where O_DIRECT helps. When app wants to avoid that, it opens fd
with O_DIRECT. App in effect says: "I *do* want to avoid extra shuffling,
because I will write huge amounts of data in big blocks."

> > But _conceptually_ "direct DMAing" and "do-not-cache-me"
> > are orthogonal, right?
>
> In the sense that you must do DMA or use cache, yes.

Let's say I implemented a heuristic in my cp command:
if source file is indeed a regular file and it is
larger than 128K, allocate aligned 128K buffer
and try to copy it using O_DIRECT i/o.

Then I use this "enhanced" cp command to copy a large directory
recursively, and then I run grep on that directory.

Can you explain why cp shouldn't cache the data it just wrote?
I *am* going to use it shortly thereafter!

> > That's why we also have bona fide fadvise and madvise
> > with FADV_DONTNEED/MADV_DONTNEED:
> >
> > http://www.die.net/doc/linux/man/man2/fadvise.2.html
> > http://www.die.net/doc/linux/man/man2/madvise.2.html
> >
> > _This_ is the proper way to say "do not cache me".
>
> But none of those advisories says how to cache or not, only what the 
> expected behavior will be. So FADV_NOREUSE does not control cache use, 
> it simply allows the system to make assumptions.

Exactly. If you don't need the data, Just let kernel know that.
When you use O_DIRECT, you are saying "I want direct DMA to disk without
extra copying". With fadvise(FADV_DONTNEED) you are saying
"do not expect access in the near future" == "do not try to optimize
for possible accesses in near future" == "do not cache"!.

Again: with O_DIRECT:

write(fd, hugebuf, 100*1024*1024);

kernel _has _difficulty_ caching these data, simply because
data isn't copied into kernel pages anyway, and if user will
continue to use hugebuf after write(), kernel simply cannot
cache that data - it _hasn't_ the data.

But if user will unmap the hugebuf? What then? Should kernel
forget that data in these pages is in effect a cached data from
the file being written to? Not necessarily.

Four years ago Linus wrote an email about it:

http://lkml.org/lkml/2002/5/11/58

btw, as an Oracle DBA on my day job, I completely agree
with Linus on the "deranged monkey" comparison in that mail...
--
vda

  reply	other threads:[~2007-01-06  0:31 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2007-01-04 11:52 open(O_DIRECT) on a tmpfs? Michael Tokarev
2007-01-04 13:08 ` Hugh Dickins
2007-01-04 16:19   ` Bill Davidsen
2007-01-04 17:09     ` Hugh Dickins
2007-01-04 17:54       ` Peter Staubach
2007-01-04 18:11         ` Bill Davidsen
2007-01-04 18:41       ` Hua Zhong
2007-01-04 19:14         ` Hugh Dickins
2007-01-04 19:35           ` Mark Lord
2007-01-05  6:57           ` Chen, Kenneth W
2007-01-05 14:38           ` Helge Hafting
2007-01-05 14:58         ` Jesper Juhl
2007-01-05 14:59           ` Jesper Juhl
2007-01-04 22:17     ` Denis Vlasenko
2007-01-05  5:30       ` Nick Piggin
2007-01-05 16:20       ` Bill Davidsen
2007-01-06  0:30         ` Denis Vlasenko [this message]
2007-01-08 19:42           ` Bill Davidsen
2007-01-05 11:49   ` Michael Tokarev
     [not found] <7zzqw-SS-27@gated-at.bofh.it>
2007-01-04 14:47 ` Bodo Eggert

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=200701060130.13903.vda.linux@googlemail.com \
    --to=vda.linux@googlemail.com \
    --cc=davidsen@tmr.com \
    --cc=hugh@veritas.com \
    --cc=linux-kernel@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.