linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jamie Lokier <jamie@shareable.org>
To: Phillip Susi <psusi@cfl.rr.com>
Cc: linux-fsdevel@vger.kernel.org,
	Linux-kernel <linux-kernel@vger.kernel.org>
Subject: Re: readahead on directories
Date: Wed, 21 Apr 2010 21:22:09 +0100	[thread overview]
Message-ID: <20100421202209.GV27575@shareable.org> (raw)
In-Reply-To: <4BCF3FAE.7090206@cfl.rr.com>

Phillip Susi wrote:
> On 4/21/2010 12:12 PM, Jamie Lokier wrote:
> > Asynchronous is available: Use clone or pthreads.
> 
> Synchronous in another process is not the same as async.  It seems I'm
> going to have to do this for now as a workaround, but one of the reasons
> that aio was created was to avoid the inefficiencies this introduces.
> Why create a new thread context, switch to it, put a request in the
> queue, then sleep, when you could just drop the request in the queue in
> the original thread and move on?

Because tests have found that it's sometimes faster than AIO anyway!

...for those things where AIO is supported at all.  The problem with
more complicated fs operations (like, say, buffered file reads and
directory operations) is you can't just put a request in a queue.

Some of it has to be done in a context with stack and occasional
sleeping.  It's just too complicated to make all filesystem operations
_entirely_ async, and that is the reason Linux AIO has never gotten
very far trying to do that.

Those things where putting a request on a queue works tend to move the
sleepable metadata fetching to the code _before_ the request is queued
to get around that.  Which is one reason why Linux O_DIRECT AIO can
still block when submitting a request... :-/

The most promising direction for AIO at the moment is in fact spawning
kernel threads on demand to do the work that needs a context, and
swizzling some pointers so that it doesn't look like threads were used
to userspace.

Kernel threads on demand, especially magical demand at the point where
the thread would block, are faster than clone() in userspace - but not
expected to be much faster if you're reading from cold cache anyway,
with lots of blocking happening.

You might even find that calling readahead() on *files* goes a bit
faster if you have several threads working in parallel calling it,
because of the ability to parallelise metadata I/O.

> > A quick skim of fs/{ext3,ext4}/dir.c finds a call to
> > page_cache_sync_readahead.  Doesn't that do any reading ahead? :-)
> 
> Unfortunately it does not help when it is synchronous.  The process
> still sleeps until it has fetched the blocks it needs.  I believe that
> code just ends up doing a single 4kb read if the directory is no larger
> than that, or if it is, then it reads up to readahead_size.  It puts the
> request in the queue then sleeps until all the data has been read, even
> if only the first 4kb was required before readdir() could return.

So you're saying it _does_ readahead_size if needed.  That's great!
Evigny's concern about sequantially reading blocks one by one
isn't anything to care about then. That's one problem solved. :-)

> This means that a single thread calling readdir() is still going to
> block reading the directory before it can move on to trying to read
> other directories that are also needed.

Of course.

> > If not, fs/ext4/namei.c:ext4_dir_inode_operations points to
> > ext4_fiemap.  So you may have luck calling FIEMAP or FIBMAP on the
> > directory, and then reading blocks using the block device.  I'm not
> > sure if the cache loaded via the block device (when mounted) will then
> > be used for directory lookups.
> 
> Yes, I had considered that.  ureadahead already makes use of ext2fslibs
> to open the block device and read the inode tables so they are already
> in the cache for later use.  It seems a bit silly to do that though,
> when that is exactly what readahead() SHOULD do for you.

Don't bother with FIEMAP then.  It sounds like all the preloadable
metadata is already loaded.  FIEMAP would have still needed to be
threaded for parallel directories.

Filesystem-independent readahead() on directories is out of the
question (except by using a kernel background thread, which is
pointless because you can do that yourself.)

Some filesystems have directories which aren't stored like a file's
data, and the process of reading the directory needs to work through
its logic, and needs a sleepable context to work in.  Generic page
reading won't work for all of them.

readahead() on directories in specific filesystem types may be possible.
It would have to be implemented in each fs.

-- Jamie

  reply	other threads:[~2010-04-21 20:22 UTC|newest]

Thread overview: 34+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-04-19 15:51 readahead on directories Phillip Susi
2010-04-21  0:44 ` Jamie Lokier
2010-04-21 14:57   ` Phillip Susi
2010-04-21 16:12     ` Jamie Lokier
2010-04-21 18:10       ` Phillip Susi
2010-04-21 20:22         ` Jamie Lokier [this message]
2010-04-21 20:59           ` Phillip Susi
2010-04-21 22:06             ` Jamie Lokier
2010-04-22  7:01               ` Brad Boyer
2010-04-22 14:26               ` Phillip Susi
2010-04-22 17:53                 ` Jamie Lokier
2010-04-22 19:23                   ` Phillip Susi
2010-04-22 20:35                     ` Jamie Lokier
2010-04-22 21:22                       ` Phillip Susi
2010-04-22 22:43                         ` Jamie Lokier
2010-04-23  4:13                           ` Phillip Susi
2010-04-21 18:38       ` Evgeniy Polyakov
2010-04-21 18:51         ` Jamie Lokier
2010-04-21 18:56           ` Evgeniy Polyakov
2010-04-21 20:02             ` Jamie Lokier
2010-04-21 20:21               ` Evgeniy Polyakov
2010-04-21 20:39                 ` Jamie Lokier
2010-04-21 19:23           ` Phillip Susi
2010-04-21 20:01             ` Jamie Lokier
2010-04-21 20:13               ` Phillip Susi
2010-04-21 20:37                 ` Jamie Lokier
2010-05-07 13:38 ` unified page and buffer cache? (was: readahead on directories) Phillip Susi
2010-05-07 13:53   ` Matthew Wilcox
2010-05-07 15:45     ` unified page and buffer cache? Phillip Susi
2010-05-07 18:30       ` Matthew Wilcox
2010-05-08  0:50         ` Phillip Susi
2010-05-08  0:46       ` tytso
2010-05-08  0:54         ` Phillip Susi
2010-05-08 12:52           ` tytso

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20100421202209.GV27575@shareable.org \
    --to=jamie@shareable.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=psusi@cfl.rr.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).