From: Jamie Lokier <jamie@shareable.org>
To: Phillip Susi <psusi@cfl.rr.com>
Cc: linux-fsdevel@vger.kernel.org,
Linux-kernel <linux-kernel@vger.kernel.org>
Subject: Re: readahead on directories
Date: Thu, 22 Apr 2010 23:43:27 +0100 [thread overview]
Message-ID: <20100422224327.GE13951@shareable.org> (raw)
In-Reply-To: <4BD0BE20.4030908@cfl.rr.com>
Phillip Susi wrote:
> On 4/22/2010 4:35 PM, Jamie Lokier wrote:
> > POSIX requires concurrent, overlapping writes don't interleave the
> > data (at least, I have read that numerous times), which is usually
> > implemented with a mutex even though there are other ways.
>
> I think what you are getting at here is that write() needs to atomically
> update the file pointer, which does not need a mutex.
No, that is not the reason. pwrite needs the mutex too.
> > The trickier stuff in proper AIO is sleeping waiting for memory to be
> > freed up, sleeping waiting for a rate-limited request queue entry
> > repeatedly, prior to each of the triple, double, single indirect
> > blocks, which you then sleep waiting to complete, sleeping waiting for
> > an atime update journal node, sleeping on requests and I/O on every
>
> There's no reason to wait for updating the atime, and
> Whether it's reading indirect blocks or b-trees
> doesn't make much difference; the fs ->get_blocks() tries not to sleep
> if possible, and if it must, returns -EAGAIN and the calling code can
> punt to a work queue to try again in a context that can sleep.
Now you are describing using threads in the blocking cases. (Work
queues, thread pools, same thing.) Earlier you were saying threads
are the wrong approach.... Good, good :-)
> The fs specific code just needs to support a flag like gfp_mask so it
> can be told we aren't in a context that can sleep; do your best and if
> you must block, return -EAGAIN. It looks like it almost already does
> something like that based on this comment from fs/mpage.c:
Yes, it's not a bad pattern. Simple to understand.
There's a slight overhead compared with saving the stack frame
fibril-style: The second, sleepable call has to redo much of the work
done in the non-sleepable call, and queuing the work queue requires
serialising etc. plus extra code for that. Plus the work queue is a
bit more scheduling
On the other hand, the queue uses less memory than a stack frame.
For the in-cache cases, there's no overhead so it's fine.
A big problem with it, apart from having to change lots of places in
all the filesystems, is that the work-queues run with the wrong
security and I/O context. Network filesystems break permissions, quotas
break, ionice doesn't work, etc. It's obviously fixable but more
involved than just putting a read request on a work queue.
That's why the fibril/acall discussions talked about spawning threads
from the caller's context or otherwise magically swizzling contexts
around to do it with the efficiency of a preexisting thread pool.
Once you're doing task security & I/O context swizzling (which turns
out to be quite fiddly), the choice between swizzling stack frames or
using EAGAIN and work queue type objects becomes a less radical design
decision, and could even be a per-filesystem, per-operation choice.
> > Oh, and fine-grained locking makes the async transformation harder,
> > not easier :-)
>
> How so? With fine grained locking you can avoid the use of mutexes and
> opt for atomic functions or spin locks, so no need to sleep.
Fine-grained locking isn't the same thing as using non-sleepable locks.
> > For readahead yes because it's just an abortable hint.
> > For general AIO, no.
>
> Why not? aio_read() is perfectly allowed to fail if there is not enough
> memory to satisfy the request.
So is read(). And then the calling application usually exits, because
there's nothing else it can do usefully. Same if aio_read() ever returns ENOMEM.
That way lies an application getting ENOMEM often and having to retry
aio_read in a loop, probably a busy one, which isn't how the interface
is supposed to work, and is not efficient either.
The only atomic allocation you might conceivably want is a small one
to enqueue the AIO and return immediately. But really even that
should sleep. That's the one case where you actually do want
aio_read() to sleep.
> That still leaves the problem of all the open() calls blocking to read
> one disk directory block at a time, since ureadahead opens all of the
> files first, then calls readahead() on each of them. This is where it
> would really help to be able to readahead() the directories first, then
> try to open all of the files.
Put open() in threads too! Actually I don't have any idea how well
that really goes.
> > Also, having defragged readahead files into a few compact zones, and
> > gotten the last boot's I/O trace, why not readahead those areas of the
> > blockdev first in perfect order, before finishing the job with
> > filesystem operations? The redundancy from no-longer needed blocks is
> > probably small compared with the gain from perfect order in few big
> > zones, and if you store the I/O trace of the filesystem stage every
> > time to use for the block stage next time, the redundancy should stay low.
>
> Good point, though I was hoping to be able to accomplish effectively the
> same thing purely with readahead() and other filesystem calls instead of
> going direct to the block device.
It depends on how accurate your block-level traces are, but if the
blocks are consolidated into few contiguous zones, readahead on the
blockdev should give perfect seek order, minimal IOPS and maximum I/O
sizes. It won't even need any particular order from the defrag. It's
hard to see even file readahead() approaching that for speed because
it's so simple.
Consider: Boot may read 50MB data in countless files including
scripts, parts of shared libs etc. Just 0.5 second on any modern
system. How long does the ureadahead run take?
-- Jamie
next prev parent reply other threads:[~2010-04-22 22:43 UTC|newest]
Thread overview: 34+ messages / expand[flat|nested] mbox.gz Atom feed top
2010-04-19 15:51 readahead on directories Phillip Susi
2010-04-21 0:44 ` Jamie Lokier
2010-04-21 14:57 ` Phillip Susi
2010-04-21 16:12 ` Jamie Lokier
2010-04-21 18:10 ` Phillip Susi
2010-04-21 20:22 ` Jamie Lokier
2010-04-21 20:59 ` Phillip Susi
2010-04-21 22:06 ` Jamie Lokier
2010-04-22 7:01 ` Brad Boyer
2010-04-22 14:26 ` Phillip Susi
2010-04-22 17:53 ` Jamie Lokier
2010-04-22 19:23 ` Phillip Susi
2010-04-22 20:35 ` Jamie Lokier
2010-04-22 21:22 ` Phillip Susi
2010-04-22 22:43 ` Jamie Lokier [this message]
2010-04-23 4:13 ` Phillip Susi
2010-04-21 18:38 ` Evgeniy Polyakov
2010-04-21 18:51 ` Jamie Lokier
2010-04-21 18:56 ` Evgeniy Polyakov
2010-04-21 20:02 ` Jamie Lokier
2010-04-21 20:21 ` Evgeniy Polyakov
2010-04-21 20:39 ` Jamie Lokier
2010-04-21 19:23 ` Phillip Susi
2010-04-21 20:01 ` Jamie Lokier
2010-04-21 20:13 ` Phillip Susi
2010-04-21 20:37 ` Jamie Lokier
2010-05-07 13:38 ` unified page and buffer cache? (was: readahead on directories) Phillip Susi
2010-05-07 13:53 ` Matthew Wilcox
2010-05-07 15:45 ` unified page and buffer cache? Phillip Susi
2010-05-07 18:30 ` Matthew Wilcox
2010-05-08 0:50 ` Phillip Susi
2010-05-08 0:46 ` tytso
2010-05-08 0:54 ` Phillip Susi
2010-05-08 12:52 ` tytso
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20100422224327.GE13951@shareable.org \
--to=jamie@shareable.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=psusi@cfl.rr.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.