From: Abhijith Das <adas@redhat.com>
To: cluster-devel.redhat.com
Subject: [Cluster-devel] [RFC PATCH 0/2] dirreadahead system call
Date: Thu, 31 Jul 2014 22:11:05 -0400 (EDT) [thread overview]
Message-ID: <776040625.354492.1406859065504.JavaMail.zimbra@redhat.com> (raw)
In-Reply-To: <20140731235306.GR20518@dastard>
----- Original Message -----
> From: "Dave Chinner" <david@fromorbit.com>
> To: "Andreas Dilger" <adilger@dilger.ca>
> Cc: "Abhijith Das" <adas@redhat.com>, "LKML" <linux-kernel@vger.kernel.org>, "linux-fsdevel"
> <linux-fsdevel@vger.kernel.org>, cluster-devel at redhat.com
> Sent: Thursday, July 31, 2014 6:53:06 PM
> Subject: Re: [RFC PATCH 0/2] dirreadahead system call
>
> On Thu, Jul 31, 2014 at 01:19:45PM +0200, Andreas Dilger wrote:
> > On Jul 31, 2014, at 6:49, Dave Chinner <david@fromorbit.com> wrote:
> > >
> > >> On Mon, Jul 28, 2014 at 03:19:31PM -0600, Andreas Dilger wrote:
> > >>> On Jul 28, 2014, at 6:52 AM, Abhijith Das <adas@redhat.com> wrote:
> > >>> OnJuly 26, 2014 12:27:19 AM "Andreas Dilger" <adilger@dilger.ca> wrote:
> > >>>> Is there a time when this doesn't get called to prefetch entries in
> > >>>> readdir() order? It isn't clear to me what benefit there is of
> > >>>> returning
> > >>>> the entries to userspace instead of just doing the statahead
> > >>>> implicitly
> > >>>> in the kernel?
> > >>>>
> > >>>> The Lustre client has had what we call "statahead" for a while,
> > >>>> and similar to regular file readahead it detects the sequential access
> > >>>> pattern for readdir() + stat() in readdir() order (taking into account
> > >>>> if
> > >>>> ".*"
> > >>>> entries are being processed or not) and starts fetching the inode
> > >>>> attributes asynchronously with a worker thread.
> > >>>
> > >>> Does this heuristic work well in practice? In the use case we were
> > >>> trying to
> > >>> address, a Samba server is aware beforehand if it is going to stat all
> > >>> the
> > >>> inodes in a directory.
> > >>
> > >> Typically this works well for us, because this is done by the Lustre
> > >> client, so the statahead is hiding the network latency of the RPCs to
> > >> fetch attributes from the server. I imagine the same could be seen with
> > >> GFS2. I don't know if this approach would help very much for local
> > >> filesystems because the latency is low.
> > >>
> > >>>> This syscall might be more useful if userspace called readdir() to get
> > >>>> the dirents and then passed the kernel the list of inode numbers
> > >>>> to prefetch before starting on the stat() calls. That way, userspace
> > >>>> could generate an arbitrary list of inodes (e.g. names matching a
> > >>>> regexp) and the kernel doesn't need to guess if every inode is needed.
> > >>>
> > >>> Were you thinking arbitrary inodes across the filesystem or just a
> > >>> subset
> > >>> from a directory? Arbitrary inodes may potentially throw up locking
> > >>> issues.
> > >>
> > >> I was thinking about inodes returned from readdir(), but the syscall
> > >> would be much more useful if it could handle arbitrary inodes.
> > >
> > > I'm not sure we can do that. The only way to safely identify a
> > > specific inode in the filesystem from userspace is via a filehandle.
> > > Plain inode numbers are susceptible to TOCTOU race conditions that
> > > the kernel cannot resolve. Also, lookup by inode number bypasses
> > > directory access permissions, so is not something we would expose
> > > to arbitrary unprivileged users.
> >
> > None of these issues are relevant in the API that I'm thinking about.
> > The syscall just passes the list of inode numbers to be prefetched
> > into kernel memory, and then stat() is used to actually get the data into
> > userspace (or whatever other operation is to be done on them),
> > so there is no danger if the wrong inode is prefetched. If the inode
> > number is bad the filesystem can just ignore it.
>
> Which means the filesystem has to treat the inode number as
> potentially hostile. i.e. it can not be trusted to be correct and so
> must take slow paths to validate the inode numbers. This adds
> *significant* overhead to the readahead path for some filesystems:
> readahead is only a win if it is low cost.
>
> For example, on XFS every untrusted inode number lookup requires an
> inode btree lookup to validate the inode is actually valid on disk
> and that is it allocated and has references. That lookup serialises
> against inode allocation/freeing as well as other lookups. In
> comparison, when using a trusted inode number from a directory
> lookup within the kernel, we only need to do a couple of shift and
> mask operations to convert it to a disk address and we are good to
> go.
>
> i.e. the difference is at least 5 orders of magnitude higher CPU usage
> for an "inode number readahead" syscall versus a "directory
> readahead" syscall, it has significant serialisation issues and it
> can stall other modification/lookups going on at the same time.
> That's *horrible behaviour* for a speculative readahead operation,
> but because the inodenumbers are untrusted, we can't avoid it.
>
> So, again, it's way more overhead than userspace just calling
> stat() asycnhronously on many files at once as readdir/gentdents
> returns dirents from the kernel to speed up cache population.
>
> That's my main issue with this patchset - it's implementing
> something in kernelspace that can *easily* be done generically in
> userspace without introducing all sorts of nasty corner cases that
> we have to handle in the kernel. We only add functionality to the kernel if
> there's a
> compelling reason to do it in kernelspace, and right now I just
> don't see any numbers that justify adding readdir+stat() readahead
> or inode number based cache population in kernelspace.
>
> Before we add *any* syscall for directory readahead, we need
> comparison numbers against doing the dumb multithreaded
> userspace readahead of stat() calls. If userspace can do this as
> fast as the kernel can....
>
This was next on my to-do list to see how an all-userspace solution compares
with doing this in the kernel. I'll post some results as soon as I can find
some time to play with this stuff again.
Cheers!
--Abhi
next prev parent reply other threads:[~2014-08-01 2:11 UTC|newest]
Thread overview: 22+ messages / expand[flat|nested] mbox.gz Atom feed top
2014-07-25 17:37 [Cluster-devel] [RFC PATCH 0/2] dirreadahead system call Abhi Das
2014-07-25 17:37 ` [Cluster-devel] [RFC PATCH 1/2] fs: Add dirreadahead syscall and VFS hooks Abhi Das
2014-07-29 8:21 ` Michael Kerrisk
2014-07-31 3:31 ` Dave Chinner
2014-07-25 17:37 ` [Cluster-devel] [RFC PATCH 2/2] gfs2: GFS2's implementation of the dir_readahead file operation Abhi Das
2014-07-26 5:27 ` [Cluster-devel] [RFC PATCH 0/2] dirreadahead system call Andreas Dilger
2014-07-28 12:52 ` Abhijith Das
2014-07-28 21:19 ` Andreas Dilger
2014-07-29 9:36 ` Steven Whitehouse
2014-07-31 4:49 ` Dave Chinner
2014-07-31 11:19 ` Andreas Dilger
2014-07-31 23:53 ` Dave Chinner
2014-08-01 2:11 ` Abhijith Das [this message]
2014-08-01 5:54 ` Andreas Dilger
2014-08-06 2:01 ` Dave Chinner
2014-10-21 5:21 ` Abhijith Das
2014-11-10 3:41 ` Abhijith Das
2014-11-10 22:23 ` Andreas Dilger
2014-11-10 22:47 ` Abhijith Das
2014-07-29 8:19 ` Michael Kerrisk
2014-07-31 3:18 ` NeilBrown
2014-08-01 2:21 ` Abhijith Das
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=776040625.354492.1406859065504.JavaMail.zimbra@redhat.com \
--to=adas@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).