cluster-devel.redhat.com archive mirror
 help / color / mirror / Atom feed
From: Abhijith Das <adas@redhat.com>
To: cluster-devel.redhat.com
Subject: [Cluster-devel] [RFC PATCH 0/2] dirreadahead system call
Date: Thu, 31 Jul 2014 22:21:14 -0400 (EDT)	[thread overview]
Message-ID: <801205433.377924.1406859674954.JavaMail.zimbra@redhat.com> (raw)
In-Reply-To: <20140731131805.5280c697@notabene.brown>



----- Original Message -----
> From: "NeilBrown" <neilb@suse.de>
> To: "Abhi Das" <adas@redhat.com>
> Cc: linux-kernel at vger.kernel.org, linux-fsdevel at vger.kernel.org, cluster-devel at redhat.com
> Sent: Wednesday, July 30, 2014 10:18:05 PM
> Subject: Re: [RFC PATCH 0/2] dirreadahead system call
> 
> On Fri, 25 Jul 2014 12:37:29 -0500 Abhi Das <adas@redhat.com> wrote:
> 
> > This system call takes 3 arguments:
> > fd      - file descriptor of the directory being readahead
> > *offset - offset in dir from which to resume. This is updated
> >           as we move along in the directory
> > count   - The max number of entries to readahead
> > 
> > The syscall is supposed to read upto 'count' entries starting at
> > '*offset' and cache the inodes corresponding to those entries. It
> > returns a negative error code or a positive number indicating
> > the number of inodes it has issued readaheads for. It also
> > updates the '*offset' value so that repeated calls to dirreadahead
> > can resume at the right location. Returns 0 when there are no more
> > entries left.
> 
> Hi Abhi,
> 
>  I like the idea of enhanced read-ahead on a directory.
>  It isn't clear to me why you have included these particular fields in the
>  interface though.
> 
>  - why have an 'offset'?  Why not just use the current offset of the
>    directory 'fd'?

The idea was that we didn't want a syscall like readahead mucking with the
file pointer as the same fd might be used to do getdents().

>  - Why have a count?  How would a program choose what count to give?

If a program knows that it's only going to work on a subset of files at a time,
it can use the count value to only readahead a small number of inodes at once
instead of reading ahead the entire directory.

That said, this interface is not set in stone and we are exploring ways to inform
the kernel of the inodes we are interested in reading ahead.

> 
>  Maybe you imagine using 'getdents' first to get a list of names, then
>  selectively calling 'dirreadahead'  on the offsets of the names you are
>  interested it?  That would be racy as names can be added and removed which
>  might change offsets.  So maybe you have another reason?
> 
>  I would like to suggest an alternate interface (I love playing the API
>  game....).
> 
>  1/ Add a flag to 'fstatat'  AT_EXPECT_MORE.
>     If the pathname does not contain a '/', then the 'dirfd' is marked
>     to indicate that stat information for all names returned by getdents will
>     be wanted.  The filesystem can choose to optimise that however it sees
>     fit.
> 
>  2/ Add a flag to 'fstatat'  AT_NONBLOCK.
>     This tells the filesystem that you want this information, so if it can
>     return it immediately it should, and if not it should start pulling it
>     into cache.  Possibly this should be two flags: AT_NONBLOCK just avoids
>     any IO, and AT_ASYNC instigates IO even if NONBLOCK is set.
> 
>  Then an "ls -l" could use AT_EXPECT_MORE and then just stat each name.
>  An "ls -l *.c", might avoid AT_EXPECT_MORE, but would use AT_NONBLOCK
>  against all names, then try again with all the names that returned
>  EWOULDBLOCK the first time.
> 
> 
>  I would really like to see the 'xstat' syscall too, but there is no point
>  having both "xstat" and "fxstat".  Follow the model of "fstatat" and provide
>  just "fxstatat" which can do both.
>  With fxstatat, AT_EXPECT_MORE would tell the dirfd exactly which attributes
>  would be wanted so it can fetch only that which is desired.
> 
>  I'm not very keen on the xgetdents idea of including name information and
>  stat information into the one syscall - I would prefer getdents and xstat be
>  kept separate.  Of course if a genuine performance cost of the separate can
>  be demonstrated, I could well change my mind.
> 
>  It does, however, have the advantage that the kernel doesn't need to worry
>  about how long read-ahead data needs to be kept, and the application doesn't
>  need to worry about how soon to retry an fstatat which failed with
>  EWOULDBLOCK.
> 
> Thanks for raising this issue again.  I hope it gets fixed one day...
> 
> NeilBrown
> 



      reply	other threads:[~2014-08-01  2:21 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-07-25 17:37 [Cluster-devel] [RFC PATCH 0/2] dirreadahead system call Abhi Das
2014-07-25 17:37 ` [Cluster-devel] [RFC PATCH 1/2] fs: Add dirreadahead syscall and VFS hooks Abhi Das
2014-07-29  8:21   ` Michael Kerrisk
2014-07-31  3:31     ` Dave Chinner
2014-07-25 17:37 ` [Cluster-devel] [RFC PATCH 2/2] gfs2: GFS2's implementation of the dir_readahead file operation Abhi Das
2014-07-26  5:27 ` [Cluster-devel] [RFC PATCH 0/2] dirreadahead system call Andreas Dilger
2014-07-28 12:52   ` Abhijith Das
2014-07-28 21:19     ` Andreas Dilger
2014-07-29  9:36       ` Steven Whitehouse
2014-07-31  4:49       ` Dave Chinner
2014-07-31 11:19         ` Andreas Dilger
2014-07-31 23:53           ` Dave Chinner
2014-08-01  2:11             ` Abhijith Das
2014-08-01  5:54             ` Andreas Dilger
2014-08-06  2:01               ` Dave Chinner
2014-10-21  5:21             ` Abhijith Das
2014-11-10  3:41               ` Abhijith Das
2014-11-10 22:23                 ` Andreas Dilger
2014-11-10 22:47                   ` Abhijith Das
2014-07-29  8:19 ` Michael Kerrisk
2014-07-31  3:18 ` NeilBrown
2014-08-01  2:21   ` Abhijith Das [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=801205433.377924.1406859674954.JavaMail.zimbra@redhat.com \
    --to=adas@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).