linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Andreas Dilger <adilger@clusterfs.com>
To: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: Sage Weil <sage@newdream.net>,
	Christoph Hellwig <hch@infradead.org>,
	Brad Boyer <flar@allandria.com>,
	Anton Altaparmakov <aia21@cam.ac.uk>,
	Gary Grider <ggrider@lanl.gov>,
	linux-fsdevel@vger.kernel.org
Subject: Re: readdirplus() as possible POSIX I/O API
Date: Wed, 6 Dec 2006 03:28:29 -0700	[thread overview]
Message-ID: <20061206102829.GY5937@schatzie.adilger.int> (raw)
In-Reply-To: <1165332224.5742.58.camel@lade.trondhjem.org>

On Dec 05, 2006  10:23 -0500, Trond Myklebust wrote:
> On Tue, 2006-12-05 at 03:26 -0700, Andreas Dilger wrote:
> > I think the "barrier semantics" are something that have just crept
> > into this discussion and is confusing the issue.
> 
> It is the _only_ concept that is of interest for something like NFS or
> CIFS. We already have the ability to cache the information.

Actually, wouldn't the ability for readdirplus() (with valid flag) be
useful for NFS if only to indicate that it does not need to flush the
cache because the ctime/mtime isn't needed by the caller?

> 'find' should be quite happy with the existing readdir(). It does not
> need to use stat() or readdirplus() in order to recurse because
> readdir() provides d_type.

It does in any but the most simplistic invocations, like "find -mtime"
or "find -mode" or "find -uid", etc.

> If the application is able to select a statlite()-type of behaviour with
> the fadvise() hints, your filesystem could be told to serve up cached
> information instead of regrabbing locks. In fact that is a much more
> flexible scheme, since it also allows the filesystem to background the
> actual inode lookups, or to defer them altogether if that is more
> efficient.

I guess I just don't understand how fadvise() on a directory file handle
(used for readdir()) can be used to affect later stat operations (which
definitely will NOT be using that file handle)?  If you mean that the
application should actually open() each file, fadvise(), fstat(), close(),
instead of just a stat() call then we are WAY into negative improvements
here due to overhead of doing open+close.

> > The filesystem can't always do "stat-ahead" on the files because that
> > requires instantiating an inode on the client which may be stale (lock
> > revoked) by the time the app gets to it, and the app (and the VFS)  have
> > no idea just how stale it is, and whether the stat is a "real" stat or
> > "only" the readdir stat (because the fadvise would only be useful on
> > the directory, and not all of the child entries), so it would need to
> > re-stat the file.
> 
> Then provide hints that allow the app to select which behaviour it
> prefers. Most (all?) apps don't _care_, and so would be quite happy with
> cached information. That is why the current NFS caching model exists in
> the first place.

Most clustered filesystems have strong cache semantics, so that isn't
a problem.  IMHO, the mechanism to pass the hint to the filesystem IS
the readdirplus_lite() that tells the filesystem exactly which data is
needed on each directory entry.

> > Also, this would potentially blow the client's real
> > working set of inodes out of cache.
> 
> Why?

Because in many cases it is desirable to limit the number of DLM locks
on a given client (e.g. GFS2 thread with AKPM about clients with
millions of DLM locks due to lack of memory pressure on large mem systems).
That means a finite-size lock LRU on the client that risks being wiped
out by a few thousand files in a directory doing "readdir() + 5000*stat()".


Consider a system like BlueGene/L with 128k compute cores.  Jobs that
run on that system will periodically (e.g. every hour) create up to 128K
checkpoint+restart files to avoid losing a lot of computation if a node
crashes.  Even if each one of the checkpoints is in a separate directory
(I wish all users were so nice :-) it means 128K inodes+DLM locks for doing
an "ls" in the directory.

> > Doing things en-masse with readdirplus() also allows the filesystem to
> > do the stat() operations in parallel internally (which is a net win if
> > there are many servers involved) instead of serially as the application
> > would do.
> 
> If your application really cared, it could add threading to 'ls' to
> achieve the same result. You can also have the filesystem preload that
> information based on fadvise hints.

But it would still need 128K RPCs to get that information, and 128K new
inodes on that client.  And what is the chance that I can get a
multi-threading "ls" into the upstream GNU ls code?  In the case of local
filesystems multi-threading ls would be a net loss due to seeking.

But even for local filesystems readdirplus_lite() would allow them to
fill in stat information they already have (either in cache or on disk),
and may avoid doing extra work that isn't needed.  For filesystems that
don't care, readdirplus_lite() can just be readdir()+stat() internally.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.


  reply	other threads:[~2006-12-06 10:28 UTC|newest]

Thread overview: 124+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2006-11-28  4:34 NFSv4/pNFS possible POSIX I/O API standards Gary Grider
2006-11-28  5:54 ` Christoph Hellwig
2006-11-28 10:54   ` Andreas Dilger
2006-11-28 11:28     ` Anton Altaparmakov
2006-11-28 20:17     ` Russell Cattelan
2006-11-28 23:28     ` Wendy Cheng
2006-11-29  9:12       ` Christoph Hellwig
2006-11-29  9:04   ` Christoph Hellwig
2006-11-29  9:14     ` Christoph Hellwig
2006-11-29  9:48     ` Andreas Dilger
2006-11-29 10:18       ` Anton Altaparmakov
2006-11-29  8:26         ` Brad Boyer
2006-11-30  9:25           ` Christoph Hellwig
2006-11-30 17:49             ` Sage Weil
2006-12-01  5:26               ` Trond Myklebust
2006-12-01  7:08                 ` Sage Weil
2006-12-01 14:41                   ` Trond Myklebust
2006-12-01 16:47                     ` Sage Weil
2006-12-01 18:07                       ` Trond Myklebust
2006-12-01 18:42                         ` Sage Weil
2006-12-01 19:13                           ` Trond Myklebust
2006-12-01 20:32                             ` Sage Weil
2006-12-04 18:02                           ` Peter Staubach
2006-12-05 23:20                             ` readdirplus() as possible POSIX I/O API Sage Weil
2006-12-06 15:48                               ` Peter Staubach
2006-12-03  1:57                         ` NFSv4/pNFS possible POSIX I/O API standards Andreas Dilger
2006-12-03  7:34                           ` Kari Hurtta
2006-12-03  1:52                     ` Andreas Dilger
2006-12-03 16:10                       ` Sage Weil
2006-12-04  7:32                         ` Andreas Dilger
2006-12-04 15:15                           ` Trond Myklebust
2006-12-05  0:59                             ` Rob Ross
2006-12-05  4:44                               ` Gary Grider
2006-12-05 10:05                                 ` Christoph Hellwig
2006-12-05  5:56                               ` Trond Myklebust
2006-12-05 10:07                                 ` Christoph Hellwig
2006-12-05 14:20                                   ` Matthew Wilcox
2006-12-06 15:04                                     ` Rob Ross
2006-12-06 15:44                                       ` Matthew Wilcox
2006-12-06 16:15                                         ` Rob Ross
2006-12-05 14:55                                   ` Trond Myklebust
2006-12-05 22:11                                     ` Rob Ross
2006-12-05 23:24                                       ` Trond Myklebust
2006-12-06 16:42                                         ` Rob Ross
2006-12-06 12:22                                     ` Ragnar Kjørstad
2006-12-06 15:14                                       ` Trond Myklebust
2006-12-05 16:55                                   ` Latchesar Ionkov
2006-12-05 22:12                                     ` Christoph Hellwig
2006-12-06 23:12                                       ` Latchesar Ionkov
2006-12-06 23:33                                         ` Trond Myklebust
2006-12-05 21:50                                   ` Rob Ross
2006-12-05 22:05                                     ` Christoph Hellwig
2006-12-05 23:18                                       ` Sage Weil
2006-12-05 23:55                                       ` Ulrich Drepper
2006-12-06 10:06                                         ` Andreas Dilger
2006-12-06 17:19                                           ` Ulrich Drepper
2006-12-06 17:27                                             ` Rob Ross
2006-12-06 17:42                                               ` Ulrich Drepper
2006-12-06 18:01                                                 ` Ragnar Kjørstad
2006-12-06 18:13                                                   ` Ulrich Drepper
2006-12-17 14:41                                                     ` Ragnar Kjørstad
2006-12-17 19:07                                                       ` Ulrich Drepper
2006-12-17 19:38                                                         ` Matthew Wilcox
2006-12-17 21:51                                                           ` Ulrich Drepper
2006-12-18  2:57                                                             ` Ragnar Kjørstad
2006-12-18  3:54                                                               ` Gary Grider
2006-12-07  5:57                                                 ` Andreas Dilger
2006-12-15 22:37                                                   ` Ulrich Drepper
2006-12-16 18:13                                                     ` Andreas Dilger
2006-12-16 19:08                                                       ` Ulrich Drepper
2006-12-14 23:58                                         ` statlite() Rob Ross
2006-12-07 23:39                                       ` NFSv4/pNFS possible POSIX I/O API standards Nikita Danilov
2006-12-05 14:37                               ` Peter Staubach
2006-12-05 10:26                             ` readdirplus() as possible POSIX I/O API Andreas Dilger
2006-12-05 15:23                               ` Trond Myklebust
2006-12-06 10:28                                 ` Andreas Dilger [this message]
2006-12-06 15:10                                   ` Trond Myklebust
2006-12-05 17:06                               ` Latchesar Ionkov
2006-12-05 22:48                                 ` Rob Ross
2006-11-29 10:25       ` NFSv4/pNFS possible POSIX I/O API standards Steven Whitehouse
2006-11-30 12:29         ` Christoph Hellwig
2006-12-01 15:52       ` Ric Wheeler
2006-11-29 12:23     ` Matthew Wilcox
2006-11-29 12:35       ` Matthew Wilcox
2006-11-29 16:26         ` Gary Grider
2006-11-29 17:18           ` Christoph Hellwig
2006-11-29 12:39       ` Christoph Hellwig
2006-12-01 22:29         ` Rob Ross
2006-12-02  2:35           ` Latchesar Ionkov
2006-12-05  0:37             ` Rob Ross
2006-12-05 10:02               ` Christoph Hellwig
2006-12-05 16:47               ` Latchesar Ionkov
2006-12-05 17:01                 ` Matthew Wilcox
     [not found]                   ` <f158dc670612050909m366594c5ubaa87d9a9ecc8c2a@mail.gmail.com>
2006-12-05 17:10                     ` Latchesar Ionkov
2006-12-05 17:39                     ` Matthew Wilcox
2006-12-05 21:55                       ` Rob Ross
2006-12-05 21:50                   ` Peter Staubach
2006-12-05 21:44                 ` Rob Ross
2006-12-06 11:01                   ` openg Christoph Hellwig
2006-12-06 15:41                     ` openg Trond Myklebust
2006-12-06 15:42                     ` openg Rob Ross
2006-12-06 23:32                       ` openg Christoph Hellwig
2006-12-14 23:36                         ` openg Rob Ross
2006-12-06 23:25                   ` Re: NFSv4/pNFS possible POSIX I/O API standards Latchesar Ionkov
2006-12-06  9:48                 ` David Chinner
2006-12-06 15:53                   ` openg and path_to_handle Rob Ross
2006-12-06 16:04                     ` Matthew Wilcox
2006-12-06 16:20                       ` Rob Ross
2006-12-06 20:57                         ` David Chinner
2006-12-06 20:40                     ` David Chinner
2006-12-06 20:50                       ` Matthew Wilcox
2006-12-06 21:09                         ` David Chinner
2006-12-06 22:09                         ` Andreas Dilger
2006-12-06 22:17                           ` Matthew Wilcox
2006-12-06 22:41                             ` Andreas Dilger
2006-12-06 23:39                           ` Christoph Hellwig
2006-12-14 22:52                             ` Rob Ross
2006-12-06 20:50                       ` Rob Ross
2006-12-06 21:01                         ` David Chinner
2006-12-06 23:19                     ` Latchesar Ionkov
2006-12-14 21:00                       ` Rob Ross
2006-12-14 21:20                         ` Matthew Wilcox
2006-12-14 23:02                           ` Rob Ross
2006-11-28 15:08 ` NFSv4/pNFS possible POSIX I/O API standards Matthew Wilcox

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20061206102829.GY5937@schatzie.adilger.int \
    --to=adilger@clusterfs.com \
    --cc=aia21@cam.ac.uk \
    --cc=flar@allandria.com \
    --cc=ggrider@lanl.gov \
    --cc=hch@infradead.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=sage@newdream.net \
    --cc=trond.myklebust@fys.uio.no \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).