From: Trond Myklebust <trondmy@primarydata.com>
To: "neilb@suse.com" <neilb@suse.com>,
"chuck.lever@oracle.com" <chuck.lever@oracle.com>,
"jlayton@kernel.org" <jlayton@kernel.org>
Cc: "Anna.Schumaker@netapp.com" <Anna.Schumaker@netapp.com>,
"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
"linux-nfs@vger.kernel.org" <linux-nfs@vger.kernel.org>,
"linux-fsdevel@vger.kernel.org" <linux-fsdevel@vger.kernel.org>
Subject: Re: [PATCH/RFC] NFS: add nostatflush mount option.
Date: Fri, 5 Jan 2018 01:34:02 +0000 [thread overview]
Message-ID: <1515116033.87651.1.camel@primarydata.com> (raw)
In-Reply-To: <87d12tf99x.fsf@notabene.neil.brown.name>
[-- Attachment #1: Type: text/plain, Size: 10056 bytes --]
Hi Neil,
On Tue, 2018-01-02 at 10:29 +1100, NeilBrown wrote:
> On Sat, Dec 23 2017, Jeff Layton wrote:
>
> > On Fri, 2017-12-22 at 07:59 +1100, NeilBrown wrote:
> > > On Thu, Dec 21 2017, Trond Myklebust wrote:
> > >
> > > > On Thu, 2017-12-21 at 10:39 -0500, Chuck Lever wrote:
> > > > > Hi Neil-
> > > > >
> > > > >
> > > > > > On Dec 20, 2017, at 9:57 PM, NeilBrown <neilb@suse.com>
> > > > > > wrote:
> > > > > >
> > > > > >
> > > > > > When an i_op->getattr() call is made on an NFS file
> > > > > > (typically from a 'stat' family system call), NFS
> > > > > > will first flush any dirty data to the server.
> > > > > >
> > > > > > This ensures that the mtime reported is correct and stable,
> > > > > > but has a performance penalty. 'stat' is normally thought
> > > > > > to be a quick operation, and imposing this cost can be
> > > > > > surprising.
> > > > >
> > > > > To be clear, this behavior is a POSIX requirement.
> > > > >
> > > > >
> > > > > > I have seen problems when one process is writing a large
> > > > > > file and another process performs "ls -l" on the containing
> > > > > > directory and is blocked for as long as it take to flush
> > > > > > all the dirty data to the server, which can be minutes.
> > > > >
> > > > > Yes, a well-known annoyance that cannot be addressed
> > > > > even with a write delegation.
> > > > >
> > > > >
> > > > > > I have also seen a legacy application which frequently
> > > > > > calls
> > > > > > "fstat" on a file that it is writing to. On a local
> > > > > > filesystem (and in the Solaris implementation of NFS) this
> > > > > > fstat call is cheap. On Linux/NFS, the causes a noticeable
> > > > > > decrease in throughput.
> > > > >
> > > > > If the preceding write is small, Linux could be using
> > > > > a FILE_SYNC write, but Solaris could be using UNSTABLE.
> > > > >
> > > > >
> > > > > > The only circumstances where an application calling
> > > > > > 'stat()'
> > > > > > might get an mtime which is not stable are times when some
> > > > > > other process is writing to the file and the two processes
> > > > > > are not using locking to ensure consistency, or when the
> > > > > > one
> > > > > > process is both writing and stating. In neither of these
> > > > > > cases is it reasonable to expect the mtime to be stable.
> > > > >
> > > > > I'm not convinced this is a strong enough rationale
> > > > > for claiming it is safe to disable the existing
> > > > > behavior.
> > > > >
> > > > > You've explained cases where the new behavior is
> > > > > reasonable, but do you have any examples where the
> > > > > new behavior would be a problem? There must be a
> > > > > reason why POSIX explicitly requires an up-to-date
> > > > > mtime.
> > > > >
> > > > > What guidance would nfs(5) give on when it is safe
> > > > > to specify the new mount option?
> > > > >
> > > > >
> > > > > > In the most common cases where mtime is important
> > > > > > (e.g. make), no other process has the file open, so there
> > > > > > will be no dirty data and the mtime will be stable.
> > > > >
> > > > > Isn't it also the case that make is a multi-process
> > > > > workload where one process modifies a file, then
> > > > > closes it (which triggers a flush), and then another
> > > > > process stats the file? The new mount option does
> > > > > not change the behavior of close(2), does it?
> > > > >
> > > > >
> > > > > > Rather than unilaterally changing this behavior of 'stat',
> > > > > > this patch adds a "nosyncflush" mount option to allow
> > > > > > sysadmins to have applications which are hurt by the
> > > > > > current
> > > > > > behavior to disable it.
> > > > >
> > > > > IMO a mount option is at the wrong granularity. A
> > > > > mount point will be shared between applications that
> > > > > can tolerate the non-POSIX behavior and those that
> > > > > cannot, for instance.
> > > >
> > > > Agreed.
> > > >
> > > > The other thing to note here is that we now have an embryonic
> > > > statx()
> > > > system call, which allows the application itself to decide
> > > > whether or
> > > > not it needs up to date values for the atime/ctime/mtime. While
> > > > we
> > > > haven't yet plumbed in the NFS side, the intention was always
> > > > to use
> > > > that information to turn off the writeback flushing when
> > > > possible.
> > >
> > > Yes, if statx() were actually working, we could change the
> > > application
> > > to avoid the flush. But then if changing the application were an
> > > option, I suspect that - for my current customer issue - we could
> > > just
> > > remove the fstat() calls. I doubt they are really necessary.
> > > I think programmers often think of stat() (and particularly
> > > fstat()) as
> > > fairly cheap and so they use it whenever convenient. Only NFS
> > > violates
> > > this expectation.
> > >
> > > Also statx() is only a real solution if/when it gets widely
> > > used. Will
> > > "ls -l" default to AT_STATX_DONT_SYNC ??
> > >
> >
> > Maybe. Eventually, I could see glibc converting normal
> > stat/fstat/etc.
> > to use a statx() syscall under the hood (similar to how stat
> > syscalls on
> > 32-bit arches will use stat64 in most cases).
> >
> > With that, we could look at any number of ways to sneak a "don't
> > flush"
> > flag into the call. Maybe an environment variable that causes the
> > stat
> > syscall wrapper to add it? I think there are possibilities there
> > that
> > don't necessarily require recompiling applications.
>
> Thanks - interesting ideas.
>
> One possibility would be an LD_PRELOAD which implements fstat() using
> statx().
> That doesn't address the "ls -l is needlessly slow" problem, but it
> would address the "legacy application calls fstat too often" problem.
>
> This isn't an option for the "enterprise kernel" the customer is
> using
> (statx? what is statx?), but having a clear view of a credible
> upstream solution is very helpful.
>
> So thanks - and thanks a lot to Trond and Chuck for your input. It
> helped clarify my thoughts a lot.
>
> Is anyone working on proper statx support for NFS, or is it a case of
> "that shouldn't be hard and we should do that, but it isn't a high
> priority for anyone" ??
How about something like the following?
Cheers
Trond
8<--------------------------------------------------------
From 755b6771deb8d793c90f56fddf7070d7c2ea87b5 Mon Sep 17 00:00:00 2001
From: Trond Myklebust <trond.myklebust@primarydata.com>
Date: Thu, 4 Jan 2018 17:46:09 -0500
Subject: [PATCH] Support statx() mask and query flags parameters
Support the query flags AT_STATX_FORCE_SYNC by forcing an attribute
revalidation, and AT_STATX_DONT_SYNC by returning cached attributes
only.
Use the mask to optimise away server revalidation for attributes
that are not being requested by the user.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
---
fs/nfs/inode.c | 40 ++++++++++++++++++++++++++++++++++------
1 file changed, 34 insertions(+), 6 deletions(-)
diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
index b112002dbdb2..a703b1d1500d 100644
--- a/fs/nfs/inode.c
+++ b/fs/nfs/inode.c
@@ -735,12 +735,22 @@ int nfs_getattr(const struct path *path, struct kstat *stat,
u32 request_mask, unsigned int query_flags)
{
struct inode *inode = d_inode(path->dentry);
- int need_atime = NFS_I(inode)->cache_validity & NFS_INO_INVALID_ATIME;
+ unsigned long cache_validity;
+ bool force_sync = query_flags & AT_STATX_FORCE_SYNC;
+ bool dont_sync = !force_sync && query_flags & AT_STATX_DONT_SYNC;
+ bool need_atime = !dont_sync;
+ bool need_cmtime = !dont_sync;
+ bool reval = force_sync;
int err = 0;
+ if (!(request_mask & STATX_ATIME))
+ need_atime = false;
+ if (!(request_mask & (STATX_CTIME|STATX_MTIME)))
+ need_cmtime = false;
+
trace_nfs_getattr_enter(inode);
/* Flush out writes to the server in order to update c/mtime. */
- if (S_ISREG(inode->i_mode)) {
+ if (S_ISREG(inode->i_mode) && need_cmtime) {
err = filemap_write_and_wait(inode->i_mapping);
if (err)
goto out;
@@ -757,9 +767,22 @@ int nfs_getattr(const struct path *path, struct kstat *stat,
*/
if ((path->mnt->mnt_flags & MNT_NOATIME) ||
((path->mnt->mnt_flags & MNT_NODIRATIME) && S_ISDIR(inode->i_mode)))
- need_atime = 0;
-
- if (need_atime || nfs_need_revalidate_inode(inode)) {
+ need_atime = false;
+
+ /* Check for whether the cached attributes are invalid */
+ cache_validity = READ_ONCE(NFS_I(inode)->cache_validity);
+ if (need_cmtime)
+ reval |= cache_validity & NFS_INO_REVAL_PAGECACHE;
+ if (need_atime)
+ reval |= cache_validity & NFS_INO_INVALID_ATIME;
+ if (request_mask & (STATX_MODE|STATX_NLINK|STATX_UID|STATX_GID|
+ STATX_ATIME|STATX_MTIME|STATX_CTIME|
+ STATX_SIZE|STATX_BLOCKS))
+ reval |= cache_validity & NFS_INO_INVALID_ATTR;
+ if (dont_sync)
+ reval = false;
+
+ if (reval) {
struct nfs_server *server = NFS_SERVER(inode);
if (!(server->flags & NFS_MOUNT_NOAC))
@@ -767,13 +790,18 @@ int nfs_getattr(const struct path *path, struct kstat *stat,
else
nfs_readdirplus_parent_cache_hit(path->dentry);
err = __nfs_revalidate_inode(server, inode);
- } else
+ } else if (!dont_sync)
nfs_readdirplus_parent_cache_hit(path->dentry);
if (!err) {
generic_fillattr(inode, stat);
stat->ino = nfs_compat_user_ino64(NFS_FILEID(inode));
if (S_ISDIR(inode->i_mode))
stat->blksize = NFS_SERVER(inode)->dtsize;
+ /* Return only the requested attrs if others may be stale */
+ if (!reval && cache_validity & (NFS_INO_REVAL_PAGECACHE|
+ NFS_INO_INVALID_ATIME|
+ NFS_INO_INVALID_ATTR))
+ stat->result_mask &= request_mask;
}
out:
trace_nfs_getattr_exit(inode, err);
--
2.14.3
--
Trond Myklebust
Linux NFS client maintainer, PrimaryData
trond.myklebust@primarydata.com
[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
next prev parent reply other threads:[~2018-01-05 1:34 UTC|newest]
Thread overview: 15+ messages / expand[flat|nested] mbox.gz Atom feed top
2017-12-21 2:57 [PATCH/RFC] NFS: add nostatflush mount option NeilBrown
2017-12-21 15:39 ` Chuck Lever
2017-12-21 15:39 ` Chuck Lever
2017-12-21 15:54 ` Trond Myklebust
2017-12-21 15:54 ` Trond Myklebust
2017-12-21 20:59 ` NeilBrown
2017-12-21 21:39 ` Trond Myklebust
2017-12-21 22:35 ` NeilBrown
2017-12-22 3:17 ` Trond Myklebust
2017-12-23 13:16 ` Jeff Layton
2018-01-01 23:29 ` NeilBrown
2018-01-05 1:34 ` Trond Myklebust [this message]
2017-12-21 20:51 ` NeilBrown
2017-12-22 16:38 ` Chuck Lever
2017-12-22 16:38 ` Chuck Lever
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1515116033.87651.1.camel@primarydata.com \
--to=trondmy@primarydata.com \
--cc=Anna.Schumaker@netapp.com \
--cc=chuck.lever@oracle.com \
--cc=jlayton@kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-nfs@vger.kernel.org \
--cc=neilb@suse.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.