From: bfields@fieldses.org (J. Bruce Fields)
To: Trond Myklebust <trondmy@primarydata.com>
Cc: Coddington Benjamin <bcodding@redhat.com>,
List Linux NFS Mailing <linux-nfs@vger.kernel.org>
Subject: Re: I can't get no readdir satisfaction
Date: Wed, 24 Aug 2016 09:56:03 -0400 [thread overview]
Message-ID: <20160824135603.GF3938@fieldses.org> (raw)
In-Reply-To: <81D2E080-0697-4B26-BE38-5E8AC35C2EA4@primarydata.com>
On Wed, Aug 24, 2016 at 12:18:04PM +0000, Trond Myklebust wrote:
>=20
> > On Aug 23, 2016, at 17:21, Benjamin Coddington <bcodding@redhat.com> wr=
ote:
> >=20
> >=20
> >=20
> > On 23 Aug 2016, at 11:36, Trond Myklebust wrote:
> >=20
> >>> On Aug 23, 2016, at 11:09, Benjamin Coddington <bcodding@redhat.com> =
wrote:
> >>>=20
> >>> Hi linux-nfs,
> >>>=20
> >>> 311324ad1713 ("NFS: Be more aggressive in using readdirplus for 'ls -=
l'
> >>> situations") changed when nfs_readdir() decides to revalidate the
> >>> directory's mapping, which contains all the entries. In addition to =
just
> >>> checking if the attribute cache has expired, it includes a check to s=
ee if
> >>> NFS_INO_INVALID_DATA is set on the directory.
> >>>=20
> >>> Well, customers that have directories with very many dentries and tha=
t same
> >>> directory's attributes are frequently updated are now grouchy that `l=
s -l`
> >>> takes so long since any update of the directory causes the mapping to=
be
> >>> invalidated and we have to start over filling the directory's mapping.
> >>>=20
> >>> I actually haven't put real hard thought into it yet (because often f=
or me
> >>> that just wastes a lot of time), so I am doing the lazy thing by aski=
ng this
> >>> question:
> >>>=20
> >>> Can we go back to just the using the attribute cache timeout, or shou=
ld we
> >>> get all heuristical about readdir?
> >>>=20
> >>=20
> >> We are all heuristical at this point. How are the heuristics failing?
> >>=20
> >> The original problem those heuristics were designed to solve was that =
all
> >> the stat() calls took forever to complete, since they are all synchron=
ous;
> >> Tigran showed some very convincing numbers for a large directory where=
the
> >> difference in performance was an order of magnitude improved by using
> >> readdirplus instead of readdir=E2=80=A6
> >=20
> > I'll try to present a better explanation. While `ls -l` is walking thr=
ough
> > a directory repeatedly entering nfs_readdir(), a CREATE response send us
> > through nfs_post_op_update_inode_locked():
> >=20
> > 1531 if (S_ISDIR(inode->i_mode))
> > 1532 invalid |=3D NFS_INO_INVALID_DATA;
> > 1533 nfs_set_cache_invalid(inode, invalid);
> >=20
> > Now, the next entry into nfs_readdir() has us do nfs_revalidate_mapping=
(),
> > which will do nfs_invalidate_mapping() for the directory, and so we hav=
e to
> > start over with cookie 0 sending READDIRPLUS to re-fill the directory's
> > mapping to get back to where we are for the current nfs_readdir().
> >=20
> > This process repeats for every entry into nfs_readdir() if the directory
> > keeps getting updated, and it becomes more likely that it will be updat=
ed as
> > each pass takes longer and longer to re-fill the mapping as the current
> > nfs_readdir() invocation is further along.
> >=20
> > READDIRPLUS isn't the problem, the problem is invalidating the directory
> > mapping in the middle of a series of getdents() if we do a CREATE. Als=
o, I
> > think a similar thing happens if the directory's mtime or ctime is upda=
ted -
> > but in that case we set NFS_INO_INVALID_DATA because the change_attr
> > updates.
> >=20
> > So, for directories with a large number of entries that updates often, =
it can
> > be very slow to list the directory.
> >=20
> > Why did 311324ad1713 change nfs_readdir from
> >=20
> > if (nfs_attribute_cache_expired(inode))
> > nfs_revalidate_mapping(inode, file->f_mapping);
> > to
> >=20
> > if (nfs_attribute_cache_expired(inode) || nfsi->cache_validity & NFS_IN=
O_INVALID_DATA)
> > nfs_revalidate_mapping(inode, file->f_mapping);
>=20
> As the commit message says, the whole purpose was to use READDIRPLUS as a=
substitute for multiple GETATTR calls when the heuristic tells us that the=
user is performing an =E2=80=98ls -l=E2=80=99 style of workload.
>=20
> >=20
> > .. and can we go back to the way it was before?
>=20
> Not without slowing down =E2=80=98ls -l=E2=80=99 on large directories.
>=20
> >=20
> > OK.. I understand why -- it is more correct since if we know the direct=
ory has
> > changed, we might as well fetch the change. Otherwise, we might be cre=
ating
> > files and then wondering why they aren't listed.
> >=20
> > It might be nicer to not invalidate the mapping we're currently using f=
or
> > readdir, though. Maybe there's a way to keep the mapping for the curre=
ntly
> > opened directory and invalidate it once it's closed.
>=20
> POSIX requires that you revalidate on opendir() and rewinddir(), and
> leaves behaviour w.r.t. file addition and removal after the call to
> opendir()/rewinddir() undefined
> (http://pubs.opengroup.org/onlinepubs/9699919799/functions/readdir.html).
It is only undefined whether the added or removed entry is returned.
Other entries still need to returned exactly once.
In this case we're restarting the read of the directory from scratch--I
don't understand how that's possible while avoiding skipped or
duplicated entries.
Surely the only safe thing to do is to continue reading using the last
cookie returned from the server.
--b.
> I believe most filesystems under Linux will ensure that if a file is
> removed, it is no longer seen in the readdir stream, and a file that
> is added will be seen unless the readdir cursor has already passed
> it, so there is that to consider.
>=20
> However correctness was not the main issue here.
>=20
> N?????r??y????b?X??=C7=A7v?^?)=DE=BA{.n?+????{???"??^n?r???z?=1A??h?????&=
??=1E?G???h?=03(?=E9=9A=8E?=DD=A2j"??=1A?=1Bm??????z?=DE=96???f???h???~?m
next prev parent reply other threads:[~2016-08-24 13:56 UTC|newest]
Thread overview: 19+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-08-23 15:09 I can't get no readdir satisfaction Benjamin Coddington
2016-08-23 15:36 ` Trond Myklebust
2016-08-23 21:21 ` Benjamin Coddington
2016-08-24 12:18 ` Trond Myklebust
2016-08-24 13:15 ` Benjamin Coddington
2016-08-24 13:39 ` Trond Myklebust
2016-08-24 13:56 ` J. Bruce Fields [this message]
2016-08-24 14:02 ` Trond Myklebust
2016-08-24 14:16 ` Benjamin Coddington
2016-08-24 14:19 ` Trond Myklebust
2016-08-24 15:18 ` Benjamin Coddington
2016-08-24 14:20 ` Fields Bruce James
2016-08-24 14:26 ` Trond Myklebust
2016-08-24 14:40 ` J. Bruce Fields
2016-08-24 14:53 ` Trond Myklebust
2016-08-24 15:16 ` Fields Bruce James
2016-08-24 13:02 ` Benjamin Coddington
2016-08-23 19:36 ` J. Bruce Fields
2016-08-23 21:50 ` Benjamin Coddington
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20160824135603.GF3938@fieldses.org \
--to=bfields@fieldses.org \
--cc=bcodding@redhat.com \
--cc=linux-nfs@vger.kernel.org \
--cc=trondmy@primarydata.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).