linux-nfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: bfields@fieldses.org (J. Bruce Fields)
To: Trond Myklebust <trondmy@primarydata.com>
Cc: Coddington Benjamin <bcodding@redhat.com>,
	List Linux NFS Mailing <linux-nfs@vger.kernel.org>
Subject: Re: I can't get no readdir satisfaction
Date: Wed, 24 Aug 2016 09:56:03 -0400	[thread overview]
Message-ID: <20160824135603.GF3938@fieldses.org> (raw)
In-Reply-To: <81D2E080-0697-4B26-BE38-5E8AC35C2EA4@primarydata.com>

On Wed, Aug 24, 2016 at 12:18:04PM +0000, Trond Myklebust wrote:
>=20
> > On Aug 23, 2016, at 17:21, Benjamin Coddington <bcodding@redhat.com> wr=
ote:
> >=20
> >=20
> >=20
> > On 23 Aug 2016, at 11:36, Trond Myklebust wrote:
> >=20
> >>> On Aug 23, 2016, at 11:09, Benjamin Coddington <bcodding@redhat.com> =
wrote:
> >>>=20
> >>> Hi linux-nfs,
> >>>=20
> >>> 311324ad1713 ("NFS: Be more aggressive in using readdirplus for 'ls -=
l'
> >>> situations") changed when nfs_readdir() decides to revalidate the
> >>> directory's mapping, which contains all the entries.  In addition to =
just
> >>> checking if the attribute cache has expired, it includes a check to s=
ee if
> >>> NFS_INO_INVALID_DATA is set on the directory.
> >>>=20
> >>> Well, customers that have directories with very many dentries and tha=
t same
> >>> directory's attributes are frequently updated are now grouchy that `l=
s -l`
> >>> takes so long since any update of the directory causes the mapping to=
 be
> >>> invalidated and we have to start over filling the directory's mapping.
> >>>=20
> >>> I actually haven't put real hard thought into it yet (because often f=
or me
> >>> that just wastes a lot of time), so I am doing the lazy thing by aski=
ng this
> >>> question:
> >>>=20
> >>> Can we go back to just the using the attribute cache timeout, or shou=
ld we
> >>> get all heuristical about readdir?
> >>>=20
> >>=20
> >> We are all heuristical at this point. How are the heuristics failing?
> >>=20
> >> The original problem those heuristics were designed to solve was that =
all
> >> the stat() calls took forever to complete, since they are all synchron=
ous;
> >> Tigran showed some very convincing numbers for a large directory where=
 the
> >> difference in performance was an order of magnitude improved by using
> >> readdirplus instead of readdir=E2=80=A6
> >=20
> > I'll try to present a better explanation.  While `ls -l` is walking thr=
ough
> > a directory repeatedly entering nfs_readdir(), a CREATE response send us
> > through nfs_post_op_update_inode_locked():
> >=20
> > 1531     if (S_ISDIR(inode->i_mode))
> > 1532         invalid |=3D NFS_INO_INVALID_DATA;
> > 1533     nfs_set_cache_invalid(inode, invalid);
> >=20
> > Now, the next entry into nfs_readdir() has us do nfs_revalidate_mapping=
(),
> > which will do nfs_invalidate_mapping() for the directory, and so we hav=
e to
> > start over with cookie 0 sending READDIRPLUS to re-fill the directory's
> > mapping to get back to where we are for the current nfs_readdir().
> >=20
> > This process repeats for every entry into nfs_readdir() if the directory
> > keeps getting updated, and it becomes more likely that it will be updat=
ed as
> > each pass takes longer and longer to re-fill the mapping as the current
> > nfs_readdir() invocation is further along.
> >=20
> > READDIRPLUS isn't the problem, the problem is invalidating the directory
> > mapping in the middle of a series of getdents() if we do a CREATE.  Als=
o, I
> > think a similar thing happens if the directory's mtime or ctime is upda=
ted -
> > but in that case we set NFS_INO_INVALID_DATA because the change_attr
> > updates.
> >=20
> > So, for directories with a large number of entries that updates often, =
it can
> > be very slow to list the directory.
> >=20
> > Why did 311324ad1713 change nfs_readdir from
> >=20
> > if (nfs_attribute_cache_expired(inode))
> >    nfs_revalidate_mapping(inode, file->f_mapping);
> > to
> >=20
> > if (nfs_attribute_cache_expired(inode) || nfsi->cache_validity & NFS_IN=
O_INVALID_DATA)
> >    nfs_revalidate_mapping(inode, file->f_mapping);
>=20
> As the commit message says, the whole purpose was to use READDIRPLUS as a=
 substitute for multiple GETATTR calls when the heuristic tells us that the=
 user is performing an =E2=80=98ls -l=E2=80=99 style of workload.
>=20
> >=20
> > .. and can we go back to the way it was before?
>=20
> Not without slowing down =E2=80=98ls -l=E2=80=99 on large directories.
>=20
> >=20
> > OK.. I understand why -- it is more correct since if we know the direct=
ory has
> > changed, we might as well fetch the change.  Otherwise, we might be cre=
ating
> > files and then wondering why they aren't listed.
> >=20
> > It might be nicer to not invalidate the mapping we're currently using f=
or
> > readdir, though.  Maybe there's a way to keep the mapping for the curre=
ntly
> > opened directory and invalidate it once it's closed.
>=20

>  POSIX requires that you revalidate on opendir() and rewinddir(), and
>  leaves behaviour w.r.t. file addition and removal after the call to
>  opendir()/rewinddir() undefined
>  (http://pubs.opengroup.org/onlinepubs/9699919799/functions/readdir.html).

It is only undefined whether the added or removed entry is returned.
Other entries still need to returned exactly once.

In this case we're restarting the read of the directory from scratch--I
don't understand how that's possible while avoiding skipped or
duplicated entries.

Surely the only safe thing to do is to continue reading using the last
cookie returned from the server.

--b.

>  I believe most filesystems under Linux will ensure that if a file is
>  removed, it is no longer seen in the readdir stream, and a file that
>  is added will be seen unless the readdir cursor has already passed
>  it, so there is that to consider.

>=20
> However correctness was not the main issue here.
>=20
> N?????r??y????b?X??=C7=A7v?^?)=DE=BA{.n?+????{???"??^n?r???z?=1A??h?????&=
??=1E?G???h?=03(?=E9=9A=8E?=DD=A2j"??=1A?=1Bm??????z?=DE=96???f???h???~?m

  parent reply	other threads:[~2016-08-24 13:56 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-08-23 15:09 I can't get no readdir satisfaction Benjamin Coddington
2016-08-23 15:36 ` Trond Myklebust
2016-08-23 21:21   ` Benjamin Coddington
2016-08-24 12:18     ` Trond Myklebust
2016-08-24 13:15       ` Benjamin Coddington
2016-08-24 13:39         ` Trond Myklebust
2016-08-24 13:56       ` J. Bruce Fields [this message]
2016-08-24 14:02         ` Trond Myklebust
2016-08-24 14:16           ` Benjamin Coddington
2016-08-24 14:19             ` Trond Myklebust
2016-08-24 15:18               ` Benjamin Coddington
2016-08-24 14:20           ` Fields Bruce James
2016-08-24 14:26             ` Trond Myklebust
2016-08-24 14:40               ` J. Bruce Fields
2016-08-24 14:53                 ` Trond Myklebust
2016-08-24 15:16                   ` Fields Bruce James
2016-08-24 13:02     ` Benjamin Coddington
2016-08-23 19:36 ` J. Bruce Fields
2016-08-23 21:50   ` Benjamin Coddington

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20160824135603.GF3938@fieldses.org \
    --to=bfields@fieldses.org \
    --cc=bcodding@redhat.com \
    --cc=linux-nfs@vger.kernel.org \
    --cc=trondmy@primarydata.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).