Concurrent `ls` takes out the thrash

linux-nfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Concurrent `ls` takes out the thrash
@ 2016-12-07 13:28 Benjamin Coddington
  2016-12-07 13:37 ` [PATCH] NFS: Serialize nfs_readdir() Benjamin Coddington
  2016-12-07 15:46 ` Concurrent `ls` takes out the thrash Trond Myklebust
  0 siblings, 2 replies; 15+ messages in thread
From: Benjamin Coddington @ 2016-12-07 13:28 UTC (permalink / raw)
  To: linux-nfs list

I was asked to figure out why the listing of very large directories was
slow.  More specifically, why concurrently listing the same large 
directory
is /very/ slow.  It seems that sometimes a user's reaction to waiting 
for
'ls' to complete is to start a few more.. and then their machine takes a
very long time to complete that work.

I can reproduce that finding.  As an example:

time ls -fl /dir/with/200000/entries/ >/dev/null

real    0m10.766s
user    0m0.716s
sys     0m0.827s

But..

for i in {1..10}; do time ls -fl /dir/with/200000/entries/ >/dev/null & 
done

Each of these ^^ 'ls' commands will take 4 to 5 minutes to complete.

The problem is that concurrent 'ls' commands stack up in nfs_readdir() 
both
waiting on the next page and taking turns filling the next page with 
xdr,
but only one of them will have desc->plus set because setting it clears 
the
flag on the directory.  So if a page is filled by a process that doesn't 
have
desc->plus then the next pass through lookup(), it dumps the entire page
cache with nfs_force_use_readdirplus().  Then the next readdir starts 
all
over filling the pagecache.  Forward progress happens, but only after 
many
steps back re-filling the pagecache.

To me most obvious fix would be to serialize nfs_readdir() on the 
directory
inode, so I'll follow-up with patch that does that with nfsi->rwsem.  
With that,
the above parallel 'ls' takes 12 seconds for each 'ls' to complete.

This only works because with concurrent 'ls' there is a consistent 
buffer
size so a waiting nfs_readdir() started in the same place for an 
unmodified
directory should always hit the cache after waiting.  Serializing
nfs_readdir() will not solve this problem for concurrent callers with
differing buffer sizes, or starting at different offsets, since there's 
a
good chance the waiting readdir() will not see the readdirplus flag when 
it
resumes and so will not prime the dcache.

While I think it's an OK fix, it feels bad to serialize.  At the same
time, nfs_readdir() is already serialized on the pagecache when 
concurrent
callers need to go to the server.  There might be other problems I 
haven't
thought about.

Maybe there's another way to fix this, or maybe we can just say "Don't 
do ls
more than once, you impatient bastards!"

Ben

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH] NFS: Serialize nfs_readdir()
  2016-12-07 13:28 Concurrent `ls` takes out the thrash Benjamin Coddington
@ 2016-12-07 13:37 ` Benjamin Coddington
  2016-12-07 16:30   ` Christoph Hellwig
  2016-12-07 17:01   ` kbuild test robot
  2016-12-07 15:46 ` Concurrent `ls` takes out the thrash Trond Myklebust
  1 sibling, 2 replies; 15+ messages in thread
From: Benjamin Coddington @ 2016-12-07 13:37 UTC (permalink / raw)
  To: Trond Myklebust, Anna Schumaker; +Cc: linux-nfs

Optimizations to NFS to choose between READDIR and READDIRPLUS don't expect
concurrent users of nfs_readdir(), and can cause the pagecache to
repeatedly be invalidated unnecessarily for this case.  Fix this by
serializing nfs_readdir() on the directory's rwsem.

Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
---
 fs/nfs/dir.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
index 53e6d6a4bc9c..8321252a4c8d 100644
--- a/fs/nfs/dir.c
+++ b/fs/nfs/dir.c
@@ -900,6 +900,7 @@ static int nfs_readdir(struct file *file, struct dir_context *ctx)
 {
 	struct dentry	*dentry = file_dentry(file);
 	struct inode	*inode = d_inode(dentry);
+	struct nfs_inode *nfsi = NFS_I(inode);
 	nfs_readdir_descriptor_t my_desc,
 			*desc = &my_desc;
 	struct nfs_open_dir_context *dir_ctx = file->private_data;
@@ -917,6 +918,7 @@ static int nfs_readdir(struct file *file, struct dir_context *ctx)
 	 */
 	memset(desc, 0, sizeof(*desc));
 
+	down_write(&nfsi->rwsem);
 	desc->file = file;
 	desc->ctx = ctx;
 	desc->dir_cookie = &dir_ctx->dir_cookie;
@@ -958,6 +960,7 @@ static int nfs_readdir(struct file *file, struct dir_context *ctx)
 			break;
 	} while (!desc->eof);
 out:
+	up_write(&nfsi->rwsem);
 	if (res > 0)
 		res = 0;
 	dfprintk(FILE, "NFS: readdir(%pD2) returns %d\n", file, res);
-- 
2.5.5


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: Concurrent `ls` takes out the thrash
  2016-12-07 13:28 Concurrent `ls` takes out the thrash Benjamin Coddington
  2016-12-07 13:37 ` [PATCH] NFS: Serialize nfs_readdir() Benjamin Coddington
@ 2016-12-07 15:46 ` Trond Myklebust
  2016-12-07 19:46   ` Benjamin Coddington
  1 sibling, 1 reply; 15+ messages in thread
From: Trond Myklebust @ 2016-12-07 15:46 UTC (permalink / raw)
  To: Benjamin Coddington; +Cc: Linux NFS Mailing List

DQo+IE9uIERlYyA3LCAyMDE2LCBhdCAwODoyOCwgQmVuamFtaW4gQ29kZGluZ3RvbiA8YmNvZGRp
bmdAcmVkaGF0LmNvbT4gd3JvdGU6DQo+IA0KPiBJIHdhcyBhc2tlZCB0byBmaWd1cmUgb3V0IHdo
eSB0aGUgbGlzdGluZyBvZiB2ZXJ5IGxhcmdlIGRpcmVjdG9yaWVzIHdhcw0KPiBzbG93LiAgTW9y
ZSBzcGVjaWZpY2FsbHksIHdoeSBjb25jdXJyZW50bHkgbGlzdGluZyB0aGUgc2FtZSBsYXJnZSBk
aXJlY3RvcnkNCj4gaXMgL3ZlcnkvIHNsb3cuICBJdCBzZWVtcyB0aGF0IHNvbWV0aW1lcyBhIHVz
ZXIncyByZWFjdGlvbiB0byB3YWl0aW5nIGZvcg0KPiAnbHMnIHRvIGNvbXBsZXRlIGlzIHRvIHN0
YXJ0IGEgZmV3IG1vcmUuLiBhbmQgdGhlbiB0aGVpciBtYWNoaW5lIHRha2VzIGENCj4gdmVyeSBs
b25nIHRpbWUgdG8gY29tcGxldGUgdGhhdCB3b3JrLg0KPiANCj4gSSBjYW4gcmVwcm9kdWNlIHRo
YXQgZmluZGluZy4gIEFzIGFuIGV4YW1wbGU6DQo+IA0KPiB0aW1lIGxzIC1mbCAvZGlyL3dpdGgv
MjAwMDAwL2VudHJpZXMvID4vZGV2L251bGwNCj4gDQo+IHJlYWwgICAgMG0xMC43NjZzDQo+IHVz
ZXIgICAgMG0wLjcxNnMNCj4gc3lzICAgICAwbTAuODI3cw0KPiANCj4gQnV0Li4NCj4gDQo+IGZv
ciBpIGluIHsxLi4xMH07IGRvIHRpbWUgbHMgLWZsIC9kaXIvd2l0aC8yMDAwMDAvZW50cmllcy8g
Pi9kZXYvbnVsbCAmIGRvbmUNCj4gDQo+IEVhY2ggb2YgdGhlc2UgXl4gJ2xzJyBjb21tYW5kcyB3
aWxsIHRha2UgNCB0byA1IG1pbnV0ZXMgdG8gY29tcGxldGUuDQo+IA0KPiBUaGUgcHJvYmxlbSBp
cyB0aGF0IGNvbmN1cnJlbnQgJ2xzJyBjb21tYW5kcyBzdGFjayB1cCBpbiBuZnNfcmVhZGRpcigp
IGJvdGgNCj4gd2FpdGluZyBvbiB0aGUgbmV4dCBwYWdlIGFuZCB0YWtpbmcgdHVybnMgZmlsbGlu
ZyB0aGUgbmV4dCBwYWdlIHdpdGggeGRyLA0KPiBidXQgb25seSBvbmUgb2YgdGhlbSB3aWxsIGhh
dmUgZGVzYy0+cGx1cyBzZXQgYmVjYXVzZSBzZXR0aW5nIGl0IGNsZWFycyB0aGUNCj4gZmxhZyBv
biB0aGUgZGlyZWN0b3J5LiAgU28gaWYgYSBwYWdlIGlzIGZpbGxlZCBieSBhIHByb2Nlc3MgdGhh
dCBkb2Vzbid0IGhhdmUNCj4gZGVzYy0+cGx1cyB0aGVuIHRoZSBuZXh0IHBhc3MgdGhyb3VnaCBs
b29rdXAoKSwgaXQgZHVtcHMgdGhlIGVudGlyZSBwYWdlDQo+IGNhY2hlIHdpdGggbmZzX2ZvcmNl
X3VzZV9yZWFkZGlycGx1cygpLiAgVGhlbiB0aGUgbmV4dCByZWFkZGlyIHN0YXJ0cyBhbGwNCj4g
b3ZlciBmaWxsaW5nIHRoZSBwYWdlY2FjaGUuICBGb3J3YXJkIHByb2dyZXNzIGhhcHBlbnMsIGJ1
dCBvbmx5IGFmdGVyIG1hbnkNCj4gc3RlcHMgYmFjayByZS1maWxsaW5nIHRoZSBwYWdlY2FjaGUu
DQoNClllcywgdGhlIHJlYWRkaXIgY29kZSB3YXMgd3JpdHRlbiB3ZWxsIGJlZm9yZSBBbOKAmXMg
cGF0Y2hlcyB0byBwYXJhbGxlbGlzZSB0aGUgVkZTIG9wZXJhdGlvbnMsIGFuZCBhIGxvdCBvZiBp
dCBkaWQgcmVseSBvbiB0aGUgaW5vZGUtPmlfbXV0ZXggYmVpbmcgc2V0IG9uIHRoZSBkaXJlY3Rv
cnkgYnkgdGhlIFZGUyBsYXllci4NCg0KSG93IGFib3V0IHRoZSBmb2xsb3dpbmcgc3VnZ2VzdGlv
bjogaW5zdGVhZCBvZiBzZXR0aW5nIGEgZmxhZyBvbiB0aGUgaW5vZGUsIHdlIGl0ZXJhdGUgdGhy
b3VnaCB0aGUgZW50cmllcyBpbiAmbmZzaS0+b3Blbl9maWxlcywgYW5kIHNldCBhIGZsYWcgb24g
dGhlIHN0cnVjdCBuZnNfb3Blbl9kaXJfY29udGV4dCB0aGF0IHRoZSByZWFkZGlyIHByb2Nlc3Nl
cyBjYW4gY29weSBpbnRvIGRlc2MtPnBsdXMuIERvZXMgdGhhdCBoZWxwIHdpdGggeW91ciB3b3Jr
bG9hZD8NCg0K


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] NFS: Serialize nfs_readdir()
  2016-12-07 13:37 ` [PATCH] NFS: Serialize nfs_readdir() Benjamin Coddington
@ 2016-12-07 16:30   ` Christoph Hellwig
  2016-12-07 19:40     ` Benjamin Coddington
  2016-12-07 17:01   ` kbuild test robot
  1 sibling, 1 reply; 15+ messages in thread
From: Christoph Hellwig @ 2016-12-07 16:30 UTC (permalink / raw)
  To: Benjamin Coddington; +Cc: Trond Myklebust, Anna Schumaker, linux-nfs

On Wed, Dec 07, 2016 at 08:37:02AM -0500, Benjamin Coddington wrote:
> Optimizations to NFS to choose between READDIR and READDIRPLUS don't expect
> concurrent users of nfs_readdir(), and can cause the pagecache to
> repeatedly be invalidated unnecessarily for this case.  Fix this by
> serializing nfs_readdir() on the directory's rwsem.

Just implement .iterate instead of .iterate_shared and you'll get the
serialization for free.  While doing that please add a comment on why
it's serialized.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] NFS: Serialize nfs_readdir()
  2016-12-07 13:37 ` [PATCH] NFS: Serialize nfs_readdir() Benjamin Coddington
  2016-12-07 16:30   ` Christoph Hellwig
@ 2016-12-07 17:01   ` kbuild test robot
  1 sibling, 0 replies; 15+ messages in thread
From: kbuild test robot @ 2016-12-07 17:01 UTC (permalink / raw)
  To: Benjamin Coddington
  Cc: kbuild-all, Trond Myklebust, Anna Schumaker, linux-nfs

[-- Attachment #1: Type: text/plain, Size: 1632 bytes --]

Hi Benjamin,

[auto build test ERROR on nfs/linux-next]
[also build test ERROR on v4.9-rc8 next-20161207]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Benjamin-Coddington/NFS-Serialize-nfs_readdir/20161208-001039
base:   git://git.linux-nfs.org/projects/trondmy/linux-nfs.git linux-next
config: xtensa-common_defconfig (attached as .config)
compiler: xtensa-linux-gcc (GCC) 4.9.0
reproduce:
        wget https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        make.cross ARCH=xtensa 

All errors (new ones prefixed by >>):

   fs/nfs/dir.c: In function 'nfs_readdir':
>> fs/nfs/dir.c:921:18: error: 'struct nfs_inode' has no member named 'rwsem'
     down_write(&nfsi->rwsem);
                     ^
   fs/nfs/dir.c:963:16: error: 'struct nfs_inode' has no member named 'rwsem'
     up_write(&nfsi->rwsem);
                   ^

vim +921 fs/nfs/dir.c

   915		 * *desc->dir_cookie has the cookie for the next entry. We have
   916		 * to either find the entry with the appropriate number or
   917		 * revalidate the cookie.
   918		 */
   919		memset(desc, 0, sizeof(*desc));
   920	
 > 921		down_write(&nfsi->rwsem);
   922		desc->file = file;
   923		desc->ctx = ctx;
   924		desc->dir_cookie = &dir_ctx->dir_cookie;

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 10104 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] NFS: Serialize nfs_readdir()
  2016-12-07 16:30   ` Christoph Hellwig
@ 2016-12-07 19:40     ` Benjamin Coddington
  0 siblings, 0 replies; 15+ messages in thread
From: Benjamin Coddington @ 2016-12-07 19:40 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Trond Myklebust, Anna Schumaker, linux-nfs

On 7 Dec 2016, at 11:30, Christoph Hellwig wrote:

> On Wed, Dec 07, 2016 at 08:37:02AM -0500, Benjamin Coddington wrote:
>> Optimizations to NFS to choose between READDIR and READDIRPLUS don't 
>> expect
>> concurrent users of nfs_readdir(), and can cause the pagecache to
>> repeatedly be invalidated unnecessarily for this case.  Fix this by
>> serializing nfs_readdir() on the directory's rwsem.
>
> Just implement .iterate instead of .iterate_shared and you'll get the
> serialization for free.  While doing that please add a comment on why
> it's serialized.

I had it in my head that Al was trying to get rid of .iterate, otherwise
yes, this would be more sensible.  I think it'll still benefit from
.iterate_shared in the regular case (the case where we're not trying to 
use
READDIRPLUS), so I'll try out Trond's suggestion to fix it up before 
chasing
this down.

Ben

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Concurrent `ls` takes out the thrash
  2016-12-07 15:46 ` Concurrent `ls` takes out the thrash Trond Myklebust
@ 2016-12-07 19:46   ` Benjamin Coddington
  2016-12-07 22:34     ` Benjamin Coddington
  0 siblings, 1 reply; 15+ messages in thread
From: Benjamin Coddington @ 2016-12-07 19:46 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Linux NFS Mailing List



On 7 Dec 2016, at 10:46, Trond Myklebust wrote:

>> On Dec 7, 2016, at 08:28, Benjamin Coddington <bcodding@redhat.com> 
>> wrote:
>>
>> I was asked to figure out why the listing of very large directories 
>> was
>> slow.  More specifically, why concurrently listing the same large 
>> directory
>> is /very/ slow.  It seems that sometimes a user's reaction to waiting 
>> for
>> 'ls' to complete is to start a few more.. and then their machine 
>> takes a
>> very long time to complete that work.
>>
>> I can reproduce that finding.  As an example:
>>
>> time ls -fl /dir/with/200000/entries/ >/dev/null
>>
>> real    0m10.766s
>> user    0m0.716s
>> sys     0m0.827s
>>
>> But..
>>
>> for i in {1..10}; do time ls -fl /dir/with/200000/entries/ >/dev/null 
>> & done
>>
>> Each of these ^^ 'ls' commands will take 4 to 5 minutes to complete.
>>
>> The problem is that concurrent 'ls' commands stack up in 
>> nfs_readdir() both
>> waiting on the next page and taking turns filling the next page with 
>> xdr,
>> but only one of them will have desc->plus set because setting it 
>> clears the
>> flag on the directory.  So if a page is filled by a process that 
>> doesn't have
>> desc->plus then the next pass through lookup(), it dumps the entire 
>> page
>> cache with nfs_force_use_readdirplus().  Then the next readdir starts 
>> all
>> over filling the pagecache.  Forward progress happens, but only after 
>> many
>> steps back re-filling the pagecache.
>
> Yes, the readdir code was written well before Al’s patches to 
> parallelise
> the VFS operations, and a lot of it did rely on the inode->i_mutex 
> being
> set on the directory by the VFS layer.
>
> How about the following suggestion: instead of setting a flag on the
> inode, we iterate through the entries in &nfsi->open_files, and set a 
> flag
> on the struct nfs_open_dir_context that the readdir processes can copy
> into desc->plus. Does that help with your workload?

That should work.. I guess I'll hack it up and present it for 
dissection.

Thanks!
Ben

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Concurrent `ls` takes out the thrash
  2016-12-07 19:46   ` Benjamin Coddington
@ 2016-12-07 22:34     ` Benjamin Coddington
  2016-12-07 22:41       ` Trond Myklebust
  0 siblings, 1 reply; 15+ messages in thread
From: Benjamin Coddington @ 2016-12-07 22:34 UTC (permalink / raw)
  To: Trond Myklebust, Trond Myklebust; +Cc: linux-nfs

From: "Benjamin Coddington" <bcodding@redhat.com>

So, I've got the following patch, and it seems to work fine for the original
workload.  However, if I use ~20 procs started 2 seconds apart, I can still
sometimes get into the state where the machine takes a very long time (5 - 8
minutes).  I wonder if I am getting into a state were vmscan is reclaiming
pages while I'm trying to fill them up.  So.. I'll do a bit more debugging
and re-post this if I feel like it is still the right approach.

Adding an int to nfs_open_dir_context puts it at 60 bytes here.  Probably
could have added a unsigned long flags and used bits for the duped stuff as
well, but it was uglier that way, so I left it.  I like how duped carries
-1, 0, and >1 information without having flag defines cluttering everywhere.

Ben

8<------------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Concurrent `ls` takes out the thrash
  2016-12-07 22:34     ` Benjamin Coddington
@ 2016-12-07 22:41       ` Trond Myklebust
  2016-12-07 22:55         ` Benjamin Coddington
  0 siblings, 1 reply; 15+ messages in thread
From: Trond Myklebust @ 2016-12-07 22:41 UTC (permalink / raw)
  To: Benjamin Coddington; +Cc: Linux NFS Mailing List


> On Dec 7, 2016, at 17:34, Benjamin Coddington <bcodding@redhat.com> wrote=
:
>=20
> From: "Benjamin Coddington" <bcodding@redhat.com>
>=20
> So, I've got the following patch, and it seems to work fine for the origi=
nal
> workload.  However, if I use ~20 procs started 2 seconds apart, I can sti=
ll
> sometimes get into the state where the machine takes a very long time (5 =
- 8
> minutes).  I wonder if I am getting into a state were vmscan is reclaimin=
g
> pages while I'm trying to fill them up.  So.. I'll do a bit more debuggin=
g
> and re-post this if I feel like it is still the right approach.
>=20
> Adding an int to nfs_open_dir_context puts it at 60 bytes here.  Probably
> could have added a unsigned long flags and used bits for the duped stuff =
as
> well, but it was uglier that way, so I left it.  I like how duped carries
> -1, 0, and >1 information without having flag defines cluttering everywhe=
re.
>=20
> Ben
>=20
> 8<-----------------------------------------------------------------------=
-
>=20
> From 5b3e747a984ec966e8c742d8f4fe4b08e1c93acd Mon Sep 17 00:00:00 2001
> Message-Id: <5b3e747a984ec966e8c742d8f4fe4b08e1c93acd.1481149380.git.bcod=
ding@redhat.com>
> From: Benjamin Coddington <bcodding@redhat.com>
> Date: Wed, 7 Dec 2016 16:23:30 -0500
> Subject: [PATCH] NFS: Move readdirplus flag to directory context
>=20
> For a concurrent 'ls -l' workload, processes can stack up in nfs_readdir(=
)
> both waiting on the next page and taking turns filling the next page from
> XDR, but only one of them will have desc->plus set because setting it
> clears the flag on the directory.  If a page is filled by a process that
> doesn't have desc->plus set then the next pass through lookup() will caus=
e
> it to dump the entire page cache with nfs_force_use_readdirplus().  Then
> the next readdir starts all over filling the pagecache.  Forward progress
> happens, but only after many steps back re-filling the pagecache.
>=20
> Fix this by moving the flag to use readdirplus to each open directory
> context.
>=20
> Suggested-by: Trond Myklebust <trondmy@primarydata.com>
> Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
> ---
> fs/nfs/dir.c           | 39 ++++++++++++++++++++++++---------------
> include/linux/nfs_fs.h |  1 +
> 2 files changed, 25 insertions(+), 15 deletions(-)
>=20
> diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
> index 7483722162fa..818172606fed 100644
> --- a/fs/nfs/dir.c
> +++ b/fs/nfs/dir.c
> @@ -78,6 +78,7 @@ static struct nfs_open_dir_context *alloc_nfs_open_dir_=
context(struct inode *dir
> =09=09ctx->dir_cookie =3D 0;
> =09=09ctx->dup_cookie =3D 0;
> =09=09ctx->cred =3D get_rpccred(cred);
> +=09=09ctx->use_readdir_plus =3D 0;
> =09=09spin_lock(&dir->i_lock);
> =09=09list_add(&ctx->list, &nfsi->open_files);
> =09=09spin_unlock(&dir->i_lock);
> @@ -443,17 +444,35 @@ int nfs_same_file(struct dentry *dentry, struct nfs=
_entry *entry)
> }
>=20
> static
> -bool nfs_use_readdirplus(struct inode *dir, struct dir_context *ctx)
> +bool nfs_use_readdirplus(struct inode *dir, struct dir_context *ctx,
> +=09=09=09struct nfs_open_dir_context *dir_ctx)
> {
> =09if (!nfs_server_capable(dir, NFS_CAP_READDIRPLUS))
> =09=09return false;
> -=09if (test_and_clear_bit(NFS_INO_ADVISE_RDPLUS, &NFS_I(dir)->flags))
> +=09if (xchg(&dir_ctx->use_readdir_plus, 0))
> =09=09return true;
> =09if (ctx->pos =3D=3D 0)
> =09=09return true;
> =09return false;
> }
>=20
> +static
> +void nfs_set_readdirplus(struct inode *dir, int force)
> +{
> +=09struct nfs_inode *nfsi =3D NFS_I(dir);
> +=09struct nfs_open_dir_context *ctx;
> +
> +=09if (nfs_server_capable(dir, NFS_CAP_READDIRPLUS) &&
> +=09    !list_empty(&nfsi->open_files)) {
> +=09=09spin_lock(&dir->i_lock);
> +=09=09list_for_each_entry(ctx, &nfsi->open_files, list)
> +=09=09=09ctx->use_readdir_plus =3D 1;
> +=09=09spin_unlock(&dir->i_lock);
> +=09=09if (force)
> +=09=09=09invalidate_mapping_pages(dir->i_mapping, 0, -1);
> +=09}
> +}
> +
> /*
>  * This function is called by the lookup and getattr code to request the
>  * use of readdirplus to accelerate any future lookups in the same
> @@ -461,11 +480,7 @@ bool nfs_use_readdirplus(struct inode *dir, struct d=
ir_context *ctx)
>  */
> void nfs_advise_use_readdirplus(struct inode *dir)
> {
> -=09struct nfs_inode *nfsi =3D NFS_I(dir);
> -
> -=09if (nfs_server_capable(dir, NFS_CAP_READDIRPLUS) &&
> -=09    !list_empty(&nfsi->open_files))
> -=09=09set_bit(NFS_INO_ADVISE_RDPLUS, &nfsi->flags);
> +=09nfs_set_readdirplus(dir, 0);
> }
>=20
> /*
> @@ -478,13 +493,7 @@ void nfs_advise_use_readdirplus(struct inode *dir)
>  */
> void nfs_force_use_readdirplus(struct inode *dir)
> {
> -=09struct nfs_inode *nfsi =3D NFS_I(dir);
> -
> -=09if (nfs_server_capable(dir, NFS_CAP_READDIRPLUS) &&
> -=09    !list_empty(&nfsi->open_files)) {
> -=09=09set_bit(NFS_INO_ADVISE_RDPLUS, &nfsi->flags);
> -=09=09invalidate_mapping_pages(dir->i_mapping, 0, -1);
> -=09}
> +=09nfs_set_readdirplus(dir, 1);
> }
>=20
> static
> @@ -921,7 +930,7 @@ static int nfs_readdir(struct file *file, struct dir_=
context *ctx)
> =09desc->ctx =3D ctx;
> =09desc->dir_cookie =3D &dir_ctx->dir_cookie;
> =09desc->decode =3D NFS_PROTO(inode)->decode_dirent;
> -=09desc->plus =3D nfs_use_readdirplus(inode, ctx) ? 1 : 0;
> +=09desc->plus =3D nfs_use_readdirplus(inode, ctx, dir_ctx) ? 1 : 0;

This fixes desc->plus at the beginning of the readdir() call. Perhaps we sh=
ould instead check the value of ctx->use_readdir_plus in the call to nfs_re=
addir_xdr_filler(), and just resetting cts->use_readdir_plus at the very en=
d of nfs_readdir()?

>=20
> =09if (ctx->pos =3D=3D 0 || nfs_attribute_cache_expired(inode))
> =09=09res =3D nfs_revalidate_mapping(inode, file->f_mapping);
> diff --git a/include/linux/nfs_fs.h b/include/linux/nfs_fs.h
> index cb631973839a..fe06c1dd1737 100644
> --- a/include/linux/nfs_fs.h
> +++ b/include/linux/nfs_fs.h
> @@ -89,6 +89,7 @@ struct nfs_open_dir_context {
> =09__u64 dir_cookie;
> =09__u64 dup_cookie;
> =09signed char duped;
> +=09int use_readdir_plus;
> };
>=20
> /*
> --=20
> 2.5.5
>=20

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Concurrent `ls` takes out the thrash
  2016-12-07 22:41       ` Trond Myklebust
@ 2016-12-07 22:55         ` Benjamin Coddington
  2016-12-07 22:59           ` Trond Myklebust
  0 siblings, 1 reply; 15+ messages in thread
From: Benjamin Coddington @ 2016-12-07 22:55 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Linux NFS Mailing List

On 7 Dec 2016, at 17:41, Trond Myklebust wrote:

>> On Dec 7, 2016, at 17:34, Benjamin Coddington <bcodding@redhat.com> 
>> wrote:
>> static
>> @@ -921,7 +930,7 @@ static int nfs_readdir(struct file *file, struct 
>> dir_context *ctx)
>> 	desc->ctx = ctx;
>> 	desc->dir_cookie = &dir_ctx->dir_cookie;
>> 	desc->decode = NFS_PROTO(inode)->decode_dirent;
>> -	desc->plus = nfs_use_readdirplus(inode, ctx) ? 1 : 0;
>> +	desc->plus = nfs_use_readdirplus(inode, ctx, dir_ctx) ? 1 : 0;
>
> This fixes desc->plus at the beginning of the readdir() call. Perhaps 
> we
> should instead check the value of ctx->use_readdir_plus in the call to
> nfs_readdir_xdr_filler(), and just resetting cts->use_readdir_plus at 
> the
> very end of nfs_readdir()?

I don't understand the functional difference.  Is this just a 
preference?

There must be something else happening.. dcache is getting under 
pressure
pruned maybe, that causes a miss and then the process starts over?

Ben

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Concurrent `ls` takes out the thrash
  2016-12-07 22:55         ` Benjamin Coddington
@ 2016-12-07 22:59           ` Trond Myklebust
  2016-12-07 23:10             ` Benjamin Coddington
  0 siblings, 1 reply; 15+ messages in thread
From: Trond Myklebust @ 2016-12-07 22:59 UTC (permalink / raw)
  To: Benjamin Coddington; +Cc: Linux NFS Mailing List

DQo+IE9uIERlYyA3LCAyMDE2LCBhdCAxNzo1NSwgQmVuamFtaW4gQ29kZGluZ3RvbiA8YmNvZGRp
bmdAcmVkaGF0LmNvbT4gd3JvdGU6DQo+IA0KPiBPbiA3IERlYyAyMDE2LCBhdCAxNzo0MSwgVHJv
bmQgTXlrbGVidXN0IHdyb3RlOg0KPiANCj4+PiBPbiBEZWMgNywgMjAxNiwgYXQgMTc6MzQsIEJl
bmphbWluIENvZGRpbmd0b24gPGJjb2RkaW5nQHJlZGhhdC5jb20+IHdyb3RlOg0KPj4+IHN0YXRp
Yw0KPj4+IEBAIC05MjEsNyArOTMwLDcgQEAgc3RhdGljIGludCBuZnNfcmVhZGRpcihzdHJ1Y3Qg
ZmlsZSAqZmlsZSwgc3RydWN0IGRpcl9jb250ZXh0ICpjdHgpDQo+Pj4gCWRlc2MtPmN0eCA9IGN0
eDsNCj4+PiAJZGVzYy0+ZGlyX2Nvb2tpZSA9ICZkaXJfY3R4LT5kaXJfY29va2llOw0KPj4+IAlk
ZXNjLT5kZWNvZGUgPSBORlNfUFJPVE8oaW5vZGUpLT5kZWNvZGVfZGlyZW50Ow0KPj4+IC0JZGVz
Yy0+cGx1cyA9IG5mc191c2VfcmVhZGRpcnBsdXMoaW5vZGUsIGN0eCkgPyAxIDogMDsNCj4+PiAr
CWRlc2MtPnBsdXMgPSBuZnNfdXNlX3JlYWRkaXJwbHVzKGlub2RlLCBjdHgsIGRpcl9jdHgpID8g
MSA6IDA7DQo+PiANCj4+IFRoaXMgZml4ZXMgZGVzYy0+cGx1cyBhdCB0aGUgYmVnaW5uaW5nIG9m
IHRoZSByZWFkZGlyKCkgY2FsbC4gUGVyaGFwcyB3ZQ0KPj4gc2hvdWxkIGluc3RlYWQgY2hlY2sg
dGhlIHZhbHVlIG9mIGN0eC0+dXNlX3JlYWRkaXJfcGx1cyBpbiB0aGUgY2FsbCB0bw0KPj4gbmZz
X3JlYWRkaXJfeGRyX2ZpbGxlcigpLCBhbmQganVzdCByZXNldHRpbmcgY3RzLT51c2VfcmVhZGRp
cl9wbHVzIGF0IHRoZQ0KPj4gdmVyeSBlbmQgb2YgbmZzX3JlYWRkaXIoKT8NCj4gDQo+IEkgZG9u
J3QgdW5kZXJzdGFuZCB0aGUgZnVuY3Rpb25hbCBkaWZmZXJlbmNlLiAgSXMgdGhpcyBqdXN0IGEg
cHJlZmVyZW5jZT8NCg0KTm8uIFRoZSBmdW5jdGlvbmFsIGRpZmZlcmVuY2UgaXMgdGhhdCB3ZSB0
YWtlIGludG8gYWNjb3VudCB0aGUgZmFjdCB0aGF0IGEgcHJvY2VzcyBtYXkgYmUgaW4gdGhlIHJl
YWRkaXIoKSBjb2RlIHdoaWxlIGEgR0VUQVRUUiBvciBMT09LVVAgZnJvbSBhIHNlY29uZCBwcm9j
ZXNzIGlzIHRyaWdnZXJpbmcg4oCcdXNlIHJlYWRkaXJwbHVz4oCdIGZlZWRiYWNrLg0KDQo+IA0K
PiBUaGVyZSBtdXN0IGJlIHNvbWV0aGluZyBlbHNlIGhhcHBlbmluZy4uIGRjYWNoZSBpcyBnZXR0
aW5nIHVuZGVyIHByZXNzdXJlDQo+IHBydW5lZCBtYXliZSwgdGhhdCBjYXVzZXMgYSBtaXNzIGFu
ZCB0aGVuIHRoZSBwcm9jZXNzIHN0YXJ0cyBvdmVyPw0KPiANCj4gQmVuDQo+IA0KDQo=

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Concurrent `ls` takes out the thrash
  2016-12-07 22:59           ` Trond Myklebust
@ 2016-12-07 23:10             ` Benjamin Coddington
  2016-12-08 14:18               ` Benjamin Coddington
  0 siblings, 1 reply; 15+ messages in thread
From: Benjamin Coddington @ 2016-12-07 23:10 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Linux NFS Mailing List

On 7 Dec 2016, at 17:59, Trond Myklebust wrote:

>> On Dec 7, 2016, at 17:55, Benjamin Coddington <bcodding@redhat.com> 
>> wrote:
>>
>> On 7 Dec 2016, at 17:41, Trond Myklebust wrote:
>>
>>>> On Dec 7, 2016, at 17:34, Benjamin Coddington <bcodding@redhat.com> 
>>>> wrote:
>>>> static
>>>> @@ -921,7 +930,7 @@ static int nfs_readdir(struct file *file, 
>>>> struct dir_context *ctx)
>>>> 	desc->ctx = ctx;
>>>> 	desc->dir_cookie = &dir_ctx->dir_cookie;
>>>> 	desc->decode = NFS_PROTO(inode)->decode_dirent;
>>>> -	desc->plus = nfs_use_readdirplus(inode, ctx) ? 1 : 0;
>>>> +	desc->plus = nfs_use_readdirplus(inode, ctx, dir_ctx) ? 1 : 0;
>>>
>>> This fixes desc->plus at the beginning of the readdir() call. 
>>> Perhaps we
>>> should instead check the value of ctx->use_readdir_plus in the call 
>>> to
>>> nfs_readdir_xdr_filler(), and just resetting cts->use_readdir_plus 
>>> at the
>>> very end of nfs_readdir()?
>>
>> I don't understand the functional difference.  Is this just a 
>> preference?
>
> No. The functional difference is that we take into account the fact 
> that a
> process may be in the readdir() code while a GETATTR or LOOKUP from a
> second process is triggering “use readdirplus” feedback.

This should only matter if the concurrent processes have different 
buffer
sizes or start at a different offsets -- which shouldn't happen with 
plain
old 'ls -l'.

.. or maybe I'm wrong if, hmm..  if acdirmin ran out (maybe?).. or if we 
mix
'ls -l' and 'ls'.. or if we have pages getting reclaimed.. OK.  I'll try 
it.

Thanks for your suggestions.
Ben

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Concurrent `ls` takes out the thrash
  2016-12-07 23:10             ` Benjamin Coddington
@ 2016-12-08 14:18               ` Benjamin Coddington
  2016-12-08 16:13                 ` Benjamin Coddington
  0 siblings, 1 reply; 15+ messages in thread
From: Benjamin Coddington @ 2016-12-08 14:18 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Linux NFS Mailing List


On 7 Dec 2016, at 18:10, Benjamin Coddington wrote:

> On 7 Dec 2016, at 17:59, Trond Myklebust wrote:
>
>>> On Dec 7, 2016, at 17:55, Benjamin Coddington <bcodding@redhat.com> 
>>> wrote:
>>>
>>> On 7 Dec 2016, at 17:41, Trond Myklebust wrote:
>>>
>>>>> On Dec 7, 2016, at 17:34, Benjamin Coddington 
>>>>> <bcodding@redhat.com> wrote:
>>>>> static
>>>>> @@ -921,7 +930,7 @@ static int nfs_readdir(struct file *file, 
>>>>> struct dir_context *ctx)
>>>>> 	desc->ctx = ctx;
>>>>> 	desc->dir_cookie = &dir_ctx->dir_cookie;
>>>>> 	desc->decode = NFS_PROTO(inode)->decode_dirent;
>>>>> -	desc->plus = nfs_use_readdirplus(inode, ctx) ? 1 : 0;
>>>>> +	desc->plus = nfs_use_readdirplus(inode, ctx, dir_ctx) ? 1 : 0;
>>>>
>>>> This fixes desc->plus at the beginning of the readdir() call. 
>>>> Perhaps we
>>>> should instead check the value of ctx->use_readdir_plus in the call 
>>>> to
>>>> nfs_readdir_xdr_filler(), and just resetting cts->use_readdir_plus 
>>>> at the
>>>> very end of nfs_readdir()?
>>>
>>> I don't understand the functional difference.  Is this just a 
>>> preference?
>>
>> No. The functional difference is that we take into account the fact 
>> that a
>> process may be in the readdir() code while a GETATTR or LOOKUP from a
>> second process is triggering “use readdirplus” feedback.
>
> This should only matter if the concurrent processes have different 
> buffer
> sizes or start at a different offsets -- which shouldn't happen with 
> plain
> old 'ls -l'.
>
> .. or maybe I'm wrong if, hmm..  if acdirmin ran out (maybe?).. or if 
> we mix
> 'ls -l' and 'ls'.. or if we have pages getting reclaimed.. OK.  I'll 
> try it.

This doesn't help.

The issue is that anything more than a couple of processes cause 
contention
on the directory's i_lock.  The i_lock is taken x entries x processes. 
The
inode flags and serialization worked far better.

If we add another inode flag, then we can add a DOING_RDPLUS.  A process
entering nfs_readdir() that sees ADVISE and not DOING clears it and sets
DOING and remembers to clear DOING on exit of nfs_readdir().  Any 
process
that sees DOING uses plus.

Ben

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Concurrent `ls` takes out the thrash
  2016-12-08 14:18               ` Benjamin Coddington
@ 2016-12-08 16:13                 ` Benjamin Coddington
  2016-12-08 21:48                   ` Benjamin Coddington
  0 siblings, 1 reply; 15+ messages in thread
From: Benjamin Coddington @ 2016-12-08 16:13 UTC (permalink / raw)
  To: Trond Myklebust, Trond Myklebust; +Cc: linux-nfs

From: "Benjamin Coddington" <bcodding@redhat.com>

Ehh.. I think it's kinda ugly, but this is what I came up with.

It works well, though, and handles everything I throw at it.  Its not as
simple as Christoph's suggestion to just go back to .iterate, which would
solve this in a simpler way.

Ben


8<---------------------------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Concurrent `ls` takes out the thrash
  2016-12-08 16:13                 ` Benjamin Coddington
@ 2016-12-08 21:48                   ` Benjamin Coddington
  0 siblings, 0 replies; 15+ messages in thread
From: Benjamin Coddington @ 2016-12-08 21:48 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: linux-nfs

> Ehh.. I think it's kinda ugly, but this is what I came up with.
>
> It works well, though, and handles everything I throw at it.  Its not as
> simple as Christoph's suggestion to just go back to .iterate, which would
> solve this in a simpler way.

After many testings, I can't find any way this is faster than using the old
.iterate serialized readdir.  Additionally, at around 1M entries, this way
frequently needs to oom-kill things.  Using .iterate, that doesn't happen,
and is much faster at that scale.

Maybe my feeble brain is unable to think of the right workload to make
.iterate_shared preferable for NFS.  Suggestions welcome.

Otherwise, I think serialized is the way to go.

Ben

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2016-12-08 21:48 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-12-07 13:28 Concurrent `ls` takes out the thrash Benjamin Coddington
2016-12-07 13:37 ` [PATCH] NFS: Serialize nfs_readdir() Benjamin Coddington
2016-12-07 16:30   ` Christoph Hellwig
2016-12-07 19:40     ` Benjamin Coddington
2016-12-07 17:01   ` kbuild test robot
2016-12-07 15:46 ` Concurrent `ls` takes out the thrash Trond Myklebust
2016-12-07 19:46   ` Benjamin Coddington
2016-12-07 22:34     ` Benjamin Coddington
2016-12-07 22:41       ` Trond Myklebust
2016-12-07 22:55         ` Benjamin Coddington
2016-12-07 22:59           ` Trond Myklebust
2016-12-07 23:10             ` Benjamin Coddington
2016-12-08 14:18               ` Benjamin Coddington
2016-12-08 16:13                 ` Benjamin Coddington
2016-12-08 21:48                   ` Benjamin Coddington

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).