* Concurrent `ls` takes out the thrash @ 2016-12-07 13:28 Benjamin Coddington 2016-12-07 13:37 ` [PATCH] NFS: Serialize nfs_readdir() Benjamin Coddington 2016-12-07 15:46 ` Concurrent `ls` takes out the thrash Trond Myklebust 0 siblings, 2 replies; 15+ messages in thread From: Benjamin Coddington @ 2016-12-07 13:28 UTC (permalink / raw) To: linux-nfs list I was asked to figure out why the listing of very large directories was slow. More specifically, why concurrently listing the same large directory is /very/ slow. It seems that sometimes a user's reaction to waiting for 'ls' to complete is to start a few more.. and then their machine takes a very long time to complete that work. I can reproduce that finding. As an example: time ls -fl /dir/with/200000/entries/ >/dev/null real 0m10.766s user 0m0.716s sys 0m0.827s But.. for i in {1..10}; do time ls -fl /dir/with/200000/entries/ >/dev/null & done Each of these ^^ 'ls' commands will take 4 to 5 minutes to complete. The problem is that concurrent 'ls' commands stack up in nfs_readdir() both waiting on the next page and taking turns filling the next page with xdr, but only one of them will have desc->plus set because setting it clears the flag on the directory. So if a page is filled by a process that doesn't have desc->plus then the next pass through lookup(), it dumps the entire page cache with nfs_force_use_readdirplus(). Then the next readdir starts all over filling the pagecache. Forward progress happens, but only after many steps back re-filling the pagecache. To me most obvious fix would be to serialize nfs_readdir() on the directory inode, so I'll follow-up with patch that does that with nfsi->rwsem. With that, the above parallel 'ls' takes 12 seconds for each 'ls' to complete. This only works because with concurrent 'ls' there is a consistent buffer size so a waiting nfs_readdir() started in the same place for an unmodified directory should always hit the cache after waiting. Serializing nfs_readdir() will not solve this problem for concurrent callers with differing buffer sizes, or starting at different offsets, since there's a good chance the waiting readdir() will not see the readdirplus flag when it resumes and so will not prime the dcache. While I think it's an OK fix, it feels bad to serialize. At the same time, nfs_readdir() is already serialized on the pagecache when concurrent callers need to go to the server. There might be other problems I haven't thought about. Maybe there's another way to fix this, or maybe we can just say "Don't do ls more than once, you impatient bastards!" Ben ^ permalink raw reply [flat|nested] 15+ messages in thread
* [PATCH] NFS: Serialize nfs_readdir() 2016-12-07 13:28 Concurrent `ls` takes out the thrash Benjamin Coddington @ 2016-12-07 13:37 ` Benjamin Coddington 2016-12-07 16:30 ` Christoph Hellwig 2016-12-07 17:01 ` kbuild test robot 2016-12-07 15:46 ` Concurrent `ls` takes out the thrash Trond Myklebust 1 sibling, 2 replies; 15+ messages in thread From: Benjamin Coddington @ 2016-12-07 13:37 UTC (permalink / raw) To: Trond Myklebust, Anna Schumaker; +Cc: linux-nfs Optimizations to NFS to choose between READDIR and READDIRPLUS don't expect concurrent users of nfs_readdir(), and can cause the pagecache to repeatedly be invalidated unnecessarily for this case. Fix this by serializing nfs_readdir() on the directory's rwsem. Signed-off-by: Benjamin Coddington <bcodding@redhat.com> --- fs/nfs/dir.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c index 53e6d6a4bc9c..8321252a4c8d 100644 --- a/fs/nfs/dir.c +++ b/fs/nfs/dir.c @@ -900,6 +900,7 @@ static int nfs_readdir(struct file *file, struct dir_context *ctx) { struct dentry *dentry = file_dentry(file); struct inode *inode = d_inode(dentry); + struct nfs_inode *nfsi = NFS_I(inode); nfs_readdir_descriptor_t my_desc, *desc = &my_desc; struct nfs_open_dir_context *dir_ctx = file->private_data; @@ -917,6 +918,7 @@ static int nfs_readdir(struct file *file, struct dir_context *ctx) */ memset(desc, 0, sizeof(*desc)); + down_write(&nfsi->rwsem); desc->file = file; desc->ctx = ctx; desc->dir_cookie = &dir_ctx->dir_cookie; @@ -958,6 +960,7 @@ static int nfs_readdir(struct file *file, struct dir_context *ctx) break; } while (!desc->eof); out: + up_write(&nfsi->rwsem); if (res > 0) res = 0; dfprintk(FILE, "NFS: readdir(%pD2) returns %d\n", file, res); -- 2.5.5 ^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: [PATCH] NFS: Serialize nfs_readdir() 2016-12-07 13:37 ` [PATCH] NFS: Serialize nfs_readdir() Benjamin Coddington @ 2016-12-07 16:30 ` Christoph Hellwig 2016-12-07 19:40 ` Benjamin Coddington 2016-12-07 17:01 ` kbuild test robot 1 sibling, 1 reply; 15+ messages in thread From: Christoph Hellwig @ 2016-12-07 16:30 UTC (permalink / raw) To: Benjamin Coddington; +Cc: Trond Myklebust, Anna Schumaker, linux-nfs On Wed, Dec 07, 2016 at 08:37:02AM -0500, Benjamin Coddington wrote: > Optimizations to NFS to choose between READDIR and READDIRPLUS don't expect > concurrent users of nfs_readdir(), and can cause the pagecache to > repeatedly be invalidated unnecessarily for this case. Fix this by > serializing nfs_readdir() on the directory's rwsem. Just implement .iterate instead of .iterate_shared and you'll get the serialization for free. While doing that please add a comment on why it's serialized. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] NFS: Serialize nfs_readdir() 2016-12-07 16:30 ` Christoph Hellwig @ 2016-12-07 19:40 ` Benjamin Coddington 0 siblings, 0 replies; 15+ messages in thread From: Benjamin Coddington @ 2016-12-07 19:40 UTC (permalink / raw) To: Christoph Hellwig; +Cc: Trond Myklebust, Anna Schumaker, linux-nfs On 7 Dec 2016, at 11:30, Christoph Hellwig wrote: > On Wed, Dec 07, 2016 at 08:37:02AM -0500, Benjamin Coddington wrote: >> Optimizations to NFS to choose between READDIR and READDIRPLUS don't >> expect >> concurrent users of nfs_readdir(), and can cause the pagecache to >> repeatedly be invalidated unnecessarily for this case. Fix this by >> serializing nfs_readdir() on the directory's rwsem. > > Just implement .iterate instead of .iterate_shared and you'll get the > serialization for free. While doing that please add a comment on why > it's serialized. I had it in my head that Al was trying to get rid of .iterate, otherwise yes, this would be more sensible. I think it'll still benefit from .iterate_shared in the regular case (the case where we're not trying to use READDIRPLUS), so I'll try out Trond's suggestion to fix it up before chasing this down. Ben ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] NFS: Serialize nfs_readdir() 2016-12-07 13:37 ` [PATCH] NFS: Serialize nfs_readdir() Benjamin Coddington 2016-12-07 16:30 ` Christoph Hellwig @ 2016-12-07 17:01 ` kbuild test robot 1 sibling, 0 replies; 15+ messages in thread From: kbuild test robot @ 2016-12-07 17:01 UTC (permalink / raw) To: Benjamin Coddington Cc: kbuild-all, Trond Myklebust, Anna Schumaker, linux-nfs [-- Attachment #1: Type: text/plain, Size: 1632 bytes --] Hi Benjamin, [auto build test ERROR on nfs/linux-next] [also build test ERROR on v4.9-rc8 next-20161207] [if your patch is applied to the wrong git tree, please drop us a note to help improve the system] url: https://github.com/0day-ci/linux/commits/Benjamin-Coddington/NFS-Serialize-nfs_readdir/20161208-001039 base: git://git.linux-nfs.org/projects/trondmy/linux-nfs.git linux-next config: xtensa-common_defconfig (attached as .config) compiler: xtensa-linux-gcc (GCC) 4.9.0 reproduce: wget https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross -O ~/bin/make.cross chmod +x ~/bin/make.cross # save the attached .config to linux build tree make.cross ARCH=xtensa All errors (new ones prefixed by >>): fs/nfs/dir.c: In function 'nfs_readdir': >> fs/nfs/dir.c:921:18: error: 'struct nfs_inode' has no member named 'rwsem' down_write(&nfsi->rwsem); ^ fs/nfs/dir.c:963:16: error: 'struct nfs_inode' has no member named 'rwsem' up_write(&nfsi->rwsem); ^ vim +921 fs/nfs/dir.c 915 * *desc->dir_cookie has the cookie for the next entry. We have 916 * to either find the entry with the appropriate number or 917 * revalidate the cookie. 918 */ 919 memset(desc, 0, sizeof(*desc)); 920 > 921 down_write(&nfsi->rwsem); 922 desc->file = file; 923 desc->ctx = ctx; 924 desc->dir_cookie = &dir_ctx->dir_cookie; --- 0-DAY kernel test infrastructure Open Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation [-- Attachment #2: .config.gz --] [-- Type: application/gzip, Size: 10104 bytes --] ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Concurrent `ls` takes out the thrash 2016-12-07 13:28 Concurrent `ls` takes out the thrash Benjamin Coddington 2016-12-07 13:37 ` [PATCH] NFS: Serialize nfs_readdir() Benjamin Coddington @ 2016-12-07 15:46 ` Trond Myklebust 2016-12-07 19:46 ` Benjamin Coddington 1 sibling, 1 reply; 15+ messages in thread From: Trond Myklebust @ 2016-12-07 15:46 UTC (permalink / raw) To: Benjamin Coddington; +Cc: Linux NFS Mailing List DQo+IE9uIERlYyA3LCAyMDE2LCBhdCAwODoyOCwgQmVuamFtaW4gQ29kZGluZ3RvbiA8YmNvZGRp bmdAcmVkaGF0LmNvbT4gd3JvdGU6DQo+IA0KPiBJIHdhcyBhc2tlZCB0byBmaWd1cmUgb3V0IHdo eSB0aGUgbGlzdGluZyBvZiB2ZXJ5IGxhcmdlIGRpcmVjdG9yaWVzIHdhcw0KPiBzbG93LiAgTW9y ZSBzcGVjaWZpY2FsbHksIHdoeSBjb25jdXJyZW50bHkgbGlzdGluZyB0aGUgc2FtZSBsYXJnZSBk aXJlY3RvcnkNCj4gaXMgL3ZlcnkvIHNsb3cuICBJdCBzZWVtcyB0aGF0IHNvbWV0aW1lcyBhIHVz ZXIncyByZWFjdGlvbiB0byB3YWl0aW5nIGZvcg0KPiAnbHMnIHRvIGNvbXBsZXRlIGlzIHRvIHN0 YXJ0IGEgZmV3IG1vcmUuLiBhbmQgdGhlbiB0aGVpciBtYWNoaW5lIHRha2VzIGENCj4gdmVyeSBs b25nIHRpbWUgdG8gY29tcGxldGUgdGhhdCB3b3JrLg0KPiANCj4gSSBjYW4gcmVwcm9kdWNlIHRo YXQgZmluZGluZy4gIEFzIGFuIGV4YW1wbGU6DQo+IA0KPiB0aW1lIGxzIC1mbCAvZGlyL3dpdGgv MjAwMDAwL2VudHJpZXMvID4vZGV2L251bGwNCj4gDQo+IHJlYWwgICAgMG0xMC43NjZzDQo+IHVz ZXIgICAgMG0wLjcxNnMNCj4gc3lzICAgICAwbTAuODI3cw0KPiANCj4gQnV0Li4NCj4gDQo+IGZv ciBpIGluIHsxLi4xMH07IGRvIHRpbWUgbHMgLWZsIC9kaXIvd2l0aC8yMDAwMDAvZW50cmllcy8g Pi9kZXYvbnVsbCAmIGRvbmUNCj4gDQo+IEVhY2ggb2YgdGhlc2UgXl4gJ2xzJyBjb21tYW5kcyB3 aWxsIHRha2UgNCB0byA1IG1pbnV0ZXMgdG8gY29tcGxldGUuDQo+IA0KPiBUaGUgcHJvYmxlbSBp cyB0aGF0IGNvbmN1cnJlbnQgJ2xzJyBjb21tYW5kcyBzdGFjayB1cCBpbiBuZnNfcmVhZGRpcigp IGJvdGgNCj4gd2FpdGluZyBvbiB0aGUgbmV4dCBwYWdlIGFuZCB0YWtpbmcgdHVybnMgZmlsbGlu ZyB0aGUgbmV4dCBwYWdlIHdpdGggeGRyLA0KPiBidXQgb25seSBvbmUgb2YgdGhlbSB3aWxsIGhh dmUgZGVzYy0+cGx1cyBzZXQgYmVjYXVzZSBzZXR0aW5nIGl0IGNsZWFycyB0aGUNCj4gZmxhZyBv biB0aGUgZGlyZWN0b3J5LiAgU28gaWYgYSBwYWdlIGlzIGZpbGxlZCBieSBhIHByb2Nlc3MgdGhh dCBkb2Vzbid0IGhhdmUNCj4gZGVzYy0+cGx1cyB0aGVuIHRoZSBuZXh0IHBhc3MgdGhyb3VnaCBs b29rdXAoKSwgaXQgZHVtcHMgdGhlIGVudGlyZSBwYWdlDQo+IGNhY2hlIHdpdGggbmZzX2ZvcmNl X3VzZV9yZWFkZGlycGx1cygpLiAgVGhlbiB0aGUgbmV4dCByZWFkZGlyIHN0YXJ0cyBhbGwNCj4g b3ZlciBmaWxsaW5nIHRoZSBwYWdlY2FjaGUuICBGb3J3YXJkIHByb2dyZXNzIGhhcHBlbnMsIGJ1 dCBvbmx5IGFmdGVyIG1hbnkNCj4gc3RlcHMgYmFjayByZS1maWxsaW5nIHRoZSBwYWdlY2FjaGUu DQoNClllcywgdGhlIHJlYWRkaXIgY29kZSB3YXMgd3JpdHRlbiB3ZWxsIGJlZm9yZSBBbOKAmXMg cGF0Y2hlcyB0byBwYXJhbGxlbGlzZSB0aGUgVkZTIG9wZXJhdGlvbnMsIGFuZCBhIGxvdCBvZiBp dCBkaWQgcmVseSBvbiB0aGUgaW5vZGUtPmlfbXV0ZXggYmVpbmcgc2V0IG9uIHRoZSBkaXJlY3Rv cnkgYnkgdGhlIFZGUyBsYXllci4NCg0KSG93IGFib3V0IHRoZSBmb2xsb3dpbmcgc3VnZ2VzdGlv bjogaW5zdGVhZCBvZiBzZXR0aW5nIGEgZmxhZyBvbiB0aGUgaW5vZGUsIHdlIGl0ZXJhdGUgdGhy b3VnaCB0aGUgZW50cmllcyBpbiAmbmZzaS0+b3Blbl9maWxlcywgYW5kIHNldCBhIGZsYWcgb24g dGhlIHN0cnVjdCBuZnNfb3Blbl9kaXJfY29udGV4dCB0aGF0IHRoZSByZWFkZGlyIHByb2Nlc3Nl cyBjYW4gY29weSBpbnRvIGRlc2MtPnBsdXMuIERvZXMgdGhhdCBoZWxwIHdpdGggeW91ciB3b3Jr bG9hZD8NCg0K ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Concurrent `ls` takes out the thrash 2016-12-07 15:46 ` Concurrent `ls` takes out the thrash Trond Myklebust @ 2016-12-07 19:46 ` Benjamin Coddington 2016-12-07 22:34 ` Benjamin Coddington 0 siblings, 1 reply; 15+ messages in thread From: Benjamin Coddington @ 2016-12-07 19:46 UTC (permalink / raw) To: Trond Myklebust; +Cc: Linux NFS Mailing List On 7 Dec 2016, at 10:46, Trond Myklebust wrote: >> On Dec 7, 2016, at 08:28, Benjamin Coddington <bcodding@redhat.com> >> wrote: >> >> I was asked to figure out why the listing of very large directories >> was >> slow. More specifically, why concurrently listing the same large >> directory >> is /very/ slow. It seems that sometimes a user's reaction to waiting >> for >> 'ls' to complete is to start a few more.. and then their machine >> takes a >> very long time to complete that work. >> >> I can reproduce that finding. As an example: >> >> time ls -fl /dir/with/200000/entries/ >/dev/null >> >> real 0m10.766s >> user 0m0.716s >> sys 0m0.827s >> >> But.. >> >> for i in {1..10}; do time ls -fl /dir/with/200000/entries/ >/dev/null >> & done >> >> Each of these ^^ 'ls' commands will take 4 to 5 minutes to complete. >> >> The problem is that concurrent 'ls' commands stack up in >> nfs_readdir() both >> waiting on the next page and taking turns filling the next page with >> xdr, >> but only one of them will have desc->plus set because setting it >> clears the >> flag on the directory. So if a page is filled by a process that >> doesn't have >> desc->plus then the next pass through lookup(), it dumps the entire >> page >> cache with nfs_force_use_readdirplus(). Then the next readdir starts >> all >> over filling the pagecache. Forward progress happens, but only after >> many >> steps back re-filling the pagecache. > > Yes, the readdir code was written well before Al’s patches to > parallelise > the VFS operations, and a lot of it did rely on the inode->i_mutex > being > set on the directory by the VFS layer. > > How about the following suggestion: instead of setting a flag on the > inode, we iterate through the entries in &nfsi->open_files, and set a > flag > on the struct nfs_open_dir_context that the readdir processes can copy > into desc->plus. Does that help with your workload? That should work.. I guess I'll hack it up and present it for dissection. Thanks! Ben ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Concurrent `ls` takes out the thrash 2016-12-07 19:46 ` Benjamin Coddington @ 2016-12-07 22:34 ` Benjamin Coddington 2016-12-07 22:41 ` Trond Myklebust 0 siblings, 1 reply; 15+ messages in thread From: Benjamin Coddington @ 2016-12-07 22:34 UTC (permalink / raw) To: Trond Myklebust, Trond Myklebust; +Cc: linux-nfs From: "Benjamin Coddington" <bcodding@redhat.com> So, I've got the following patch, and it seems to work fine for the original workload. However, if I use ~20 procs started 2 seconds apart, I can still sometimes get into the state where the machine takes a very long time (5 - 8 minutes). I wonder if I am getting into a state were vmscan is reclaiming pages while I'm trying to fill them up. So.. I'll do a bit more debugging and re-post this if I feel like it is still the right approach. Adding an int to nfs_open_dir_context puts it at 60 bytes here. Probably could have added a unsigned long flags and used bits for the duped stuff as well, but it was uglier that way, so I left it. I like how duped carries -1, 0, and >1 information without having flag defines cluttering everywhere. Ben 8<------------------------------------------------------------------------ ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Concurrent `ls` takes out the thrash 2016-12-07 22:34 ` Benjamin Coddington @ 2016-12-07 22:41 ` Trond Myklebust 2016-12-07 22:55 ` Benjamin Coddington 0 siblings, 1 reply; 15+ messages in thread From: Trond Myklebust @ 2016-12-07 22:41 UTC (permalink / raw) To: Benjamin Coddington; +Cc: Linux NFS Mailing List > On Dec 7, 2016, at 17:34, Benjamin Coddington <bcodding@redhat.com> wrote= : >=20 > From: "Benjamin Coddington" <bcodding@redhat.com> >=20 > So, I've got the following patch, and it seems to work fine for the origi= nal > workload. However, if I use ~20 procs started 2 seconds apart, I can sti= ll > sometimes get into the state where the machine takes a very long time (5 = - 8 > minutes). I wonder if I am getting into a state were vmscan is reclaimin= g > pages while I'm trying to fill them up. So.. I'll do a bit more debuggin= g > and re-post this if I feel like it is still the right approach. >=20 > Adding an int to nfs_open_dir_context puts it at 60 bytes here. Probably > could have added a unsigned long flags and used bits for the duped stuff = as > well, but it was uglier that way, so I left it. I like how duped carries > -1, 0, and >1 information without having flag defines cluttering everywhe= re. >=20 > Ben >=20 > 8<-----------------------------------------------------------------------= - >=20 > From 5b3e747a984ec966e8c742d8f4fe4b08e1c93acd Mon Sep 17 00:00:00 2001 > Message-Id: <5b3e747a984ec966e8c742d8f4fe4b08e1c93acd.1481149380.git.bcod= ding@redhat.com> > From: Benjamin Coddington <bcodding@redhat.com> > Date: Wed, 7 Dec 2016 16:23:30 -0500 > Subject: [PATCH] NFS: Move readdirplus flag to directory context >=20 > For a concurrent 'ls -l' workload, processes can stack up in nfs_readdir(= ) > both waiting on the next page and taking turns filling the next page from > XDR, but only one of them will have desc->plus set because setting it > clears the flag on the directory. If a page is filled by a process that > doesn't have desc->plus set then the next pass through lookup() will caus= e > it to dump the entire page cache with nfs_force_use_readdirplus(). Then > the next readdir starts all over filling the pagecache. Forward progress > happens, but only after many steps back re-filling the pagecache. >=20 > Fix this by moving the flag to use readdirplus to each open directory > context. >=20 > Suggested-by: Trond Myklebust <trondmy@primarydata.com> > Signed-off-by: Benjamin Coddington <bcodding@redhat.com> > --- > fs/nfs/dir.c | 39 ++++++++++++++++++++++++--------------- > include/linux/nfs_fs.h | 1 + > 2 files changed, 25 insertions(+), 15 deletions(-) >=20 > diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c > index 7483722162fa..818172606fed 100644 > --- a/fs/nfs/dir.c > +++ b/fs/nfs/dir.c > @@ -78,6 +78,7 @@ static struct nfs_open_dir_context *alloc_nfs_open_dir_= context(struct inode *dir > =09=09ctx->dir_cookie =3D 0; > =09=09ctx->dup_cookie =3D 0; > =09=09ctx->cred =3D get_rpccred(cred); > +=09=09ctx->use_readdir_plus =3D 0; > =09=09spin_lock(&dir->i_lock); > =09=09list_add(&ctx->list, &nfsi->open_files); > =09=09spin_unlock(&dir->i_lock); > @@ -443,17 +444,35 @@ int nfs_same_file(struct dentry *dentry, struct nfs= _entry *entry) > } >=20 > static > -bool nfs_use_readdirplus(struct inode *dir, struct dir_context *ctx) > +bool nfs_use_readdirplus(struct inode *dir, struct dir_context *ctx, > +=09=09=09struct nfs_open_dir_context *dir_ctx) > { > =09if (!nfs_server_capable(dir, NFS_CAP_READDIRPLUS)) > =09=09return false; > -=09if (test_and_clear_bit(NFS_INO_ADVISE_RDPLUS, &NFS_I(dir)->flags)) > +=09if (xchg(&dir_ctx->use_readdir_plus, 0)) > =09=09return true; > =09if (ctx->pos =3D=3D 0) > =09=09return true; > =09return false; > } >=20 > +static > +void nfs_set_readdirplus(struct inode *dir, int force) > +{ > +=09struct nfs_inode *nfsi =3D NFS_I(dir); > +=09struct nfs_open_dir_context *ctx; > + > +=09if (nfs_server_capable(dir, NFS_CAP_READDIRPLUS) && > +=09 !list_empty(&nfsi->open_files)) { > +=09=09spin_lock(&dir->i_lock); > +=09=09list_for_each_entry(ctx, &nfsi->open_files, list) > +=09=09=09ctx->use_readdir_plus =3D 1; > +=09=09spin_unlock(&dir->i_lock); > +=09=09if (force) > +=09=09=09invalidate_mapping_pages(dir->i_mapping, 0, -1); > +=09} > +} > + > /* > * This function is called by the lookup and getattr code to request the > * use of readdirplus to accelerate any future lookups in the same > @@ -461,11 +480,7 @@ bool nfs_use_readdirplus(struct inode *dir, struct d= ir_context *ctx) > */ > void nfs_advise_use_readdirplus(struct inode *dir) > { > -=09struct nfs_inode *nfsi =3D NFS_I(dir); > - > -=09if (nfs_server_capable(dir, NFS_CAP_READDIRPLUS) && > -=09 !list_empty(&nfsi->open_files)) > -=09=09set_bit(NFS_INO_ADVISE_RDPLUS, &nfsi->flags); > +=09nfs_set_readdirplus(dir, 0); > } >=20 > /* > @@ -478,13 +493,7 @@ void nfs_advise_use_readdirplus(struct inode *dir) > */ > void nfs_force_use_readdirplus(struct inode *dir) > { > -=09struct nfs_inode *nfsi =3D NFS_I(dir); > - > -=09if (nfs_server_capable(dir, NFS_CAP_READDIRPLUS) && > -=09 !list_empty(&nfsi->open_files)) { > -=09=09set_bit(NFS_INO_ADVISE_RDPLUS, &nfsi->flags); > -=09=09invalidate_mapping_pages(dir->i_mapping, 0, -1); > -=09} > +=09nfs_set_readdirplus(dir, 1); > } >=20 > static > @@ -921,7 +930,7 @@ static int nfs_readdir(struct file *file, struct dir_= context *ctx) > =09desc->ctx =3D ctx; > =09desc->dir_cookie =3D &dir_ctx->dir_cookie; > =09desc->decode =3D NFS_PROTO(inode)->decode_dirent; > -=09desc->plus =3D nfs_use_readdirplus(inode, ctx) ? 1 : 0; > +=09desc->plus =3D nfs_use_readdirplus(inode, ctx, dir_ctx) ? 1 : 0; This fixes desc->plus at the beginning of the readdir() call. Perhaps we sh= ould instead check the value of ctx->use_readdir_plus in the call to nfs_re= addir_xdr_filler(), and just resetting cts->use_readdir_plus at the very en= d of nfs_readdir()? >=20 > =09if (ctx->pos =3D=3D 0 || nfs_attribute_cache_expired(inode)) > =09=09res =3D nfs_revalidate_mapping(inode, file->f_mapping); > diff --git a/include/linux/nfs_fs.h b/include/linux/nfs_fs.h > index cb631973839a..fe06c1dd1737 100644 > --- a/include/linux/nfs_fs.h > +++ b/include/linux/nfs_fs.h > @@ -89,6 +89,7 @@ struct nfs_open_dir_context { > =09__u64 dir_cookie; > =09__u64 dup_cookie; > =09signed char duped; > +=09int use_readdir_plus; > }; >=20 > /* > --=20 > 2.5.5 >=20 ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Concurrent `ls` takes out the thrash 2016-12-07 22:41 ` Trond Myklebust @ 2016-12-07 22:55 ` Benjamin Coddington 2016-12-07 22:59 ` Trond Myklebust 0 siblings, 1 reply; 15+ messages in thread From: Benjamin Coddington @ 2016-12-07 22:55 UTC (permalink / raw) To: Trond Myklebust; +Cc: Linux NFS Mailing List On 7 Dec 2016, at 17:41, Trond Myklebust wrote: >> On Dec 7, 2016, at 17:34, Benjamin Coddington <bcodding@redhat.com> >> wrote: >> static >> @@ -921,7 +930,7 @@ static int nfs_readdir(struct file *file, struct >> dir_context *ctx) >> desc->ctx = ctx; >> desc->dir_cookie = &dir_ctx->dir_cookie; >> desc->decode = NFS_PROTO(inode)->decode_dirent; >> - desc->plus = nfs_use_readdirplus(inode, ctx) ? 1 : 0; >> + desc->plus = nfs_use_readdirplus(inode, ctx, dir_ctx) ? 1 : 0; > > This fixes desc->plus at the beginning of the readdir() call. Perhaps > we > should instead check the value of ctx->use_readdir_plus in the call to > nfs_readdir_xdr_filler(), and just resetting cts->use_readdir_plus at > the > very end of nfs_readdir()? I don't understand the functional difference. Is this just a preference? There must be something else happening.. dcache is getting under pressure pruned maybe, that causes a miss and then the process starts over? Ben ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Concurrent `ls` takes out the thrash 2016-12-07 22:55 ` Benjamin Coddington @ 2016-12-07 22:59 ` Trond Myklebust 2016-12-07 23:10 ` Benjamin Coddington 0 siblings, 1 reply; 15+ messages in thread From: Trond Myklebust @ 2016-12-07 22:59 UTC (permalink / raw) To: Benjamin Coddington; +Cc: Linux NFS Mailing List DQo+IE9uIERlYyA3LCAyMDE2LCBhdCAxNzo1NSwgQmVuamFtaW4gQ29kZGluZ3RvbiA8YmNvZGRp bmdAcmVkaGF0LmNvbT4gd3JvdGU6DQo+IA0KPiBPbiA3IERlYyAyMDE2LCBhdCAxNzo0MSwgVHJv bmQgTXlrbGVidXN0IHdyb3RlOg0KPiANCj4+PiBPbiBEZWMgNywgMjAxNiwgYXQgMTc6MzQsIEJl bmphbWluIENvZGRpbmd0b24gPGJjb2RkaW5nQHJlZGhhdC5jb20+IHdyb3RlOg0KPj4+IHN0YXRp Yw0KPj4+IEBAIC05MjEsNyArOTMwLDcgQEAgc3RhdGljIGludCBuZnNfcmVhZGRpcihzdHJ1Y3Qg ZmlsZSAqZmlsZSwgc3RydWN0IGRpcl9jb250ZXh0ICpjdHgpDQo+Pj4gCWRlc2MtPmN0eCA9IGN0 eDsNCj4+PiAJZGVzYy0+ZGlyX2Nvb2tpZSA9ICZkaXJfY3R4LT5kaXJfY29va2llOw0KPj4+IAlk ZXNjLT5kZWNvZGUgPSBORlNfUFJPVE8oaW5vZGUpLT5kZWNvZGVfZGlyZW50Ow0KPj4+IC0JZGVz Yy0+cGx1cyA9IG5mc191c2VfcmVhZGRpcnBsdXMoaW5vZGUsIGN0eCkgPyAxIDogMDsNCj4+PiAr CWRlc2MtPnBsdXMgPSBuZnNfdXNlX3JlYWRkaXJwbHVzKGlub2RlLCBjdHgsIGRpcl9jdHgpID8g MSA6IDA7DQo+PiANCj4+IFRoaXMgZml4ZXMgZGVzYy0+cGx1cyBhdCB0aGUgYmVnaW5uaW5nIG9m IHRoZSByZWFkZGlyKCkgY2FsbC4gUGVyaGFwcyB3ZQ0KPj4gc2hvdWxkIGluc3RlYWQgY2hlY2sg dGhlIHZhbHVlIG9mIGN0eC0+dXNlX3JlYWRkaXJfcGx1cyBpbiB0aGUgY2FsbCB0bw0KPj4gbmZz X3JlYWRkaXJfeGRyX2ZpbGxlcigpLCBhbmQganVzdCByZXNldHRpbmcgY3RzLT51c2VfcmVhZGRp cl9wbHVzIGF0IHRoZQ0KPj4gdmVyeSBlbmQgb2YgbmZzX3JlYWRkaXIoKT8NCj4gDQo+IEkgZG9u J3QgdW5kZXJzdGFuZCB0aGUgZnVuY3Rpb25hbCBkaWZmZXJlbmNlLiAgSXMgdGhpcyBqdXN0IGEg cHJlZmVyZW5jZT8NCg0KTm8uIFRoZSBmdW5jdGlvbmFsIGRpZmZlcmVuY2UgaXMgdGhhdCB3ZSB0 YWtlIGludG8gYWNjb3VudCB0aGUgZmFjdCB0aGF0IGEgcHJvY2VzcyBtYXkgYmUgaW4gdGhlIHJl YWRkaXIoKSBjb2RlIHdoaWxlIGEgR0VUQVRUUiBvciBMT09LVVAgZnJvbSBhIHNlY29uZCBwcm9j ZXNzIGlzIHRyaWdnZXJpbmcg4oCcdXNlIHJlYWRkaXJwbHVz4oCdIGZlZWRiYWNrLg0KDQo+IA0K PiBUaGVyZSBtdXN0IGJlIHNvbWV0aGluZyBlbHNlIGhhcHBlbmluZy4uIGRjYWNoZSBpcyBnZXR0 aW5nIHVuZGVyIHByZXNzdXJlDQo+IHBydW5lZCBtYXliZSwgdGhhdCBjYXVzZXMgYSBtaXNzIGFu ZCB0aGVuIHRoZSBwcm9jZXNzIHN0YXJ0cyBvdmVyPw0KPiANCj4gQmVuDQo+IA0KDQo= ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Concurrent `ls` takes out the thrash 2016-12-07 22:59 ` Trond Myklebust @ 2016-12-07 23:10 ` Benjamin Coddington 2016-12-08 14:18 ` Benjamin Coddington 0 siblings, 1 reply; 15+ messages in thread From: Benjamin Coddington @ 2016-12-07 23:10 UTC (permalink / raw) To: Trond Myklebust; +Cc: Linux NFS Mailing List On 7 Dec 2016, at 17:59, Trond Myklebust wrote: >> On Dec 7, 2016, at 17:55, Benjamin Coddington <bcodding@redhat.com> >> wrote: >> >> On 7 Dec 2016, at 17:41, Trond Myklebust wrote: >> >>>> On Dec 7, 2016, at 17:34, Benjamin Coddington <bcodding@redhat.com> >>>> wrote: >>>> static >>>> @@ -921,7 +930,7 @@ static int nfs_readdir(struct file *file, >>>> struct dir_context *ctx) >>>> desc->ctx = ctx; >>>> desc->dir_cookie = &dir_ctx->dir_cookie; >>>> desc->decode = NFS_PROTO(inode)->decode_dirent; >>>> - desc->plus = nfs_use_readdirplus(inode, ctx) ? 1 : 0; >>>> + desc->plus = nfs_use_readdirplus(inode, ctx, dir_ctx) ? 1 : 0; >>> >>> This fixes desc->plus at the beginning of the readdir() call. >>> Perhaps we >>> should instead check the value of ctx->use_readdir_plus in the call >>> to >>> nfs_readdir_xdr_filler(), and just resetting cts->use_readdir_plus >>> at the >>> very end of nfs_readdir()? >> >> I don't understand the functional difference. Is this just a >> preference? > > No. The functional difference is that we take into account the fact > that a > process may be in the readdir() code while a GETATTR or LOOKUP from a > second process is triggering “use readdirplus” feedback. This should only matter if the concurrent processes have different buffer sizes or start at a different offsets -- which shouldn't happen with plain old 'ls -l'. .. or maybe I'm wrong if, hmm.. if acdirmin ran out (maybe?).. or if we mix 'ls -l' and 'ls'.. or if we have pages getting reclaimed.. OK. I'll try it. Thanks for your suggestions. Ben ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Concurrent `ls` takes out the thrash 2016-12-07 23:10 ` Benjamin Coddington @ 2016-12-08 14:18 ` Benjamin Coddington 2016-12-08 16:13 ` Benjamin Coddington 0 siblings, 1 reply; 15+ messages in thread From: Benjamin Coddington @ 2016-12-08 14:18 UTC (permalink / raw) To: Trond Myklebust; +Cc: Linux NFS Mailing List On 7 Dec 2016, at 18:10, Benjamin Coddington wrote: > On 7 Dec 2016, at 17:59, Trond Myklebust wrote: > >>> On Dec 7, 2016, at 17:55, Benjamin Coddington <bcodding@redhat.com> >>> wrote: >>> >>> On 7 Dec 2016, at 17:41, Trond Myklebust wrote: >>> >>>>> On Dec 7, 2016, at 17:34, Benjamin Coddington >>>>> <bcodding@redhat.com> wrote: >>>>> static >>>>> @@ -921,7 +930,7 @@ static int nfs_readdir(struct file *file, >>>>> struct dir_context *ctx) >>>>> desc->ctx = ctx; >>>>> desc->dir_cookie = &dir_ctx->dir_cookie; >>>>> desc->decode = NFS_PROTO(inode)->decode_dirent; >>>>> - desc->plus = nfs_use_readdirplus(inode, ctx) ? 1 : 0; >>>>> + desc->plus = nfs_use_readdirplus(inode, ctx, dir_ctx) ? 1 : 0; >>>> >>>> This fixes desc->plus at the beginning of the readdir() call. >>>> Perhaps we >>>> should instead check the value of ctx->use_readdir_plus in the call >>>> to >>>> nfs_readdir_xdr_filler(), and just resetting cts->use_readdir_plus >>>> at the >>>> very end of nfs_readdir()? >>> >>> I don't understand the functional difference. Is this just a >>> preference? >> >> No. The functional difference is that we take into account the fact >> that a >> process may be in the readdir() code while a GETATTR or LOOKUP from a >> second process is triggering “use readdirplus” feedback. > > This should only matter if the concurrent processes have different > buffer > sizes or start at a different offsets -- which shouldn't happen with > plain > old 'ls -l'. > > .. or maybe I'm wrong if, hmm.. if acdirmin ran out (maybe?).. or if > we mix > 'ls -l' and 'ls'.. or if we have pages getting reclaimed.. OK. I'll > try it. This doesn't help. The issue is that anything more than a couple of processes cause contention on the directory's i_lock. The i_lock is taken x entries x processes. The inode flags and serialization worked far better. If we add another inode flag, then we can add a DOING_RDPLUS. A process entering nfs_readdir() that sees ADVISE and not DOING clears it and sets DOING and remembers to clear DOING on exit of nfs_readdir(). Any process that sees DOING uses plus. Ben ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Concurrent `ls` takes out the thrash 2016-12-08 14:18 ` Benjamin Coddington @ 2016-12-08 16:13 ` Benjamin Coddington 2016-12-08 21:48 ` Benjamin Coddington 0 siblings, 1 reply; 15+ messages in thread From: Benjamin Coddington @ 2016-12-08 16:13 UTC (permalink / raw) To: Trond Myklebust, Trond Myklebust; +Cc: linux-nfs From: "Benjamin Coddington" <bcodding@redhat.com> Ehh.. I think it's kinda ugly, but this is what I came up with. It works well, though, and handles everything I throw at it. Its not as simple as Christoph's suggestion to just go back to .iterate, which would solve this in a simpler way. Ben 8<--------------------------------------------------------------------------------------- ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Concurrent `ls` takes out the thrash 2016-12-08 16:13 ` Benjamin Coddington @ 2016-12-08 21:48 ` Benjamin Coddington 0 siblings, 0 replies; 15+ messages in thread From: Benjamin Coddington @ 2016-12-08 21:48 UTC (permalink / raw) To: Trond Myklebust; +Cc: linux-nfs > Ehh.. I think it's kinda ugly, but this is what I came up with. > > It works well, though, and handles everything I throw at it. Its not as > simple as Christoph's suggestion to just go back to .iterate, which would > solve this in a simpler way. After many testings, I can't find any way this is faster than using the old .iterate serialized readdir. Additionally, at around 1M entries, this way frequently needs to oom-kill things. Using .iterate, that doesn't happen, and is much faster at that scale. Maybe my feeble brain is unable to think of the right workload to make .iterate_shared preferable for NFS. Suggestions welcome. Otherwise, I think serialized is the way to go. Ben ^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2016-12-08 21:48 UTC | newest] Thread overview: 15+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2016-12-07 13:28 Concurrent `ls` takes out the thrash Benjamin Coddington 2016-12-07 13:37 ` [PATCH] NFS: Serialize nfs_readdir() Benjamin Coddington 2016-12-07 16:30 ` Christoph Hellwig 2016-12-07 19:40 ` Benjamin Coddington 2016-12-07 17:01 ` kbuild test robot 2016-12-07 15:46 ` Concurrent `ls` takes out the thrash Trond Myklebust 2016-12-07 19:46 ` Benjamin Coddington 2016-12-07 22:34 ` Benjamin Coddington 2016-12-07 22:41 ` Trond Myklebust 2016-12-07 22:55 ` Benjamin Coddington 2016-12-07 22:59 ` Trond Myklebust 2016-12-07 23:10 ` Benjamin Coddington 2016-12-08 14:18 ` Benjamin Coddington 2016-12-08 16:13 ` Benjamin Coddington 2016-12-08 21:48 ` Benjamin Coddington
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).