From: "Mkrtchyan, Tigran" <tigran.mkrtchyan@desy.de>
To: David Wysochanski <dwysocha@redhat.com>
Cc: trondmy <trondmy@hammerspace.com>,
Anna Schumaker <anna.schumaker@netapp.com>,
linux-nfs <linux-nfs@vger.kernel.org>
Subject: Re: [PATCH 00/11] Add NFS readdir tracepoints and improve performance of reading directories
Date: Mon, 2 Nov 2020 16:58:33 +0100 (CET) [thread overview]
Message-ID: <42420978.5311919.1604332713797.JavaMail.zimbra@desy.de> (raw)
In-Reply-To: <1604325011-29427-1-git-send-email-dwysocha@redhat.com>
Hi Dave,
thanks for addressing this issue!
The readdir on large directory that's permanently changes is a pain in a b...
We have solved it by creating a snapshot of a directory on the server and
identifying this snapshot by cookie verifier. This is probably it not posix
compliant, but works for us.
I am happy to run some tests.
Tigran.
----- Original Message -----
> From: "David Wysochanski" <dwysocha@redhat.com>
> To: "trondmy" <trondmy@hammerspace.com>, "Anna Schumaker" <anna.schumaker@netapp.com>
> Cc: "linux-nfs" <linux-nfs@vger.kernel.org>
> Sent: Monday, 2 November, 2020 14:50:00
> Subject: [PATCH 00/11] Add NFS readdir tracepoints and improve performance of reading directories
> Note these patches are also on github at:
> https://github.com/DaveWysochanskiRH/kernel/tree/5.9-nfs-readdir
>
> This patchset is mostly tracepoints but patch 9 fixes a performance
> problem with NFSv4.x when listing a directory that is being modified.
> After the NFSv4.x patch, for consistency, patch 11 is included which
> makes NFSv3 behave the same as NFSv4, and given the inclusion of
> patch 9 does not re-introduce the regression fixed in previous commit
> 79f687a3de9e. As described in patch 11, NFSv3 does not currently have
> the NFSv4 problem because it does not drop the pagecache when a
> large directory read is in progress.
>
> As background, currently any process that is listing a directory
> must search starting at cookie 0 at each entry to nfs_readdir(),
> regardless of the previous cookie returned by the server, regardless of
> NFS version, and regardless of whether the directory is being modified
> or cache expired. This means, for example, if a large directory listing
> requires many getdents64/nfs_readdir calls, the kernel will search
> through the pagecache with an increasing penalty as the size of the
> directory increases and the process attempts to obtain the latter
> entries of the directory. When the directory is being modified, and
> when acdirmax is exceeded, and nfs_attribute_cache_expired() returns
> true, with NFSv4.x we drop the pagecache for the directory entries,
> so the process cannot make forward progress.
>
> I investigated using a similar approach as was done with NFSv3 in
> commit 79f687a3de9e, but for NFSv4, READDIR does not return the directory
> attributes and thus _nfs4_proc_readdir does not call nfs_refresh_inode()
> unlike nfs3_proc_readdir(). With NFSv3, the calling of nfs_refresh_inode()
> is what updates nfsi->read_cache_jiffies causing nfs_attribute_cache_expired()
> to always return false, leaving the pagecache in tact despite
> NFS_INO_INVALID_DATA being set. Thus it's not clear whether the NFSv3
> approach could be done for NFSv4 to achieve the same behavior as with
> NFSv3. In addition, the current NFSv3 approach leaves in place the
> aforementioned penalty, which for large directories can be substantial.
> So rather than this approach, the approach taken with NFSv4 leaves
> in place the dropping of the pagecache when acdirmax expires, and
> relies on the fact that a process can trust the last cookie returned
> by the server and continue at that point in the pagecache without
> forcing the current process to start over from cookie 0 upon the
> next entry to nfs_readdir(). As far as I could tell there is no
> POSIX requirement to start from 0 again when a directory is being
> modified during an in-progress directory read, and so the current
> behavior is only an implementation limitation.
>
> The NFSv4 performance problem can be seen by exporting a directory
> with a larger number of files such that the uncached time to list the
> directory exceeds acdirmax. I have an explicit reproducer script
> I can provide, but a simple reproducer outline is as follows:
>
> 1. First export a directory that contains enough files that the
> listing exceeds acdirmax. In my testing, the NFS server and client
> were the same system, there were 500,000 files in the directory,
> and I set acdirmin=10,acdirmax=20.
>
> 2. Mount client1 (writer) and client2 (reader). Note below I use
> the loopback IP address, and a 'wsize' parameter on the writer mount
> so we get different superblocks:
> mount -o vers=4.1 127.0.0.1:/export/bz1889478 $MNT1
> mount -o vers=4.1,wsize=65536 127.0.0.1:/export/bz1889478 $MNT2
>
> 3. Start client1 (writer):
> echo 3 > /proc/sys/vm/drop_caches
> i=500000; while true; do touch $MNT2/file$i.bin; i=$((i + 1)); sleep 1; done >
> /dev/null 2>&1 &
>
> 4. Start client2 (reader):
> while true; do time ls -1f $MNT1 | wc -l; done &
>
> The output from my reproducer script is:
> ./t0_bz1889478.sh 4.1 127.0.0.1 500000
> Mon 02 Nov 2020 07:54:22 AM EST: Running with NFS vers=4.1 server 127.0.0.1 and
> files 500000 directory is idle
> Mon 02 Nov 2020 07:54:22 AM EST: mount -o vers=4.1 127.0.0.1:/export/bz1889478
> /mnt/nfs/bz1889478-mnt1
> Mon 02 Nov 2020 07:54:22 AM EST: dropping caches with: echo 3 >
> /proc/sys/vm/drop_caches
> Mon 02 Nov 2020 07:54:24 AM EST: idle directory: skipping client2 adding files
> Mon 02 Nov 2020 07:54:24 AM EST: client1 running command: ls -lf
> /mnt/nfs/bz1889478-mnt1
> Mon 02 Nov 2020 07:54:24 AM EST: waiting for listing to complete
> Mon 02 Nov 2020 07:54:46 AM EST: client1 pid took 22 seconds to list the
> directory of 500000 files
> Mon 02 Nov 2020 07:54:46 AM EST: client1 READDIR ops total: 4902 (before = 0
> after = 4902)
> Mon 02 Nov 2020 07:54:47 AM EST: umounting /mnt/nfs/bz1889478-mnt1
> /mnt/nfs/bz1889478-mnt2
> Mon 02 Nov 2020 07:54:47 AM EST: Running with NFS vers=4.1 server 127.0.0.1 and
> files 500000 directory being modified
> Mon 02 Nov 2020 07:54:47 AM EST: mount -o vers=4.1 127.0.0.1:/export/bz1889478
> /mnt/nfs/bz1889478-mnt1
> Mon 02 Nov 2020 07:54:47 AM EST: mount -o vers=4.1,wsize=65536
> 127.0.0.1:/export/bz1889478 /mnt/nfs/bz1889478-mnt2
> Mon 02 Nov 2020 07:54:47 AM EST: dropping caches with: echo 3 >
> /proc/sys/vm/drop_caches
> Mon 02 Nov 2020 07:54:50 AM EST: changing directory: client2 start adding 1
> file/sec at /mnt/nfs/bz1889478-mnt2
> Mon 02 Nov 2020 07:54:51 AM EST: client1 running command: ls -lf
> /mnt/nfs/bz1889478-mnt1
> Mon 02 Nov 2020 07:54:51 AM EST: waiting for listing to complete
> Mon 02 Nov 2020 07:55:12 AM EST: client1 pid took 21 seconds to list the
> directory of 500000 files
> Mon 02 Nov 2020 07:55:12 AM EST: client1 READDIR ops total: 4902 (before = 0
> after = 4902)
> ./t0_bz1889478.sh: line 124: 5973 Killed while true; do
> echo $i; touch $NFS_CLIENT_MOUNTPOINT2/file$i.bin; i=$((i + 1)); sleep 1;
> done > /dev/null 2>&1
> Mon 02 Nov 2020 07:55:20 AM EST: umounting /mnt/nfs/bz1889478-mnt1
> /mnt/nfs/bz1889478-mnt2
> PASSED TEST ./t0_bz1889478.sh on kernel 5.9.0-nfs-readdir+ with NFS vers=4.1
>
> For diagnostics and verification, of course a tcpdump can be
> used, or even READDIR ops and time can be compared as in the
> reproducer, but also the included tracepoints can be used. For
> the tracepoints, before step #2 above use the below trace-cmd
> to trace the listing and see whether the problem occurs or not,
> evidenced by either the presence of nfs_invalidate_mapping*
> trace events or multiple nfs_readdir_enter calls with
> "cookie=0x00000000":
> trace-cmd start -e nfs:nfs_readdir_enter -e nfs4:nfs4_readdir -e
> nfs:nfs_readdir_exit -e nfs:nfs_invalidate_mapping_*
>
>
> Dave Wysochanski (11):
> NFSv4: Improve nfs4_readdir tracepoint by adding additional fields
> NFS: Replace dfprintk statements with trace events in nfs_readdir
> NFS: Move nfs_readdir_descriptor_t into internal header
> NFS: Add tracepoints for functions involving nfs_readdir_descriptor_t
> NFS: Add tracepoints for opendir, closedir, fsync_dir and llseek_dir
> NFS: Add tracepoints for nfs_readdir_xdr_filler enter and exit
> NFS: Add tracepoint to entry and exit of nfs_do_filldir
> NFS: Replace LOOKUPCACHE dfprintk statements with tracepoints
> NFS: Improve performance of listing directories being modified
> NFS: Add page_index to nfs_readdir enter and exit tracepoints
> NFS: Bring back nfs_dir_mapping_need_revalidate() in nfs_readdir()
>
> fs/nfs/dir.c | 101 +++++++++---------
> fs/nfs/internal.h | 18 ++++
> fs/nfs/nfs3xdr.c | 2 +-
> fs/nfs/nfs4proc.c | 2 +-
> fs/nfs/nfs4trace.h | 44 +++++++-
> fs/nfs/nfstrace.h | 277 +++++++++++++++++++++++++++++++++++++++++++++++++
> include/linux/nfs_fs.h | 1 +
> 7 files changed, 394 insertions(+), 51 deletions(-)
>
> --
> 1.8.3.1
prev parent reply other threads:[~2020-11-02 15:58 UTC|newest]
Thread overview: 29+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-11-02 13:50 [PATCH 00/11] Add NFS readdir tracepoints and improve performance of reading directories Dave Wysochanski
2020-11-02 13:50 ` [PATCH 01/11] NFSv4: Improve nfs4_readdir tracepoint by adding additional fields Dave Wysochanski
2020-11-02 13:50 ` [PATCH 02/11] NFS: Replace dfprintk statements with trace events in nfs_readdir Dave Wysochanski
2020-11-02 13:50 ` [PATCH 03/11] NFS: Move nfs_readdir_descriptor_t into internal header Dave Wysochanski
2020-11-02 13:50 ` [PATCH 04/11] NFS: Add tracepoints for functions involving nfs_readdir_descriptor_t Dave Wysochanski
2020-11-02 17:32 ` kernel test robot
2020-11-02 13:50 ` [PATCH 05/11] NFS: Add tracepoints for opendir, closedir, fsync_dir and llseek_dir Dave Wysochanski
2020-11-02 13:50 ` [PATCH 06/11] NFS: Add tracepoints for nfs_readdir_xdr_filler enter and exit Dave Wysochanski
2020-11-02 13:50 ` [PATCH 07/11] NFS: Add tracepoint to entry and exit of nfs_do_filldir Dave Wysochanski
2020-11-02 13:50 ` [PATCH 08/11] NFS: Replace LOOKUPCACHE dfprintk statements with tracepoints Dave Wysochanski
2020-11-02 13:50 ` [PATCH 09/11] NFS: Improve performance of listing directories being modified Dave Wysochanski
2020-11-02 16:21 ` Trond Myklebust
2020-11-02 16:26 ` David Wysochanski
2020-11-02 17:31 ` Trond Myklebust
2020-11-02 19:45 ` David Wysochanski
2020-11-02 21:30 ` Trond Myklebust
2020-11-02 22:05 ` David Wysochanski
2020-11-03 3:38 ` Trond Myklebust
2020-11-03 13:29 ` David Wysochanski
2020-11-03 0:09 ` Frank van der Linden
2020-11-03 17:49 ` David Wysochanski
2020-11-02 13:50 ` [PATCH 10/11] NFS: Add page_index to nfs_readdir enter and exit tracepoints Dave Wysochanski
2020-11-02 13:50 ` [PATCH 11/11] NFS: Bring back nfs_dir_mapping_need_revalidate() in nfs_readdir() Dave Wysochanski
2020-11-02 15:38 ` Mkrtchyan, Tigran
2020-11-02 16:16 ` David Wysochanski
2020-11-02 14:27 ` [PATCH 00/11] Add NFS readdir tracepoints and improve performance of reading directories Chuck Lever
2020-11-02 15:07 ` David Wysochanski
2020-11-02 15:13 ` Chuck Lever
2020-11-02 15:58 ` Mkrtchyan, Tigran [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=42420978.5311919.1604332713797.JavaMail.zimbra@desy.de \
--to=tigran.mkrtchyan@desy.de \
--cc=anna.schumaker@netapp.com \
--cc=dwysocha@redhat.com \
--cc=linux-nfs@vger.kernel.org \
--cc=trondmy@hammerspace.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox